Page 1 of 1

NullReferenceException when doing PDF OCR

Posted: Mon Feb 13, 2017 5:23 am
by attila1977
I'm doing bulk PDF to PDF OCR conversion by using GdPicture .NET sdk v12.0.57.

My application does OCR one by one according to a predefined Image PDF file list.
It is running on Windows Server 2008 R2, 2 x quad core processor, the OCR thread is 15.
The application crashed after 40,000 pages OCR, and I saw a error message in Event Viewer.

Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException
Stack:
at gdpicture_᠗.gdpicture_ᜀ(GdPicture12.Internal.Imaging.GdPictureBitmap ByRef, Int32, Int32, GdPictureRAWColorPalette, Byte[], Byte[])
at gdpicture_ហ.gdpicture_ᜀ(Byte[], gdpicture_᠖)
at gdpicture_ហ.gdpicture_ᜂ()
at gdpicture_ហ.gdpicture_ᜀ(gdpicture_ឧ, Boolean ByRef)
at gdpicture_ស.gdpicture_ᜀ(Int32, gdpicture_ᠯ ByRef)
at GdPicture12.GdPicturePDF.ExtractPageImage(Int32)
at GdPicture12.GdPicturePDF.gdpicture_ᜀ(Int32 ByRef, Boolean, Boolean, Boolean, Boolean, Boolean)
at GdPicture12.GdPicturePDF.gdpicture_ᜀ(Int32, System.String, System.String, System.String, Single, Boolean, gdpicture_២, Boolean, Int32)
at GdPicture12.GdPicturePDF.gdpicture_ᜀ(System.String, System.String, System.String, System.String, Single, gdpicture_២, Int32)
at GdPicture12.GdPicturePDF+ᜁ.gdpicture_
()
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.ThreadHelper.ThreadStart()

Re: NullReferenceException when doing PDF OCR

Posted: Tue Feb 14, 2017 3:03 pm
by David
Hi,

Thank you for contacting us.

May I ask you to shared the required material so I can reproduce the issue on my end? Including the input PDFs and a code snippet.

Thank you

David

Re: NullReferenceException when doing PDF OCR

Posted: Wed Feb 15, 2017 5:00 am
by attila1977
Hi David,
Thank you for your reply.
I'm sorry that I'm not able to provide the image PDFs, because they are confidential documents.
Basically, these PDF files contain 300-800 pages scanned images, page size from A4-A3.
Most of them are black and white, but some PDFs may have 10% of 24 bit color images.
We are using our application to do bulk OCR process for these image PDFs, total number of PDF files is around 1000. We created a PDF list, and our application will do OCR according to this list.
We noticed that there was an error occurred when the application ran more than 12 hours. Gdpicture did not throw exception but crashed directly, we can only see the error message in Windows Event Viewer. When we ran the application again, it continued to do OCR without problem, and last for another 10 hours or even 2 days. This is a random error and I don't know how to prevent it.

Below is my code snippet:

Code: Select all

public void Start() {
            lock (this) {
                CanStartOCR = false;
                if (_OCREntity != null)
                {
                    Console.WriteLine(DateTime.Now + " OCR Start");
                    _nativePdf.OcrPagesProgress += _nativePdf_OcrPagesProgress;
                    _nativePdf.OcrPagesDone += _nativePdf_OcrPagesDone;
                    if (_nativePdf.LoadFromFile(_OCREntity.OCRFilePath, false) == GdPictureStatus.OK)
                    {
                        string ocrlanguage = "eng";
                        if (_OCREntity.OCRLanguage != null && !_OCREntity.OCRLanguage.Equals(""))
                            ocrlanguage = _OCREntity.OCRLanguage;

                        if (OCRLanguageCheck) {
                            ocrlanguage = OCRLanguageText;
                        }
                        var status = _nativePdf.OcrPages("*", _OCREntity.OCRThreadMaxCount, ocrlanguage, _OCREntity.OCRPath, "", 300);
                        
                   }
                }
                else {
                    Console.WriteLine(DateTime.Now + " OCR No Start");
                    CanStartOCR = true;
                }                
            }
        }
        
        private void _nativePdf_OcrPagesDone(GdPictureStatus Status)
        {
            _nativePdf.OcrPagesProgress -= _nativePdf_OcrPagesProgress;
            _nativePdf.OcrPagesDone -= _nativePdf_OcrPagesDone;
            Console.WriteLine(DateTime.Now+" Page Done" + Status.ToString()+"{"+ _OCREntity._fileName+ "}");
            if (Status == GdPictureStatus.OK)
            {
                Status = _nativePdf.SaveToFileInc(_OCREntity.OCROutputPath);
                if (Status == GdPictureStatus.OK)
                {
                    _nativePdf.CloseDocument();
                    _nativePdf.ClosePath();
                    CanStartOCR = true;

                    Document doc = _Job.Document;
                    doc.OCRPDF = true;
                    doc.OCRPDFName = _OCREntity.OCROutputFileName;
                    doc.OCRPDFPath = _OCREntity.OCROutputPath;
                    doc.OCRPDFRootPath = _OCREntity.OCROutputRootPath;
                    doc.Status = Constant.BatchStatus.Completed;
                    doc.Station = Constant.BatchStation.PDF;

                    _OCRBuilder.OCRCompleted(_Job, doc);
                    _OCRBuilder.ExportWaiting(_Job);

                    OnOCRPagesDoneRequest(Status.ToString(), _PageNo, _Processed, _Count);
                }
            }            
                  
        }
        private int _PageNo ;
        private int _Processed;
        private int _Count;
        private void _nativePdf_OcrPagesProgress(GdPictureStatus Status, int PageNo, int Processed, int Count, ref bool Cancel)
        {
            _PageNo = PageNo;
            _Processed = Processed;
            _Count = Count;
            OnOCRPagesProgressRequest(Status.ToString(), PageNo, Processed, Count);
        }

Re: NullReferenceException when doing PDF OCR

Posted: Thu Feb 16, 2017 4:40 pm
by David
Hi,

Having a look at your code I can detect a resource leak.

Please have a look at the _nativePdf_OcrPagesDone method. If the character recognition engine fails to read the document for some reason (document too large, not enough memory, etc.) the Status parameter may be different than GdPictureStatus.OK. This will lead the software not to call CloseDocument and thus not to release the memory used my the object.

We are used to deal with confidential information. If you wish we can sign an NDA so you can provide your document and we could reproduce on our end.

Regards,

David

Re: NullReferenceException when doing PDF OCR

Posted: Tue Feb 21, 2017 4:57 am
by attila1977
Hi David,

Thank you for your reply.

I created a dummy image pdf to reproduce the error in my test environment. The same error occurred again.
I duplicated 50 copies from this image pdf (https://drive.google.com/file/d/0BxI_4n ... sp=sharing), put them into a folder , let my application process them one by one.
Error occurred when the application was processing the 27th PDF.
My test environment: windows 2008 r2 , 1X E5410 quad core CPU, 32 GB RAM, GdPicture .NET sdk v12.0.57, 64 bit platform, 3 OCR treads.

I hope it can help you to reproduce the error on your end.

Thanks.

Re: NullReferenceException when doing PDF OCR

Posted: Thu Feb 23, 2017 10:59 am
by David
Hi,

I'm sorry but I'm not able to reproduce the issue with the latest GdPicture.NET 12.

May I ask you to update and confirm the latest GdPicture.NET 12 solves the issue?

I'm looking forward to hearing from you.

David

Re: NullReferenceException when doing PDF OCR

Posted: Thu Mar 02, 2017 4:31 am
by attila1977
Hi David,

I've updated to the latest GdPicture.NET 12.
The issue still remains.

Re: NullReferenceException when doing PDF OCR

Posted: Fri Mar 03, 2017 11:35 am
by Cedric
Hello,

We are still trying to reproduce the issue but without success for the moment.
We are going to let the process run during the weekend to see if it happens on a long run.
In any case we will let you know the result.

Re: NullReferenceException when doing PDF OCR

Posted: Mon Mar 06, 2017 10:42 am
by Cedric
Hi,

We are still unable to reproduce the issue even with very long runs.
Could you please share a reduced application that we can run as-is?

Re: NullReferenceException when doing PDF OCR

Posted: Tue Mar 07, 2017 6:17 am
by attila1977
Hi Cedric,
Thanks for your help, I will prepare a reduced application.

Re: NullReferenceException when doing PDF OCR

Posted: Wed Aug 08, 2018 12:01 pm
by benedikt
I've got this issue when disposing the imaging and pdf instance before the ocr process finished. My solution for now was to set
the sync option to true:

Last parameter here:

Code: Select all

pdfInstance.OcrPages("*", 0, language, GdPictureHelper.OCRDirectory, "", resolution, 0, true);
Complete code, which cause the error:

Code: Select all

        public byte[] Convert(byte[] data, bool embeddOCRText = true, string language = "deu")
        {
            byte[] pdf = null;

            using (var pdfInstance = GdPictureHelper.GetPDFInstance())
            {
                using (var gdPictureImaging = GdPictureHelper.GetImagingInstance())
                {
                    int imageId = gdPictureImaging.CreateGdPictureImageFromByteArray(data);
                    if (gdPictureImaging.GetStat() == GdPictureStatus.OK)
                    {
                        float resolution = System.Math.Max(200, gdPictureImaging.GetVerticalResolution(imageId));
                        var state = pdfInstance.NewPDF(embeddOCRText);

                        if (state == GdPictureStatus.OK)
                        {
                            for (int i = 1; i <= gdPictureImaging.GetPageCount(imageId); i++)
                            {
                                if (gdPictureImaging.SelectPage(imageId, i) == GdPictureStatus.OK)
                                {
                                    var addImageResult = pdfInstance.AddImageFromGdPictureImage(imageId, false, true);
                                }
                            }

                            pdfInstance.OcrPages("*", 0, language, GdPictureHelper.OCRDirectory, "", resolution, 0, true);

                            using (var stream = new MemoryStream())
                            {
                                pdfInstance.SaveToStream(stream);
                                stream.Position = 0;
                                pdf = stream.ToArray();
                            }
                        }
                        else
                        {
                            throw new Exception($"Culd not convert document. State: {state}");
                        }
                    }
                    else
                    {
                        throw new Exception("Could not create gdpicture imaging instance");
                    }

                    // Close pdf document
                    pdfInstance?.CloseDocument();

                    // Release gdpicture image
                    gdPictureImaging.ReleaseGdPictureImage(imageId);
                }
            }
The last to parts (CloseDocument and ReleaseGdPictureImage) can be skipped as far as i know.

Re: NullReferenceException when doing PDF OCR

Posted: Wed Jan 16, 2019 9:54 pm
by Gabriela
Hello,

GdPicturePDF.OcrPages() method is running asynchronously, in other words, you have to wait for the OCR process ending before manipulating the document further. It is clearly documented here:
https://www.gdpicture.com/guides/gdpicture/web ... Pages.html
Setting the Sync parameter to True/true is a good option here, or you can benefit from using several OCR related events:
https://www.gdpicture.com/guides/gdpicture/web ... vents.html