Page 1 of 1

Bug in OCR engine?

Posted: Sat Jan 19, 2013 12:28 pm
by acl
I am observing some very strange behaviour when OCRing a fairly clean region of a 400 dpi scanned page.

Please check out the attached files (which are 1bpp when fed to the OCR engine).

The file good.gif contains four clear numbers plus some noise at the bottom. All the numbers are correctly recognized. The noise leads to some garbage, but I don't care (my parser filters that). Here is the output of OCRTesseractDoOCR:
21.047,74
20.416,31
4.287,43
25.335,17
f/\A lf\
The file bad.gif contains the exact same four numbers (pixel by pixel identical, I checked this with photoshop) but not the noise. Here, the OCR ignores part of the image entirely (the part in front of the dot). Here is the output of OCRTesseractDoOCR:
047,74
416,31
287,43
335,17
I don't get why it does this. Any help would be greatly appreciated.


Here is the relevant code excerpt (using GDPicture.NET 9.3):

Code: Select all

        imagingApi.OCRTesseractReinit()
        imagingApi.OCRTesseractSetOCRContext(OCRContext.OCRContextDocument)
        imagingApi.OCRTesseractSetPassCount(3)
        Dim test = imagingApi.CreateGdPictureImageFromFile("good.bmp")
        Dim s = imagingApi.OCRTesseractDoOCR(test, "fra", "OCR", "") 
        imagingApi.ReleaseGdPictureImage(test)

Re: Bug in OCR engine?

Posted: Tue Jan 22, 2013 2:20 pm
by Cedric
Hello,

I strongly suggest you open a ticket on our support platform, this issue need investigation.

Thanks!