Please check out the attached files (which are 1bpp when fed to the OCR engine).
The file good.gif contains four clear numbers plus some noise at the bottom. All the numbers are correctly recognized. The noise leads to some garbage, but I don't care (my parser filters that). Here is the output of OCRTesseractDoOCR:
The file bad.gif contains the exact same four numbers (pixel by pixel identical, I checked this with photoshop) but not the noise. Here, the OCR ignores part of the image entirely (the part in front of the dot). Here is the output of OCRTesseractDoOCR:21.047,74
20.416,31
4.287,43
25.335,17
f/\A lf\
I don't get why it does this. Any help would be greatly appreciated.047,74
416,31
287,43
335,17
Here is the relevant code excerpt (using GDPicture.NET 9.3):
Code: Select all
imagingApi.OCRTesseractReinit()
imagingApi.OCRTesseractSetOCRContext(OCRContext.OCRContextDocument)
imagingApi.OCRTesseractSetPassCount(3)
Dim test = imagingApi.CreateGdPictureImageFromFile("good.bmp")
Dim s = imagingApi.OCRTesseractDoOCR(test, "fra", "OCR", "")
imagingApi.ReleaseGdPictureImage(test)