OCR on TIF coverted from PDF

azeeth · Post by **azeeth** » Tue May 28, 2019 8:09 am

Hi,

We have a requirement to extract text from a scanned PDF document. We are doing English language OCR. We have been able to use GDPicture to do that but a lot of the extracted text in not correct.
We thought we may get better results if we convert PDF to TIF first and then run OCR on it. The results were a little better than before, but still a lot of inaccuracies in text.

Then we tried converting the PDF to TIF using a separate product called 2TIff. When we ran GDPicture OCR of that TIF, the results were much much better and accurate.
I have attached the original TIF files and their results.

Could you please tell what is GDPicture not doing that 2Tiff did to get worse OCR results using the same GDPicture OCR engine? Is there a way to improve the TIF conversion from PDF?

Example files
https://drive.google.com/file/d/1mNfOCZ ... sp=sharing

Thanks
Ajit

Gabriela · Post by **Gabriela** » Thu May 30, 2019 2:45 pm

Hello,

May I ask you to provide us with the exact code snippet you are using for OCR so we can replicate your issues? We do not know what 2Tiff is doing. In order to provide you support on GdPicture.NET toolkit, we need to reproduce your issues using the current release. Then we can investigate them more.
Thank you for your understandings and we are waiting for the code and exact steps on how to replicate it.

azeeth · Post by **azeeth** » Fri May 31, 2019 2:40 am

Hi, below is function that runs OCR on a Tif file and extracts text in a text file.

Code: Select all

Private Function ConvertTifToOCR(TifFilename As String, textFilename As String) As Boolean
        Dim inputTifObj As GdPictureImaging = New GdPictureImaging()
        Dim pageCount As Integer
        Dim imageID As Integer = inputTifObj.CreateGdPictureImageFromFile(TifFilename)
        If inputTifObj.GetStat() = GdPictureStatus.OK Then
            If inputTifObj.TiffIsMultiPage(imageID) Then
                pageCount = inputTifObj.TiffGetPageCount(imageID)
            End If

            Dim ocrObj As GdPictureOCR = New GdPictureOCR()
            ocrObj.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
            ocrObj.CharacterSet = ""
            ocrObj.AddLanguage(OCRLanguage.English)
            Dim resID As String = "page"
            Dim content As String = Nothing
            Dim stream As System.IO.StreamWriter = New System.IO.StreamWriter(textFilename)
            For i As Integer = 1 To pageCount
                inputTifObj.TiffSelectPage(imageID, i)
                If ocrObj.SetImage(imageID) = GdPictureStatus.OK Then
                    ocrObj.OCRMode = OCRMode.FavorAccuracy
                    ocrObj.RunOCR(resID)
                    If ocrObj.GetStat() = GdPictureStatus.OK Then
                        content = ocrObj.GetOCRResultText(resID)
                        If ocrObj.GetStat() = GdPictureStatus.OK Then
                            stream.WriteLine(content & vbFormFeed & vbCrLf)
                        End If
                    Else
                        MessageBox.Show("The Ocr didn't process. Error: " + ocrObj.GetStat().ToString())
                    End If
                Else
                    MessageBox.Show("The image can't be set. Error: " + ocrObj.GetStat().ToString())
                End If
                ocrObj.ReleaseOCRResult(resID)
            Next
            stream.Close()
            inputTifObj.ReleaseGdPictureImage(imageID)
            ocrObj.Dispose()
            MessageBox.Show("Tif file processed through OCR")

            Return True
        Else
            MessageBox.Show("The Tif file can't be opened. Error: " + inputTifObj.GetStat().ToString())
        End If
        inputTifObj.Dispose()

        Return False
    End Function

Gabriela · Post by **Gabriela** » Mon Jun 03, 2019 3:03 pm

Hello,

I would like to explain to you here some more details about OCR. From what I see, you saved the scanned pages in PDF document. Using GdPictureOCR class you will need the scanned image, so here I would recommend you to scan directly to tiff. Next, you need to scan using appropriate DPI, so the scanned page will be readable. The precision of the OCRed text you can also achieve using another set of languages, for further details read here:
https://github.com/tesseract-ocr/tesser ... Data-Files
There are different language files for fast OCR and accurate OCR. And finally, the OCR'ed text will be more accurate when doing OCR on regions as on the whole pages. I hope this help.

azeeth · Post by **azeeth** » Wed Jun 05, 2019 3:43 am

We get PDF from third party sources that need to be OCR'd, so Tifs are out questions.
Running GDPicture OCR on PDFs produced worst results in terms of text accuracy.
Running GDPicture OCR on TIF converted from PDF using GDPicture produced better results in term of accuracy.
Running GDPicture OCR on TIF converted from PDF using 2Tiff produced best results in terms of text accuracy.

We are definitely using the accurate OCR trained files.

Gabriela · Post by **Gabriela** » Wed Jun 05, 2019 11:48 am

Hello,

Here is an interesting source that can be useful:
https://github.com/tesseract-ocr/tesser ... oveQuality

Thank you also for creating a support ticket.

Gabriela · Post by **Gabriela** » Thu Jun 06, 2019 11:57 am

Hi,

Finally, we have figured out that the source PDF has internal page rotation. After solving this with the use of NormalizePage() method the OCR results are excellent and there is no need to convert to TIFF.
So maybe this helps also to others.

OCR on TIF coverted from PDF

OCR on TIF coverted from PDF

Re: OCR on TIF coverted from PDF

Re: OCR on TIF coverted from PDF

Re: OCR on TIF coverted from PDF

Re: OCR on TIF coverted from PDF

Re: OCR on TIF coverted from PDF

Re: OCR on TIF coverted from PDF

Who is online

Stay in Touch

About ORPALIS