Extracting Words and Coordinates using Tesseract

kchidlow · Post by **kchidlow** » Thu Nov 14, 2013 11:01 am

I want to extract words from image documents together with their coordinates. This data is processed later in my workflow program.

The OCRTesseractDoOCR method returns a single string including all words but no coordinates.
Using OCRTesseractGetCharCount and OCRTesseractGetCharLeft type methods I can cycle through the individual characters

Is there a method to extract the words and coordinates from an image?
Clearly Tesseract applies the logic to determine what is a word as this must be used when returning the string in the DoOCR method.

Thanks for any help.

rajalla · Post by **rajalla** » Tue Jun 30, 2015 10:26 pm

Were you able find any solution ? I am looking to achieve similar functionality

EFernkaes · Post by **EFernkaes** » Sun Aug 30, 2015 3:08 pm

Yes, I'm wondering, why there are so less answers about this?

I think, the functionally to get words from an OCR engine is very important, too!

Every other OCR engine is delivering words, which then can be searched for patterns or keywords etc.
Not the actual thesseract engine like I understand.

Strangely, the .SaveAsPDFOCR Method creates words in the final PDF that can be searched.

Is there no other way than to cycle through the results and split the chars with the found spaces to create words?
Or are we doing something wrong? Haven't we found a mtehod or a parameter that can change this behavior yet?

Greets

Post by **Loïc** » Mon Aug 31, 2015 6:41 pm

Hello,

The process is quite simple. Just retrieve all recognized characters of the document using the appropriated method. During your iteration, if the method OCRTesseractGetCharSpaces() returns a value different of 0 you are at the beginning of a new word.

Let me know if you need further information.

Cheers,

Loïc

EFernkaes · Post by **EFernkaes** » Fri Sep 04, 2015 8:40 am

Hi Loïc,

thank you for your reply.
Sure, you're right, on the one hand, it is surely easy and I understand, that the Tesseract engine ist from a Google project and none of your major programming tasks. But you are the one who are providing a SDK with it. So if everyone has do this "simple programming" why not offer this in a just more simple method/class in the next update, like ".OCRTesseractGetWordCount"

Here is mine little programming:
I used to store the OCR results (before GDPicture) in an array of this simple structure

Code: Select all

 Public Structure OCRDataStruct
        Public Coord As RectangleF
        Public Text As String
        Public Confidence As Double
    End Structure

since I - and everybody else - wants to know the coordinates of the bounding box of a certain word.
And there is the first problem, how to get the bounding box?

So many months ago, I wrote this little helper routine to get the words from the chars with space like you described it in your post, and also get the bounding box from the existing data:

Code: Select all

        ' Build WordList with Coordinates
        Dim wordList As New List(Of OCRDataStruct), word As String, newWord As OCRDataStruct
        Dim maxBottom As Long, maxRight As Long
        For i = 1 To tOCRGdPictureImaging.OCRTesseractGetCharCount
            If i = 1 Then
                newWord.Text = ""
                newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0)
            Else
                If tOCRGdPictureImaging.OCRTesseractGetCharSpaces(i) Then
                    newWord.Text = word
                    newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top)
                    wordList.Add(newWord)
                    newWord.Text = ""
                    newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0)
                    word = ""
                    maxBottom = 0
                    maxRight = 0
                End If
            End If
            word += ChrW(tOCRGdPictureImaging.OCRTesseractGetCharCode(i))
            maxBottom = Math.Max(maxBottom, tOCRGdPictureImaging.OCRTesseractGetCharBottom(i))
            maxRight = Math.Max(maxRight, tOCRGdPictureImaging.OCRTesseractGetCharRight(i))
        Next
        newWord.Text = word
        newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top)
        wordList.Add(newWord)

This doing quite fine, at least for me, and produces similar output like other OCR engines we tried and used in the past (e.g. Pegasus, FineReader). I know, there is a little bit more programming necessary if you want to provide the "for....each" feature, but most of this code should meet the requirements

BUT:
I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.
For example: The simple word "Look". The upper "L" and the lower "k" are chars that will be recognized quite easy.
But the double lower "o" kann also be interpreted as two small zeros "0". So there is a valid chance, to decide, these are zeros instead of 0 with a confidence of e.g. 60:40 but there is a much lesser confidence of "L00k" instead of "Look" if using a dictionary on word basis instead of single character recognition.
I thought there is a dictionary that is used for the OCR recognition on word basis. And if it is so, why not deliver these results, too?
By the way, if a word is separated because it is too long for the rest of the line, e.g.
"........ swinging his long-
sword over his......." the separation on with space will not do the job. The only solution is a dictionary for these cases.

Every professional programmer, who is not only trying to make searchable PDF files, will need this functionality because on the word basis will be made decisions, wether keywords are found on defined positions or not.

Thank you very much for your patience and the update for the PDF ans MRC-jpgs.

EF

win568 · Post by **win568** » Wed Sep 09, 2015 8:10 am

Hello

We are also not satisified about the missing possibilities to get words. In our application we use the OCR for paperless booking. We scan the OCR-Result for Invoicenumbers, Dates, Amounts and and and to automatically book the invoice. For this we Need a good OCR Result or the possibility to find out the Convidence of the words to Show the User that an automatic booking is not possible.

It would be a nice advance, when the Tesseract API Returns Words, Coordinates and the Convidence.

Best Regards
Roland

Post by **Loïc** » Thu Sep 10, 2015 2:12 pm

Hi,

I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.

The confidence of the engine is word based. Just get the confidence of the first char to get the confidence of the word. You will see that all chars in the word have the same confidence.

@Roland please check the snippet of EFernkaes, its show a simple way to extract words. I don't know what I can say more.

Cheers,

Loïc

Post by **Loïc** » Thu May 12, 2016 3:27 pm

Hi,

We have added a new method to simplify word association. It is: OCRTesseractGetCharWord()

Cheers,

Loïc

tfierens2 · Post by **tfierens2** » Tue Jan 09, 2018 1:30 pm

While the function provided by @EFernkaes is useful, it doesn't appear to be accurate as it is not handling "end of line" unless I'm missing something but the last word of my line is always appended to the first word of the next line. I assume that it is because he/she is using the 'OCRTesseractGetCharSpaces' function which find spaces but not "end of line".

I've looked for a function to build this but I cannot find this anywhere.

This is quite obvious when looking at the overall results returned from 'OCRTesseractDoOCR' when splitting each line and removing empty ones using something similar to this:

Line:

"' 5th August 2017 Statement No. 66"
"Company plc"

Words:
'
5th
2017
Statement
No.
66Company

As you can see the 66Company is a problem as Company is actually located on line 2.

Any suggestions?

I'll try to build the word list differently using the method Loic mentioned but it would be handy to be able to use something like this:

if (OCRTesseractGetCharSpaces(i) > 0 || OCRTesseractEndOfLine(i))
...

Thanks.

Gabriela · Post by **Gabriela** » Tue Jan 29, 2019 2:33 pm

Hello,

The new class GdPictureOCR is now available for such purposes:
https://www.gdpicture.com/guides/gdpicture/we ... reOCR.html

Extracting Words and Coordinates using Tesseract

Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Re: Extracting Words and Coordinates using Tesseract

Who is online

Stay in Touch

About ORPALIS