Extracting Words and Coordinates using Tesseract
Extracting Words and Coordinates using Tesseract
I want to extract words from image documents together with their coordinates. This data is processed later in my workflow program.
The OCRTesseractDoOCR method returns a single string including all words but no coordinates.
Using OCRTesseractGetCharCount and OCRTesseractGetCharLeft type methods I can cycle through the individual characters
Is there a method to extract the words and coordinates from an image?
Clearly Tesseract applies the logic to determine what is a word as this must be used when returning the string in the DoOCR method.
Thanks for any help.
The OCRTesseractDoOCR method returns a single string including all words but no coordinates.
Using OCRTesseractGetCharCount and OCRTesseractGetCharLeft type methods I can cycle through the individual characters
Is there a method to extract the words and coordinates from an image?
Clearly Tesseract applies the logic to determine what is a word as this must be used when returning the string in the DoOCR method.
Thanks for any help.
Re: Extracting Words and Coordinates using Tesseract
Were you able find any solution ? I am looking to achieve similar functionality
Re: Extracting Words and Coordinates using Tesseract
Yes, I'm wondering, why there are so less answers about this?
I think, the functionally to get words from an OCR engine is very important, too!
Every other OCR engine is delivering words, which then can be searched for patterns or keywords etc.
Not the actual thesseract engine like I understand.
Strangely, the .SaveAsPDFOCR Method creates words in the final PDF that can be searched.
Is there no other way than to cycle through the results and split the chars with the found spaces to create words?
Or are we doing something wrong? Haven't we found a mtehod or a parameter that can change this behavior yet?
Greets
I think, the functionally to get words from an OCR engine is very important, too!
Every other OCR engine is delivering words, which then can be searched for patterns or keywords etc.
Not the actual thesseract engine like I understand.
Strangely, the .SaveAsPDFOCR Method creates words in the final PDF that can be searched.
Is there no other way than to cycle through the results and split the chars with the found spaces to create words?
Or are we doing something wrong? Haven't we found a mtehod or a parameter that can change this behavior yet?
Greets
Re: Extracting Words and Coordinates using Tesseract
Hello,
The process is quite simple. Just retrieve all recognized characters of the document using the appropriated method. During your iteration, if the method OCRTesseractGetCharSpaces() returns a value different of 0 you are at the beginning of a new word.
Let me know if you need further information.
Cheers,
Loïc
The process is quite simple. Just retrieve all recognized characters of the document using the appropriated method. During your iteration, if the method OCRTesseractGetCharSpaces() returns a value different of 0 you are at the beginning of a new word.
Let me know if you need further information.
Cheers,
Loïc
Re: Extracting Words and Coordinates using Tesseract
Hi Loïc,
thank you for your reply.
Sure, you're right, on the one hand, it is surely easy and I understand, that the Tesseract engine ist from a Google project and none of your major programming tasks. But you are the one who are providing a SDK with it. So if everyone has do this "simple programming" why not offer this in a just more simple method/class in the next update, like ".OCRTesseractGetWordCount"
Here is mine little programming:
I used to store the OCR results (before GDPicture) in an array of this simple structure
since I - and everybody else - wants to know the coordinates of the bounding box of a certain word.
And there is the first problem, how to get the bounding box?
So many months ago, I wrote this little helper routine to get the words from the chars with space like you described it in your post, and also get the bounding box from the existing data:
This doing quite fine, at least for me, and produces similar output like other OCR engines we tried and used in the past (e.g. Pegasus, FineReader). I know, there is a little bit more programming necessary if you want to provide the "for....each" feature, but most of this code should meet the requirements
BUT:
I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.
For example: The simple word "Look". The upper "L" and the lower "k" are chars that will be recognized quite easy.
But the double lower "o" kann also be interpreted as two small zeros "0". So there is a valid chance, to decide, these are zeros instead of 0 with a confidence of e.g. 60:40 but there is a much lesser confidence of "L00k" instead of "Look" if using a dictionary on word basis instead of single character recognition.
I thought there is a dictionary that is used for the OCR recognition on word basis. And if it is so, why not deliver these results, too?
By the way, if a word is separated because it is too long for the rest of the line, e.g.
"........ swinging his long-
sword over his......." the separation on with space will not do the job. The only solution is a dictionary for these cases.
Every professional programmer, who is not only trying to make searchable PDF files, will need this functionality because on the word basis will be made decisions, wether keywords are found on defined positions or not.
Thank you very much for your patience and the update for the PDF ans MRC-jpgs.
EF
thank you for your reply.
Sure, you're right, on the one hand, it is surely easy and I understand, that the Tesseract engine ist from a Google project and none of your major programming tasks. But you are the one who are providing a SDK with it. So if everyone has do this "simple programming" why not offer this in a just more simple method/class in the next update, like ".OCRTesseractGetWordCount"
Here is mine little programming:
I used to store the OCR results (before GDPicture) in an array of this simple structure
Code: Select all
Public Structure OCRDataStruct
Public Coord As RectangleF
Public Text As String
Public Confidence As Double
End Structure
And there is the first problem, how to get the bounding box?
So many months ago, I wrote this little helper routine to get the words from the chars with space like you described it in your post, and also get the bounding box from the existing data:
Code: Select all
' Build WordList with Coordinates
Dim wordList As New List(Of OCRDataStruct), word As String, newWord As OCRDataStruct
Dim maxBottom As Long, maxRight As Long
For i = 1 To tOCRGdPictureImaging.OCRTesseractGetCharCount
If i = 1 Then
newWord.Text = ""
newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0)
Else
If tOCRGdPictureImaging.OCRTesseractGetCharSpaces(i) Then
newWord.Text = word
newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top)
wordList.Add(newWord)
newWord.Text = ""
newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0)
word = ""
maxBottom = 0
maxRight = 0
End If
End If
word += ChrW(tOCRGdPictureImaging.OCRTesseractGetCharCode(i))
maxBottom = Math.Max(maxBottom, tOCRGdPictureImaging.OCRTesseractGetCharBottom(i))
maxRight = Math.Max(maxRight, tOCRGdPictureImaging.OCRTesseractGetCharRight(i))
Next
newWord.Text = word
newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top)
wordList.Add(newWord)
BUT:
I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.
For example: The simple word "Look". The upper "L" and the lower "k" are chars that will be recognized quite easy.
But the double lower "o" kann also be interpreted as two small zeros "0". So there is a valid chance, to decide, these are zeros instead of 0 with a confidence of e.g. 60:40 but there is a much lesser confidence of "L00k" instead of "Look" if using a dictionary on word basis instead of single character recognition.
I thought there is a dictionary that is used for the OCR recognition on word basis. And if it is so, why not deliver these results, too?
By the way, if a word is separated because it is too long for the rest of the line, e.g.
"........ swinging his long-
sword over his......." the separation on with space will not do the job. The only solution is a dictionary for these cases.
Every professional programmer, who is not only trying to make searchable PDF files, will need this functionality because on the word basis will be made decisions, wether keywords are found on defined positions or not.
Thank you very much for your patience and the update for the PDF ans MRC-jpgs.
EF
Re: Extracting Words and Coordinates using Tesseract
Hello
We are also not satisified about the missing possibilities to get words. In our application we use the OCR for paperless booking. We scan the OCR-Result for Invoicenumbers, Dates, Amounts and and and to automatically book the invoice. For this we Need a good OCR Result or the possibility to find out the Convidence of the words to Show the User that an automatic booking is not possible.
It would be a nice advance, when the Tesseract API Returns Words, Coordinates and the Convidence.
Best Regards
Roland
We are also not satisified about the missing possibilities to get words. In our application we use the OCR for paperless booking. We scan the OCR-Result for Invoicenumbers, Dates, Amounts and and and to automatically book the invoice. For this we Need a good OCR Result or the possibility to find out the Convidence of the words to Show the User that an automatic booking is not possible.
It would be a nice advance, when the Tesseract API Returns Words, Coordinates and the Convidence.
Best Regards
Roland
Re: Extracting Words and Coordinates using Tesseract
Hi,
@Roland please check the snippet of EFernkaes, its show a simple way to extract words. I don't know what I can say more.
Cheers,
Loïc
The confidence of the engine is word based. Just get the confidence of the first char to get the confidence of the word. You will see that all chars in the word have the same confidence.I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.
@Roland please check the snippet of EFernkaes, its show a simple way to extract words. I don't know what I can say more.
Cheers,
Loïc
Re: Extracting Words and Coordinates using Tesseract
Hi,
We have added a new method to simplify word association. It is: OCRTesseractGetCharWord()
Cheers,
Loïc
We have added a new method to simplify word association. It is: OCRTesseractGetCharWord()
Cheers,
Loïc
Re: Extracting Words and Coordinates using Tesseract
While the function provided by @EFernkaes is useful, it doesn't appear to be accurate as it is not handling "end of line" unless I'm missing something but the last word of my line is always appended to the first word of the next line. I assume that it is because he/she is using the 'OCRTesseractGetCharSpaces' function which find spaces but not "end of line".
I've looked for a function to build this but I cannot find this anywhere.
This is quite obvious when looking at the overall results returned from 'OCRTesseractDoOCR' when splitting each line and removing empty ones using something similar to this:
Line:
"' 5th August 2017 Statement No. 66"
"Company plc"
Words:
'
5th
2017
Statement
No.
66Company
As you can see the 66Company is a problem as Company is actually located on line 2.
Any suggestions?
I'll try to build the word list differently using the method Loic mentioned but it would be handy to be able to use something like this:
if (OCRTesseractGetCharSpaces(i) > 0 || OCRTesseractEndOfLine(i))
...
Thanks.
I've looked for a function to build this but I cannot find this anywhere.
This is quite obvious when looking at the overall results returned from 'OCRTesseractDoOCR' when splitting each line and removing empty ones using something similar to this:
Line:
"' 5th August 2017 Statement No. 66"
"Company plc"
Words:
'
5th
2017
Statement
No.
66Company
As you can see the 66Company is a problem as Company is actually located on line 2.
Any suggestions?
I'll try to build the word list differently using the method Loic mentioned but it would be handy to be able to use something like this:
if (OCRTesseractGetCharSpaces(i) > 0 || OCRTesseractEndOfLine(i))
...
Thanks.
Re: Extracting Words and Coordinates using Tesseract
Hello,
The new class GdPictureOCR is now available for such purposes:
https://www.gdpicture.com/guides/gdpicture/we ... reOCR.html
The new class GdPictureOCR is now available for such purposes:
https://www.gdpicture.com/guides/gdpicture/we ... reOCR.html
Who is online
Users browsing this forum: No registered users and 1 guest