OCR bug - Same image area is included in 2 different chars

Discussions about machine vision support in GdPicture.
Post Reply
Slava
Posts: 66
Joined: Fri Jun 22, 2007 4:43 pm

OCR bug - Same image area is included in 2 different chars

Post by Slava » Wed Sep 09, 2009 4:03 pm

Hi,

I am facing a problem with character coordinates. In the attached image you can see that the upper dot from ':' character is included in two different characters / words / lines. This is becoming a problem when reading the coordinates rectangle of the whole word (ex. 'Factuur'). The coordinates of the word will overlap another line, in this case. A document-recognition based on template system fails when this happens.

Are you able to fix this issue?

p.s. The 'r' character is however is beeing recognized correctly. As wel as ':'.

Version: GdPicture Pro v5.11.18 ActiveX

Kind regards,
Slava
Attachments
Tesseract OCR problem.jpg

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: OCR bug - Same image area is included in 2 different chars

Post by Loïc » Wed Sep 09, 2009 4:23 pm

Hi,

Could you send the image source at esupport (at) gdpicture (dot) ?

The problem occurs only on 1 image or on several ?


Kind regards,

Loïc

Slava
Posts: 66
Joined: Fri Jun 22, 2007 4:43 pm

Re: OCR bug - Same image area is included in 2 different chars

Post by Slava » Wed Sep 09, 2009 5:06 pm

Loic,

This particular document first needs a premission for sending outside the company. (it can take a while, if you still need to examine it)

I've been investigating this problem and processed another 15 documents to see how often it occurs. It did not occur in other documents. However earlier we got feedback from our customer about overlapped word highlighting. But so far we were not aware of the reason.

I have also noticed that my guess about the dot was wrong, in the case above. If you look closely, the 'm' and the ':' chars on the second line are marked together. So the upper dot is not included in 2 chars, but only in the 'r' char. And the lower to 'm'. So we got:

r
.


and

m .

recognized as 'r' and 'm' (without dots)

As this problem does not happen often, it is not a critical issue for now. If I enconquer more problems with it, I will post here / contact you again.

Kind regards,
Slava

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest