Clearing the "natural learning algorithm"

Nigels · Post by **Nigels** » Thu Sep 03, 2009 5:25 pm

Hi

I have been experiencing problems performing OCR on multiple documents. I get different results returned depending on the order in which the documents are processed.

I believe this is because of the “natural learning algorithm” employed by the Tesseract Engine as mentioned in other posts.

I am using the ActiveX version which does not have the option to “clear” the “learnt” information. As a result, I guess, it learns from previous documents that can be different sizes, different fonts, different quality, etc – making the results differ apparently randomly and quite significantly according to what has been read before.

This is far from ideal – you really want to receive the same OCR results each time the same document is read!

Has this been fixed in a later version or does a workaround exist to get over this problem?

Thanks

Nigel

Post by **Loïc** » Fri Sep 04, 2009 10:21 am

Hi Nigel,

This bug is known by the tesseract ocr development team.

We found a workaround in GdPicture.NET but we were unable to include it in ActiveX editions of GdPicture.
Unfortunately, I can't do anything now. Just hope this bug will be solved asap.

Kind regards,

Loïc

Nigels · Post by **Nigels** » Tue Sep 08, 2009 12:56 pm

Thanks Loic

I hope so too!

Cheers

Nigel

Nigels · Post by **Nigels** » Tue Oct 06, 2009 2:37 pm

Hi Loic

I have come up with a possible workaround to this, it is not ideal because it increases the amount of processing but it does appear to clear the algorithm.

We are typically processing a number of documents (IE a batch of invoices). The solution I have come up with is to late bind the cimaging control and then destroy and recreate it between each page/document. This means that the object needs to be created for each page and the document reloaded for each page if it is a multipage document - which is a bit of an overhead!

I ran a test on a 24 page document and received significantly different results using this technique compared to just OCR'ing each page in turn.

Can you confirm if this method will be clearing the "natural learning algorithm" which is why I am seeing different results?

Also, any idea when this problem will be fixed (if it is soon I will not change all my code!).

Cheers

Nigel

Post by **Loïc** » Tue Oct 06, 2009 2:41 pm

Hi Nigel,

Your solution is good and it is the one we implemented in GdPicture.NET

For the bug from the Tesseract team, I think they solved it in a current beta release. However, we did not try it because we are waiting for stable release only.

Kind regards,

Loïc

jawa · Post by **jawa** » Sat Jan 29, 2011 4:00 am

I wish to turn off this learning feature.
You are referring to a fix in tesseract (beta). Do you know where I can get that version and how I can turn off the learning feature ?
Thanks a lot in advance,
Jawa

Clearing the "natural learning algorithm"

Clearing the "natural learning algorithm"

Re: Clearing the "natural learning algorithm"

Re: Clearing the "natural learning algorithm"

Re: Clearing the "natural learning algorithm"

Re: Clearing the "natural learning algorithm"

Re: Clearing the "natural learning algorithm"

Who is online

Stay in Touch

About ORPALIS