Page 1 of 1

some magic in .GetPageText?

Posted: Wed Jan 06, 2021 3:59 pm
by fierikit
Hi
we have to extract text from Pdf, and we use .GetPageText from GdPicturePDF.
but appens something funny:
what i see is different from what i get
i see: COC TELEMATICO
but i get: cac TELEMATiCa
i see: Q1J7N1LVL
but i get: Q1J7N1 LVL <-- there is a space between 1 and L

I have attached an the Pdf with this problem, but this problem appens with too much Pdf of the same kind
I have also to say that 'select and paste' from Acrobat produce the same problem as using .GetPageText
Maybe is not a problem of GDPicture but can someone help me solve the problem?
Alberto

Re: some magic in .GetPageText?

Posted: Tue Jan 12, 2021 4:30 pm
by Hugo
Hi Fierikit,

I have taken a look at your PDF and can say this is not GdPicture at fault but the document's fault. The way the text has previously been OCR'ed is not accurate and this is showed by adobe.

Have you tried using our OCR engine on your documents to get more accurate text results from your document?
This is the result you should be able to get from our OCR (attached file)
Screenshot_404.png
Let me know if you need anything else

Re: some magic in .GetPageText?

Posted: Tue Jan 12, 2021 5:13 pm
by fierikit
Thanks Hugo
I also said is a Pdf fault.
The problem is that .GetPageText is in a batch and automatic activity because the number of Pdf to read could be too much high.
The batch cannot understand if text from .GetPageText is different from what an Human eye can see.

is very very rare to find a Pdf like the one i sent, but when it happens could be a problem for later activities,

usually if i use .GetPageText in a Pdf produced from scanning (so an image Pdf) the result is an empty string
but with a Pdf of that kind i get the whole text (with some errors generated from what You said).

is too long, for each Pdf, to use .GetPageText AND do OCR, also if i only need only the last 3 rows
so there is a way, or propery, so i can undertand that some parts of the Pdf are generated from a previous OCR?

I hope have explained
thanks for your support
Alberto