Page 1 of 1

Extract text from a PDF page with formatting retained

Posted: Tue Mar 19, 2019 8:03 am
by azeeth
Hi, I have a requirement to extract text from PDF page and write to a text file.

I have used GetPageText() function which successfully extracts the text but the format is not retained. I mean all white spaces have been removed between words and paragraphs.

I have also tried to use GetPageTextWithCoords() function, which returns the text with coordinates. This can be useful but is their easier way to transform this to text only?

I know it is hard with different font and text sizes. But is there a inbuilt function which can extract text to a nearest position in terms of white spaces, new lines and paragraphs? In other words I want the page text from to retain text position and coordinates in the text file as well as closely resemble as possible from PDF.

Thanks.
Ajit

Re: Extract text from a PDF page with formatting retained

Posted: Tue Mar 19, 2019 3:48 pm
by Gabriela
Hello,

The main "issue" here comes from the definition of the PDF format.
The text how you can see it on the page in your PDF document is written in a totally different way behind the scene. So even if you see some text drawn in the same line on the page, it may not be the real line in the document content.
The GetPageText() method extracts text with all spaces if those are there, it means if the text is really separated with space characters. But if the text is aligned in the same line, it can be in reality two separate texts with no real spaces. So the text is extracted without "formatting".
And if we take into consideration also font and font size, it is almost impossible to implement such a feature without some heuristics behind.
Using both methods GetPageTextWithCoords() and GetPageTextWithCoordsEx() you can implement your own parser to be able to obtain the page formatting you want:
https://www.gdpicture.com/guides/gdpicture/web ... oords.html
https://www.gdpicture.com/guides/gdpicture/web ... rdsEx.html