Extract text from a PDF page with formatting retained
Posted: Tue Mar 19, 2019 8:03 am
Hi, I have a requirement to extract text from PDF page and write to a text file.
I have used GetPageText() function which successfully extracts the text but the format is not retained. I mean all white spaces have been removed between words and paragraphs.
I have also tried to use GetPageTextWithCoords() function, which returns the text with coordinates. This can be useful but is their easier way to transform this to text only?
I know it is hard with different font and text sizes. But is there a inbuilt function which can extract text to a nearest position in terms of white spaces, new lines and paragraphs? In other words I want the page text from to retain text position and coordinates in the text file as well as closely resemble as possible from PDF.
Thanks.
Ajit
I have used GetPageText() function which successfully extracts the text but the format is not retained. I mean all white spaces have been removed between words and paragraphs.
I have also tried to use GetPageTextWithCoords() function, which returns the text with coordinates. This can be useful but is their easier way to transform this to text only?
I know it is hard with different font and text sizes. But is there a inbuilt function which can extract text to a nearest position in terms of white spaces, new lines and paragraphs? In other words I want the page text from to retain text position and coordinates in the text file as well as closely resemble as possible from PDF.
Thanks.
Ajit