Get text from pdf including spaces in the PDF document

Discussions about PDF management.
Post Reply
benedikt
Posts: 3
Joined: Wed Aug 08, 2018 11:57 am

Get text from pdf including spaces in the PDF document

Post by benedikt » Wed Sep 12, 2018 9:15 am

Hey,

is it possible to get the formatted text from a PDF? For example a line in the pdf looks like:

Code: Select all

1  Test t3            3,5   14
But the result of "GetPageText" is:

Code: Select all

1 Test t3 3,5 14
I need the space information to split a line into columns.

Thanks a lot!

Gabriela
Posts: 436
Joined: Wed Nov 22, 2017 9:52 am

Re: Get text from pdf including spaces in the PDF document

Post by Gabriela » Tue Sep 18, 2018 1:55 pm

Hello,

Using our latest release the text is extracted "as is", means with all spaces. Please, find an example in our documentation pages here:
https://www.gdpicture.com/guides/gdpicture/web ... eText.html

benedikt
Posts: 3
Joined: Wed Aug 08, 2018 11:57 am

Re: Get text from pdf including spaces in the PDF document

Post by benedikt » Tue Sep 18, 2018 3:06 pm

As you can read in my post, I'm already using GetPageText with the latest release. But I don't get any spaces larger then one.
So the visual space is not "filled" with spacechars.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Get text from pdf including spaces in the PDF document

Post by Loïc » Tue Jan 15, 2019 10:52 am

Hello Benedikt,

Your PDF probably doesn't contain spacing characters. Note that it is possible to draw individual text run at any offset in this format.

The OCR technology is a good approach to "recreate" such missing information. You can easily do that using GdPicture by rasterizing the PDF page to a bitmap first, then running OCR using the GdPictureOCR method.

Please let us know if you need a code snippet or further information.

With best regards,

Loïc Carrère

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest