Merge PDF and ocr text
Merge PDF and ocr text
Hi All,
I would like to use the GDPicture.NET 10 for OCR PDF documents. Then, i would extract the text with coordinates (i find the GetPageTextWithCoords() method).
I would like to store this text (with coords) separately from the non-ocr PDF. When the user clicks to see the PDF, i would like to merge the non-ocr pdf with text (with coords) on the fly, so the users gets a pdf with ocr-text.
Is this possible with GDPicture.NET 10?
Thank in advance,
Gabor
I would like to use the GDPicture.NET 10 for OCR PDF documents. Then, i would extract the text with coordinates (i find the GetPageTextWithCoords() method).
I would like to store this text (with coords) separately from the non-ocr PDF. When the user clicks to see the PDF, i would like to merge the non-ocr pdf with text (with coords) on the fly, so the users gets a pdf with ocr-text.
Is this possible with GDPicture.NET 10?
Thank in advance,
Gabor
Re: Merge PDF and ocr text
I really don't understand the point of producing a PDF-OCR file to extract the text of it and put it in the original PDF. Doing that will transform the original PDF to en PDF-OCR which is the whole point of the method in the first place.
It is achievable but it is not practical nor straight forward and in the end it is the exact same thing as directly producing a PDF-OCR.
It is achievable but it is not practical nor straight forward and in the end it is the exact same thing as directly producing a PDF-OCR.
Re: Merge PDF and ocr text
Dear Cedric,
The point is: The customer has a database. They store PDFs in a blob field of a table. We have to OCR a these PDFs, but we have to keep the original file as well. The database now is 400 GB, if we put PDF-OCR next to the original PDF, we got an almost 800 GB database. We thought, that it would be simple to store only the text layer, and when the user want to view the document, on-the-fly merge the text and the PDF.
The point is: The customer has a database. They store PDFs in a blob field of a table. We have to OCR a these PDFs, but we have to keep the original file as well. The database now is 400 GB, if we put PDF-OCR next to the original PDF, we got an almost 800 GB database. We thought, that it would be simple to store only the text layer, and when the user want to view the document, on-the-fly merge the text and the PDF.
Re: Merge PDF and ocr text
This is still possible but you will have to create from scratch a PDF file on-the-fly and depending on the PDF size and content, it may not be a good idea.
The method you are looking for is the DrawText method documented here: https://www.gdpicture.com/guides/gdpicture/GdP ... tring.html
Basically what you will have to do is:
- Load the original PDF document
- Create a new blank PDF document
- Create each page in the new PDF document with the size based on the corresponding page from the original one
- Insert the text layer on each page
- Create a raster image of each original PDF page and insert them on the corresponding new PDF page (otherwise you will have the inserted text over the picture which is not what you want)
The method you are looking for is the DrawText method documented here: https://www.gdpicture.com/guides/gdpicture/GdP ... tring.html
Basically what you will have to do is:
- Load the original PDF document
- Create a new blank PDF document
- Create each page in the new PDF document with the size based on the corresponding page from the original one
- Insert the text layer on each page
- Create a raster image of each original PDF page and insert them on the corresponding new PDF page (otherwise you will have the inserted text over the picture which is not what you want)
Re: Merge PDF and ocr text
Thank you for help! We will consider you advise.
Best regards,
Gabor
Best regards,
Gabor
Who is online
Users browsing this forum: No registered users and 1 guest