Document sizes after PdfAddGdPictureImageToPdfOCR

Discussions about machine vision support in GdPicture.
Post Reply
lbleicher
Posts: 16
Joined: Fri Nov 04, 2011 4:51 am

Document sizes after PdfAddGdPictureImageToPdfOCR

Post by lbleicher » Thu Dec 22, 2011 7:13 pm

Hi-

My application executed OCR on scanned image PDFs to create searchable PDF/A output. However, I have noticed that the result of the code below takes as much as 10x disk space as the original. Can anyone explain why? Am I missing a step somewhere?

Attached is a sample PDF that goes from 12k before the process to 700k after.

Thanks,
Leo

Code: Select all

        
Dict = "eng"

        PdfID = oGdPictureImaging.PdfOCRStart(OutputFilePath, True, "", "", "", "", "DocDigester")
        oGdPictureImaging.OCRTesseractSetPassCount(2)

        If InputPDF.LoadFromFile(pdfPath, False) = GdPicture.GdPictureStatus.OK Then
            For i As Integer = 1 To InputPDF.GetPageCount()
                InputPDF.SelectPage(i)
                ImageID = InputPDF.RenderPageToGdPictureImage(200, True)

                curPageImage = InputPDF.ExtractPageImage(i)
                inPgPD = myPage.GetBitDepth(curPageImage)
                Select Case inPgPD
                    Case 1
                        oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W 
                    Case 8
                        oGdPictureImaging.ConvertTo8BppGrayScale(ImageID) 'grayscale
                    Case 24
                        'do nothing default is 3x8bit color
                    Case Else
                        oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W 
                End Select

                Dim pgText As String = oGdPictureImaging.PdfAddGdPictureImageToPdfOCR(PdfID, ImageID, Dict, sciroot & "docdigester\bin\win", "")
                oGdPictureImaging.ReleaseGdPictureImage(ImageID)
                oGdPictureImaging.ReleaseGdPictureImage(curPageImage)
            Next i
        Else
            'report out reason for problem.
            Dim errCode As Integer = InputPDF.GetStat()
        End If
        InputPDF.CloseDocument()
        oGdPictureImaging.PdfOCRStop(PdfID)

Attachments
123456A.zip
(5.93 KiB) Downloaded 387 times

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Document sizes after PdfAddGdPictureImageToPdfOCR

Post by Loïc » Thu Dec 22, 2011 7:22 pm

Hello Leo,

If your input PDF is image based you should consider to replace:

Code: Select all

ImageID = InputPDF.RenderPageToGdPictureImage(200, True)
by:

Code: Select all

ImageID = InputPDF.RenderPageToGdPictureImageEx(200, True)
Let me know if this is better.

Kind regards,

Loïc

lbleicher
Posts: 16
Joined: Fri Nov 04, 2011 4:51 am

Re: Document sizes after PdfAddGdPictureImageToPdfOCR

Post by lbleicher » Fri Jan 13, 2012 7:58 pm

Hi Loic-

Thanks for the suggestion, but that does not help. I already had a select/case statement to do conversion back to the original bit depth (though the RenderPageToGdPictureImageEx method is a better way).

I still have this 11k input pdf coming out as 1148k!!!

Is it possible that the JPEG compression is not being applied? Could this be a result of generating the output as a PDF/A?

How could I make sure compression is being applied to the PDF created by the PdfOCRStart statement?

Thanks,
Leo

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest