Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Discussions about machine vision support in GdPicture.
Post Reply
csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by csinkinson » Wed Sep 30, 2009 2:49 am

Hi Guys,

Just bought GDPicture.net Ultimate - love it! Great work! :)

I'm trying to OCR a document on the server using ASP.net. I've created a new web application in Visual Studio 2008. I'm using the following code (from the Forum post about OCR'ing a multipage PDF) :

Code: Select all

        Dim ImageID As Integer

        Dim oGdViewer As New GdPicture.GdViewer
        Dim oGdPictureImaging As New GdPicture.GdPictureImaging
        Dim PdfID As Integer

        oGdViewer.SetLicenseNumber("XXXXX")
        oGdPictureImaging.SetLicenseNumber("XXXXX")
        oGdPictureImaging.SetLicenseNumberOCRTesseract("XXXXX")

        oGdViewer.DisplayFromFile(Server.MapPath("sample.pdf"))

        PdfID = oGdPictureImaging.PdfOCRStart(Replace(Server.MapPath("sample.pdf"), ".pdf", "_out.pdf"), True, "", "", "", "", "")
        For i As Integer = 1 To oGdViewer.PageCount
            ImageID = oGdViewer.PdfRenderPageToGdPictureImage(200, i)
            oGdPictureImaging.ConvertTo1Bpp(ImageID)
            oGdPictureImaging.PdfAddGdPictureImageToPdfOCR(PdfID, ImageID, TesseractDictionary.TesseractDictionaryEnglish, "C:\Program Files\GdPicture.NET\Redist\OCR", "")
            oGdViewer.ReleaseGdPictureImage(ImageID)
        Next
        oGdPictureImaging.PdfOCRStop(PdfID)
        oGdViewer.CloseDocument()
When I try to run the code above, I get this error: "DragDrop registration did not succeed - Current thread must be set to single thread apartment (STA) mode before OLE calls can be made. Ensure that your Main function has STAThreadAttribute marked on it." on this line of code:

Code: Select all

Dim oGdViewer As New GdPicture.GdViewer
I'm running this code on the server-side (not a windows forms app). Is there something that I need to do to get this code to run on the server side in web application?

I have a small project that reproduces the problem. I've attached it to this message. The forums has a 2 MB limit, so you will need to place copies of GdPicture.NET.dll, GdPicture.NET.image.gdimgplug.dll, GdPicture.NET.ocr.tesseract.dll and GdPicture.NET.pdf.gdpdfplug.dll into the "BIN" folder within the sample project. Also, you'll need to update the project with actual license codes.

I would appreciate any suggestions you have! Thank you! :)

Chris
Attachments
TesseractTest.zip
Sample Project that reproduces the issue.
(26.07 KiB) Downloaded 462 times

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by Loïc » Wed Sep 30, 2009 2:34 pm

Hi,

Thank you for these good explanations.

I just fixed your problem. Please, wait for the next minor release (within 2 or 3 days), or contact us to esupport (at) gdpicture (dot) com to get it now.

Kind regards,

Loïc

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by Loïc » Wed Sep 30, 2009 4:38 pm

Me again !


Please, download GdPicture.NET 6.4.7. Your bug have been fixed in this release.

Kind regards,

Loïc

csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by csinkinson » Wed Sep 30, 2009 7:42 pm

Hi Loïc,

Wow, that did it! Thank you so much for the fast turnaround! I'm now able to get the sample to work properly through Visual Studio and it's embedded testing Web Server (ASP.NET Development Server). It generates a nice PDF with the embedded text from the OCR engine - perfect! :)

I ran into another small issue. I just tried publishing the sample through IIS. When running under IIS, the code creates a new PDF as it is supposed to do, but the resulting PDF doesn't have the text embedded. It's an image-only PDF, not a Image+Text PDF. There is no error message either. I've tried IIS7 on my Vista Workstation, and also a IIS6 Windows 2003 Server, but they both have the same result.

Attached are two sample PDFs that are output by the code provided earlier. One from the ASP.net Development Web Server, the other from IIS. You can see that the IIS sample does not contain the embedded text.

Would you mind creating a virtual directory in IIS to test the sample provided earlier? I'm wondering if I just have an issue with the IIS configuration...

Thanks again for your help! :)
Attachments
Samples.zip
Samples from IIS and the ASP.net Development Web Server
(41.38 KiB) Downloaded 425 times

csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by csinkinson » Fri Oct 02, 2009 9:32 pm

Hi Loïc,

I've been trying to isolate the cause of the OCR Engine not working under IIS. So far I haven't been successful in finding the problem. I've tried a bunch of different things, that I though I should share with you:

1. Tried running the virtual directory under the "administrator" account (Directory Security -> anonymous access). This didn't help.

2. I checked that IIS is running under the "LocalSystem" service account, so it should have any access it needs.

3. I tried adding Data Execution Prevention exceptions for all the GDPicture.NET DLLs (Control panel -> System -> Advanced -> Performance section -> "Settings" button -> Data Execution Prevention tab -> Added all the DLLs). Again no luck.

4. I loaded ASP.NET Cassini onto the server and tried running under it. As expected, this works. Cassini seems to run as a user process under the user account.

One thought I had was this: Does the Tesseract PlugIn use any of the local user's environment variables? Such as the "temp" folder path, etc? If so this might explain some of the issue. If the Tesseract Engine running under IIS doesn't have local environment variables, then it might not be functioning correctly when using IIS.

This is all just speculation, as I'm not certain exactly how the plug-in was designed. I hope some of this might be helpful!

Look forward to your thoughts! :)

Chris

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by Loïc » Sat Oct 03, 2009 9:40 am

Hi Chris,

I suppose your IIS server doesn't allow the OCR engine to access to the dictionary files.

You must check all returned value of the GdPciture.NET methods to see where the error is.
If you need more help, try to reproduce your problem in a small application and send it to esupport (at) gdpicture (dot) net for more investigations.

Kind regards,

Loïc

csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by csinkinson » Sat Oct 03, 2009 7:13 pm

Hi Loïc,

I've found some information that may be helpful! I installed Process Monitor (http://technet.microsoft.com/en-us/sysi ... 96645.aspx) onto my system so that I could watch for any File I/O errors while the Tesseract engine was running under IIS. I noticed this entry:

Code: Select all

Date & Time:	10/3/2009 12:28:48 PM
Event Class:	File System
Operation:	CreateFile
[b]Result:	ACCESS DENIED[/b]
[b]Path:	C:\Windows\System32\inetsrv\GdPicture.NET.ocr.tesseract.dll[/b]
TID:	57548
Duration:	0.0000770
Desired Access:	Generic Read/Write
Disposition:	OverwriteIf
Options:	Synchronous IO Non-Alert, Non-Directory File, Open No Recall
Attributes:	n/a
ShareMode:	None
AllocationSize:	0
Description:	IIS Worker Process
Company:	Microsoft Corporation
Name:	w3wp.exe
Version:	7.0.6001.18000
Path:	c:\windows\system32\inetsrv\w3wp.exe
Command Line:	c:\windows\system32\inetsrv\w3wp.exe -a \\.\pipe\iisipm0d1e1454-db72-4f11-ac10-d8da255c9d17 -v "v2.0" -h "C:\inetpub\temp\apppools\DefaultAppPool.config" -w "" -m 0 -t 20 -ap "DefaultAppPool"
PID:	54284
Parent PID:	3480
Session ID:	0
[b]User:	NT AUTHORITY\NETWORK SERVICE[/b]
Auth ID:	00000000:000003e4
Architecture:	32-bit
Virtualized:	False
Integrity:	System
Started:	10/3/2009 12:24:05 PM
Ended:	(Running)
Modules:
w3wp.exe
So, I gave the "Network Service" account access to the c:\windows\system32\inetsrv\ folder. This solved the problem. The ASP.net application was running under the "DefaultApplicationPool" in IIS which runs under the "Network Service" account. Makes sense.

But going forward, this will cause some problems. We have a large customer base that we're hoping to upgrade to use the Tesseract OCR Engine. Asking everyone to change their IIS settings is not appealing, most of these clients don't have dedicated IT staff. (They'd probably telephone us and complain! yikes! haha)

I've read some articles online (http://sjc.ironspeed.com/post?id=2496571) that seem to indicate that temporary files are created in the c:\windows\system32\inetsrv\ when a .NET library writes files to the disk without specifying a full path. Is it possible that Tesseract is trying to write temporary files to the disk? Is it possible to provide Tesseract with a default "temp" folder? We have a temp folder within our system that does have the correct permissions (Network Service can access it). If we could specify a temp path, I think these issues would go away!

I do have a project that reproduces the problem. You can use the original sample that I provided in the first message of this thread (TesseractTest.zip). Just unzip the files to a folder, adjust the path to the OCR Libraries in default.aspx.vb if needed, and place copies of GdPicture.NET.dll, GdPicture.NET.image.gdimgplug.dll, GdPicture.NET.ocr.tesseract.dll and GdPicture.NET.pdf.gdpdfplug.dll into the "BIN" folder within the sample project. Then go into IIS, right-click "Default Web Site", choose "New"-> "Virtual Directory", a wizard will appear, enter the alias "TesseractTest" click "next", choose the path to the folder where you extracted the TesseractTest.zip (e.g. c:\Inetpub\wwwroot\TesseractTest\) click "next", then allow permissions for "Read" and "Run Scripts (Such as ASP)". Finally open a browser and visit http://localhost/TesseractTest/, click the "Run OCR" button. You'll notice the PDF does not contain any text from the OCR engine.

Then, give the "Network Service" access to the "c:\windows\system32\inetsrv\" folder and try again. You should notice that the resulting PDF does contain the OCR text! So, it seems like we can isolate this issue. But there doesn't seem to be anything else that I can do given the interface to GDPicture.

If you have trouble setting this up, I can give you remote desktop access to one of our testing servers so that you can sign in and look at everything I've discussed. Just let me know if this would be helpful!

I must admit that I'm really impressed with your support! You guys do a great job! A big chunk of my day is involved with doing technical support. Always drives me nuts when users don't give enough info so that we can diagnose problems. I'm hoping this info will help! :) If there is anything else you need, please let me know!

Have a great weekend!

Chris

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by Loïc » Mon Oct 05, 2009 1:24 pm

Hi,

I don't think it is a temporary file problem.
I think it is a problem of GdPciture.NET which load the tesseract engine using LoadLibrary().

I am on this problem and will solve it within few days.

However I can give you a workaround:

-Copy the GdPicture.NET.ocr.tesseract.dll file into the system32 folder or into any other folder defined within the PATH environment variable.

In all case, never change the permission of the c:\windows\system32\inetsrv\ folder. It is not a good solution :wink:

Kind regards,

Loïc

csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by csinkinson » Mon Oct 05, 2009 5:18 pm

Hi Loïc,

Yes, I agree with you about giving Network Service access to the c:\windows\system32\inetsrv\ folder! Bad idea! haha ;)

Please keep us informed on your progess with the update!

Thank you

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by Loïc » Mon Oct 05, 2009 5:43 pm

Hi,

Write me a mail to esupport (at) gdpicture (dot) com. I will send you the next release before publishing it.
I have H5N1 and I will be at work only at the end of the week...

Kind regards,

Loïc

csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by csinkinson » Mon Oct 05, 2009 7:17 pm

Will do! And please take good care of yourself so that you recover quickly! :)

Thank you,
Chris

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by Loïc » Mon Oct 05, 2009 7:18 pm

Thank you Chris :)

I just sent you the link to download the next release to your email.


Loîc

csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Re: Server-Side OCR Processing PDF with Tesseract Using ASP.NET

Post by csinkinson » Mon Oct 05, 2009 8:03 pm

Hi Loïc,

Just got the new version and it works perfectly! Thank you so much for your incredibly fast support!

Get better soon! :)

Chris

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests