PDF OCR gives "Attempted to read or write protected memory. This is often an indication that other memory is corrupt."

6 Replies
4 Participants

Created by
JP jpalo

Platform
WinForms

Platform
WinForms

Control
PDF

Created On
Dec 19, 2019 01:55 PM UTC

Last Activity On
Jan 6, 2020 12:02 PM UTC

Want to subscribe?
SIGN IN

I have the simplest WinForms app with file path textbox, output text box and a button. I test PDF OCR capabilities, but at processor.PerformOCR it throws error:

Unhandled Exception: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

at Syncfusion.OCRProcessor.Native.OCRApi.InitializeDataPath(IntPtr pt, String path, String lang)

at Syncfusion.OCRProcessor.OCRProcessor.DoOCR(String[] args)

--- End of inner exception stack trace ---

at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)

at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)

at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)

at Program.Main(String[] args)

This is the code, all there is:

try

{

SyncfusionLicenseProvider.RegisterLicense("VALID LICENSE KEY HERE");

//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)

using (OCRProcessor processor = new OCRProcessor(@"C:\Temp\TesseractBinaries\3.02\"))

{

//Load a PDF document

PdfLoadedDocument lDoc = new PdfLoadedDocument(file.Text);

//Set OCR language to process

processor.Settings.Language = Languages.English;

//Process OCR by providing the PDF document and Tesseract data

output.Text = processor.PerformOCR(lDoc, @"C:\Temp\TesseractData");

//Save the OCR processed PDF document in the disk

lDoc.Close(true);

}

catch(Exception ex)

{

output.Text = ex.Message;

}

TesseractBinaries does contain the required *.dll files, TesseractData contains *.traineddata files, Project itself has NuGet references to Syncfusion packages. Actual .sln attached.

I originally tested this on Windows Server 2012R2 within SharePoint 2016 Event Handler, but it gave same error as now also on my local machine (Windows 10) in a test WinForms application.

Attachment: OCRTester_f653dc74.zip

6 Replies

SL Sowmiya Loganathan Syncfusion Team December 20, 2019 11:56 AM UTC

Hi Jussi,

Thank you for contacting Syncfusion support.

We have tried the provided sample in our end, but we regret to let you know that we were unable to reproduce the reported issue. Please find the modified sample from below,

Sample: https://www.syncfusion.com/downloads/support/forum/150079/ze/OCRTester-1608099049

We suspect that the issue to be a document specific issue. So could you please share the input PDF document to replicate the issue, it will helpful be helpful for further analysis and provide the better solution on this.

Regards,

Sowmiya Loganathan

JP jpalo December 20, 2019 12:10 PM UTC

I'd be happy to share it via email as docs are not public.

This one sample.pdf I can share here, though, it is not throwing the exception, but is not finding any text.

Attachment: sample_968a931e.zip

PN Preethi Nesakkan Gnanadurai Syncfusion Team December 23, 2019 11:37 AM UTC

Hi Jussi,

We have created an incident under your Direct- trac account. Kindly share your documents in the ticket.

Regards,

Preethi

PV Prakash Viswanathan Syncfusion Team December 23, 2019 12:08 PM UTC

Hi Jussi,

Syncfusion OCR processor only recognize text from the images in the PDF document. But the provided document does not have any image and it contains only text. So the OCR processor is not finding any text for the provided document. Kindly try the OCR processing for PDF document with images.

If you need to get the text from PDF document, you can use extract text functionality, please refer below link for more information,

UG: https://help.syncfusion.com/file-formats/pdf/working-with-text-extraction

We have created an incident under your Direct- trac account. Kindly share your documents in the ticket.

Regards,

Prakash V

JP jpalo January 3, 2020 10:06 AM UTC

Thank you. Combined OCR + PDF text extraction to support PDFs with both images and text. However, 2 issues:

1. Why text is extracted with (lots of) random line breaks here and there, like this:

(B)

Compress
ed Air (kgf/cm
2
)

as when opening the PDF with Adobe Acrobat, and selecting text, and copy&pasting it here is without any line breaks:

(B) Compressed Air (kgf/cm2)

Same as image copied from the document:

Issue is not related to text being extracted from table, as it occurs on also text outside tables:

is extracted as:

TEST DATE: J
AN
.
22
.2019~
F
EB
.1.2019

2. Is it possible to OCR multiple languages with one go? Now it just accepts single language.

SL Sowmiya Loganathan Syncfusion Team January 6, 2020 12:02 PM UTC

Hi Jussi,

Why text is extracted with (lots of) random line breaks here and there, like this:

(B)

Compress
ed Air (kgf/cm
2
)

as when opening the PDF with Adobe Acrobat, and selecting text, and copy&pasting it here is without any line breaks:

(B) Compressed Air (kgf/cm2)

We have used Tesseract engine to perform OCR on PDF document in our end. In Tesseract engine itself, process the PDF document by word by word. So this could be based on how the content preserved in PDF. Due to this only, extracted text is breaks at random line and this is the behavior.

Please let us know if you have any concerns on this.

Is it possible to OCR multiple languages with one go? Now it just accepts single language.

We can able to process the OCR with multiple language at one time using below code snippet,

//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)

using (OCRProcessor processor = new OCRProcessor(@"Tesseract Binaries/"))

{

//Set OCR language to process

processor.Settings.Language = "eng+deu";

Note: Make sure to include the language data file for the respective language in Tessdata folder.

Please download the language data files in the below link,

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302

Regards,

Sowmiya Loganathan

6 Replies
4 Participants
Want to subscribe?
SIGN IN
Created by
JP jpalo
Platform
WinForms
Control
PDF
Created On
Dec 19, 2019 01:55 PM UTC
Last Activity On
Jan 6, 2020 12:02 PM UTC

Viewer Component

.NET PDF Processing Library

Conversions

Editor Component

.NET Word Processing Library

Conversions

Editor Component

.NET Excel Processing Library

Conversions

.NET PowerPoint Processing Library

Conversions

PDF OCR gives "Attempted to read or write protected memory. This is often an indication that other memory is corrupt."

Enterprise Solutions

Free Products

Viewer Component

.NET PDF Processing Library

Conversions

Editor Component

.NET Word Processing Library

Conversions

Editor Component

.NET Excel Processing Library

Conversions

.NET PowerPoint Processing Library

Conversions

Learning

Resources

Support

PDF OCR gives "Attempted to read or write protected memory. This is often an indication that other memory is corrupt."