We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. (Last updated on: June 24, 2019).
Unfortunately, activation email could not send to your email. Please try again.
Syncfusion Feedback

PDF OCR gives "Attempted to read or write protected memory. This is often an indication that other memory is corrupt."

Thread ID:

Created:

Updated:

Platform:

Replies:

150079 Dec 19,2019 01:55 PM UTC Jan 6,2020 12:02 PM UTC WinForms 6
loading
Tags: PDF
jpalo
Asked On December 19, 2019 01:55 PM UTC

I have the simplest WinForms app with file path textbox, output text box and a button. I test PDF OCR capabilities, but at processor.PerformOCR it throws error:

Unhandled Exception: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Syncfusion.OCRProcessor.Native.OCRApi.InitializeDataPath(IntPtr pt, String path, String lang)
   at Syncfusion.OCRProcessor.OCRProcessor.DoOCR(String[] args)
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Program.Main(String[] args)

This is the code, all there is:
try
            {
                SyncfusionLicenseProvider.RegisterLicense("VALID LICENSE KEY HERE");

                //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
                using (OCRProcessor processor = new OCRProcessor(@"C:\Temp\TesseractBinaries\3.02\"))
                {
                    //Load a PDF document
                    PdfLoadedDocument lDoc = new PdfLoadedDocument(file.Text);

                    //Set OCR language to process
                    processor.Settings.Language = Languages.English;
                    //Process OCR by providing the PDF document and Tesseract data
                    output.Text = processor.PerformOCR(lDoc, @"C:\Temp\TesseractData");
                    //Save the OCR processed PDF document in the disk                                
                    lDoc.Close(true);
                }
            }
            catch(Exception ex)
            {
                output.Text = ex.Message;
            }

TesseractBinaries does contain the required *.dll files, TesseractData contains *.traineddata files, Project itself has NuGet references to Syncfusion packages. Actual .sln attached.

I originally tested this on Windows Server 2012R2 within SharePoint 2016 Event Handler, but it gave same error as now also on my local machine (Windows 10) in a test WinForms application.

Attachment: OCRTester_f653dc74.zip

Sowmiya Loganathan [Syncfusion]
Replied On December 20, 2019 11:56 AM UTC

Hi Jussi, 
 
Thank you for contacting Syncfusion support.  
 
We have tried the provided sample in our end, but we regret to let you know that we were unable to reproduce the reported issue. Please find the modified sample from below, 
 
 
We suspect that the issue to be a document specific issue. So could you please share the input PDF document to replicate the issue, it will helpful be helpful for further analysis and provide the better solution on this.  
 
Regards, 
Sowmiya Loganathan 


jpalo
Replied On December 20, 2019 12:10 PM UTC

I'd be happy to share it via email as docs are not public.

This one sample.pdf I can share here, though, it is not throwing the exception, but is not finding any text.

Attachment: sample_968a931e.zip

Preethi Nesakkan Gnanadurai [Syncfusion]
Replied On December 23, 2019 11:37 AM UTC

Hi Jussi, 
  
We have created an incident under your Direct- trac account. Kindly share your documents in the ticket. 
  
Regards, 
Preethi 


Prakash Viswanathan [Syncfusion]
Replied On December 23, 2019 12:08 PM UTC

Hi Jussi, 

Syncfusion OCR processor only recognize text from the images in the PDF document. But the provided document does not have any image and it contains only text. So the OCR processor is not finding any text for the provided document. Kindly try the OCR processing for PDF document with images.  

If you need to get the text from PDF document, you can use extract text functionality, please refer below link for more information, 
 
We have created an incident under your Direct- trac account. Kindly share your documents in the ticket. 
 
Regards, 
Prakash V 


jpalo
Replied On January 3, 2020 10:06 AM UTC

Thank you. Combined OCR + PDF text extraction to support PDFs with both images and text. However, 2 issues:

1. Why text is extracted with (lots of) random line breaks here and there, like this:

(B)
 
Compress
ed Air (kgf/cm
2
)

as when opening the PDF with Adobe Acrobat, and selecting text, and copy&pasting it here is without any line breaks:
(B) Compressed Air (kgf/cm2)

Same as image copied from the document:

Issue is not related to text being extracted from table, as it occurs on also text outside tables:
  
is extracted as:
TEST DATE: J
AN
.
22
.2019~
F
EB
.1.2019

2. Is it possible to OCR multiple languages with one go? Now it just accepts single language.

Sowmiya Loganathan [Syncfusion]
Replied On January 6, 2020 12:02 PM UTC

Hi Jussi, 

Why text is extracted with (lots of) random line breaks here and there, like this: 
(B)
 
Compress
ed Air (kgf/cm
2
)
 
 
as when opening the PDF with Adobe Acrobat, and selecting text, and copy&pasting it here is without any line breaks: 
(B) Compressed Air (kgf/cm2) 

We have used Tesseract engine to perform OCR on PDF document in our end. In Tesseract engine itself, process the PDF document by word by word. So this could be based on how the content preserved in PDF. Due to this only, extracted text is breaks at random line and this is the behavior.  

Please let us know if you have any concerns on this. 
Is it possible to OCR multiple languages with one go? Now it just accepts single language. 
We can able to process the OCR with multiple language at one time using below code snippet,  

//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll) 
using (OCRProcessor processor = new OCRProcessor(@"Tesseract Binaries/")) 
{ 
    //Set OCR language to process 
    processor.Settings.Language = "eng+deu";                       

Note: Make sure to include the language data file for the respective language in Tessdata folder.  

Please download the language data files in the below link,  



Regards, 
Sowmiya Loganathan 


CONFIRMATION

This post will be permanently deleted. Are you sure you want to continue?

Sorry, An error occured while processing your request. Please try again later.

Please sign in to access our forum

This page will automatically be redirected to the sign-in page in 10 seconds.

Warning Icon You are using an outdated version of Internet Explorer that may not display all features of this and other websites. Upgrade to Internet Explorer 8 or newer for a better experience.Close Icon

Live Chat Icon For mobile
Live Chat Icon