Failed OCR on Some PDF

Hello,


I am trying to use syncfusion PDF OCR with tesseract. but for some pdf it failed with the following error:


Error in findFileFormatStream: truncated file

Error in pixReadStream: Unknown format: no pix returned

Error in pixRead: pix not read

[15:17:39 ERR] Failed to scan: Syncfusion.Pdf.PdfException: Failed to load image 'C:\Users\username\AppData\Local\Temp\3ff09db1-1e03-4315-a2bb-e45a35659ce3.png'.

   at Syncfusion.OCRProcessor.OCRProcessor.ProcessOCR(String[] args, String[] imagePathList)

   at Syncfusion.OCRProcessor.OCRProcessor.GetHOCR(String imagePath, String dataPath, Boolean multiPageTiff, String[] imagePathList)

   at Syncfusion.OCRProcessor.OCRProcessor.PerformOCR(PdfLoadedDocument lDoc, Int32 startIndex, Int32 endIndex, String dataPath)


I don't know what exactly caused this error but, from what I can assume this error only happens when the pdf file is saved from Word with multiple images in it and the file size is above 30MB.


Syncfusion Version : 20.4.0.38


Thank you in advance!


8 Replies

KS Karmegam Seerangan Syncfusion Team December 20, 2023 01:59 PM UTC

Hi Fiqi,


Thank you for contacting Syncfusion support.


We have checked the reported issue with our testing documents, but we didn't encounter any issues. We suspect that the reported issue may occur for a particular document. We kindly request you to share the input documents, complete code snippet, package name, and version to check this on our end. This will be helpful for us to analyze and assist you further on this.


Regards,

Karmegam




RC Rhett Curran January 25, 2024 06:02 AM UTC

Hi there,


I am having the same issue as above.


I have tried multiple PDF files, with the same result.


I am using an External OCR engine (Sample from Github) and can debug the process through the "PerformOCR" function.  I can see the "png" file being create in the temp directory and follow the process through to the end of the PeformOCR function in the IOcrEngine.  The error above is thrown when the program returns back to the main function that called the PerformOCR.  At this point, the temp file is deleted by the process.


Below is the code I am running.


 using (OCRProcessor processor = new OCRProcessor())

            {

                //Load an existing PDF document.

                FileStream stream = new FileStream("input.pdf", FileMode.Open, FileAccess.Read);

                PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);


                //Set OCR language.

                processor.Settings.Language = Languages.English;

               

               //Initialize the Azure vision OCR external engine.

                IOcrEngine azureOcrEngine = new AzureExternalOcrEngine();

                processor.ExternalEngine = azureOcrEngine;


                processor.Settings.TempFolder = "C:\\temp\\";


                //Perform OCR.

                processor.PerformOCR(lDoc); // --> This line runs in the new interface but fails when it returns to this function and will not go to the next line.


                //Create file stream.

                FileStream outputStream = new FileStream("OCR.pdf", FileMode.CreateNew);

....

}

using v 24.1.41

Any help would be appreciated.

Cheers


Rhett




KS Karmegam Seerangan Syncfusion Team January 25, 2024 10:11 AM UTC

Hi Rhett,


Thank you for contacting Syncfusion support.

We are able to reproduce the reported issue on our end. Currently, we are validating this and will provide further details on January 30th, 2024.


Regards,

Karmegam



KS Karmegam Seerangan Syncfusion Team January 31, 2024 04:36 AM UTC

The reported issues still need a lot more in-depth research; thus we are looking into them. On February 1st, 2024, further information will be updated.



KS Karmegam Seerangan Syncfusion Team February 1, 2024 03:57 PM UTC

We have confirmed the reported "Failed to load image exception occurs while using External OCR Engine" issue as a defect. We will include this fix in our upcoming weekly NuGet release on February 6th, 2024.

Please use the below feedback link to track the status of the reported issue.
https://www.syncfusion.com/feedback/50500/failed-to-load-image-exception-occurs-while-using-external-ocr-engine 

Disclaimer: “Inclusion of this solution in the weekly release may change due to other factors including but not limited to QA checks and works reprioritization.”



RC Rhett Curran February 2, 2024 02:44 AM UTC

Thanks for following up Karmegam.


I tried the link provided to track the status but it appears to be restricted access.


Access Denied

This private feedback is not associated with your account.

If you believe that this error message is incorrect, please feel free to contact us



SG Sivaram Gunabalan Syncfusion Team February 2, 2024 10:49 AM UTC

Hi Rhett,

Apologize for the inconvenience caused.

We have resolved the access permission issue on our end. We will let you know once the fix is included in the Weekly release.



KS Karmegam Seerangan Syncfusion Team February 6, 2024 11:34 AM UTC

We have included the fix for this issue Failed to load image exception occurs while using External OCR Engine” fix in our latest weekly release (24.2.4). Please download the Nuget from the below link

Nuget Link: NuGet Gallery | Syncfusion.PDF.OCR.NET 24.2.4


Loader.
Up arrow icon