Hello,
I am trying to use syncfusion PDF OCR with tesseract. but for some pdf it failed with the following error:
Error in findFileFormatStream: truncated file
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
[15:17:39 ERR] Failed to scan: Syncfusion.Pdf.PdfException: Failed to load image 'C:\Users\username\AppData\Local\Temp\3ff09db1-1e03-4315-a2bb-e45a35659ce3.png'.
at Syncfusion.OCRProcessor.OCRProcessor.ProcessOCR(String[] args, String[] imagePathList)
at Syncfusion.OCRProcessor.OCRProcessor.GetHOCR(String imagePath, String dataPath, Boolean multiPageTiff, String[] imagePathList)
at Syncfusion.OCRProcessor.OCRProcessor.PerformOCR(PdfLoadedDocument lDoc, Int32 startIndex, Int32 endIndex, String dataPath)
I don't know what exactly caused this error but, from what I can assume this error only happens when the pdf file is saved from Word with multiple images in it and the file size is above 30MB.
Syncfusion Version : 20.4.0.38
Thank you in advance!
Hi Fiqi,
Thank you for contacting Syncfusion support.
We have checked the reported issue with our testing documents, but we didn't encounter any issues. We suspect that the reported issue may occur for a particular document. We kindly request you to share the input documents, complete code snippet, package name, and version to check this on our end. This will be helpful for us to analyze and assist you further on this.
Regards,
Karmegam
Hi there,
I am having the same issue as above.
I have tried multiple PDF files, with the same result.
I am using an External OCR engine (Sample from Github) and can debug the process through the "PerformOCR" function. I can see the "png" file being create in the temp directory and follow the process through to the end of the PeformOCR function in the IOcrEngine. The error above is thrown when the program returns back to the main function that called the PerformOCR. At this point, the temp file is deleted by the process.
Below is the code I am running.
using (OCRProcessor processor = new OCRProcessor())
{
//Load an existing PDF document.
FileStream stream = new FileStream("input.pdf", FileMode.Open, FileAccess.Read);
PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);
//Set OCR language.
processor.Settings.Language = Languages.English;
//Initialize the Azure vision OCR external engine.
IOcrEngine azureOcrEngine = new AzureExternalOcrEngine();
processor.ExternalEngine = azureOcrEngine;
processor.Settings.TempFolder = "C:\\temp\\";
//Perform OCR.
processor.PerformOCR(lDoc); // --> This line runs in the new interface but fails when it returns to this function and will not go to the next line.
//Create file stream.
FileStream outputStream = new FileStream("OCR.pdf", FileMode.CreateNew);
....
}
using v 24.1.41
Any help would be appreciated.
Cheers
Rhett
Hi Rhett,
Thank you for contacting Syncfusion support.
We are able to reproduce the reported issue on our end. Currently, we are validating this and will provide further details on January 30th, 2024.
Regards,
Karmegam
The reported issues still need a lot more in-depth research; thus we are looking into them. On February 1st, 2024, further information will be updated.
We have confirmed the reported "Failed to load image exception occurs while using External OCR Engine" issue as a defect. We will include this fix in our upcoming weekly NuGet release on February 6th, 2024.
Please use the below feedback link to track the status of the reported issue.
https://www.syncfusion.com/feedback/50500/failed-to-load-image-exception-occurs-while-using-external-ocr-engine
Disclaimer: “Inclusion of this solution in the weekly release may change due to other factors including but not limited to QA checks and works reprioritization.”
Thanks for following up Karmegam.
I tried the link provided to track the status but it appears to be restricted access.
This private feedback is not associated with your account.
If you believe that this error message is incorrect, please feel free to contact us
Hi Rhett,
Apologize for the inconvenience caused.
We have resolved the access permission issue on our end. We will let you know once the fix is included in the Weekly release.
We have included the fix for this issue “Failed to load image exception occurs while using External OCR Engine” fix in our latest weekly release (24.2.4). Please download the Nuget from the below link
Nuget Link: NuGet Gallery | Syncfusion.PDF.OCR.NET 24.2.4