Hi,
I am trying to perform OCR but it is OCRing perfectly, below is the sample code what i am using.
using (OCRProcessor processor = new OCRProcessor())
{
FileStream stream = new FileStream($"D:/Files/SyncFusionTest/1PageScanPdf.pdf", FileMode.Open, FileAccess.Read);
PdfLoadedDocument pdfLoadedDocument = new PdfLoadedDocument(stream);
processor.Settings.Language = Languages.English;
string text = processor.PerformOCR(pdfLoadedDocument);
using (FileStream outputFileStream = new FileStream($"D:/Files/SyncFusionTest/1PageScanPdf_Ocr1.pdf", FileMode.Create, FileAccess.ReadWrite))
{
pdfLoadedDocument.Save(outputFileStream);
}
pdfLoadedDocument.Close(true);
}
But after OCRing it is not able to serach all text.
Lets say in below image there are 8 'Lorem Ipsum' but it is searching only 6 and when check this by pasting into notepad i saw that other two 'Lorem' is converted to 'Torem'.
Can someone help on this please?
Hi Mohd,
Thank you for reaching out to Syncfusion Support.
We are unable to reproduce the reported issue with our testing documents. We suspect the reported issue may occur for the particular document. However, we have attached the tested sample for your reference. Kindly try the sample and let us know the result.
Kindly try the sample and let us know the result. If you are still facing issues, we kindly request you to share the input document, package name, package version, and environment details to replicate the same issue on our end. This information will be more helpful for us to analyze and provide you with a prompt solution.
Regards,
Karmegam
Hi Karmegam,
Thanks for the response.
Its not with one file, same behaviour is noticed with multiple file.
One quick example is below.
I have processed the the below pdf using the application that you shared above
Pdf snippet:
Text from SyncFusion Snippet:
here clearly we can see the pdf starts with 'Video provide a powerful way to help you......'
but the text we got after ocr is 'Vr'deo orouroesa powerful way to hetp you.......'
same this if we search we wil not get in OCRed file
Snippet:
And its happening with most of the files i m attching the link here you can process them.
Link: https://drive.google.com/drive/folders/1sjxrPVZMI3jjQtTgSPYECoD9eUtI_1Rc
Could you please please have a look on this and guide us?
Thanks.
'
We are able to reproduce the reported issue with the provided documents on our end. The reported issue occurs due to the low quality of images, so we kindly request you increase the quality of the image. However, we have the option to improve the quality of the input using the tessdata-best or tessdata-fast.
We have support to assign the manual tessdata path using the TessDataPath property in the OCRProcessor class. We have attached the tessdata best and fast github links for your reference.
GitHub - tesseract-ocr/tessdata_best: Best (most accurate) trained LSTM models.
GitHub - tesseract-ocr/tessdata_fast: Fast integer versions of trained LSTM models
Tessdata-fast: https://www.syncfusion.com/downloads/support/directtrac/general/ze/tessdata-fast187458364
Please refer to the below documentation.
Text does not recognize properly when performing OCR on a PDF document with low-quality images,
Perform OCR on PDF and image files | Syncfusion
Kindly try the provided solution and get back to us if you need further assistance.