What is the accuracy for performing OCR?

Hi,
I am trying to perform OCR but it is OCRing perfectly, below is the sample code what i am using.

    using (OCRProcessor processor = new OCRProcessor())
    {
        FileStream stream = new FileStream($"D:/Files/SyncFusionTest/1PageScanPdf.pdf", FileMode.Open, FileAccess.Read);
        PdfLoadedDocument pdfLoadedDocument = new PdfLoadedDocument(stream);
        processor.Settings.Language = Languages.English;
        string text = processor.PerformOCR(pdfLoadedDocument);
        using (FileStream outputFileStream = new FileStream($"D:/Files/SyncFusionTest/1PageScanPdf_Ocr1.pdf", FileMode.Create, FileAccess.ReadWrite))
        {
            pdfLoadedDocument.Save(outputFileStream);
        }
        pdfLoadedDocument.Close(true);
    }

But after OCRing it is not able to serach all text.
Lets say in below image there are 8 'Lorem Ipsum' but it is searching only 6 and when check this by pasting into notepad i saw that other two 'Lorem' is converted to 'Torem'.
Can someone help on this please?
Image_1726_1716385057812


3 Replies

KS Karmegam Seerangan Syncfusion Team May 23, 2024 05:33 PM UTC

Hi Mohd,

Thank you for reaching out to Syncfusion Support.

 

We are unable to reproduce the reported issue with our testing documents. We suspect the reported issue may occur for the particular document. However, we have attached the tested sample for your reference. Kindly try the sample and let us know the result.

 

Sample: https://www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-for-the-entire-PDF-document-1207507298

 

Kindly try the sample and let us know the result. If you are still facing issues, we kindly request you to share the input document, package name, package version, and environment details to replicate the same issue on our end. This information will be more helpful for us to analyze and provide you with a prompt solution.


Regards,

Karmegam



MN Mohd Nasir May 24, 2024 07:35 AM UTC

Hi  Karmegam,

Thanks for the response.

Its not with one file, same behaviour is noticed with multiple file.

One quick example is below.

I have processed the the below pdf using the application that you shared above

Pdf snippet:

Image_8089_1716535240356

Text from SyncFusion Snippet:

Image_8047_1716535219495


here clearly we can see the pdf starts with 'Video provide a powerful way to help you......'

but the text we got after ocr is 'Vr'deo orouroesa powerful way to hetp you.......'

same this if we search we wil not get in OCRed file
Snippet:

Image_2710_1716535528650

And its happening with most of the files i m attching the link here you can process them.

Link: https://drive.google.com/drive/folders/1sjxrPVZMI3jjQtTgSPYECoD9eUtI_1Rc



Could you please please have a look on this and guide us?

Thanks.

'



KS Karmegam Seerangan Syncfusion Team May 27, 2024 11:44 AM UTC

We are able to reproduce the reported issue with the provided documents on our end. The reported issue occurs due to the low quality of images, so we kindly request you increase the quality of the image. However, we have the option to improve the quality of the input using the tessdata-best or tessdata-fast.

 

We have support to assign the manual tessdata path using the TessDataPath property in the OCRProcessor class. We have attached the tessdata best and fast  github links for your reference.


GitHub - tesseract-ocr/tessdata_best: Best (most accurate) trained LSTM models.

GitHub - tesseract-ocr/tessdata_fast: Fast integer versions of trained LSTM models

Tessdata-fast: https://www.syncfusion.com/downloads/support/directtrac/general/ze/tessdata-fast187458364

 

 

Please refer to the below documentation.

Text does not recognize properly when performing OCR on a PDF document with low-quality images,

 

Perform OCR on PDF and image files | Syncfusion

 

Kindly try the provided solution and get back to us if you need further assistance.



Loader.
Up arrow icon