OCR from image returns garbled text

2 Replies
2 Participants

Created by
SK Santhosh Kumar

Platform
WinForms

Platform
WinForms

Control
PDF

Created On
Dec 19, 2024 09:37 AM UTC

Last Activity On
Dec 24, 2024 02:49 PM UTC

Want to subscribe?
SIGN IN

Dear Team,

I'm trying to perform OCR from an image file and extract tabular data preferably with the layout. But the text that is extracted/OCRed is not same as the image file. When I convert the image into a searchable PDF and then extract the text, the result is still garbled.

I'm attaching the image file, the extracted PDF and the OCRed Text (from the image) and the sample code.

I'm trying to use Tessaract version 4.0 but the binary file path is not being taken up. it works only if I provide the x86 path. for x64 path, it throws the following exception "Exception has been thrown by the target of an invocation" even though the file exists.

I would like to know if I'm missing anything in the setup.

Syncfusion version used: 28.1.35 (Winforms application)

Thanks

Santhosh

Attachment: OCRed_text_Sample_d29b408a.zip

2 Replies

AM Arumugam Muppidathi Syncfusion Team December 20, 2024 12:54 PM UTC

Hi Santhosh,

We have checked the reported issue on our end. We were able to reproduce the reported issue with the provided details and the resultant OCRed text was garbled. Currently, we are working on this and we will update further details on December 24th, 2024.

Regards,

Arumugam M

AM Arumugam Muppidathi Syncfusion Team December 24, 2024 02:49 PM UTC

Hi Santhosh,

Upon further analysis, We are internally using Google’s Tesseract engine to recognize text from scanned documents or images. We have checked the provided image directly in the Google Tesseract engine and it does not return expected results in the Tesseract engine itself. To improve text recognition accuracy, we have incorporated external image processing support through our OCR library. Although some text is not accurately recognized, we are unable to proceed further as a result.

Additionally, please note that we do not have direct support to extract table data from the PDF document. Instead of we can extract the text from the PDF document using our ExtracText API. You can refer the following UG documentation for extracting text from the PDF document.

Working with Text Extraction | Syncfusion

You can also refer the following link for further information.

9 Types of Useful Data You Can Extract from a PDF Using C# | Syncfusion Blogs

Also, we have logged this as a feature request on our end for extracting table data from the PDF document. You can track the status of this feature using the following feedback link.

Add support to extract table data from the PDF document in ASP.NET Core | Feedback Portal (syncfusion.com)

Regards,

Arumugam M

Need More Help?

Get personalized assistance from our support team.

Contact Support

2 Replies
2 Participants
Want to subscribe?
SIGN IN
Created by
SK Santhosh Kumar
Platform
WinForms
Control
PDF
Created On
Dec 19, 2024 09:37 AM UTC
Last Activity On
Dec 24, 2024 02:49 PM UTC

Need More Help?

Get personalized assistance from our support team.

Contact Support