Dear Team,
I'm trying to perform OCR from an image file and extract tabular data preferably with the layout. But the text that is extracted/OCRed is not same as the image file. When I convert the image into a searchable PDF and then extract the text, the result is still garbled.
I'm attaching the image file, the extracted PDF and the OCRed Text (from the image) and the sample code.
I'm trying to use Tessaract version 4.0 but the binary file path is not being taken up. it works only if I provide the x86 path. for x64 path, it throws the following exception "Exception has been thrown by the target of an invocation" even though the file exists.
I would like to know if I'm missing anything in the setup.
Syncfusion version used: 28.1.35 (Winforms application)
Thanks
Santhosh
Attachment: OCRed_text_Sample_d29b408a.zip
Hi Santhosh,
We have checked the reported issue on our end. We were able to reproduce the reported issue with the provided details and the resultant OCRed text was garbled. Currently, we are working on this and we will update further details on December 24th, 2024.
Regards,
Arumugam M
Hi Santhosh,
Upon further analysis, We are internally using Google’s Tesseract engine to recognize text from scanned documents or images. We have checked the provided image directly in the Google Tesseract engine and it does not return expected results in the Tesseract engine itself. To improve text recognition accuracy, we have incorporated external image processing support through our OCR library. Although some text is not accurately recognized, we are unable to proceed further as a result.
Additionally, please note that we do not have direct support to extract table data from the PDF document. Instead of we can extract the text from the PDF document using our ExtracText API. You can refer the following UG documentation for extracting text from the PDF document.
Working with Text Extraction | Syncfusion
You can also refer the following link for further information.
9 Types of Useful Data You Can Extract from a PDF Using C# | Syncfusion Blogs
Also, we have logged this as a feature request on our end for extracting table data from the PDF document. You can track the status of this feature using the following feedback link.
Regards,
Arumugam M