OCR only works for sample doc (winforms)

Good day,

I have tried out the simple sample provided for OCR on winforms. The sample works fine. However, when I use any PDF other than the provided sample.pdf, it says that the text is null. It recognizes the proper number of pages, but has null text on all pages.

Is there any known reason for why this would only work for the sample doc?

Here is the code:

 openFileDialog1.ShowDialog();

            string filePath = openFileDialog1.FileName;
      
            //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
            using (OCRProcessor processor = new OCRProcessor(@"../../Data/Tesseract binaries/"))
            {
                //Load a PDF document
              
                PdfLoadedDocument lDoc = new PdfLoadedDocument(filePath);

                //Set OCR language to process
           
                processor.Settings.Language = Languages.English;

                //Process OCR by providing the PDF document and Tesseract data
                OCRLayoutResult result;
                processor.PerformOCR(lDoc, @"../../Data/Tessdata/", out result);

                MessageBox.Show(result.Pages[0].Lines[0].Text);  //this errors because the text returns null on all pdfs other than the sample
                //Save the OCR processed PDF document in the disk
                lDoc.Save("Sample.pdf");

                lDoc.Close(true);

                System.Diagnostics.Process.Start("Sample.pdf");
            }


Thank you!

4 Replies 1 reply marked as answer

GK Gowthamraj Kumar Syncfusion Team March 24, 2021 02:12 PM UTC

Hi Travis, 
 
Thank you for contacting Syncfusion support. 
 
Syncfusion’s Essential PDF (.NET PDF library) supports OCR by using the Tesseract open-source engine. A scanned paper document containing raster images is converted to a searchable and selectable document with a few lines of code. We suspect that the other PDF document does not contain any scanned images so that it returns empty or null.  Please make sure the scanned PDF document containing a raster image or not. Please refer to the below documentation link for more information,       
   
Note: OCR process returns the text only when the PDF document contains a scanned image. Otherwise, it does not return any text.      
   
If still, you are facing the same issue, kindly provide more details such as input PDF document, complete code snippet or sample, product version to check the issue on our end. So, that it will be helpful for us to analyze and assist you further on this.  
 
Regards, 
Gowthamraj K 



TC Travis Chambers March 24, 2021 02:43 PM UTC

Thank you for your feedback! I suppose this does make some sense. the pdf documents I am using are predominately images created with the Micrososft snipping tool or images of documents taken with a cell phone camera. Non of these have any text.

That said, I have tested other Tesseract wrappers and they are all able to read these same pdfs and with great accuracy, so that I find confusing.

As a workaround, I have tried to use the Syncfusion ocr tool directly on bitmap images of the documents. When I do this it does successfully find text, but the quality of the output is pretty terrible. Where as these same images result in roughly 90% accuracy in other tesseract wrappers, not a single word comes out correctly here. Most other ocr tools include options like "EnhanceImage", or "Sharpern", that help to boost performance. Is there any such option here?

Thank you.


GK Gowthamraj Kumar Syncfusion Team March 25, 2021 02:18 PM UTC

Hi Travis, 
 
Currently, we are analyzing your requirement on our end and we will update the further details on March 29th 2021. 
 
Regards, 
Gowthamraj K 



GK Gowthamraj Kumar Syncfusion Team March 29, 2021 02:11 PM UTC

Hi Travis, 
 
Thank you for your patience. 
 
We have analyze the OCR conversion with all the test document on our end, but the quality of the output is good and accuracy. We suspect that the issue may due to that particular document specific. Also, we can perform OCR using tesseract 4.0. The TesseractVersion property is used to switch the tesseract version, it give more accurate results. By default, OCR will be performed with tesseract version 3.02. We have option for setting the different performance level to the OCRProcessor using Performance enumeration. 
 
You must use the pre-built Syncfusion tesseract 4.0 binaries in the project to run the OCR properly. We have attached the sample and output document for your reference. Kindly please try the sample with your input document on your end and let us know the result. 
 
 
Please refer the below link for more information, 
 
If still, you are facing the same issue, kindly provide more details such as input PDF document, resultant documents (with other ocr tools) modified sample, product version to check the issue on our end. So, that it will be helpful for us to analyze and assist you further on this.   
 
Regards, 
Gowthamraj K 


Marked as answer
Loader.
Up arrow icon