OCR Sample doesn't seem to work. Required Tesseract version?

I'm trying to test out the OCR sample (with 18.4.0.39) and I'm not getting any results, nor any errors.

Q1.  The inline doc for PerformOCR indicates that the return value is the OCR'ed text.   But it is either blank or null (if I pass in garbage for the Tesseract Binary path).   Is this really where the results are returned?   

Q2.  The samples show the a PDF (input.pdf) being input and then a stream saved as Output.pdf and the PerformOCR return value ignored.   When I do this the Output seems identical to the Input (which is fine) but I'm not seeing any differences.   What should I be seeing?

Q3.  Other samples refer to various Tesseract versions but the most recent version of OCRSettings doesn't support these settings.   I have the 5.0 Alpha version installed as various source said it was essentially identical to 4.x.   If there is a specific version of Tesseract required, that would be very helpful to know.

Thanks,
James

    <PackageReference Include="Syncfusion.Pdf.Net.Core" Version="18.4.0.43" />
    <PackageReference Include="Syncfusion.PDF.OCR.Net.Core" Version="18.4.0.39" />

5 Replies 1 reply marked as answer

SL Sowmiya Loganathan Syncfusion Team February 22, 2021 12:25 PM UTC

Hi James,   
 
Thank you for contacting Syncfusion support.   
 
We have analyzed the reported issue “OCR’ed text does not return properly while processing OCR in ASP.NET Core platform”. Could you please try the sample in the below KB in your end and let us know the result.   
 
 
Note: The default tesseract version in ASP.NET Core (Windows) is 3.05.   
  
If you still facing the issue, kindly provide us the below details. It will helpful for us to provide the precise solution on this.   
 
  • Input PDF document
  • Output document (which you have facing the issue)
  • OS details (Windows or Linux or MAC)
 
Regards,  
Sowmiya Loganathan 



JB James Bennett February 23, 2021 06:05 PM UTC

First, is ProcessOCR supposed to return the OCR'ed text?

I've tried the sample here https://help.syncfusion.com/file-formats/pdf/working-with-ocr/dot-net-core and ProcessOCR returns an empty string.

Also, that sample indicates that Tessearact is installed as part of the NuGet package Syncfusion.PDF.OCR.Net.Core, but I don't see them installed there.

Nevertheless, I've tried both what the standard install Path directories and those defined by the Sample

[code]
    class Program
    {
        // from 5.0 Alpha Install
        static string binaryTessereactPath = @"C:\Program Files\Tesseract-OCR";
        static string dataTesseractPath = @"C:\Program Files\Tesseract-OCR\tessdata\";
        static void Main(string[] args)
        {
            // try with Example paths

            binaryTessereactPath = @"TesseractBinaries\Windows";
            dataTesseractPath = @"tessdata\";

          // license key code omitted

            var name =  "input.pdf";

            using (var docStream = new FileStream(name, FileMode.Open, FileAccess.Read))
            using (var processor = new OCRProcessor(binaryTessereactPath))
            {
                //Load a PDF document 
                var lDoc = new PdfLoadedDocument(docStream);
                //Set OCR language to process 

                processor.Settings.Language = Languages.English;

                var result = processor.PerformOCR(lDoc, dataTesseractPath);


                //Save the OCR processed PDF document in the disk 
                var OutputDirectory = @"C:\Users\jbennett\Desktop\PDFOUTPUT";
                var outputName = Path.Join(OutputDirectory, "convertedfromInput.pdf");
                using (var bwStream = new FileStream(outputName, FileMode.Create))
                {
                    lDoc.Save(bwStream);
                }

                lDoc.Close(true);

            }
    }
    }
[/CODE]


GK Gowthamraj Kumar Syncfusion Team February 24, 2021 06:26 PM UTC

Hi James, 

First, is ProcessOCR supposed to return the OCR'ed text? 
 
I've tried the sample here https://help.syncfusion.com/file-formats/pdf/working-with-ocr/dot-net-core and ProcessOCR returns an empty string. 

We have checked OCR processer with simple sample in our end, but it’s return the text properly. We suspect that the issue may occurs due to that document specific.  

Kindly please share the input document to check the issue in our end. So, that it will be helpful for us to analyze and assist you further on this. 
Also, that sample indicates that Tessearact is installed as part of the NuGet package Syncfusion.PDF.OCR.Net.Core, but I don't see them installed there. 

Nevertheless, I've tried both what the standard install Path directories and those defined by the Sample 
You can get the Tesseract binaries and tessdata in NuGet installed location. 

TesseractBinaries 
syncfusionocrprocessor\Tesseractbinaries_core (or) 
C:\Users\username.nuget\packages\Syncfusion.PDF.OCR.Net.Core\XX.X.X.XX\lib\TesseractBinaries 

Tessdata 
syncfusionocrprocessor\tessdata (or) 
C:\Users\username.nuget\packages\Syncfusion.PDF.OCR.Net.Core\XX.X.X.XX\lib\tessdata  

We have attached the sample with all the required binaries, kindly please run the sample in your end let us know the result.  


Regards, 
Gowthamraj K 


Marked as answer

JB James Bennett March 12, 2021 04:47 AM UTC

Is there some trick then to find the location of the binaries and tessdata at runtime on a production system?  Or am I supposed to just copy the binaries into my project like the OCR sample you sent?


GK Gowthamraj Kumar Syncfusion Team March 15, 2021 12:06 PM UTC

Hi James, 

Thank you for your update. 

No. We do not have any tricks to find the location. You can copy the Tesseract binaries and tessdata from NuGet installed location to your project location for performing the OCR operation.  

Regards, 
Gowthamraj K 


Loader.
Up arrow icon