OCR Sample doesn't seem to work. Required Tesseract version?

Question

I'm trying to test out the OCR sample (with 18.4.0.39) and I'm not getting any results, nor any errors.

Q1. The inline doc for PerformOCR indicates that the return value is the OCR'ed text. But it is either blank or null (if I pass in garbage for the Tesseract Binary path). Is this really where the results are returned?

Q2. The samples show the a PDF (input.pdf) being input and then a stream saved as Output.pdf and the PerformOCR return value ignored. When I do this the Output seems identical to the Input (which is fine) but I'm not seeing any differences. What should I be seeing?

Q3. Other samples refer to various Tesseract versions but the most recent version of OCRSettings doesn't support these settings. I have the 5.0 Alpha version installed as various source said it was essentially identical to 4.x. If there is a specific version of Tesseract required, that would be very helpful to know.

Thanks,

James

Gowthamraj Kumar · Accepted Answer

Hi James, 





First, is ProcessOCR supposed to return the OCR'ed text? 
 
I've tried the sample here https://help.syncfusion.com/file-formats/pdf/working-with-ocr/dot-net-core and ProcessOCR returns an empty string. 


We have checked OCR processer with simple sample in our end, but it’s return the text properly. We suspect that the issue may occurs due to that document specific.  

Kindly please share the input document to check the issue in our end. So, that it will be helpful for us to analyze and assist you further on this. 


Also, that sample indicates that Tessearact is installed as part of the NuGet package Syncfusion.PDF.OCR.Net.Core, but I don't see them installed there. 

Nevertheless, I've tried both what the standard install Path directories and those defined by the Sample 

You can get the Tesseract binaries and tessdata in NuGet installed location. 

TesseractBinaries 
syncfusionocrprocessor\Tesseractbinaries_core (or) 
C:\Users\username.nuget\packages\Syncfusion.PDF.OCR.Net.Core\XX.X.X.XX\lib\TesseractBinaries 

Tessdata 
syncfusionocrprocessor	essdata (or) 
C:\Users\username.nuget\packages\Syncfusion.PDF.OCR.Net.Core\XX.X.X.XX\lib	essdata  

We have attached the sample with all the required binaries, kindly please run the sample in your end let us know the result.  
Sample: https://www.syncfusion.com/downloads/support/forum/162830/ze/OCRCoreSample73503502 


Regards, 
Gowthamraj K

Sowmiya Loganathan · Answer

Hi James,   
 
Thank you for contacting Syncfusion support.   
 
We have analyzed the reported issue “OCR’ed text does not return properly while processing OCR in ASP.NET Core platform”. Could you please try the sample in the below KB in your end and let us know the result.   
 
Perform OCR in ASP.NET Core: https://www.syncfusion.com/kb/11696/how-to-perform-ocr-in-asp-net-core-platform   
UG: https://help.syncfusion.com/file-formats/pdf/working-with-ocr/dot-net-core   
 
Note: The default tesseract version in ASP.NET Core (Windows) is 3.05.   
  
If you still facing the issue, kindly provide us the below details. It will helpful for us to provide the precise solution on this.   
 

Input PDF document 
Output document (which you have facing the issue)
OS details (Windows or Linux or MAC)
 
Regards,  
Sowmiya Loganathan

James Bennett · Answer

First, is ProcessOCR supposed to return the OCR'ed text?

I've tried the sample here https://help.syncfusion.com/file-formats/pdf/working-with-ocr/dot-net-core and ProcessOCR returns an empty string.

Also, that sample indicates that Tessearact is installed as part of the NuGet package Syncfusion.PDF.OCR.Net.Core, but I don't see them installed there.

Nevertheless, I've tried both what the standard install Path directories and those defined by the Sample

[code]

class Program

{

// from 5.0 Alpha Install

static string binaryTessereactPath = @"C:\Program Files\Tesseract-OCR";

static string dataTesseractPath = @"C:\Program Files\Tesseract-OCR\tessdata\";

static void Main(string[] args)

{

// try with Example paths

binaryTessereactPath = @"TesseractBinaries\Windows";

dataTesseractPath = @"tessdata\";

// license key code omitted

var name = "input.pdf";

using (var docStream = new FileStream(name, FileMode.Open, FileAccess.Read))

using (var processor = new OCRProcessor(binaryTessereactPath))

{

//Load a PDF document

var lDoc = new PdfLoadedDocument(docStream);

//Set OCR language to process

processor.Settings.Language = Languages.English;

var result = processor.PerformOCR(lDoc, dataTesseractPath);

//Save the OCR processed PDF document in the disk

var OutputDirectory = @"C:\Users\jbennett\Desktop\PDFOUTPUT";

var outputName = Path.Join(OutputDirectory, "convertedfromInput.pdf");

using (var bwStream = new FileStream(outputName, FileMode.Create))

{

lDoc.Save(bwStream);

}

lDoc.Close(true);

}

[/CODE]

James Bennett · Answer

Is there some trick then to find the location of the binaries and tessdata at runtime on a production system?  Or am I supposed to just copy the binaries into my project like the OCR sample you sent?

Gowthamraj Kumar · Answer

Hi James, 

Thank you for your update. 

No. We do not have any tricks to find the location. You can copy the Tesseract binaries and tessdata from NuGet installed location to your project location for performing the OCR operation.  

Regards, 
Gowthamraj K