OCR in italian language

Question

I'm using AspNet.Core 5.0 + Syncfusion Blazor 18.4.0.43

The Nuget package Syncfusion.PDF.OCR.Net.Core is working fine and I can get a good result loading the Pdf files but the Tesseract library extracts the words in English

I changed in my project the language in this way:

//Initialize OCR processor

OCRProcessor processor = new OCRProcessor(hostingEnv.ContentRootPath + @"/TesseractBinaries/Windows");

//Load a PDF document

PdfLoadedDocument lDoc = new PdfLoadedDocument(source.ToArray());

//Set OCR language to process

processor.Settings.Language = "ita";

//OCRLayoutResult hocrBounds;

processor.PerformOCR(lDoc, hostingEnv.ContentRootPath + @"/tessdata/");

because the Italian language is not selectable in the Languages options and then I downlaoded the "ita.traineddata" to replace the "eng.traineddata" inside the tessdata path but the text extracted is still in English

is it not possible to use this line?

processor.Settings.Language = "ita"

Is there any other way to use the Italian language?

Thanks

Gowthamraj Kumar · Accepted Answer

Hi Walter, 
 
Thank you for your patience. 
 
We have checked the provided sample on our end, we able to get the empty string return from the perform OCR method. On our further analysis the provided document, we have found that the document does not contains any scanned images, so that it return empty.  While using OCR library, a scanned PDF document containing a raster image is converted to a searchable PDF document. Please refer the below UG documentation,    
UG: https://help.syncfusion.com/file-formats/pdf/working-with-ocr/dot-net-core      
KB: https://www.syncfusion.com/kb/11696/how-to-perform-ocr-in-asp-net-core-platform      

Please let us know if you need any further assistance with this. 

Regards, 
Gowthamraj K

Gowthamraj Kumar · Answer

Hi Walter, 

Thank you for contacting Syncfusion support. 

Currently, we are checking the sample with Italian language on our end and we will update the further details on March 17th 2021. 

However, kindly please share the input and output document, tessdata, product version to check the issue on our end. So, that it will be helpful for us to analyze and assist you further on this. 

Regards, 
Gowthamraj K

Walter Martin · Answer

I added in attachment a project o show you what I'm trying to do.

Basically I'd like to use the pdf ocr library to load a pdf file and calculate the most 10 frequent words

If you try to load in the project the file Ricette.pdf, it will find only english words, even if I used the ita.traineddata file

Also most of other pdf files I tried to load, generated an error so they can't be "scanned"

Is there a way to use the italian file and to understand why most of the pdf files can't be "scanned" ?

Thanks

Attachment: Ricette_b5bf45bb.zip

Gowthamraj Kumar · Answer

Hi Walter,  
 
Thank you for sharing the details.  
 
As we said earlier, we are checking the sample for performing the OCR with PDF document contains a Italian language on our end and we will update the further details on March 17th 2021.  
 
Regards,  
Gowthamraj K

Walter Martin · Answer

I'm sorry, my mistakeI gave you a wrong pdf because it was not the one scannedThen, thanks to your suggestions I understood the right way to proceed and now the libraries are working fine with my sample

Gowthamraj Kumar · Answer

Hi Walter, 
 
Thank you for your update. Please let us know if you need any further assistance with this. 

Regards, 
Gowthamraj K