OCR in italian language

I'm using AspNet.Core 5.0 + Syncfusion Blazor 18.4.0.43
The Nuget package Syncfusion.PDF.OCR.Net.Core is working fine and I can get a good result loading the Pdf files but the Tesseract library extracts the words in English

I changed in my project the language in this way:
                
//Initialize OCR processor
                OCRProcessor processor = new OCRProcessor(hostingEnv.ContentRootPath + @"/TesseractBinaries/Windows");
                //Load a PDF document
                PdfLoadedDocument lDoc = new PdfLoadedDocument(source.ToArray());
                //Set OCR language to process
                processor.Settings.Language = "ita";
                //OCRLayoutResult hocrBounds;
                processor.PerformOCR(lDoc, hostingEnv.ContentRootPath + @"/tessdata/");

because the Italian language is not selectable in the Languages options and then I downlaoded the "ita.traineddata" to replace the "eng.traineddata" inside the tessdata path but the text extracted is still in English

is it not possible to use this line?
  processor.Settings.Language = "ita"

Is there any other way to use the Italian language?

Thanks







6 Replies 1 reply marked as answer

GK Gowthamraj Kumar Syncfusion Team March 15, 2021 11:47 AM UTC

Hi Walter, 

Thank you for contacting Syncfusion support. 

Currently, we are checking the sample with Italian language on our end and we will update the further details on March 17th 2021. 

However, kindly please share the input and output document, tessdata, product version to check the issue on our end. So, that it will be helpful for us to analyze and assist you further on this. 

Regards, 
Gowthamraj K 



WM Walter Martin March 16, 2021 12:27 AM UTC

I added in attachment a project o show you what I'm trying to do.
Basically I'd like to use the pdf ocr library to load a pdf file and calculate the most 10 frequent words
If you try to load in the project the file Ricette.pdf, it will find only english words, even if I used the ita.traineddata file
Also most of other pdf files I tried to load, generated an error so they can't be "scanned"
Is there a way to use the italian file and to understand why most of the pdf files can't be "scanned" ?
Thanks



Attachment: Ricette_b5bf45bb.zip


GK Gowthamraj Kumar Syncfusion Team March 16, 2021 12:15 PM UTC

Hi Walter,  
 
Thank you for sharing the details.  
 
As we said earlier, we are checking the sample for performing the OCR with PDF document contains a Italian language on our end and we will update the further details on March 17th 2021.  
 
Regards,  
Gowthamraj K 



GK Gowthamraj Kumar Syncfusion Team March 17, 2021 05:02 PM UTC

Hi Walter, 
 
Thank you for your patience. 
 
We have checked the provided sample on our end, we able to get the empty string return from the perform OCR method. On our further analysis the provided document, we have found that the document does not contains any scanned images, so that it return empty.  While using OCR library, a scanned PDF document containing a raster image is converted to a searchable PDF document. Please refer the below UG documentation,    

Please let us know if you need any further assistance with this. 

Regards, 
Gowthamraj K 


Marked as answer

WM Walter Martin March 17, 2021 10:02 PM UTC

I'm sorry, my mistake
I gave you a wrong pdf because it was not the one scanned
Then, thanks to your suggestions I understood the right way to proceed and now the libraries are working fine with my sample



GK Gowthamraj Kumar Syncfusion Team March 18, 2021 09:19 AM UTC

Hi Walter, 
 
Thank you for your update. Please let us know if you need any further assistance with this. 

Regards, 
Gowthamraj K 
 


Loader.
Up arrow icon