OCR with different language

Question

Hi,I try to change your ocr sample (version 12.2.0.36 of Essential Stusio) to use language different than english. I download polish language from tesseract download site (https://code.google.com/p/tesseract-ocr/downloads/list) - version 3.02. I add this file to tessdata folder, change in code to Polish:processor.Settings.Language = Languages.Polish;but result was the same pdf file (not searchable). I try regarding to your sample make my own ocr application which returns ocred string but I get only empty string.What I do wrong? What I should change to get proper result? I use pdf with screen capture of some polish pages with almost only text (I attach sample pdf file).I have some question about this library:1. Can your engine use dictionaries (e.g. polish to check result)?2. Is your engine adaptability (self learning)?Best regards,KlaudiuszAttachment: sample_pl_900788.zip

Praveenkumar H · Answer

Hi Klaudiusz,Thank you for using syncfusion products,We are afraid that we are not able to reproduce the issue, we have attached the sample project with output document for your reference.1.Can your engine use dictionaries(e.g. polish to check result) : we can't get your actual requirement here, could you please provide more details on this.2. Is your engine adaptability (self learning)?: Currently we don't have support to train the tesseract engine.Please let us know if you need further assistance.With Regards,PraveenAttachment: OCR_Testing_43000c57.zip

enova · Answer

Hi,I try your sample and OCR something but it is not good quality. Few first lines looks like this (from my Sample pdf):Osanna dudawašem Proszą funku: da „ma apląkaqą waaawaą uąuuzhwmnąe ścąągnąęua phku z dysku sąw la Dhką Extda ą zwajdowaw sąę v. kalabgu a Aappaza repuvt:this should looks like this:Ostatnio dodawałem prostą funkcję do pewnej aplikacji webowej: 
umożliwienie ściągnięcia pliku z dysku. Były to pliki Excela i 
znajdowały się w katalogu ~/App_Data/reports.
What can I do to improve OCR quality and what to do to get proper data?Ad. 1 - using dictionaries like Ward to compare wards to check if we OCR ward properly.Best regards,Klaudiusz

Praveenkumar H · Answer

Hi Klaudiuz,Thank you for your update,The syncfusion OCR makes use of tesseract OCR , we can see the same output in tesseract OCR itself.Please increase the quality of image to get the proper output.we don't have any dictionaries to compare the results.Please let us know if you need further assistance.With Regards,Praveen

enova · Answer

Hi,Better images quality helps but we have one more observation. Look at attachments - there is original pdf and OCRed to searchable pdf. I attach notepad with OCR result also. Problem is searching in pdf - try to find word "Saga" in OCRed_Sample_pl.pdf. You can notice that (I use Adobe viewer) found but pdf select more than should. It should select only first word not "Saga o wiedźminie". I think it is because pdf file - when you copy first words to clipboard you get:"Sagaowiedźminie"If you look at OCR result (OCRed_sample_pl_data.txt) there are spaces between "Saga" and "o" words and between "o" and "wiedźminie" words.Best regards,KlaudiuszAttachment: files_1bcfc5ee.zip

Praveenkumar H · Answer

Hi Klaudiusz,

Thank you for you update,

On our further investigation we have found that the text selection
issue occurs in Adobe reader itself the issue not in our side we are drawing
the text to its correct position only. We have tested the same document with
Microsoft reader it works properly we have attached screenshots for your
reference.



Please let
us know if you need further assistance.

With
Regards,

Praveen