We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date
close icon

OCR with different language

Hi,

I try to change your ocr sample (version 12.2.0.36 of Essential Stusio) to use language different than english. I download polish language from tesseract download site (https://code.google.com/p/tesseract-ocr/downloads/list) - version 3.02. I add this file to tessdata folder, change in code to Polish:

processor.Settings.Language = Languages.Polish;

but result was the same pdf file (not searchable). I try regarding to your sample make my own ocr application which returns ocred string but I get only empty string.

What I do wrong? What I should change to get proper result? I use pdf with screen capture of some polish pages with almost only text (I attach sample pdf file).

I have some question about this library:
1. Can your engine use dictionaries (e.g. polish to check result)?
2. Is your engine adaptability (self learning)?

Best regards,
Klaudiusz

Attachment: sample_pl_900788.zip

5 Replies

PH Praveenkumar H Syncfusion Team July 21, 2014 07:31 AM UTC

Hi Klaudiusz,

Thank you for using syncfusion products,

We are afraid that we are not able to reproduce the issue, we have attached the sample project with output document for your reference.

1.Can your engine use dictionaries(e.g. polish to check result) : we can't get your actual requirement here, could you please provide more details on this.

2. Is your engine adaptability (self learning)?: Currently we don't have support to train the tesseract engine.

Please let us know if you need further assistance.

With Regards,

Praveen


Attachment: OCR_Testing_43000c57.zip


EN enova July 21, 2014 10:56 AM UTC

Hi,

I try your sample and OCR something but it is not good quality. Few first lines looks like this (from my Sample pdf):

Osanna dudawašem Proszą funku: da „ma apląkaqą waaawaą uąuuzhwmnąe ścąągnąęua phku z dysku sąw la Dhką Extda ą zwajdowaw sąę v. kalabgu a Aappaza repuvt:

this should looks like this:

Ostatnio dodawałem prostą funkcję do pewnej aplikacji webowej: umożliwienie ściągnięcia pliku z dysku. Były to pliki Excela i znajdowały się w katalogu ~/App_Data/reports.

What can I do to improve OCR quality and what to do to get proper data?

Ad. 1 - using dictionaries like Ward to compare wards to check if we OCR ward properly.

Best regards,
Klaudiusz


PH Praveenkumar H Syncfusion Team July 23, 2014 09:49 AM UTC

Hi Klaudiuz,

Thank you for your update,

The syncfusion OCR makes use of tesseract OCR , we can see the same output in tesseract OCR itself.

Please increase the quality of image to get the proper output.

we don't have any dictionaries to compare the results.

Please let us know if you need further assistance.

With Regards,

Praveen



EN enova July 24, 2014 02:00 PM UTC

Hi,

Better images quality helps but we have one more observation. Look at attachments - there is original pdf and OCRed to searchable pdf. I attach notepad with OCR result also.
Problem is searching in pdf - try to find word "Saga" in OCRed_Sample_pl.pdf. You can notice that (I use Adobe viewer) found but pdf select more than should. It should select only first word not "Saga o wiedźminie". I think it is because pdf file - when you copy first words to clipboard you get:

"Sagaowiedźminie"

If you look at OCR result (OCRed_sample_pl_data.txt) there are spaces between "Saga" and "o" words and between "o" and "wiedźminie" words.

Best regards,
Klaudiusz

Attachment: files_1bcfc5ee.zip


PH Praveenkumar H Syncfusion Team July 28, 2014 05:16 AM UTC


Hi Klaudiusz,


Thank you for you update,


On our further investigation we have found that the text selection issue occurs in Adobe reader itself the issue not in our side we are drawing the text to its correct position only. We have tested the same document with Microsoft reader it works properly we have attached screenshots for your reference.



Please let us know if you need further assistance.


With Regards,


Praveen




Loader.
Live Chat Icon For mobile
Up arrow icon