Optical Character Recognition in PDF Using Tesseract Open-Source Engine

Optical character recognition (OCR) is a technology used to convert scanned paper documents, in the form of PDF files or images, to searchable, editable data. Paper documents—such as brochures, invoices, contracts, etc.—are sent via email. This process usually involves a scanner that converts the document to dots of different colors, known as a raster image. In order to extract the data and repurpose the content of the document, an OCR engine is necessary. The OCR engine detects the characters present in the image, puts those characters into words, and then into sentences, enabling you to search and edit the content of the document.

Tesseract Engine

Tesseract is an optical character recognition engine, one of the most accurate OCR engines currently available. It is licensed under Apache 2.0 and has been developed by Google since 2006.

Getting Started with Essential PDF and Tesseract Engine

Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines of code, a scanned paper document containing raster images is converted to a searchable and selectable document.

Deployment Requirements

The following assemblies are required to deploy Essential PDF and the OCR process.

Syncfusion Assemblies

  • Syncfusion.Core.dll
  • Syncfusion.Compression.Base.dll
  • Syncfusion.Pdf.Base.dll
  • Syncfusion.OcrProcessor.Base.dll

Tesseract Assemblies

  • SyncfusionTessaract.dll (Tesseract Engine Version 3.02)
  • liblept168.dll (Leptonica image processing library used by Tesseract engine)

Referencing OCR Assemblies in a .NET Project

To reference the OCR assemblies in a .NET project:

1. Open the Solution Explorer of the application you have created. Right-click the Reference folder and then click Add References.

2. Add the following assemblies as references in the application.

  • Syncfusion.Core.dll
  • Syncfusion.Compression.Base.dll
  • Syncfusion.Pdf.Base.dll
  • Syncfusion.OcrProcessor.Base.dll

3. SyncfusionTessaract.dll and liblept168.dll should not be added as a reference. They should be kept in the local machine, and the location of the assemblies should be passed as a parameter to the OCR processor.

Performing OCR for a Scanned Paper Document

1. To perform optical character recognition, as a first step, create the OCR processor by generating an object of the OCRProcessor class. It is mandatory for the constructor of the OCRProcessor class to accept the path of the Tesseract binaries, SyncfusionTessaract.dll, and liblept168.dll.

//Initializes the OCR processor by providing Tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
//to the OCR processor overload.
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");

2. The PDF document that has to undergo the optical character recognition is loaded by using the PdfLoadedDocument class.

//Loads a PDF document.
PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");

3. The next step is to set the language for the OCR process and start the OCR process with the input of the language dictionary. Tesseract supports a variety of languages. The following code explains the OCR process for English and how to provide the English dictionary input.

//Sets OCR language to process.
processor.Settings.Language = "eng";
//Processes OCR by providing PDF document, data dictionary, and language.
processor.PerformOCR(lDoc, @"Tessdata\");

Note: The Tesseract binaries—namely SyncfusionTessaract.dll, liblept168.dll, and the language pack (tessdata)—are available at

<<Installation Location>>\Syncfusion\Essential Studio\<<Version Number>>\OCRProcessor>>

4. The final step is to save the PDF document and dispose of the PdfLoadedDocument object. The saved PDF document now contains the contents in a searchable form.

//Saves the OCR-processed PDF document to a disk.
lDoc.Save("Sample.pdf");
lDoc.Close(true);

Performing OCR on a Section of the Document

Optical character recognition can also be performed on a section of a document rather than the complete document. The following documentation link provides a code sample and explanation.

http://help.syncfusion.com/ug/windows%20forms/documents/performocrforaspecif.htm

Multi-Language Support for OCR

The Tesseract engine, starting from version 3, supports a variety of languages such as Arabic, English, Bulgarian, Catalan, Czech, Chinese, German, and other languages as given in the following table.

Essential PDF also supports all these languages in the OCR processor. By default, Syncfusion ships only the English dictionary pack in Essential Studio. The pack is available locally in <<Installation Location>>\Syncfusion\Essential Studio\<<Version Number>>\OCRProcessor\Tessdata>>. The dictionary packs for the other languages can be downloaded from the following online location:

https://code.google.com/p/tesseract-ocr/downloads/list

1. The dictionary packs from the above link can be downloaded, extracted to a folder, and the location of the folder can be passed to the PerformOCR() method of the OCRProcessor class.

2. It is also mandatory to change the corresponding language code in the OCRProcessor.Settings.Language property. For example, to perform optical character recognition in German, the property should be set as processor.Settings.Language = "deu";

The following table shows the complete set of supported languages and their language codes.

Supported Languages and Language Codes

Language

Language code

Arabic

ara

Azerbaijani

aze

Bulgarian

bul

Catalan

cat

Czech

ces

Simplified Chinese

chi_sim

Traditional Chinese

chi_tra

Cherokee

chr

Danish

dan

Danish (Fraktur)

dan-frak

German, standard and Fraktur script

deu

Greek

ell

English

eng

Old English

enm

Esperanto

epo

Estonian

est

Finnish

fin

French

fra

Old French

frm

Galician

glg

Hebrew

heb

Hindi

hin

Croatian

hrv

Hungarian

hun

Indonesian

ind

Italian

ita

Japanese

jpn

Korean

kor

Latvian

lav

Lithuanian

lit

Dutch

nld

Norwegian

nor

Polish

pol

Portuguese

por

Romanian

ron

Russian

rus

Slovakian

slk

Slovenian

slv

Albanian

sqi

Spanish

spa

Serbian

srp

Swedish

swe

Tamil

tam

Telugu

tel

Tagalog

tgl

Thai

tha

Turkish

tur

Ukrainian

ukr

Vietnamese

vie

Tips for Improving OCR Accuracy

You can improve the accuracy of the OCR process by choosing the correct compression method when converting the scanned paper to a TIFF image and then to a PDF document.

1. Tesseract works best with text when at least 300 dots per inch (DPI) are used, so it is beneficial to resize images.

2. Compression:

  • Use (zip) lossless compression for color or gray-scale images.
  • Use CCITT Group 4 or JBIG2 (lossless) compression for monochrome images. This ensures that optical character recognition works on the highest-quality image, thereby improving the OCR accuracy. This is especially useful in low-resolution scans.

3. In addition, rotated images and skewed images can also affect the accuracy and readability of the OCR process.

For more details regarding quality improvement, refer to the following link:

https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality

Online Sample

An online sample that performs OCR on a PDF with a monochrome image is available at

http://asp.syncfusion.com/demos/web/pdf/pdfocr.aspx.

Content Contributor: Pravin Joshua David | Content Editor: Usha Clementine Henry

Pingbacks and trackbacks (1)+

Loading