7 min read Nov 21, 2024 11 Comments

Optical Character Recognition in PDF Using Tesseract Open-Source Engine

Summarize this blog post with:

Table of Contents

Tesseract engine

Optical character recognition (OCR) is a technology used to convert scanned paper documents, in the form of PDF files or images, to searchable, editable data. Paper documents—such as brochures, invoices, contracts, etc.—are sent via email. This process usually involves a scanner that converts the document to lots of different colors, known as a raster image. In order to extract the data and repurpose the content of the document, an OCR engine is necessary. The OCR engine detects the characters present in the image, puts those characters into words, and then into sentences, enabling you to search and edit the content of the document.

Tesseract engine

Tesseract is an optical character recognition engine, one of the most accurate OCR engines currently available. It is licensed under Apache 2.0 and has been developed by Google since 2006.

Getting Started with Essential PDF and Tesseract Engine

Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines of code, a scanned paper document containing raster images is converted to a searchable and selectable document.

You can download the OCR processor product setup here.

Deployment Requirements

The following assemblies are required to deploy Essential PDF and the OCR process.

Syncfusion assemblies

Syncfusion.Compression.Base.dll
Syncfusion.Pdf.Base.dll
Syncfusion.OcrProcessor.Base.dll

Tesseract assemblies

SyncfusionTessaract.dll (Tesseract Engine Version 3.02)
liblept168.dll (Leptonica image processing library used by Tesseract engine)

Referencing OCR assemblies in a .NET project

To reference the OCR assemblies in a .NET project:

Open the Solution Explorer of the application you have created. Right-click the Reference folder and then click Add References.
Add the following assemblies as references in the application:
- Syncfusion.Compression.Base.dll
- Syncfusion.Pdf.Base.dll
- Syncfusion.OcrProcessor.Base.dll
SyncfusionTessaract.dll and liblept168.dll should not be added as a reference. They should be kept in the local machine, and the location of the assemblies should be passed as a parameter to the OCR processor.

Performing OCR for a scanned paper document

1. To perform optical character recognition, as a first step, create the OCR processor by generating an object of the OCRProcessor class. It is mandatory for the constructor of the OCRProcessor class to accept the path of the Tesseract binaries, SyncfusionTessaract.dll, and liblept168.dll.

//Initializes the OCR processor by providing Tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
//to the OCR processor overload.
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries");

2. The PDF document that has to undergo the optical character recognition is loaded by using the PdfLoadedDocument class.

//Loads a PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");

3. The next step is to set the language for the OCR process and start the OCR process with the input of the language dictionary. Tesseract supports a variety of languages. The following code explains the OCR process for English and how to provide the English dictionary input.

//Sets OCR language to process.
processor.Settings.Language = "eng";
//Processes OCR by providing PDF document, data dictionary, and language.
processor.PerformOCR(loadedDocument, @"Tessdata");

Note: You can get the Tesseract binaries SyncfusionTessaract.dll, liblept168.dll, and the language pack (tessdata)— by downloading the OCR processor zip file from the following location: https://www.syncfusion.com/downloads/latest-version

4. The final step is to save the PDF document and dispose of the PdfLoadedDocument object. The saved PDF document now contains the contents in a searchable form.

//Saves the OCR-processed PDF document to a disk.
loadedDocument.Save("Sample.pdf");
loadedDocument.Close(true);

Performing OCR on a section of the document

Optical character recognition can also be performed on a section of a document rather than the complete document. The following documentation link provides a code sample and explanation.

Multiple language support for OCR

The Tesseract engine, starting from version 3, supports a variety of languages such as Arabic, English, Bulgarian, Catalan, Czech, Chinese and German as given in the following table.

Essential PDF also supports all these languages in the OCR processor. By default, Syncfusion ships only the English dictionary in the package. The dictionary packs for the other languages can be downloaded from the following online location:

https://github.com/tesseract-ocr/tessdata

The dictionary packs from the above link can be downloaded, extracted to a folder, and the location of the folder can be passed to the PerformOCR() method of the OCRProcessor class.
It is also mandatory to change the corresponding language code in the OCRProcessor.Settings.Language property. For example, to perform optical character recognition in German, the property should be set as processor.Settings.Language = “deu”;

The following table shows the complete set of supported languages and their language codes.

Language	Language code
Arabic	ara
Azerbaijani	aze
Bulgarian	bul
Catalan	cat
Czech	ces
Simplified Chinese	chi_sim
Traditional Chinese	chi_tra
Cherokee	chr
Danish	dan
Danish (Fraktur)	dan-frak
German, standard and Fraktur script	deu
Greek	ell
English	eng
Old English	enm
Esperanto	epo
Estonian	est
Finnish	fin
French	fra
Old French	frm
Galician	glg
Hebrew	heb
Hindi	hin
Croatian	hrv
Hungarian	hun
Indonesian	ind
Italian	ita
Japanese	jpn
Korean	kor
Latvian	lav
Lithuanian	lit
Dutch	nld
Norwegian	nor
Polish	pol
Portuguese	por
Romanian	ron
Russian	rus
Slovakian	slk
Slovenian	slv
Albanian	sqi
Spanish	spa
Serbian	srp
Swedish	swe
Tamil	tam
Telugu	tel
Tagalog	tgl
Thai	tha
Turkish	tur
Ukrainian	ukr
Vietnamese	vie

Tips for improving OCR accuracy

You can improve the accuracy of the OCR process by choosing the correct compression method when converting the scanned paper to a TIFF image and then to a PDF document:

Tesseract works best with text when at least 300 dots per inch (DPI) are used, so it is beneficial to resize images.
Compression:
- Use (zip) lossless compression for color or gray-scale images.
- Use CCITT Group 4 or JBIG2 (lossless) compression for monochrome images. This ensures that optical character recognition works on the highest-quality image, thereby improving the OCR accuracy. This is especially useful in low-resolution scans.
In addition, rotated images and skewed images can also affect the accuracy and readability of the OCR process.

For more details regarding quality improvement, refer to the following link:

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality.

The sample can be checked-out from this GitHub repository. Give it a star, if it is being useful to you.

Take a moment to peruse the documentation, where you’ll find other options and features, all accompanying code examples.

If you are new to our PDF library, it is highly recommended that you follow our Getting Started guide.

If you have any questions or require clarification for these features, please let us know in the comments below. You can also contact us through our support forum or Direct-Trac. We are happy to assist you!

If you like this blog post, we think you’ll also like the following resources:

[Ebook]C# Succinctly
[Ebook]PDF Succinctly
[Ebook]Web Servers Succinctly
[Blog post] 7 ways to compress PDF files in C#, VB.NET
[Blog post] HTML to PDF Conversion Using ASP.NET Core in Linux Docker

This post was originally published on February 20, 2015.

Tags:

Essential PDF Optical Character Recognition Tesseract

Be the first to get updates

Stay Ahead – Get Exclusive Updates First!

No spam, just valuable updates.

Unsubscribe anytime – no hard feelings!

Meet the Author

George Livingston

George Livingston is the Product Manager for PDF at Syncfusion Software. He is passionate about web technologies. He loves creating productive software tools and shooting photographs.

Comments (11)

nick

February 15, 2019 at 6:11 pm

doest it support .net core project

George Livingston

February 20, 2019 at 1:19 am

Hi Nick,

At present we do not support OCR processor in .NET Core project. We will consider your request and will update you once the feature is implemented in any of our upcoming release.

Regards,
George

Chandran

October 4, 2019 at 8:39 am

OCR will support for telugu language ?

processor.Settings.Language = “tel”;

//Set tesseract OCR engine

processor.Settings.TesseractVersion = TesseractVersion.Version3_02;

//Process OCR by providing the bitmap image, data dictionary and language

string ocrText = processor.PerformOCR(image, @”../../OCR/Tessdata/”);

Using this code throw an error.

Can you please resolve this ?

Chandran

October 4, 2019 at 8:41 am

For additional info.

I have placed all the trained data in the respective path.

Sowmiya Loganathan

October 7, 2019 at 6:26 am

@ Chandran

Hi Chandran,

Could you please download the Telegu tessdata (tel.traineddata) from the below link to work with Telugu characters while performing OCR,
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302

Please try with the above tessdata and let us know the result. If you still facing any issue, kindly share the error details and input file. It will helpful for us to provide the precise solution on this.

Regards,
Sowmiya Loganathan

Dayakar

January 22, 2020 at 6:32 am

doest it support .net core project now?

Sowmiya Loganathan

January 22, 2020 at 8:21 am

@ Dayakar

Hi Dayakar,

At present we do not support performing OCR in the ASP.NET Core platform. We have already logged a feature request for this and we have planned to implement this feature in our upcoming Volume 1 release 2020 which is expected to available by March 2020 tentatively. We will let you know once the feature is implemented. The status of implementation can be tracked through our Feature Management System:
https://www.syncfusion.com/feedback/4467/support-for-performing-ocr-in-a-pdf-document

Please let us know if you have any concerns about this.

Regards,
Sowmiya Loganathan

Sivakumar Balakrishnan

July 6, 2020 at 5:01 am

Using Tesseract How I can read the text from pdf file

Sowmiya Loganathan

July 8, 2020 at 6:51 am

@ Sivakumar Balakrishnan

Hi Sivakumar,

The text from OCRed document can be read in below two ways,

• Code snippet
//Process OCR by providing the PDF document and Tesseract data
string str = processor.PerformOCR(lDoc, @”../../Tessdata/”, true);

• Select text (Ctrl+A) from resultant OCR’ed PDF document and paste it to text file.

Regards,
Sowmiya Loganathan

Sowmiya Loganathan

July 8, 2020 at 6:56 am

@ Sowmiya Loganathan

Hi Sivakumar,

We can also extract the text from PDF document using ExtractText method. Please refer the below documentation for more details,
https://help.syncfusion.com/file-formats/pdf/working-with-text-extraction

Note: The PerformOCR method returns only the text OCRed by OCRProcessor. Other existing text can be extracted by using this feature.

Regards,
Sowmiya Loganathan

Sowmiya Loganathan

July 8, 2020 at 7:01 am

Hi Everyone,

We have provided support for the feature “Support for performing OCR in a PDF document in ASP.NET Core platform” from the version 18.1.0.42. Please refer the below link for more details,
https://www.syncfusion.com/blogs/post/easiest-way-to-ocr-process-pdf-documents-in-asp-net-core.aspx

Regards,
Sowmiya Loganathan

Optical Character Recognition in PDF Using Tesseract Open-Source Engine

Tesseract engine

Getting Started with Essential PDF and Tesseract Engine

Deployment Requirements

Referencing OCR assemblies in a .NET project

Performing OCR for a scanned paper document

Performing OCR on a section of the document

Multiple language support for OCR

Tips for improving OCR accuracy

Tags:

Be the first to get updates

Stay Ahead – Get Exclusive Updates First!

Leave a comment Cancel reply

Comments (11)

CONTACT US