Tesseract OPX in File Formats

Introduction

Tesseract is an open source Optical Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly or (for programmers) using an API to extract typed, handwritten, or printed text from images. Tesseract OPX makes it easy to use Tesseract with Microsoft .NET. Tesseract OPX is also optimized for working with Syncfusion Essential PDF for .NET to be able to process PDF documents with images that contain text. Tesseract OPX, along with Essential PDF, can process the text in images within PDF documents and overlay them with searchable text.

Assemblies Required

To use the OCR feature in your application, you need to add reference to the following set of assemblies:

Assembly Name	Description
Syncfusion.Pdf.Base	This assembly contains the core feature for manipulating and saving PDF documents.
Syncfusion.Compression.Base	This assembly compresses the internal contents of a PDF document.
Syncfusion.OCRProcessor.Base	This assembly contains core feature for OCR the image and PDF document.

The following namespaces should be added in the application:

using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Parsing;

Performing OCR on PDF document

You can perform OCR on a PDF document with the help of OCRProcessor Class. Place the SyncfusionTesseract.dll and liblept168.dll assemblies (available in the installed location Installation Location\Syncfusion\Essential Studio «version number\ocrprocessor) in the local system and provide the assembly path to the OCR processor.

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");

Place the Tesseract language data {E.g eng.traineddata} (available in the installed location Installation Location-\Syncfusion\Essential Studio «version number->\OCRProcessor) in the local system and provide a path to the OCR processor

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
processor.PerformOCR(lDoc,@"Tessdata\");

You can also download the language packages from the link below. https://github.com/tesseract-ocr/tessdata

Please refer to the code snippet below.

//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
{
	//Load a PDF document
	PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
	//Set OCR language to process
	processor.Settings.Language = Languages.English;
	//Process OCR by providing the PDF document and Tesseract data
	processor.PerformOCR(lDoc, @"Tessdata\");
	//Save the OCR processed PDF document in the disk
	lDoc.Save("Sample.pdf");
	lDoc.Close(true);
}

Performing OCR for a region of the document:

//Initialize the OCR processor by providing the path of the tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
{
	//Load a PDF document
	PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
	//Set OCR language to process
	processor.Settings.Language = Languages.English;
	RectangleF rect = new RectangleF(0, 100, 950, 150);
	//Assign rectangles to the page
	List <pageregion> pageRegions = new List <pageregion>();
	PageRegion region = new PageRegion();
	region.PageIndex = 1;
	region.PageRegions = new RectangleF[] { rect };
	pageRegions.Add(region);
	processor.Settings.Regions = pageRegions;
	//Process OCR by providing the PDF document and Tesseract data
	processor.PerformOCR(lDoc, @"Tessdata\");
	//Save the OCR processed PDF document in the disk
	lDoc.Save("Sample.pdf");
	lDoc.Close(true);
}

Performing OCR on image

You can perform OCR on an image also. Refer to the below code snippets for a demonstration.

//Initialize the OCR processor by providing the path of the tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
{
	//loading the input image
	Bitmap image = new Bitmap("input.jpeg");
	//Set OCR language to process
	processor.Settings.Language = Languages.English;
	//Process OCR by providing the bitmap image, data dictionary and language
	string ocrText = processor.PerformOCR(image, @"Tessdata\");
	image.Dispose();
}

Learning