We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy.
Unfortunately, activation email could not send to your email. Please try again.

How to convert scanned image to searchable PDF

Optical Character Recognition (OCR)

Optical character Recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images to searchable or editable data. Paper documents such as brochures, invoices, and contracts, are sent via email. This process usually involves a scanner that converts the document to dots of different colors, known as a raster image. To extract the data and repurpose the content of the document, an OCR engine is necessary. The OCR engine detects the characters present in an image, puts those characters into words, and then into sentences to search and edit the content of the document.

Tesseract is an optical character recognition engine, one of the most accurate OCR engines at present.

Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines code, a scanned paper document containing raster images is converted to a searchable and selectable document.

The following assemblies are required to use the OCR feature in your application.

Syncfusion assemblies

  • Syncfusion.Compression.Base.dll
  • Syncfusion.Pdf.Base.dll
  • Syncfusion.OcrProcessor.Base.dll

Tesseract assemblies

  • SyncfusionTessaract.dll (Tesseract Engine Version 3.02)
  • liblept168.dll (Leptonica image processing library used by Tesseract engine)

Steps to convert scanned image to searchable PDF programmatically:

  1. Create a new C# console application project.

  1. Install Syncfusion.Pdf.Base and Syncfusion.OCRProcessor.Base NuGet packages as reference to your .NET Framework application from NuGet.org.

 

 

 

  1. Include the following namespace in the Program.cs file.

 

 

  1. Tesseract assemblies are not added as a reference. They must be kept in the local machine, and the location of the assemblies are passed as a parameter to the OCR processor.

 

 

  1. Place the Tesseract language data {E.g eng.traineddata} in the local system and provide a path to the OCR processor.

 

 

The dictionary packs for the other languages can be downloaded from the following online location:

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302

Note: You can get the Tesseract binaries SyncfusionTessaract.dll, liblept168.dll, and the language pack (tessdata)— by downloading the OCR processor zip file from Add-On section from the following link.

https://www.syncfusion.com/downloads/latest-version

  1. Use the following code snippet to convert scanned image to searchable PDF.

 

 

You can download the work sample from OCRSample.Zip

By executing the program, you will get the PDF document as follows.

Take a moment to peruse the documentation, where you will find other options like performing OCR on image, region of the document, and large PDF documents with code examples.

Refer here to explore the rich set of Syncfusion Essential PDF features.

Note:

Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion license key in your application to use the components without trail message.

 

Article ID: Published Date: Last Revised Date: Platform: Control:
9144 08/14/2018 08/28/2018 Windows Forms PDF
Did you find this information helpful?
Add Comment
You must log in to leave a comment

You are using an outdated version of Internet Explorer that may not display all features of this and other websites. Upgrade to Internet Explorer 8 or newer for a better experience.