We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date
Unfortunately, activation email could not send to your email. Please try again.
Syncfusion Feedback

How to convert scanned image to searchable PDF by processing OCR

Platform: WinForms |
Control: PDF |
Published Date: August 14, 2018 |
Last Revised Date: May 3, 2019

Tesseract is an optical character recognition engine, one of the most accurate OCR engines at present.

Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines code, a scanned paper document containing raster images is converted to a searchable and selectable document.

The following assemblies are required to use the OCR feature in your application.

Syncfusion assemblies

  • Syncfusion.Compression.Base.dll
  • Syncfusion.Pdf.Base.dll
  • Syncfusion.OcrProcessor.Base.dll

Tesseract assemblies

  • SyncfusionTessaract.dll (Tesseract Engine Version 3.02)
  • liblept168.dll (Leptonica image processing library used by Tesseract engine)

Steps to convert scanned image to searchable PDF (OCR) programmatically:

  1. Create a new C# console application project. Create new console application
  2. Install Syncfusion.Pdf.WinForms and Syncfusion.OCRProcessor.Base NuGet packages as reference to your .NET Framework application from NuGet.org. Install PDF Winforms nuget package Install nuget package
  3. Include the following namespace in the Program.cs file.

C#

using Syncfusion.Pdf;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
using Syncfusion.OCRProcessor;
using System.IO; 

 

VB.NET

Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Graphics
Imports Syncfusion.Pdf.Parsing
Imports Syncfusion.OCRProcessor
Imports System.IO

 

  1. Tesseract assemblies will be found in the NuGet package installed location, you can move the Tesseract assemblies to your application folder and refer the location of the assemblies are passed as a parameter to the OCR processor.

C#

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");

 

VB.NET

Dim processor As New OCRProcessor("TesseractBinaries\")

 

  1. Place the Tesseract language data {E.g eng.traineddata} in the local system of the application folder and provide a path to the perform OCR method.

C#

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
processor.PerformOCR(lDoc, @"TessData\");

 

VB.NET

Dim processor As New OCRProcessor("TesseractBinaries\")
processor.PerformOCR(lDoc, "TessData\")

 

The dictionary packs for the other languages can be downloaded from the following online location:

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302

Note:

You can get the Tesseract binaries SyncfusionTessaract.dll, liblept168.dll, and the language pack (tessdata)— by downloading the OCR processor zip file from Add-On section from the following link.

https://www.syncfusion.com/downloads/latest-version

  1. Use the following code snippet to convert scanned image to searchable PDF.

C#

//Create a new PDF document
PdfDocument document = new PdfDocument();
//Add a page to the document
PdfPage page = document.Pages.Add();
//Create PDF graphics for a page
PdfGraphics graphics = page.Graphics;
//Load the image from the disk
PdfBitmap image = new PdfBitmap("Input.jpg");
//Draw the image
graphics.DrawImage(image, 0, 0,page.GetClientSize().Width,page.GetClientSize().Height);
//Save the document into stream
MemoryStream stream = new MemoryStream();
document.Save(stream);
//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor(@"/Tesseract Binaries/"))
{
    //Load a PDF document
    PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);
    //Set OCR language to process
    processor.Settings.Language = Languages.English;
    //Process OCR by providing the PDF document and Tesseract data
    processor.PerformOCR(lDoc, @"/Tessdata/");
    //Save the OCR processed PDF document in the disk
    lDoc.Save("OCR.pdf");
    //Close the document
    lDoc.Close(true);
}
//This will open the PDF file so, the result will be seen in default PDF viewer
Process.Start("OCR.pdf");

 

VB.NET

'Create a new PDF document
Dim document As New PdfDocument()
'Add a page to the document
Dim page As PdfPage = document.Pages.Add()
'Create PDF graphics for a page
Dim graphics As PdfGraphics = page.Graphics
'Load the image from the disk
Dim image As New PdfBitmap("Input.jpg")
'Draw the image
graphics.DrawImage(image, 0, 0, page.GetClientSize().Width, page.GetClientSize().Height)
'Save the document into stream
Dim stream As New MemoryStream()
document.Save(stream)
'Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept1
Using processor As New OCRProcessor("Tesseract Binaries\")
    'Load a PDF document
    Dim lDoc As New PdfLoadedDocument(stream)
    'Set OCR language to process
    processor.Settings.Language = Languages.English
    'Process OCR by providing the PDF document and Tesseract data
    processor.PerformOCR(lDoc, "Tessdata\")
    'Save the OCR processed PDF document in the disk
    lDoc.Save("OCR.pdf")
    'Close the document
    lDoc.Close(True)
End Using
'This will open the PDF file so, the result will be seen in default PDF viewer
Process.Start("OCR.pdf")

 

You can download the work sample from OCRSample.Zip

By executing the program, you will get the PDF document as follows. Screenshot of output PDF file

Take a moment to peruse the documentation, where you will find other options like performing OCR on image, region of the document, and large PDF documents with code examples.

Refer here to explore the rich set of Syncfusion Essential PDF features.

Note:

Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion license key in your application to use the components without trail message.

 

 

2X faster development

The ultimate WinForms UI toolkit to boost your development speed.
ADD COMMENT
You must log in to leave a comment
Comments
Jordan Capa
Apr 23, 2019

thank you for the helpful guide and instructions on how to convert scanned image to searcahable pdf by processing ocr. you should try zetpdf.com for converting to pdf and such. it's very convenient and easy to use!

Reply
Tova
Jun 14, 2020

פעלתי לפי ההוראות וקיבלתי את השגיאה הבאה : 'Tesseract engine has not been initialized זה מה שיצרתי using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))

        {
            //loading the input image

            Bitmap image = new Bitmap(strfilename);

            //Set OCR language to process

            processor.Settings.Language = Syncfusion.OCRProcessor.Languages.English;


            //Process OCR by providing the bitmap image, data dictionary and language

             ocrText = processor.PerformOCR(image, @"TessData\");

        }

בשורה זו processor.PerformOCR(image, @"TessData\") אני מקבלת את הנפילה כיצד ניתן לפתור אותה?

Reply

Please sign in to access our KB

This page will automatically be redirected to the sign-in page in 10 seconds.

Up arrow icon

Warning Icon You are using an outdated version of Internet Explorer that may not display all features of this and other websites. Upgrade to Internet Explorer 8 or newer for a better experience.Close Icon

Live Chat Icon For mobile
Live Chat Icon