2X faster development
The ultimate WinForms UI toolkit to boost your development speed.
Tesseract is an optical character recognition engine, one of the most accurate OCR engines at present. Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines code, a scanned paper document containing raster images is converted to a searchable and selectable document. The following assemblies are required to use the OCR feature in your application. Syncfusion assemblies
Tesseract assemblies
Steps to convert scanned image to searchable PDF (OCR) programmatically:
C# using Syncfusion.Pdf; using Syncfusion.Pdf.Graphics; using Syncfusion.Pdf.Parsing; using Syncfusion.OCRProcessor; using System.IO;
VB.NET Imports Syncfusion.Pdf Imports Syncfusion.Pdf.Graphics Imports Syncfusion.Pdf.Parsing Imports Syncfusion.OCRProcessor Imports System.IO
C# OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
VB.NET Dim processor As New OCRProcessor("TesseractBinaries\")
C# OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"); processor.PerformOCR(lDoc, @"TessData\");
VB.NET Dim processor As New OCRProcessor("TesseractBinaries\") processor.PerformOCR(lDoc, "TessData\")
The dictionary packs for the other languages can be downloaded from the following online location: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302 Note: You can get the Tesseract binaries SyncfusionTessaract.dll, liblept168.dll, and the language pack (tessdata)— by downloading the OCR processor zip file from Add-On section from the following link. https://www.syncfusion.com/downloads/latest-version
C# //Create a new PDF document PdfDocument document = new PdfDocument(); //Add a page to the document PdfPage page = document.Pages.Add(); //Create PDF graphics for a page PdfGraphics graphics = page.Graphics; //Load the image from the disk PdfBitmap image = new PdfBitmap("Input.jpg"); //Draw the image graphics.DrawImage(image, 0, 0,page.GetClientSize().Width,page.GetClientSize().Height); //Save the document into stream MemoryStream stream = new MemoryStream(); document.Save(stream); //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll) using (OCRProcessor processor = new OCRProcessor(@"/Tesseract Binaries/")) { //Load a PDF document PdfLoadedDocument lDoc = new PdfLoadedDocument(stream); //Set OCR language to process processor.Settings.Language = Languages.English; //Process OCR by providing the PDF document and Tesseract data processor.PerformOCR(lDoc, @"/Tessdata/"); //Save the OCR processed PDF document in the disk lDoc.Save("OCR.pdf"); //Close the document lDoc.Close(true); } //This will open the PDF file so, the result will be seen in default PDF viewer Process.Start("OCR.pdf");
VB.NET 'Create a new PDF document Dim document As New PdfDocument() 'Add a page to the document Dim page As PdfPage = document.Pages.Add() 'Create PDF graphics for a page Dim graphics As PdfGraphics = page.Graphics 'Load the image from the disk Dim image As New PdfBitmap("Input.jpg") 'Draw the image graphics.DrawImage(image, 0, 0, page.GetClientSize().Width, page.GetClientSize().Height) 'Save the document into stream Dim stream As New MemoryStream() document.Save(stream) 'Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept1 Using processor As New OCRProcessor("Tesseract Binaries\") 'Load a PDF document Dim lDoc As New PdfLoadedDocument(stream) 'Set OCR language to process processor.Settings.Language = Languages.English 'Process OCR by providing the PDF document and Tesseract data processor.PerformOCR(lDoc, "Tessdata\") 'Save the OCR processed PDF document in the disk lDoc.Save("OCR.pdf") 'Close the document lDoc.Close(True) End Using 'This will open the PDF file so, the result will be seen in default PDF viewer Process.Start("OCR.pdf")
You can download the work sample from OCRSample.Zip By executing the program, you will get the PDF document as follows. Take a moment to peruse the documentation, where you will find other options like performing OCR on image, region of the document, and large PDF documents with code examples. Refer here to explore the rich set of Syncfusion Essential PDF features. Note: Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion license key in your application to use the components without trail message.
|
2X faster development
The ultimate WinForms UI toolkit to boost your development speed.
This page will automatically be redirected to the sign-in page in 10 seconds.
thank you for the helpful guide and instructions on how to convert scanned image to searcahable pdf by processing ocr. you should try zetpdf.com for converting to pdf and such. it's very convenient and easy to use!
פעלתי לפי ההוראות וקיבלתי את השגיאה הבאה : 'Tesseract engine has not been initialized זה מה שיצרתי using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
בשורה זו processor.PerformOCR(image, @"TessData\") אני מקבלת את הנפילה כיצד ניתן לפתור אותה?