Articles in this section
Category / Section

How to Get the Page of the OCR'ed Text

1 min read

The Syncfusion .NET Optical Character Recognition (OCR) Library is used to extract text from scanned PDFs and images. With a few lines of C# code, a scanned PDF document containing a raster image is converted into a searchable and selectable PDF document. Save the OCR result as text, structured data, or searchable PDF documents. The .NET OCR Library uses a powerful Tesseract OCR engine.
Using this library, ​the OCR process can be performed for individual PDF document pages to acquire text for each page separately in C# and VB.NET.

Steps to get the page of the OCR’ed text programmatically

  1. Create a new C# Windows Forms application project. Windows app creation.png
  2. Install the Syncfusion.Pdf.OCR.WinForms NuGet packages as a reference to your .NET Framework application from NuGet.org.
    Nuget package.png
  3. Add a new button in Form1.Designer.cs file.
 private System.Windows.Forms.Button button1;
 private System.Windows.Forms.Label label1;
 private void InitializeComponent() 
 {
      this.button1 = new System.Windows.Forms.Button();
      this.label1 = new System.Windows.Forms.Label();
      this.SuspendLayout();
      // 
      // button1
      // 
      this.button1.Location = new System.Drawing.Point(298, 226);
      this.button1.Name = "button1";
      this.button1.Size = new System.Drawing.Size(159, 46);
      this.button1.TabIndex = 0;
      this.button1.Text = "Perform OCR";
      this.button1.UseVisualStyleBackColor = true;
      this.button1.Click += new System.EventHandler(this.button1_Click);
      // 
      // label1
      // 
      this.label1.AutoSize = true;
      this.label1.Location = new System.Drawing.Point(136, 193);
      this.label1.Name = "label1";
      this.label1.Size = new System.Drawing.Size(503, 20);
      this.label1.TabIndex = 1;
      this.label1.Text = "Click button to perform OCR of the PDF document";
      // 
      // Form1
      // 
      this.AutoScaleDimensions = new System.Drawing.SizeF(9F, 20F);
      this.AutoScaleMode = System.Windows.Forms.AutoScaleMode.Font;
      this.ClientSize = new System.Drawing.Size(800, 450);
      this.Controls.Add(this.label1);
      this.Controls.Add(this.button1);
      this.Name = "Form1";
      this.Text = "Form1";
      this.ResumeLayout(false);
      this.PerformLayout();
}
  1. Include the following namespaces in the Form1.cs file.

C#

using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Parsing;
using System.IO;

VB.NET

Imports Syncfusion.OCRProcessor
Imports Syncfusion.Pdf.Parsing
Imports System.IO
  1. Create the btnCreate_Click event and the OCR process can be performed for individual pages of a PDF document to acquire text for each page separately. Please find the code example and sample as follows for the same.

C#

string resulttext = string.Empty;
//Load the existing PDF document.
PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
for (int i = 0; i < lDoc.Pages.Count; i++)
{
//Initialize the OCR processor.
using (OCRProcessor processor = new OCRProcessor())
{
//Set the performance.
processor.Settings.Performance = Performance.Slow;
resulttext += " \n" + "page no " + i.ToString() + "\n";
//Process OCR by providing the loaded PDF document page by page.
resulttext += processor.PerformOCR(lDoc, i, i,processor.TessDataPath);
}
}
//Save the OCRed text with the page number.
File.WriteAllText("Result.txt", resulttext);
//Close the document.
lDoc.Close(true);

VB.NET

Dim resulttext As String = String.Empty
'Load the existing PDF document.
Dim lDoc As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")
For i As Integer = 0 To lDoc.Pages.Count - 1
    'Initialize the OCR processor.
     Using processor As OCRProcessor = New OCRProcessor()
         'Set the performance.
         processor.Settings.Performance = Performance.Slow
         resulttext += " \n" + "page no " + i.ToString() + "\n"
         'Process OCR by providing the loaded PDF document page by page.
         resulttext += processor.PerformOCR(lDoc, i, i, processor.TessDataPath)
     End Using
Next
'Save the OCRed text with the page number.
File.WriteAllText("Result.txt", resulttext)
'Close the document
lDoc.Close(True)

A complete working sample can be downloaded from the OCRPageByPage.zip.

By executing the program, you will get the text file (contains extracted text) as follows.

Output screenshot.png

Take a moment to peruse the documentation, where you will find other options like performing OCR on an image, region of the document, rotated page, and large PDF documents with code examples.

Refer to here to explore the rich set of Syncfusion Essential PDF features.

Note: Starting with v16.2.0.x, if you reference Syncfusion assemblies from the trial setup or the NuGet feed, include a license key in your projects. Refer to this link to learn about generating and registering the Syncfusion license key in your application to use the components without a trail message.

Did you find this information helpful?
Yes
No
Help us improve this page
Please provide feedback or comments
Comments (0)
Please sign in to leave a comment
Access denied
Access denied