Extract text from a page region

Hello,
I would like to know if it is possible to extract the text from a page region of a PDF file instead of extracting it from all the page or all the document and how to do it properly.

Thank you very much.

Alessio

2 Replies

BS Balasubramanian Sundararajan Syncfusion Team January 19, 2018 02:00 PM UTC

Hi Alessio, 
 
Thank you for using Syncfusion product. 
 
We cannot extract the text from the specific region of the PDF page directly using PDF library. But we can achieve your requirement using following steps, 
 
  1. Export the PDF document pages as images.
 
 
  //Exported PDF document pages as images. 
  Bitmap image = loadedDocument.ExportAsImage(0,200,200); 
 
  *Accuracy of the extracted text is depends on the quality of the exported image. We can export the image with quality by     changing DPI while exporting the image from the PDF document. 
 
 
 
  1. Clone the exported image and process OCR on the image to extract the text using Syncfusion OCRProcessor.
 
 
//Region in which the text to be extracted. 
Rectangle region = new Rectangle(120, 122, 1360, 520); 
 
using (OCRProcessor processor = new OCRProcessor("../../Tesseract binaries")) 
{ 
     //Language to process the OCR 
     processor.Settings.Language = Languages.English; 
 
 
     //Clone the exported image with respect to the region of the searched text. 
     using (Bitmap clonedImage = image.Clone(region, System.Drawing.Imaging.PixelFormat.Format32bppArgb)) 
     { 
          //Extracted the text from the image using OCR engine. 
          ocrText = processor.PerformOCR(clonedImage, @"../../Tessdata/"); 
     } 
} 
 
 
 
We have also created a sample in which we have extracted the text from the specific region in the PDF document. We request you to try the below sample and let us if your requirement has been fulfilled or not. 
 
 
Note: We have created the above sample with .NET framework 4.6 and used Syncfusion Essential Studio 15.4.0.17 references. 
 
 
For your reference, we have marked the region in the below exported page from which we have extracted the text and also attached the resultant output in the below attachment. 
 
 
 
 
You can also refer the below documentation to know more details about OCR, 
 
 
 
Thanks, 
Balasubramanian S 
 
 



T T March 13, 2019 06:37 AM UTC

Balasubramanian S,

Thank you very much for this example, it was a great help and exactly what we needed!

Loader.
Up arrow icon