Extract text from a page region

Question

Hello,
I would like to know if it is possible to extract the text from a page region of a PDF file instead of extracting it from all the page or all the document and how to do it properly.

Thank you very much.

Alessio

Balasubramanian Sundararajan · Answer

Hi Alessio, 
 
Thank you for using Syncfusion product. 
 
We cannot extract the text from the specific region of the PDF page directly using PDF library. But we can achieve your requirement using following steps, 
 

Export the PDF document pages as images.
 





 
  //Exported PDF document pages as images. 
  Bitmap image = loadedDocument.ExportAsImage(0,200,200); 
 
  *Accuracy of the extracted text is depends on the quality of the exported image. We can export the image with quality by     changing DPI while exporting the image from the PDF document. 
  
 

Clone the exported image and process OCR on the image to extract the text using Syncfusion OCRProcessor.
 





 
//Region in which the text to be extracted. 
Rectangle region = new Rectangle(120, 122, 1360, 520); 
 
using (OCRProcessor processor = new OCRProcessor("../../Tesseract binaries")) 
{ 
     //Language to process the OCR 
     processor.Settings.Language = Languages.English; 
 
 
     //Clone the exported image with respect to the region of the searched text. 
     using (Bitmap clonedImage = image.Clone(region, System.Drawing.Imaging.PixelFormat.Format32bppArgb)) 
     { 
          //Extracted the text from the image using OCR engine. 
          ocrText = processor.PerformOCR(clonedImage, @"../../Tessdata/"); 
     } 
}  
 
 
We have also created a sample in which we have extracted the text from the specific region in the PDF document. We request you to try the below sample and let us if your requirement has been fulfilled or not. 
 
http://www.syncfusion.com/downloads/support/forum/135531/ze/ExtractTextFromRegion_48218741 
 
Note: We have created the above sample with .NET framework 4.6 and used Syncfusion Essential Studio 15.4.0.17 references. 
 
 
For your reference, we have marked the region in the below exported page from which we have extracted the text and also attached the resultant output in the below attachment. 
 
Exported Page: http://www.syncfusion.com/downloads/support/forum/135531/ze/ExportedPage942801700  
 
Output: http://www.syncfusion.com/downloads/support/forum/135531/ze/output-2006893567  
 
 
You can also refer the below documentation to know more details about OCR, 
 
https://help.syncfusion.com/file-formats/pdf/working-with-ocr  
 
 
Thanks, 
Balasubramanian S

T · Answer

Balasubramanian S,Thank you very much for this example, it was a great help and exactly what we needed!