We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date
Unfortunately, activation email could not send to your email. Please try again.
Syncfusion Feedback

How to extract the text from specific coordinates of the PDF document?

Platform: ASP.NET Web Forms |
Control: PdfViewer

You can extract the text from the specific coordinates of the PDF document using the OCR process in the server side as workaround. Refer the following code snippet for the same.

HTML

<asp:Content ID="BodyContent" runat="server" ContentPlaceHolderID="MainContent">
     <table width="70%" style="height: 31px;left:15%;position:absolute">
           <tr style="width: 100%;text-align:center">
                <td >
                    <asp:FileUpload ID="FileUpload1" runat="server" />
                    <asp:Button ID="Button1" runat="server" onclick="Button1_Click" 
                        Text="ExtractText" />
                </td>                
            </tr>
             </table>
    <div style="left:15%;position:absolute; top:88px">
     <asp:TextBox ID="TextBox1" runat="server" TextMode="multiline" Height="259px" Width="888px"></asp:TextBox>
        </div>
</asp:Content>

C#

protected void Button1_Click(object sender, EventArgs e)
        {
if (System.IO.Path.GetExtension(FileUpload1.PostedFile.FileName).Equals(".pdf"))
            {
                //Get the pdf file stream 
                Stream pdfStream = FileUpload1.PostedFile.InputStream;
                //Create an instance for PdfLoadedDocument
                PdfLoadedDocument ldoc = new PdfLoadedDocument(pdfStream);
                System.Drawing.RectangleF bnds = new System.Drawing.RectangleF(30, 60, 300, 300);
                using (System.Drawing.Bitmap image = ldoc.ExportAsImage(1))
                {
                    Bitmap clippedImage = image.Clone(bnds, image.PixelFormat);
 
                    //Performs OCR for the cloned image to extract the text.
                    using (OCRProcessor pro = new OCRProcessor(HttpContext.Current.Request.PhysicalApplicationPath + "//Tesseract binaries//"))
                    {
                        pro.Settings.Language = Languages.English;
                        var text = pro.PerformOCR(clippedImage, HttpContext.Current.Request.PhysicalApplicationPath + "//Tessdata//");
                        TextBox1.Text += text;
                    }
                };              
            }        }

Sample:

https://www.syncfusion.com/downloads/support/directtrac/general/ze/PdfViewer_ExtractText-847350617

Note:

Clicking on the extract text button in the above sample will export the PDF document page as image using ExportAsImage method in PdfLoadedDocument, the resultant image is then clipped with the provided rectangle. Then the clipped image is OCRed to get the text from the image.  

 

2X faster development

The ultimate ASP.NET Web Forms UI toolkit to boost your development speed.
ADD COMMENT
You must log in to leave a comment

Please sign in to access our KB

This page will automatically be redirected to the sign-in page in 10 seconds.

Up arrow icon

Warning Icon You are using an outdated version of Internet Explorer that may not display all features of this and other websites. Upgrade to Internet Explorer 8 or newer for a better experience.Close Icon

Live Chat Icon For mobile