Articles in this section
Category / Section

How to extract the text from specific coordinates of the PDF document?

1 min read

You can extract the text from the specific coordinates of the PDF document using the OCR process in the server side as workaround. Refer the following code snippet for the same.

HTML

<asp:Content ID="BodyContent" runat="server" ContentPlaceHolderID="MainContent">
     <table width="70%" style="height: 31px;left:15%;position:absolute">
           <tr style="width: 100%;text-align:center">
                <td >
                    <asp:FileUpload ID="FileUpload1" runat="server" />
                    <asp:Button ID="Button1" runat="server" onclick="Button1_Click" 
                        Text="ExtractText" />
                </td>                
            </tr>
             </table>
    <div style="left:15%;position:absolute; top:88px">
     <asp:TextBox ID="TextBox1" runat="server" TextMode="multiline" Height="259px" Width="888px"></asp:TextBox>
        </div>
</asp:Content>

C#

protected void Button1_Click(object sender, EventArgs e)
        {
if (System.IO.Path.GetExtension(FileUpload1.PostedFile.FileName).Equals(".pdf"))
            {
                //Get the pdf file stream 
                Stream pdfStream = FileUpload1.PostedFile.InputStream;
                //Create an instance for PdfLoadedDocument
                PdfLoadedDocument ldoc = new PdfLoadedDocument(pdfStream);
                System.Drawing.RectangleF bnds = new System.Drawing.RectangleF(30, 60, 300, 300);
                using (System.Drawing.Bitmap image = ldoc.ExportAsImage(1))
                {
                    Bitmap clippedImage = image.Clone(bnds, image.PixelFormat);
 
                    //Performs OCR for the cloned image to extract the text.
                    using (OCRProcessor pro = new OCRProcessor(HttpContext.Current.Request.PhysicalApplicationPath + "//Tesseract binaries//"))
                    {
                        pro.Settings.Language = Languages.English;
                        var text = pro.PerformOCR(clippedImage, HttpContext.Current.Request.PhysicalApplicationPath + "//Tessdata//");
                        TextBox1.Text += text;
                    }
                };              
            }        }

Sample:

https://www.syncfusion.com/downloads/support/directtrac/general/ze/PdfViewer_ExtractText-847350617

Note:

Clicking on the extract text button in the above sample will export the PDF document page as image using ExportAsImage method in PdfLoadedDocument, the resultant image is then clipped with the provided rectangle. Then the clipped image is OCRed to get the text from the image.  

 

Note

A new version of Essential Studio for ASP.NET is available. Versions prior to the release of Essential Studio 2014, Volume 2 will now be referred to as a classic versions.The new ASP.NET suite is powered by Essential Studio for JavaScript providing client-side rendering of HTML 5-JavaScript controls, offering better performance, and better support for touch interactivity. The new version includes all the features of the old version, so migration is easy.

The Classic controls can be used in existing projects; however, if you are starting a new project, we recommend using the latest version of Essential Studio for ASP.NET. Although Syncfusion will continue to support all Classic Versions, we are happy to assist you in migrating to the newest edition.

For current customers, you can check out our components from the License and Downloads page. If you are new to Syncfusion, you can try our 30-day free trial to check out our other controls. If you have any queries or require clarifications, please let us know in the comments section below.

You can also contact us through our support forumsDirect-Trac, or feedback portal. We are always happy to assist you!

Did you find this information helpful?
Yes
No
Help us improve this page
Please provide feedback or comments
Comments (0)
Please sign in to leave a comment
Access denied
Access denied