Articles in this section
Category / Section

How to get the bounds of words by extracting text using PDF Viewer server library

2 mins read

Extract text using PDF Viewer server library

The PDF Viewer server library allows you to extract the text from a page along with the bounds. Text extracting can be done using the ExtractText() method. It will extract the text from the PDF document and return bounds of each character. Refer to the following UG link for more details.

https://ej2.syncfusion.com/aspnetcore/documentation/pdfviewer/how-to/extract-text/

Getting bounds of words using ExtractText()

The ExtractText() using PDF Viewer server library will return bounds of each character. Refer to the following code to get the bounds of the words.

Step1: Extracting the text from PDF document.

PdfRenderer renderer = new PdfRenderer();

            renderer.Load(@"currentDirectory\..\..\..\..\Data\HTTP Succinctly.pdf");

            List<TextData> textDataCollection = new List<TextData>();

            // "text" contains the whole text extracted from the PDF document

            string text = renderer.ExtractText(1, out textDataCollection);

            System.IO.File.WriteAllText(@"currentDirectory\..\..\..\..\Data\ExtractedText.txt", text);

 

Step2: Getting the bounds of the words with the extracted text

  //"textBounds" contain the bound of each word

            List<TextBounds> textBounds = new List<TextBounds>();

            int count = 0;

            string finalText = "";

            var glyphBounds = new RectangleF(0, 0, 0, 0);

            for (int j = count; j < textDataCollection.Count; j++)

            {

      //To find whether the character us empty string or new line

                if (!textDataCollection[j].Text.Contains("\r") && !textDataCollection[j].Text.Contains(" "))

                {

                    finalText += textDataCollection[j].Text;

                    int wordCount = 1;

                    var minx = textDataCollection[j].Bounds.Left;

                    var miny = textDataCollection[j].Bounds.Top;

                    var maxx = textDataCollection[j].Bounds.Right;

                    var maxy = textDataCollection[j].Bounds.Bottom;

                    for (int k = j + 1; k < textDataCollection.Count; k++, wordCount++)

                    {

                        if (!textDataCollection[k].Text.Contains(" ") && !textDataCollection[k].Text.Contains("\r"))

                        {

                           //Calculating the word bounds

                            if (minx > textDataCollection[k].Bounds.Left)

                                minx = textDataCollection[k].Bounds.Left;

                            if (miny > textDataCollection[k].Bounds.Top)

                                miny = textDataCollection[k].Bounds.Top;

                            if (maxx < textDataCollection[k].Bounds.Right)

                                maxx = textDataCollection[k].Bounds.Right;

                            if (maxy < textDataCollection[k].Bounds.Bottom)

                                maxy = textDataCollection[k].Bounds.Bottom;

                            finalText += textDataCollection[k].Text;

                            j = k;

                            if (j == textDataCollection.Count - 1)

                            {

                                glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny));

                                textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds));

                                finalText = "";

                                break;

                            }

                        }

                        else

                        {

                            glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny));

                            textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds));

                            finalText = "";

                            break;

                        }

                    }

                }

                else if (textDataCollection[j].Text.Contains("\r"))

                {

                    j++;

                }

            }   

 

Sample link:

https://www.syncfusion.com/downloads/support/directtrac/general/ze/WordBounds-1782596420

Did you find this information helpful?
Yes
No
Help us improve this page
Please provide feedback or comments
Comments (0)
Please sign in to leave a comment
Access denied
Access denied