How to get the bounds of words by extracting text using PDF Viewer server library
Extract text using PDF Viewer server library
The PDF Viewer server library allows you to extract the text from a page along with the bounds. Text extracting can be done using the ExtractText() method. It will extract the text from the PDF document and return bounds of each character. Refer to the following UG link for more details.
https://ej2.syncfusion.com/aspnetcore/documentation/pdfviewer/how-to/extract-text/
Getting bounds of words using ExtractText()
The ExtractText() using PDF Viewer server library will return bounds of each character. Refer to the following code to get the bounds of the words.
Step1: Extracting the text from PDF document.
PdfRenderer renderer = new PdfRenderer(); renderer.Load(@"currentDirectory\..\..\..\..\Data\HTTP Succinctly.pdf"); List<TextData> textDataCollection = new List<TextData>(); // "text" contains the whole text extracted from the PDF document string text = renderer.ExtractText(1, out textDataCollection); System.IO.File.WriteAllText(@"currentDirectory\..\..\..\..\Data\ExtractedText.txt", text); |
Step2: Getting the bounds of the words with the extracted text
//"textBounds" contain the bound of each word List<TextBounds> textBounds = new List<TextBounds>(); int count = 0; string finalText = ""; var glyphBounds = new RectangleF(0, 0, 0, 0); for (int j = count; j < textDataCollection.Count; j++) { //To find whether the character us empty string or new line if (!textDataCollection[j].Text.Contains("\r") && !textDataCollection[j].Text.Contains(" ")) { finalText += textDataCollection[j].Text; int wordCount = 1; var minx = textDataCollection[j].Bounds.Left; var miny = textDataCollection[j].Bounds.Top; var maxx = textDataCollection[j].Bounds.Right; var maxy = textDataCollection[j].Bounds.Bottom; for (int k = j + 1; k < textDataCollection.Count; k++, wordCount++) { if (!textDataCollection[k].Text.Contains(" ") && !textDataCollection[k].Text.Contains("\r")) { //Calculating the word bounds if (minx > textDataCollection[k].Bounds.Left) minx = textDataCollection[k].Bounds.Left; if (miny > textDataCollection[k].Bounds.Top) miny = textDataCollection[k].Bounds.Top; if (maxx < textDataCollection[k].Bounds.Right) maxx = textDataCollection[k].Bounds.Right; if (maxy < textDataCollection[k].Bounds.Bottom) maxy = textDataCollection[k].Bounds.Bottom; finalText += textDataCollection[k].Text; j = k; if (j == textDataCollection.Count - 1) { glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny)); textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds)); finalText = ""; break; } } else { glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny)); textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds)); finalText = ""; break; } } } else if (textDataCollection[j].Text.Contains("\r")) { j++; } } |
Sample link:
https://www.syncfusion.com/downloads/support/directtrac/general/ze/WordBounds-1782596420