No proper line break from pdfloadedpage's extractText()

Hi,

The text extract from pdf is no working properly with break line. Below as code and attached pdf.

Document

Issue

1_BA20CBE6-4227-4E55-81E5-CF1BD280F4C8.txt

No proper line break. For example: ARTICLE 1.1 should be start at new line.

1_E5E4AE9D-2C0E-4CCF-A39D-666D45D39B16.txt

No proper line break, no space between words.

Code:

FileStream docStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);

PdfLoadedDocument loadedDocument = new PdfLoadedDocument(docStream);

PdfLoadedPageCollection loadedPages = loadedDocument.Pages;

foreach (PdfLoadedPage loadedPage in loadedPages)

{

docText += loadedPage.ExtractText();

}


Attachment: Text_Extraction_Issue_91717c7e.zip

3 Replies

SK Shamini Kiruba Sobers Syncfusion Team April 13, 2022 09:47 PM UTC

Hi Kuan,


We were able to reproduce the issue “Line breaks are not proper in the PdfLoadedPage's ExtractText()”. We will validate the issue and update further details in two business days on April 19, 2022.


Regards,

Shamini



SK Shamini Kiruba Sobers Syncfusion Team April 19, 2022 07:04 PM UTC

Hi Kuan,


We suggest using the ExtractText(bool IsLayout) overload method instead of the ExtractText() method. With the IsLayout parameter set to TRUE, we can extract the text in the format that is preserved in the PDF document.


We have shared the sample and the extracted text outputs for your reference which can be downloaded from the following links.


Sample: https://www.syncfusion.com/downloads/support/directtrac/general/ze/ExtractText-1425399247


Extracted text output: https://www.syncfusion.com/downloads/support/directtrac/general/ze/ExtractedOutput_WithLayout180655289


Kindly let us know if it helps.


Regards,

Shamini



KL Kuan Long Khiu April 20, 2022 01:10 AM UTC

Hi  Shamini,


It works now.

Thanks!


Regards,

Kuan Long



Loader.
Up arrow icon