Extracting text font size and name

Question

Good Morning, evaluating PDF.Winforms 17.4.0.44 for text extraction, it perfectly identifies text color and fontStyle=bold, but font size is always 1 and type is always Microsoft Sans Serif. Please advise.

Also, do you have any hint how paragraphs are recognized?

Thank you.

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy, 
 
Thanks for using Syncfusion Product. 
 
We are unable to reproduce the issue “ExtractText method using PdfLoadedDocument returns wrong font name and size” in our side with the PDF document we have. Kindly refer the sample in the below link which we used to reproduce the reported issue, 
 
https://www.syncfusion.com/downloads/support/forum/150844/ze/Forum150844-1585674003  
 
We suspect that this issue is specific to the PDF document used. Please share the document with us. If the document does not contain any confidential information, you can share it in the forum itself, if it does contain any, please create a Direct Trac incident and post the document there.  Also, please share the below details to analyse more on this and provide you a better solution, 
1.       Simple/Modify sample to reproduce the issue. 
2.       Replication procedure to reproduce the issue. 
3.       .Net Framework 
4.       Visual Studio version 
5.       Operating System 
 
Regards,  
Uthandaraja S

Gyorgy Gorog · Answer

Uthandaraja, thanks for update.

So have far I identified 4 types of PDF, named accordingly. I also attach my logs as .FNT text files.

1: always Sans Serif, always size 1

2. fonts seem OK, size always 1

3. fonts and sizes seem OK, heading/footing still size 1

4. everything seems OK.

Microsoft Visual Studio Community 2019 Version 16.4.3

Microsoft .NET Framework Version 4.8.03752

Microsoft Windows 10 Pro 10.0.18362 build 18362 (Hungarian)

Syncfusion.Pdf.Winforms Version 17.4.0.44

Code:

public static void GetTextFromPdf(string pdfFile)

{

string logFile = Path.ChangeExtension(pdfFile, ".FNT");

File.Delete(logFile);

File.AppendAllText(logFile, Environment.NewLine + pdfFile + Environment.NewLine + Environment.NewLine );

PdfLoadedDocument loadedDocument = new PdfLoadedDocument(pdfFile);

PdfUsedFont[] usedFonts = loadedDocument.UsedFonts;

File.AppendAllText(logFile, "PdfUsedFont[]: " +

usedFonts.Select(f => f.Name + " :" + f.Size).Distinct().Join(Environment.NewLine) +

Environment.NewLine);

foreach (PdfPageBase page in loadedDocument.Pages)

{

List<TextData> TextFormat = new List<TextData>();

string pageTexts = page.ExtractText(out TextFormat);

File.AppendAllText( logFile, Settings.Default.TrText = TextFormat.Select(td => td.Text + ": " + td.FontName + ": " + td.FontSize)

.Join(Environment.NewLine) + Environment.NewLine);

}

.Join is just an extension version of string.Join.

Thanks.

Attachment: PDFTypes_b7622964.zip

Vishnuraj Haridoss · Answer

Hi Gyorgy,   
   
We are able to reproduce the issue “Issues with font size and font name” in our side. We have forwarded this issue to our development team for further analysis and we will update further details on 29th January 2020.   
   
Regards,   
Vishnuraj Haridoss

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy, 
 
Thanks for your patience. We have confirmed that the issue with “ExtractText method using PdfLoadedDocument returns wrong font name and size” is a defect and we have logged a defect report. The fix for this issue is will be included in our 2020 Volume 1 main release which will be available on March 2020. 
 
Regards, 
Uthandaraja S

Gyorgy Gorog · Answer

Uthandaraja, thanks for update.

Padmini Ramamurthy · Answer

Hi Gyorgy, 
  
You are welcome and we will update you once our volume 1 release is rolled out. 
  
Regards, 
Padmini

Gyorgy Gorog · Answer

Padmini, I have another problems with Pdf.Winforms 17.4.0.50.

1. I attach a pdf, in which there are clearly separated numbered paragraphs. Yet the text exracted has e.g. paragraph 2. and 3. in a single paragraph:

2. Az ajánlatkérő a támogatott projekt megvalósítása során gyártógépsor leszállítása és beüzemelése tárgyában közbeszerzéséi eljárást valósította meg. Az ajánlatkérő a Kbt. 112. § (1) bekezdés b) pontja alapján a Kbt. 117. § (1) bekezdése szerinti nyílt közbeszerzési eljárását 2018. április 14. napján indította meg. Az eljárást megindító ajánlattételi felhívás (a továbbiakban: felhívás) a Közbeszerzési Értesítőben a KÉ-6248/2018. szám alatt 2018. április 16. napján jelent meg. 3. A felhívás módosítására két alkalommal került sor. A második módosító hirdetmény 2018. május 25. napján került feladása, a korrigendum a Közbeszerzési Értesítőben a KÉ-8898/2018. szám alatt 2018. május 29. napján jelent meg. A második korrigendum érintette az értékelési szempontokat (felhívás II.2.5) pontja), az ellenszolgáltatás teljesítésének feltételeire vonatkozó rendelkezéseket (felhívás III.1.7) pontja) és ezekkel összefüggésben az ajánlati ár megadásának pénznemét (felhívás VI.3.4) pontjának 3. alpontja), az ajánlattétel határidejét (felhívás IV.2.2) pontja), az ajánlatok felbontásának feltételeit (felhívás IV.2.6) pontja), valamint az ajánlati kötöttség időtartamára vonatkozó rendelkezéseket (felhívás IV.2.5) pontja).

2. If I try (with any PDF) to extract with TextLines, I alwas get:

Index was out of range. Must be non-negative and less than the size of the collection.

Parameter name: index

at System.ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument argument, ExceptionResource resource)

at Syncfusion.Pdf.PdfPageBase.ExtractText(TextLines& textLines

Code is just

PdfLoadedDocument loadedDocument = new PdfLoadedDocument(pdfBytes);

for (int pageNo = 0; pageNo < Math.Min(maxPages, loadedDocument.Pages.Count); pageNo ++)

{

PdfPageBase page = loadedDocument.Pages[pageNo];

TextLines textLines = new TextLines();

text += page.ExtractText(out textLines);

}

Thanks!

Attachment: K2019HU3000KDX000550_72378a58.ZIP

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy,    
    
We are able to reproduce the issues “ExtractedText method returns 2 paragraphs in a single paragraph” and “System.ArgumentOutOfRangeException thrown when using ExtractText(out TextLines textLines) method” in our side with the provided PDF document. We have forwarded these issues to our development team for further analysis and we will update further details on 24th February 2020.    
 
Regards,  
Uthandaraja S

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy, 
 
Thanks for your patience. 
 
Please find the details below, 





Query 

Details 


System.ArgumentOutOfRangeException thrown when using ExtractText(out TextLines textLines) method 

 
We have confirmed that the issue with “System.ArgumentOutOfRangeException thrown when using ExtractText(out TextLines textLines) method” is a defect and we have logged a defect report. The fix for this issue will be included in our weekly NuGet package which is expected to be available on 17th March 2020.  
 


ExtractedText method returns 2 paragraphs in a single paragraph 

 
On further analysis, we do not extract the texts based on the layout using page.ExtractText() method. In page.ExtractText() method, text are extracted based on the text rendering operators and a new line character will be added in between text on the occurrence of a text rendering operator which might cause less readability of the extracted content.   
 
However, you can extract the text based on the layout by using the ExtractText(bool) overload. Please find the UG documentation for your reference, 
 
https://help.syncfusion.com/file-formats/pdf/working-with-text-extraction#working-with-layout-based-text-extraction  
 
However, we could see some spacing issue in the above suggested layout overload method. We will fix this spacing issue and the fix for this issue will be included in our weekly NuGet package which is expected to be available on 17th March 2020.  
  

Please let us know if you need further assistance. 
 
Regards, 
Uthandaraja S

Gyorgy Gorog · Answer

Uthandaraja, thanks for update. My problem with the ExtractText(true) method is that my PDF-s often have two columns and the resulted text is messy in this case.Meanwhile I found a PDF that is optically quite similar to the others which are extracted correctly but this one has phantom 
-s even within words:ExtractText(): very short lines all alongDr. Petró Szilvia közbeszerzési biztos, az eljáró tanács elnöke, Gulyás ExtractText(true): no extra lines or spaces, but columns are not exact (in this case still better, you are right).A  tanács  tagjai: Dr. Petró Szilvia közbeszerzési biztos, az eljáró tanács elnöke, Gulyás Richárd közbeszerzési biztos, Dr. Virágh Norbert közbeszerzési biztos   A kezdeményező:          Kormányzati Ellenőrzési Hivatal (Budapest, Tartsay Vilmos u. 13.)  A kezdeményező képviselője:    Dr. Zombori Gábor Pál kamarai jogtanácsos               (e-elérhetőség: KRID 540329397)  A beszerző:            AUTÓ UNIVERZÁL Kft. (Kecskemét, Csáktornyai u. 6.) (e-elérhetőség: KRID 10664795) A beszerző képviselője:         Nagy és Kiss Ügyvédi Iroda               (Budapest, Szabadság tér 7.) (e-elérhetőség: KRID: 18118607)  A kérelmezett:           Cargo Service Zrt. (Budapest, Cinkotai út 34.) (e-elérhetőség: KRID: 26581280)  A beszerzés tárgya, értéke: Autóbusz javítás, nettó 223.550.950.-Ft  In other documents, ExtratText(true) tends to put lots of spaces within words (before ő and ű always, occasionally before any other character):A Dönt   őbizottság megállapítja, hogy a beszerz               ő megsértette a közbeszerzésekr            ől szóló 2015. évi CXLIII. törvény (a továbbiakban: 2015. évi Kbt.) 19. § (3) bekezdé                             sére és a 2015. évi Kbt. 110. §-ára tekintettel a 2015. évi Kbt. 4. § (1) bekezdésé                        t. A  Dönt   őbizottság  a  beszerz       ővel  szemben  a  közbeszerzési  eljárás  jogtalan  mell                    őzése  miatt 2.500.000.-Ft, azaz kett         őmillió-ötszázezer forint bírságot szab ki. A  Dönt   őbizottság  megállapítja,  hogy  a  beszerz                 ő  és  a  kérelmezett  között  élelmezési szolgáltatás  beszerzése  tárgyú  2013.  november  1.  napján  kötött  s                           zerz ődés  a  határozat  91. pontja szerint semmis.This is the same text from Foxit Reader:A Döntőbizottság megállapítja, hogy a beszerző és a kérelmezett között élelmezésiszolgáltatás beszerzése tárgyú 2013. november 1. napján kötött szerződés a határozat 91.pontja szerint semmis.  BTW Foxit Reader extracts correct text from this PDF and keeps columns: A kezdeményező:                                                        Kormányzati Ellenőrzési Hivatal                                                                         (Budapest, Tartsay Vilmos u. 13.)   A kezdeményező képviselője:                                             Dr. Zombori Gábor Pál kamarai jogtanácsos                                                                          (e-elérhetőség: KRID 540329397)  A beszerző:                                                            AUTÓ UNIVERZÁL Kft.                                                                         (Kecskemét, Csáktornyai u. 6.)                                                                         (e-elérhetőség: KRID  10664795)  A beszerző képviselője:                                                Nagy és Kiss Ügyvédi Iroda                                                                         (Budapest, Szabadság tér 7.)                                                                         (e-elérhetőség: KRID: 18118607) Acrobat DC not not keep columns:Az ügy iktatószáma: D.296/13/2019.A tanács tagjai: Dr. Petró Szilvia közbeszerzési biztos, az eljáró tanács elnöke, Gulyás Richárd közbeszerzési biztos, Dr. Virágh Norbert közbeszerzési biztos  A kezdeményező:  Kormányzati Ellenőrzési Hivatal (Budapest, Tartsay Vilmos u. 13.)  A kezdeményező képviselője:  Dr. Zombori Gábor Pál kamarai jogtanácsos (e-elérhetőség: KRID 540329397) A beszerző: AUTÓ UNIVERZÁL Kft. (Kecskemét, Csáktornyai u. 6.) (e-elérhetőség: KRID 10664795) A beszerző képviselője: Nagy és Kiss Ügyvédi Iroda (Budapest, Szabadság tér 7.) (e-elérhetőség: KRID: 18118607) Thanks!Attachment: K2019HU3000KDX000296_d4ee6c93.ZIP

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy, 

Thanks for your update. 

We are able to reproduce the issue “Spaces are not added properly for extracted text when using ExtractText(true) method” in our side with provided PDF document. We have forwarded this issue to our development team and we will update further details on 03rd March 2020. 

Regards, 
Uthandaraja S

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy, 
 
Thanks for your patience. 
 
At present, we do not have support to extracting text as table format from the PDF document. However, we can achieve the requirement by following way using tabula open source. The sample can be downloaded from the below location.   
 
https://www.syncfusion.com/downloads/support/forum/150844/ze/ExtractPDFTabularData-2044472463  
 
Please find the excel file in “Data” folder. 
 
Components involved:  
    
1.             tabula (For converting PDF to CSV).   
2.             Syncfusion.XlsIO (Parsing the CSV file and to get the data).   
 
Steps to achieve the same:   
    
1.             Ensure Java installed in your machine and provide the Java installed location properly. 
 




ProcessStartInfo startInfo = new ProcessStartInfo(@"C:\Program Files\Java\jre1.8.0_121\bin\java.exe");  
 
2.             Ensure the “tabula-0.8.0-jar-with-dependencies.jar” dependency is available in Data folder of the application.   
3.             Place the input PDF file (from which you need to extract the tabular data) parallel to above mentioned dependency jar file.   
4.             Pass the arguments (dependency jar name, output csv file name, input PDF file name)   
 




startInfo.Arguments = "-jar tabula-0.8.0-jar-with-dependencies.jar -p all -o ExportSales.csv ExportSales.pdf";   
 
5.             Start the process. Once the process completed, the CSV file will be generated parallel to the PDF file.   
6.             Use Syncusion.XlsIO.ExcelEngine to details of the tabular data present in the CSV file.   
 




 
ExcelEngine excelEngine = new ExcelEngine();   
IApplication application = excelEngine.Excel;   
IWorkbook workbook = application.Workbooks.Open("ExportSales.csv");  
 
 
Note:   
 
If you get an alert PDF document cannot be converted to Excel, while uploading the PDF file and the .csv file is not created in the Data folder, then the problem may be related to the tabula or the input PDF file.   
 
Please let us know if you need further assistance.  
 
Regards, Uthandaraja S

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy,The issues with "System.ArgumentOutOfRangeException thrown when using ExtractText(out TextLines textLines) method" and “Extra spaces added between words using ExtractText(bool) method” has been fixed and the patch for this fix can be downloaded from the following location. Recommended approach - exe/nuget will perform automatic configurationPlease find the patch setup from below location:Exe : http://syncfusion.com/Installs/support/patch/17.4.0.39/931611/F150844/SyncfusionPatch_17.4.0.39_931611_3172020094545196_F150844.exeNuget : http://syncfusion.com/Installs/support/patch/17.4.0.39/931611/F150844/SyncfusionNuget_17.4.0.39_931611_3172020094545196_F150844.zipAdvanced approach – use only if you have specific needs and can directly replace existing assemblies for your build environmentPlease find the patch assemblies alone from below location:http://syncfusion.com/Installs/support/patch/17.4.0.39/931611/F150844/SyncfusionPatch_17.4.0.39_931611_3172020094545196_F150844.zipAssembly Version: 17.4.0.39Installation Directions : This patch should replace the files “Syncfusion.Pdf.Base” under the following folder.$system drive:\ Files\Syncfusion\Essential Studio\$Version # \precompiledassemblies\$Version#\4.6Eg : $system drive:\Program Files\Syncfusion\Essential Studio\9.3.0.61\precompiledassemblies\9.3.0.61\4.0To automatically run the Assembly Manager, please check the Run assembly manager checkbox option while installing the patch. If this option is unchecked, the patch will replace the assemblies in precompiled assemblies’ folder only. Then, you will have to manually copy and paste them to the preferred location or you will have to run the Syncfusion Assembly Manager application (available from the Syncfusion Dashboard, installed as a shortcut in the Application menu) to re-install assemblies.Disclaimer : Please note that we have created this patch for version 17.4.0.39 specifically to resolve the following issue(s) reported in this/the incident(s). 150844If you have received other patches for the same version for other products, please apply all patches in the order received.This fix will be included in our main release 2020 Vol 1 which will be available in end of March 2020.Regards,Uthandaraja S

Gyorgy Gorog · Answer

Uthandaraja, testing 18.1.0.36-beta, it seems from the release notes that you worked really a lot. Still, formatted text extraction puts unexpected space series into text. I attach the PDF.In this case, unformatted extraction works as expected, but in another PDF, it puts unexpected line ends into the paragraphs.I am unable to decide whether use formatted or unformatted extraction.  I renamed the PDF-s accordingly.A 
határozat  ellen  fellebbezésnek,  újrafelvételi 
elj                    
árásnak  nincs  helye. 
A 

határozat bírósági felülvizsgálatát
annak kézbesíté                    sétől számított tizenöt napon 

belül 
keresettel  a  felperes 
belföldi  székhelye  (lak                      óhelye)  szerint 
illetékes 

közigazgatási és munkaügyi bíróságtól
lehet kérni.                     A
keresetlevelet az illetékes 

bírósághoz címezve, kizárólag a
Döntőbizottsághoz lehet benyújtani. Tárgyalás 

tartását  a 
felperes  a  keresetlevélben  kérheti. 
A  ke                     resetlevél  benyújtásának 
a 

határozat végrehajtására nincs
halasztó hatálya.   Attachment: Formatted_vs_unformatted_extract_3f691ba6.zip

Uthandaraja Selva Sundara Kani · Answer

Hi Gyorgy, 

Thanks for your update. 

Please find the details below, 





Query 

Details 


Uthandaraja, testing 18.1.0.36-beta, it seems from the release notes that you worked really a lot. Still, formatted text extraction puts unexpected space series into text. I attach the PDF. 

We did not include the fix for the issue “Extra spaces added between words using ExtractText(bool) method” in 18.1.0.36-beta release. So only, the extra spaces added for the shared PDF document. The fix will be included in our 2020 Vol 1 main release which is expected to be available in the end of March 2020 and the spaces issue will be resolved in this release.  


I am unable to decide whether use formatted or unformatted extraction.   

If you want to extract the texts base on the layout, we suggest you to use the ExtractText(bool) overload. Otherwise, you can use the ExtractText method. 

Please let us know if you need further assistance. 

Regards, 
Uthandaraja S

Query	Details
System.ArgumentOutOfRangeException thrown when using ExtractText(out TextLines textLines) method	We have confirmed that the issue with “System.ArgumentOutOfRangeException thrown when using ExtractText(out TextLines textLines) method” is a defect and we have logged a defect report. The fix for this issue will be included in our weekly NuGet package which is expected to be available on 17th March 2020.
ExtractedText method returns 2 paragraphs in a single paragraph	On further analysis, we do not extract the texts based on the layout using page.ExtractText() method. In page.ExtractText() method, text are extracted based on the text rendering operators and a new line character will be added in between text on the occurrence of a text rendering operator which might cause less readability of the extracted content.  However, you can extract the text based on the layout by using the ExtractText(bool) overload. Please find the UG documentation for your reference, https://help.syncfusion.com/file-formats/pdf/working-with-text-extraction#working-with-layout-based-text-extraction However, we could see some spacing issue in the above suggested layout overload method. We will fix this spacing issue and the fix for this issue will be included in our weekly NuGet package which is expected to be available on 17th March 2020.