PDF to HTML issues

6 Replies
3 Participants

Created by
GG Gyorgy Gorog

Platform
WinForms

Platform
WinForms

Control
PDF

Created On
Feb 4, 2020 07:45 AM UTC

Last Activity On
Feb 18, 2020 11:02 AM UTC

Want to subscribe?
SIGN IN

Good Morning, I am evaluating PDF to HTML OPX using the downloadable solution (PdfToHtmlOPX1940268788.zip) with Syncfusion.Pdf.WinForms 17.4.0.46.

I think Syncfusion PDF controls are great and could be a major competitor to Adobe DC, especially together with DocIO when it comes to PDF -> Word conversion.

However there are a few issues in PDF -> HTML:

1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it?

2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too?

3. Memory consumption grows upon repeated conversion even if

I use the same converter and settings objects

say ldoc.EnableMemoryOptimization = true;

and call after each document:

ldoc.Close(true);

GC.Collect();

GC.WaitForPendingFinalizers();

Application.DoEvents();

I suspect it's the images? When converting a short PDF with no images, memory leak is marginal:

4. Minor: IgnoreImage seems to have no effect. I'm OK wih this as I like images.

Thanks!

6 Replies

SL Sowmiya Loganathan Syncfusion Team February 5, 2020 01:48 PM UTC

Hi Gyorgy,

Thank you for contacting Syncfusion support.

1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it?	We have used open source software “OPX” to convert PDF to HTML file. So we have preserved the characters only supported in OPX.
2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too?	We can able to skip the page numbers and header/footer in PDF document by drawing the content in PdfTemplate and draws it in PDF document as like header and footer. Please find the below sample which illustrate this, *Sample*: https://www.syncfusion.com/downloads/support/forum/151213/ze/PdfSample-1618687314
3. Memory consumption grows upon repeated conversion even if I use the same converter and settings objects ldoc.EnableMemoryOptimization = true; and call after each document: ldoc.Close(true); GC.Collect(); GC.WaitForPendingFinalizers(); Application.DoEvents();	Could you please share us the PDF document and complete code snippet/sample to replicate this issue. It will helpful for us to provide the precise solution on this.

Regards,

Sowmiya Loganathan

GG Gyorgy Gorog February 7, 2020 09:14 AM UTC

Sowmiya, thanks for update.

3. For skipping headers and page numbers, I meant during text extraction and html generation.

4. For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip:

This is a pdf with more images, please see attched (jimny_muszaki_1029_1c9d842.zip):

I also attach another pdf that completely fails (40064_2016_Article_3041.zip).

The code is yours, just put into a cycle.

//Initialize PdfToHtmlConverter

PdfToHtmlConverter converter = new PdfToHtmlConverter();

//Initialize and applying settings

PdfToHtmlConverterSettings setting = new PdfToHtmlConverterSettings();

setting.IsFrame = false;

setting.AbsolutePositioning = false;

converter.Settings = setting;

for ( int i = 0; i < 100; i++)

{

//Loading the input PDF document.

PdfLoadedDocument ldoc = new PdfLoadedDocument(txtImageFile.Text);

//Converting PDF to HTML

converter.Convert(txtImageFile.Text, "output.html", ldoc.Pages.Count);

ldoc.Close(true);

Tbi.Text = i.ToString();

GC.Collect();

GC.WaitForPendingFinalizers();

Application.DoEvents();

Thread.Sleep(1);

}

Thanks!

Attachment: 40064_2016_Article_3041_295d70a0.zip

SL Sowmiya Loganathan Syncfusion Team February 10, 2020 12:38 PM UTC

Hi Gyorgy,

For skipping headers and page numbers, I meant during text extraction and html generation.	We have used open source software “OPX” to convert PDF to HTML, So we could not able to skip the header and page number during HTML generation.
For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip: I also attach another pdf that completely fails (40064_2016_Article_3041.zip).	We were able to reproduce the reported issue and suspect that this to be a defect. Currently we are validating on this and will update the further details on 12^th February, 2020.

Regards,

Sowmiya Loganathan

SL Sowmiya Loganathan Syncfusion Team February 12, 2020 02:15 PM UTC

Hi Gyorgy,

I also attach another pdf that completely fails (40064_2016_Article_3041.zip).	We are internally make use of open source xpdf to convert PDF to HTML, and the conversion fails due to exception occurs in that open source library itself. So we can’t proceed further to resolve this issue in our end.
For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip:	We have checked the reported memory leak issue, but it does not take more memory in our end. Also, we have ensured the memory taken by the PdfLoadedDocument is actual memory needed to process the document. There is no bottle neck in our implementation. So, could not optimize this further.

Regards,

Sowmiya Loganathan

GG Gyorgy Gorog February 17, 2020 09:39 AM UTC

Sowmiya, thanks for update. Sorry you don't want to beat Acrobat :) you are just a few steps apart.

Thanks anyway.

PV Prakash Viswanathan Syncfusion Team February 18, 2020 11:02 AM UTC

Hi Gyorgy,

Thank you for the update. Please let us know if you need any further assistance on this.

Regards,

Prakash V

6 Replies
3 Participants
Want to subscribe?
SIGN IN
Created by
GG Gyorgy Gorog
Platform
WinForms
Control
PDF
Created On
Feb 4, 2020 07:45 AM UTC
Last Activity On
Feb 18, 2020 11:02 AM UTC

Viewer Component

.NET PDF Processing Library

Conversions

Editor Component

.NET Word Processing Library

Conversions

Editor Component

.NET Excel Processing Library

Conversions

.NET PowerPoint Processing Library

Conversions

PDF to HTML issues

Enterprise Solutions

Free Products

Viewer Component

.NET PDF Processing Library

Conversions

Editor Component

.NET Word Processing Library

Conversions

Editor Component

.NET Excel Processing Library

Conversions

.NET PowerPoint Processing Library

Conversions

Learning

Resources

Support

PDF to HTML issues