PDF to HTML issues
Good Morning, I am evaluating PDF to HTML OPX using the downloadable solution (PdfToHtmlOPX1940268788.zip) with Syncfusion.Pdf.WinForms 17.4.0.46.

I think Syncfusion PDF controls are great and could be a major competitor to Adobe DC, especially together with DocIO when it comes to PDF -> Word conversion.
However there are a few issues in PDF -> HTML:
1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it?
2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too?
3. Memory consumption grows upon repeated conversion even if
I use the same converter and settings objects
say ldoc.EnableMemoryOptimization = true;
and call after each document:
ldoc.Close(true);
GC.Collect();
GC.WaitForPendingFinalizers();
Application.DoEvents();
I suspect it's the images? When converting a short PDF with no images, memory leak is marginal:
4. Minor: IgnoreImage seems to have no effect. I'm OK wih this as I like images.
Thanks!
SIGN IN To post a reply.
6 Replies
SL
Sowmiya Loganathan
Syncfusion Team
February 5, 2020 01:48 PM UTC
Hi Gyorgy,
Thank you for contacting Syncfusion support.
|
1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it? |
We have used open source software “OPX” to convert PDF to HTML file. So we have preserved the characters only supported in OPX. |
|
2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too? |
We can able to skip the page numbers and header/footer in PDF document by drawing the content in PdfTemplate and draws it in PDF document as like header and footer. Please find the below sample which illustrate this,
|
|
3. Memory consumption grows upon repeated conversion even if
I use the same converter and settings objects
ldoc.EnableMemoryOptimization = true;
and call after each document:
ldoc.Close(true);
GC.Collect();
GC.WaitForPendingFinalizers();
Application.DoEvents(); |
Could you please share us the PDF document and complete code snippet/sample to replicate this issue. It will helpful for us to provide the precise solution on this. |
Regards,
Sowmiya Loganathan
GG
Gyorgy Gorog
February 7, 2020 09:14 AM UTC
Sowmiya, thanks for update.


Attachment: 40064_2016_Article_3041_295d70a0.zip
3. For skipping headers and page numbers, I meant during text extraction and html generation.
4. For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip:
This is a pdf with more images, please see attched (jimny_muszaki_1029_1c9d842.zip):
I also attach another pdf that completely fails (40064_2016_Article_3041.zip).
The code is yours, just put into a cycle.
//Initialize PdfToHtmlConverter
PdfToHtmlConverter converter = new PdfToHtmlConverter();
//Initialize and applying settings
PdfToHtmlConverterSettings setting = new PdfToHtmlConverterSettings();
setting.IsFrame = false;
setting.AbsolutePositioning = false;
converter.Settings = setting;
for ( int i = 0; i < 100; i++)
{
//Loading the input PDF document.
PdfLoadedDocument ldoc = new PdfLoadedDocument(txtImageFile.Text);
//Converting PDF to HTML
converter.Convert(txtImageFile.Text, "output.html", ldoc.Pages.Count);
ldoc.Close(true);
Tbi.Text = i.ToString();
GC.Collect();
GC.WaitForPendingFinalizers();
Application.DoEvents();
Thread.Sleep(1);
}
Thanks!
Attachment: 40064_2016_Article_3041_295d70a0.zip
SL
Sowmiya Loganathan
Syncfusion Team
February 10, 2020 12:38 PM UTC
Hi Gyorgy,
|
For skipping headers and page numbers, I meant during text extraction and html generation. |
We have used open source software “OPX” to convert PDF to HTML, So we could not able to skip the header and page number during HTML generation. |
|
We were able to reproduce the reported issue and suspect that this to be a defect. Currently we are validating on this and will update the further details on 12th February, 2020. |
Regards,
Sowmiya Loganathan
SL
Sowmiya Loganathan
Syncfusion Team
February 12, 2020 02:15 PM UTC
Hi Gyorgy,
|
I also attach another pdf that completely fails (40064_2016_Article_3041.zip). |
We are internally make use of open source xpdf to convert PDF to HTML, and the conversion fails due to exception occurs in that open source library itself. So we can’t proceed further to resolve this issue in our end. |
|
For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip: |
We have checked the reported memory leak issue, but it does not take more memory in our end. Also, we have ensured the memory taken by the PdfLoadedDocument is actual memory needed to process the document. There is no bottle neck in our implementation. So, could not optimize this further. |
Regards,
Sowmiya Loganathan
GG
Gyorgy Gorog
February 17, 2020 09:39 AM UTC
Sowmiya, thanks for update. Sorry you don't want to beat Acrobat :) you are just a few steps apart.
Thanks anyway.
PV
Prakash Viswanathan
Syncfusion Team
February 18, 2020 11:02 AM UTC
Hi Gyorgy,
Thank you for the update. Please let us know if you need any further assistance on this.
Regards,
Prakash V
SIGN IN To post a reply.
- 6 Replies
- 3 Participants
-
GG Gyorgy Gorog
- Feb 4, 2020 07:45 AM UTC
- Feb 18, 2020 11:02 AM UTC