We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date

PDF to HTML issues

Good Morning, I am evaluating PDF to HTML OPX using the downloadable solution (PdfToHtmlOPX1940268788.zip) with Syncfusion.Pdf.WinForms 17.4.0.46. 

I think Syncfusion PDF controls are great and could be a major competitor to Adobe DC, especially together with DocIO when it comes to PDF -> Word conversion. 

However there are a few issues in PDF -> HTML:

1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it?

2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too?

3. Memory consumption grows upon repeated conversion even if 
     I use the same converter and settings objects 
     say             ldoc.EnableMemoryOptimization = true;
          and call after each document:
           ldoc.Close(true);
           GC.Collect();
            GC.WaitForPendingFinalizers();
            Application.DoEvents();

I suspect it's the images? When converting a short PDF with no images, memory leak is marginal:



4. Minor: IgnoreImage seems to have no effect. I'm OK wih this as I like images.

Thanks!

6 Replies

SL Sowmiya Loganathan Syncfusion Team February 5, 2020 01:48 PM UTC

Hi Gyorgy,  

Thank you for contacting Syncfusion support.  

1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it? 
We have used open source software “OPX” to convert PDF to HTML file. So we have preserved the characters only supported in OPX.  
2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too? 
We can able to skip the page numbers and header/footer in PDF document by drawing the content in PdfTemplate and draws it in PDF document as like header and footer. Please find the below sample which illustrate this, 


3. Memory consumption grows upon repeated conversion even if  
     I use the same converter and settings objects  
     ldoc.EnableMemoryOptimization = true; 
          and call after each document: 
           ldoc.Close(true); 
           GC.Collect(); 
            GC.WaitForPendingFinalizers(); 
            Application.DoEvents(); 
Could you please share us the PDF document and complete code snippet/sample to replicate this issue. It will helpful for us to provide the precise solution on this.  

Regards, 
Sowmiya Loganathan 
 



GG Gyorgy Gorog February 7, 2020 09:14 AM UTC

Sowmiya, thanks for update. 

3. For skipping headers and page numbers, I meant during text extraction and html generation.

4. For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip:


This is a pdf with more images, please see attched (jimny_muszaki_1029_1c9d842.zip):


I also attach another pdf that completely fails (40064_2016_Article_3041.zip).

The code is yours, just put into a cycle. 
//Initialize PdfToHtmlConverter
            PdfToHtmlConverter converter = new PdfToHtmlConverter();
            //Initialize and applying settings
            PdfToHtmlConverterSettings setting = new PdfToHtmlConverterSettings();
            setting.IsFrame = false;
            setting.AbsolutePositioning = false;
            converter.Settings = setting;
           
            for ( int i = 0; i < 100; i++)
            {
                //Loading the input PDF document.
                PdfLoadedDocument ldoc = new PdfLoadedDocument(txtImageFile.Text);
                //Converting PDF to HTML
                converter.Convert(txtImageFile.Text, "output.html", ldoc.Pages.Count);
                ldoc.Close(true);

                Tbi.Text = i.ToString();
                GC.Collect();
                GC.WaitForPendingFinalizers();
                Application.DoEvents();
                Thread.Sleep(1);
            }

Thanks!


Attachment: 40064_2016_Article_3041_295d70a0.zip


SL Sowmiya Loganathan Syncfusion Team February 10, 2020 12:38 PM UTC

Hi Gyorgy, 
 
For skipping headers and page numbers, I meant during text extraction and html generation.  
We have used open source software “OPX” to convert PDF to HTML, So we could not able to skip the header and page number during HTML generation.  
  • For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip:
  • I also attach another pdf that completely fails (40064_2016_Article_3041.zip).
We were able to reproduce the reported issue and suspect that this to be a defect. Currently we are validating on this and will update the further details on 12th February, 2020.  
 
Regards, 
Sowmiya Loganathan 



SL Sowmiya Loganathan Syncfusion Team February 12, 2020 02:15 PM UTC

Hi Gyorgy, 
 
I also attach another pdf that completely fails (40064_2016_Article_3041.zip). 
We are internally make use of open source xpdf to convert PDF to HTML, and the conversion fails due to exception occurs in that open source library itself. So we can’t proceed further to resolve this issue in our end.   
For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip: 
We have checked the reported memory leak issue, but it does not take more memory in our end. Also, we have ensured the memory taken by the PdfLoadedDocument is actual memory needed to process the document. There is no bottle neck in our implementation. So, could not optimize this further.  
 
Regards, 
Sowmiya Loganathan 



GG Gyorgy Gorog February 17, 2020 09:39 AM UTC

Sowmiya, thanks for update. Sorry you don't want to beat Acrobat :) you are just a few steps apart. 
Thanks anyway.


PV Prakash Viswanathan Syncfusion Team February 18, 2020 11:02 AM UTC

Hi Gyorgy, 
 
Thank you for the update. Please let us know if you need any further assistance on this.   
    
Regards,   
Prakash V   


Loader.
Live Chat Icon For mobile
Up arrow icon