We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. (Last updated on: June 24, 2019).
Unfortunately, activation email could not send to your email. Please try again.
Syncfusion Feedback

PDF to HTML issues

Thread ID:

Created:

Updated:

Platform:

Replies:

151213 Feb 4,2020 07:45 AM UTC Feb 18,2020 11:02 AM UTC WinForms 6
loading
Tags: PDF
Gyorgy Gorog
Asked On February 4, 2020 08:13 AM UTC

Good Morning, I am evaluating PDF to HTML OPX using the downloadable solution (PdfToHtmlOPX1940268788.zip) with Syncfusion.Pdf.WinForms 17.4.0.46. 

I think Syncfusion PDF controls are great and could be a major competitor to Adobe DC, especially together with DocIO when it comes to PDF -> Word conversion. 

However there are a few issues in PDF -> HTML:

1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it?

2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too?

3. Memory consumption grows upon repeated conversion even if 
     I use the same converter and settings objects 
     say             ldoc.EnableMemoryOptimization = true;
          and call after each document:
           ldoc.Close(true);
           GC.Collect();
            GC.WaitForPendingFinalizers();
            Application.DoEvents();

I suspect it's the images? When converting a short PDF with no images, memory leak is marginal:



4. Minor: IgnoreImage seems to have no effect. I'm OK wih this as I like images.

Thanks!

Sowmiya Loganathan [Syncfusion]
Replied On February 5, 2020 01:48 PM UTC

Hi Gyorgy,  

Thank you for contacting Syncfusion support.  

1. We Hungarians use some strange characters (ő and ű) that don't make it to the HTML. They are nicely present in PdfLoadedPage:ExtractText(). Actually I can correct this from ExtractText() but it's hacking isn't it? 
We have used open source software “OPX” to convert PDF to HTML file. So we have preserved the characters only supported in OPX.  
2. Acrobat DC has options to skip page numbers and heading/footing. Is there any hope Syncfusion.Pdf can do that too? 
We can able to skip the page numbers and header/footer in PDF document by drawing the content in PdfTemplate and draws it in PDF document as like header and footer. Please find the below sample which illustrate this, 


3. Memory consumption grows upon repeated conversion even if  
     I use the same converter and settings objects  
     ldoc.EnableMemoryOptimization = true; 
          and call after each document: 
           ldoc.Close(true); 
           GC.Collect(); 
            GC.WaitForPendingFinalizers(); 
            Application.DoEvents(); 
Could you please share us the PDF document and complete code snippet/sample to replicate this issue. It will helpful for us to provide the precise solution on this.  

Regards, 
Sowmiya Loganathan 
 


Gyorgy Gorog
Replied On February 7, 2020 09:14 AM UTC

Sowmiya, thanks for update. 

3. For skipping headers and page numbers, I meant during text extraction and html generation.

4. For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip:


This is a pdf with more images, please see attched (jimny_muszaki_1029_1c9d842.zip):


I also attach another pdf that completely fails (40064_2016_Article_3041.zip).

The code is yours, just put into a cycle. 
//Initialize PdfToHtmlConverter
            PdfToHtmlConverter converter = new PdfToHtmlConverter();
            //Initialize and applying settings
            PdfToHtmlConverterSettings setting = new PdfToHtmlConverterSettings();
            setting.IsFrame = false;
            setting.AbsolutePositioning = false;
            converter.Settings = setting;
           
            for ( int i = 0; i < 100; i++)
            {
                //Loading the input PDF document.
                PdfLoadedDocument ldoc = new PdfLoadedDocument(txtImageFile.Text);
                //Converting PDF to HTML
                converter.Convert(txtImageFile.Text, "output.html", ldoc.Pages.Count);
                ldoc.Close(true);

                Tbi.Text = i.ToString();
                GC.Collect();
                GC.WaitForPendingFinalizers();
                Application.DoEvents();
                Thread.Sleep(1);
            }

Thanks!


Attachment: 40064_2016_Article_3041_295d70a0.zip

Sowmiya Loganathan [Syncfusion]
Replied On February 10, 2020 12:38 PM UTC

Hi Gyorgy, 
 
For skipping headers and page numbers, I meant during text extraction and html generation.  
We have used open source software “OPX” to convert PDF to HTML, So we could not able to skip the header and page number during HTML generation.  
  • For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip:
  • I also attach another pdf that completely fails (40064_2016_Article_3041.zip).
We were able to reproduce the reported issue and suspect that this to be a defect. Currently we are validating on this and will update the further details on 12th February, 2020.  
 
Regards, 
Sowmiya Loganathan 


Sowmiya Loganathan [Syncfusion]
Replied On February 12, 2020 02:15 PM UTC

Hi Gyorgy, 
 
I also attach another pdf that completely fails (40064_2016_Article_3041.zip). 
We are internally make use of open source xpdf to convert PDF to HTML, and the conversion fails due to exception occurs in that open source library itself. So we can’t proceed further to resolve this issue in our end.   
For memory leak, this is the sample "barcode.pdf" you provide with PdfToHtmlOPX1940268788.zip: 
We have checked the reported memory leak issue, but it does not take more memory in our end. Also, we have ensured the memory taken by the PdfLoadedDocument is actual memory needed to process the document. There is no bottle neck in our implementation. So, could not optimize this further.  
 
Regards, 
Sowmiya Loganathan 


Gyorgy Gorog
Replied On February 17, 2020 09:39 AM UTC

Sowmiya, thanks for update. Sorry you don't want to beat Acrobat :) you are just a few steps apart. 
Thanks anyway.

Prakash Viswanathan [Syncfusion]
Replied On February 18, 2020 11:02 AM UTC

Hi Gyorgy, 
 
Thank you for the update. Please let us know if you need any further assistance on this.   
    
Regards,   
Prakash V   


CONFIRMATION

This post will be permanently deleted. Are you sure you want to continue?

Sorry, An error occured while processing your request. Please try again later.

Please sign in to access our forum

This page will automatically be redirected to the sign-in page in 10 seconds.

Warning Icon You are using an outdated version of Internet Explorer that may not display all features of this and other websites. Upgrade to Internet Explorer 8 or newer for a better experience.Close Icon

Live Chat Icon For mobile
Live Chat Icon