We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. (Last updated on: November 16, 2018).
Unfortunately, activation email could not send to your email. Please try again.
Syncfusion Feedback

Can we convert PDF into HTML/XML using SyncFusion?

Thread ID:





129023 Feb 22,2017 08:46 AM UTC Feb 23,2017 11:27 AM UTC ASP.NET Core 1
Tags: PdfViewer
Vikas Sharma
Asked On February 22, 2017 08:46 AM UTC

I am working on reading a PDF electronically filled by user using SyncFusion. I have not been able to get much in the library. ExtractText() gives me the PDF text but that is not of much use to me because I need to read the data that user has electronically filled in the PDF as well.  I also tried with form.Fields but that helps only if user has electronically edited PDF having created the Form Fields initially using Adobe Acrobat Reader. But I want to consider the scenario where user has just entered plain text in PDF as data and I want to read entire PDF including PDF text and user input data.  

Is there any method exposed in the library for this task?

If not, then can I export the PDF to XML or as HTML table and then read that XML/HTML. Can anyone help me with the correct functions/methods (if available) in SyncFusion.

Mohan Selvaraj [Syncfusion]
Replied On February 23, 2017 11:27 AM UTC

Hi Vikas,  
 At present, we do not support text extraction in the ASP.NET Core platform. However, we can get the form fields without exporting PDF to XML or HTML in other supported platforms. Please refer the below steps to extract the form fields values.  
1.      The PdfLoadedForm object was used to get the form fields filled values, please refer the below code snippet for more details.  
        //Load the existing PDF document   
        PdfLoadedDocument ldoc = new PdfLoadedDocument("../../input.pdf");  
        //Load the existing form  
        PdfLoadedForm lForm = ldoc.Form;  
            //Get the form fields value  
            foreach (PdfLoadedField lField in lForm.Fields)  
                if (lField is PdfLoadedTextBoxField)  
                    PdfLoadedTextBoxField lTextBox = lField as PdfLoadedTextBoxField;  
                    //Get the text box value  
                    string text = lTextBox.Text;  
2.      We can also get the form fields values by flattening it and extract the text to the PDF. Please refer the below code snippet and sample for more details.  
            //Load the existing PDF document  
            PdfLoadedDocument ldoc = new PdfLoadedDocument("../../input.pdf");  
            //Load the existing form  
            PdfLoadedForm lForm = ldoc.Form;  
            //Flatten the form fields  
            lForm.Flatten = true;  
            MemoryStream ms = new MemoryStream();  
            //Save the PDF document  
            //Close the PDF document  
            //Load the document  
            PdfLoadedDocument doc = new PdfLoadedDocument(ms);             
            string text = string.Empty;  
            foreach (PdfLoadedPage lpage in doc.Pages)  
                //Extract the text  
                text += lpage.ExtractText(true);  
           File.AppendAllText("text.txt", text);  
We have created a sample to demonstrate the same in Windows forms and you can download the sample from the below link.  
Please let us know that about the platform you are using. This detail will be helpful for us to analyze further and assist you better.  
Please let us know if you have any concerns.  
Mohan S


This post will be permanently deleted. Are you sure you want to continue?

Sorry, An error occured while processing your request. Please try again later.

Please sign in to access our forum

or the page will be automatically redirected to sign-in page in 10 seconds.

Warning Icon You are using an outdated version of Internet Explorer that may not display all features of this and other websites. Upgrade to Internet Explorer 8 or newer for a better experience.Close Icon