We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date

Can we convert PDF into HTML/XML using SyncFusion?

I am working on reading a PDF electronically filled by user using SyncFusion. I have not been able to get much in the library. ExtractText() gives me the PDF text but that is not of much use to me because I need to read the data that user has electronically filled in the PDF as well.  I also tried with form.Fields but that helps only if user has electronically edited PDF having created the Form Fields initially using Adobe Acrobat Reader. But I want to consider the scenario where user has just entered plain text in PDF as data and I want to read entire PDF including PDF text and user input data.  

Is there any method exposed in the library for this task?

If not, then can I export the PDF to XML or as HTML table and then read that XML/HTML. Can anyone help me with the correct functions/methods (if available) in SyncFusion.

1 Reply

MS Mohan Selvaraj Syncfusion Team February 23, 2017 11:27 AM UTC

Hi Vikas,  
 At present, we do not support text extraction in the ASP.NET Core platform. However, we can get the form fields without exporting PDF to XML or HTML in other supported platforms. Please refer the below steps to extract the form fields values.  
  
1.      The PdfLoadedForm object was used to get the form fields filled values, please refer the below code snippet for more details.  
        //Load the existing PDF document   
        PdfLoadedDocument ldoc = new PdfLoadedDocument("../../input.pdf");  
  
        //Load the existing form  
        PdfLoadedForm lForm = ldoc.Form;  
  
            //Get the form fields value  
            foreach (PdfLoadedField lField in lForm.Fields)  
            {  
                if (lField is PdfLoadedTextBoxField)  
                {  
                    PdfLoadedTextBoxField lTextBox = lField as PdfLoadedTextBoxField;  
  
                    //Get the text box value  
                    string text = lTextBox.Text;  
                  }  
       }  
  
2.      We can also get the form fields values by flattening it and extract the text to the PDF. Please refer the below code snippet and sample for more details.  
  
            //Load the existing PDF document  
            PdfLoadedDocument ldoc = new PdfLoadedDocument("../../input.pdf");  
  
            //Load the existing form  
            PdfLoadedForm lForm = ldoc.Form;  
  
            //Flatten the form fields  
            lForm.Flatten = true;  
  
            MemoryStream ms = new MemoryStream();  
  
            //Save the PDF document  
            ldoc.Save(ms);  
  
            //Close the PDF document  
            ldoc.Close(true);  
  
            //Load the document  
            PdfLoadedDocument doc = new PdfLoadedDocument(ms);             
  
            string text = string.Empty;  
  
            foreach (PdfLoadedPage lpage in doc.Pages)  
            {  
                //Extract the text  
                text += lpage.ExtractText(true);  
            }  
           File.AppendAllText("text.txt", text);  
  
              
  
We have created a sample to demonstrate the same in Windows forms and you can download the sample from the below link.  
  
  
Please let us know that about the platform you are using. This detail will be helpful for us to analyze further and assist you better.  
  
Please let us know if you have any concerns.  
  
Regards,   
Mohan S

Loader.
Up arrow icon