Can we convert PDF into HTML/XML using SyncFusion?

1 Reply
2 Participants

Created by
VS Vikas Sharma

Platform
ASP.NET Core

Platform
ASP.NET Core

Control
PdfViewer

Created On
Feb 22, 2017 08:46 AM UTC

Last Activity On
Feb 23, 2017 11:27 AM UTC

Want to subscribe?
SIGN IN

I am working on reading a PDF electronically filled by user using SyncFusion. I have not been able to get much in the library. ExtractText() gives me the PDF text but that is not of much use to me because I need to read the data that user has electronically filled in the PDF as well. I also tried with form.Fields but that helps only if user has electronically edited PDF having created the Form Fields initially using Adobe Acrobat Reader. But I want to consider the scenario where user has just entered plain text in PDF as data and I want to read entire PDF including PDF text and user input data.

Is there any method exposed in the library for this task?

If not, then can I export the PDF to XML or as HTML table and then read that XML/HTML. Can anyone help me with the correct functions/methods (if available) in SyncFusion.

1 Reply

MS Mohan Selvaraj Syncfusion Team February 23, 2017 11:27 AM UTC

Hi Vikas,

At present, we do not support text extraction in the ASP.NET Core platform. However, we can get the form fields without exporting PDF to XML or HTML in other supported platforms. Please refer the below steps to extract the form fields values.

1. The PdfLoadedForm object was used to get the form fields filled values, please refer the below code snippet for more details.

//Load the existing PDF document

PdfLoadedDocument ldoc = new PdfLoadedDocument("../../input.pdf");

//Load the existing form

PdfLoadedForm lForm = ldoc.Form;

//Get the form fields value

foreach (PdfLoadedField lField in lForm.Fields)

{

if (lField is PdfLoadedTextBoxField)

{

PdfLoadedTextBoxField lTextBox = lField as PdfLoadedTextBoxField;

//Get the text box value

string text = lTextBox.Text;

}

2. We can also get the form fields values by flattening it and extract the text to the PDF. Please refer the below code snippet and sample for more details.

//Load the existing PDF document

PdfLoadedDocument ldoc = new PdfLoadedDocument("../../input.pdf");

//Load the existing form

PdfLoadedForm lForm = ldoc.Form;

//Flatten the form fields

lForm.Flatten = true;

MemoryStream ms = new MemoryStream();

//Save the PDF document

ldoc.Save(ms);

//Close the PDF document

ldoc.Close(true);

//Load the document

PdfLoadedDocument doc = new PdfLoadedDocument(ms);

string text = string.Empty;

foreach (PdfLoadedPage lpage in doc.Pages)

{

//Extract the text

text += lpage.ExtractText(true);

}

File.AppendAllText("text.txt", text);

We have created a sample to demonstrate the same in Windows forms and you can download the sample from the below link.

http://www.syncfusion.com/downloads/support/directtrac/173350/ze/WindowsFormsApplication1-903902823

Please let us know that about the platform you are using. This detail will be helpful for us to analyze further and assist you better.

Please let us know if you have any concerns.

Regards,

Mohan S

1 Reply
2 Participants
Want to subscribe?
SIGN IN
Created by
VS Vikas Sharma
Platform
ASP.NET Core
Control
PdfViewer
Created On
Feb 22, 2017 08:46 AM UTC
Last Activity On
Feb 23, 2017 11:27 AM UTC

Viewer Component

.NET PDF Processing Library

Conversions

Editor Component

.NET Word Processing Library

Conversions

Editor Component

.NET Excel Processing Library

Conversions

.NET PowerPoint Processing Library

Conversions

Can we convert PDF into HTML/XML using SyncFusion?

Enterprise Solutions

Free Products

Viewer Component

.NET PDF Processing Library

Conversions

Editor Component

.NET Word Processing Library

Conversions

Editor Component

.NET Excel Processing Library

Conversions

.NET PowerPoint Processing Library

Conversions

Learning

Resources

Support

Can we convert PDF into HTML/XML using SyncFusion?