PDF to Docx conversion

Question

Hello forum,I'm looking for an easy way to convert a Syncfusion.PdfLoadedDocument into a WordDocument.All examples show how to convert into PDF documents but non of them vice versa.Any ideas how to achieve this?Best regards,Sascha

Sowmiya Loganathan · Answer

Hi Sascha, 

Thank you for contacting Syncfusion support. 

At present we do have direct support to convert PDF to Word document. However as a workaround we can achieve your requirement by exporting PDF pages as image then add that images to Word Document using PDF and DocIO library. Please refer the below KB link for your reference, 
https://www.syncfusion.com/kb/8084/how-to-convert-pdf-document-to-word-document  

Please let us know if you need any further assistance on this. 

Regards, 
Sowmiya Loganathan

Sascha Nebel · Answer

"...by exporting PDF pages as image then add that images to Word Document..."

Thank you for fast response!

The approach you describe is using images. This does not work on our site. We are using Syncfusion libraries currently as a pre-processing step to generate documents that can be read and interpreted by an API to extract data from the results. So using images would lead us to use OCR on top. And that is something we would like to avoid.

Meanwhile I was trying out a different approach:

I found the OPX example (https://www.syncfusion.com/products/opx/xpdf) where you converted a PDF into HTML. My idea is, that it could be a workaround to use this aproach and finally generate a WordDocument from the HTML output.

Unfortunately there is no possibility to use streams or byte arrays using the XPDF lib with the wrapper you provide. It can only take paths. Furthermore it's a not ideal that there is no nuget package available so I had to grab the DLLs from the example manually.

So there is still no solution to this issue.

Help is appreciated!

Sascha

Sowmiya Loganathan · Answer

Hi Sascha, 

We can convert PDF to HTML using XPDF (https://www.syncfusion.com/products/opx/xpdf) and then converts the resultant HTML file into Word document using DocIO library. But resultant HTML file from PDF to HTML conversion is not a well formatted HTML file since DocIO library supports only the HTML files that meets the validation either against XHTML 1.0 strict or XHTML 1.0 Transitional schema.  

Please refer the below documentation link for more details, 
https://help.syncfusion.com/file-formats/docio/html  

Note: If you load the non-formatted HTML files to Word document, it throws the error. So you can convert that HTML file to well formatted HTML file then perform HTML to word.  

Please let us know if you have any concerns on this. 

Regards, 
Sowmiya Loganathan