Articles in this section
Category / Section

How to convert HTML document to plain text in C# and VB.NET?

7 mins read

The Essential DocIO converts the HTML file into Word document and vice versa. You can also convert the HTML document to plain text format and vice versa.

In Word library (DocIO) we use XmlReader for parsing the content from input HTML. So, the input HTML should meet XML standard (have proper open and close tags), even if you specify XHTMLValidationType parameter as XHTMLValidationType.None.

XHTML Validation

Every HTML content is validated against a Document Type Declaration (DTD) which is a set of mark-up declarations that define a document type for a SGML-family mark-up language (GML, SGML, XML, HTML).

XHTML validation types

The following XHTML validation types are supported in Essential DocIO while importing an HTML content.

XHTML validation types

Description

XHTMLValidationType.None

It does not perform any schema validation, but the given HTML content should meet XHTML 1.0 format.

XHTMLValidationType.Transitional

It allows several attributes within the tags.

XHTMLValidationType.Strict

It does not allow the attributes inside the tag.

 

Steps to convert HTML document to plain text in C#

  1. Create a new C# console application project.

Create new C# console app in WinForms

  1. Install Syncfusion.DocIO.WinForms NuGet package as a reference to your .NET Framework applications from the NuGet.org.

Install WinForms NuGet packages

  1. Include the following namespace in the Program.cs file.

C#

using Syncfusion.DocIO;
using Syncfusion.DocIO.DLS;

VB

Imports Syncfusion.DocIO
Imports Syncfusion.DocIO.DLS
  1. Use the following code to convert HTML document to plain text.

C#

//Loads the HTML document against validation type none
WordDocument document = new WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None);
//Saves the Word document
document.Save("HTMLtoText.txt", FormatType.Txt);
//Closes the document
document.Close();

VB

'Loads the HTML document against validation type none 
Dim document As WordDocument = New WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None) 
'Saves the Word document
document.Save("HTMLtoText.txt", FormatType.Txt)
'Closes the document
document.Close()

 

A complete working example of converting a HTML document to plain text in C# can be downloaded from here.

Input HTML document as follows:

Input HTML document

By executing the program, you will get the plain text as follows:

Output Text file

Take a moment to peruse the documentation, where you can find basic Word document processing options along with features like mail merge, merge and split documents, find and replace text in the Word document, protect the Word documents, and most importantly PDF and Image conversions with code examples.

Explore more about the rich set of Syncfusion Word Framework features.

An online example to protect the Word document from editing using Essential DocIO..

See Also:

Word to HTML and HTML to Word Conversions

Note:

Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion license key in your application to use the components without trail message.

 

Did you find this information helpful?
Yes
No
Help us improve this page
Please provide feedback or comments
Comments (0)
Please sign in to leave a comment
Access denied
Access denied