C# implements easy extraction of plain text from HTML

1. Introduction

Processing HTML content usually requires plain text to be extracted for processing, analysis, or display without causing messy HTML tags. In this blog, we will explore a simple and effective way to use regular expressions (Regex) in C# to strip HTML tags and decode HTML entities into plain text. This technique is particularly useful in scenarios such as reading web crawl content, cleaning email formats, or preparing text data for machine learning preprocessing.

2. Problem statement

HTML content is designed for web browsers and is not suitable for direct text processing. Extracting only text parts can be tricky due to the nested and complex nature of HTML tags. Developers need a reliable way to efficiently convert HTML to plain text.

3. Solution Overview

We will use the C# method to delete the HTML tag and decode the HTML encoded entity to its text equivalent using C#. This method provides a quick and accurate way to extract clean text from HTML.

4. Define text extraction method

First, we will create a method that accepts a string containing HTML and returns a cleaned plain text string.

Code walkthrough

using System;
using ;

public class Program
{
    public static void Main()
    {
        // Define a string containing HTML content        string htmlContent = "&lt;p&gt;Hello &lt;b&gt;World!&lt;/b&gt;&lt;/p&gt;";
        
        // Call the ExtractTextFromHtml method to extract plain text from HTML        string plainText = ExtractTextFromHtml(htmlContent);
        
        // Output the extracted plain text content        (plainText); // Output: Hello World!    }

    // Define a static method to extract plain text from HTML    public static string ExtractTextFromHtml(string html)
    {
        // If the entered HTML string is empty, return an empty string        if (html == null)
        {
            return "";
        }

        // Replace all HTML tags with regular expressions to a space        string plainText = (html, "&lt;[^&gt;]+?&gt;", " ");
        
        // Decode HTML entities and remove the spaces before and after        plainText = (plainText).Trim();

        // Return the processed plain text        return plainText;
    }
}

5. Explanation

**Input verification: **This function first checks whether the input html string is empty. If empty, an empty string is returned, ensuring that the method does not throw an exception when passing null.

**Replace regular expressions: **Use to delete all HTML tags. Pattern <[^>]+?> Matches any sequence that begins with <, followed by one or more non-> characters and ends with >. These sequences are replaced by spaces, ensuring that words previously separated by HTML tags are not concatenated together.

**Decode HTML Entity: **Stripped text may still contain HTML entities (such as &, <, etc.). Used to convert these entities back to their respective characters.

**Trip: **Lastly, use Trim to remove any leading or trailing spaces from the generated plain text.

6. Conclusion

By following the above steps, developers can effectively extract text from HTML content using simple regular expression-based methods in C#. This feature is critical for applications that need to process or display text extracted from HTML sources, ensuring the clarity and availability of data.

This guide provides practical solutions to common problems in text processing and can be a valuable addition to your development kit. Whether you are dealing with web crawling, data cleaning, or content management systems, understanding how to efficiently convert HTML to plain text is a key skill.

This is the end of this article about C#’s easy extraction of plain text from HTML. For more related C# HTML to extract plain text content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!