C# Use for loop to remove HTML tags

The most common way to remove HTML tags from a paragraph of text to eliminate the styles and paragraphs contained in it is probably regular expressions. But please note that regular expressions don't handle all HTML documents, so sometimes it's better to use an iterative approach, such as a for loop.

Look at the following code:

using System;
using ;
/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return (source, "<.*?>", );
}
/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", );
/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, );
}
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < ; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
}

There are two different ways to remove HTML tags from a given string, one is to use regular expressions and the other is to use a character array to process in a for loop. Let's take a look at the test results:

using System;
using ;
class Program
{
static void Main()
{
const string html = "<p>There was a <b>.NET</b> programmer " +
"and he stripped the <i>HTML</i> tags.</p>";
((html));
((html));
((html));
}
}

The output result is as follows:

There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.

The above code calls three different methods in the HtmlRemoval class, all of which return the same result, that is, the HTML tags in the given string are removed. The second method is recommended, which is to directly refer to a predefined regular expression object, which is faster than the first method. But there are some disadvantages, and in some cases its startup time will increase by dozens of times. For specific content, you can view the following two articles:

Regex Performance

Generally, regular expressions are not the most efficient to execute, so another method is given in the HtmlRemoval class, using a character array to process strings. The test program provides 1,000 HTML files, each HTML file has about 8,000 characters, and all files are read through the method. The test results show that the character array is the fastest execution speed.

Performance test for HTML removal

: 2404 ms
: 1366 ms
: 287 ms [Fastest]

File length test for HTML removal

File length before: 8085 chars
: 4382 chars
: 4382 chars
: 4382 chars

Therefore, using character arrays to process large batches of files can save time. In the character array method, just adding non-HTML tagged characters to the array buffer, for efficiency, it uses a character array and a new string constructor to receive character arrays and ranges, which will be faster than using StringBuilder.

For self-closed HTML tags

In XHTML, some tags do not have independent closing tags, such as <br/>, <img/>, etc. The above code should be able to handle self-closed HTML tags correctly. Below are some supported HTML tags, note that regular expression methods may not handle invalid HTML tags correctly.

Supported tags

<img src="" />
<img src=""/>
<br />
<br/>
< div >
<!-- -->

Comments in HTML documentation

The code given in this article may be invalid for removing HTML tags from HTML document comments. Sometimes, some invalid HTML tags may be included in the comments, which will not be completely removed during processing. However, scanning for these incorrect HTML tags can sometimes be necessary.

How to verify

There are many ways to verify XHTML, and we can iterate in the same way as the above code. An easy way is to count '<' and '>' to determine if they match, or use regular expressions to match. Here are some resources to describe these methods:

HTML Brackets: Validation

Validate XHTML

There are many ways to remove HTML tags from a given string, and the results they return are also correct. There is no doubt that iterating with a character array is the most efficient.

The above is the editor’s introduction to C# using for loop to remove HTML tags. I hope it will be helpful to everyone. If you have any questions, please leave me a message and the editor will reply to everyone in time. Thank you very much for your support for my website!