SoFunction
Updated on 2025-03-06

C# crawl web page data, parse the title and describe the picture, and remove HTML tags

1. First, grab the entire web page content, put the data in byte[] (the form is byte when transmitted on the network), and further convert it into a String to facilitate its operation. Examples are as follows:

Copy the codeThe code is as follows:

private static string GetPageData(string url)
{
    if (url == null || () == "")
        return null;
    WebClient wc = new WebClient();
    = ;
    Byte[] pageData = (url);
    return (pageData);//.
}

2. After obtaining the string form of the data, you can parse the web page (in fact, it is the application of various operations and regular expressions of the string):

There are several commonly used analysis:

1. Get the title

Copy the codeThe code is as follows:

Match TitleMatch = (strResponse, "<title>([^<]*)</title>", | );
title = [1].Value;

2. Obtain description information

Copy the codeThe code is as follows:

Match Desc = (strResponse, "<meta name=\"DESCRIPTION\" content=\"([^<]*)\">", | );
strdesc = [1].Value;

3. Get pictures

Copy the codeThe code is as follows:

public class HtmlHelper
{
    /// <summary>
/// Extract image address from HTML
    /// </summary>
    public static List<string> PickupImgUrl(string html)
    {
        Regex regImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", );
        MatchCollection matches = (html);
        List<string> lstImg = new List<string>();
        foreach (Match match in matches)
        {
            (["imgUrl"].Value);
        }
        return lstImg;
    }
    /// <summary>
/// Extract image address from HTML
    /// </summary>
    public static string PickupImgUrlFirst(string html)
    {
        List<string> lstImg = PickupImgUrl(html);
        return == 0 ? : lstImg[0];
    }
}

4. Remove the HTML tag

Copy the codeThe code is as follows:

private string StripHtml(string strHtml)
{
    Regex objRegExp = new Regex("<(.|\n)+?>");
    string strOutput = (strHtml, "");
    strOutput = ("<", "&lt;");
    strOutput = (">", "&gt;");
    return strOutput;
}

Some exceptions can make the removal unclean, so it is recommended to convert twice in a row. This converts the HTML tag into spaces. Too many consecutive spaces will affect the subsequent operation of the string. So add this statement:

Copy the codeThe code is as follows:

//Change all spaces into one space
Regex r = new Regex(@"\s+");
wordsOnly = (strResponse, " ");
();