C# crawl web page data, parse the title and describe the picture, and remove HTML tags

1. First, grab the entire web page content, put the data in byte[] (the form is byte when transmitted on the network), and further convert it into a String to facilitate its operation. Examples are as follows:

Copy the codeThe code is as follows:

private static string GetPageData(string url)

{

    if (url == null || () == "")

        return null;

    WebClient wc = new WebClient();

     = ;

    Byte[] pageData = (url);

    return (pageData);//.

}

2. After obtaining the string form of the data, you can parse the web page (in fact, it is the application of various operations and regular expressions of the string):

There are several commonly used analysis:

1. Get the title

Copy the codeThe code is as follows:

Match TitleMatch = (strResponse, "<title>([^<]*)</title>",  | );

title = [1].Value;

2. Obtain description information

Copy the codeThe code is as follows:

Match Desc = (strResponse, "<meta name=\"DESCRIPTION\" content=\"([^<]*)\">",  | );

strdesc = [1].Value;

3. Get pictures

Copy the codeThe code is as follows:

public class HtmlHelper

{

    /// <summary>
/// Extract image address from HTML

    /// </summary>

    public static List<string> PickupImgUrl(string html)

    {

        Regex regImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", );

        MatchCollection matches = (html);

        List<string> lstImg = new List<string>();

        foreach (Match match in matches)

        {

            (["imgUrl"].Value);

        }

        return lstImg;

    }

    /// <summary>
/// Extract image address from HTML

    /// </summary>

    public static string PickupImgUrlFirst(string html)

    {

        List<string> lstImg = PickupImgUrl(html);

        return  == 0 ?  : lstImg[0];

    }

}

4. Remove the HTML tag

Copy the codeThe code is as follows:

private string StripHtml(string strHtml)

{

    Regex objRegExp = new Regex("<(.|\n)+?>");

    string strOutput = (strHtml, "");

    strOutput = ("<", "&lt;");

    strOutput = (">", "&gt;");

    return strOutput;

}

Some exceptions can make the removal unclean, so it is recommended to convert twice in a row. This converts the HTML tag into spaces. Too many consecutive spaces will affect the subsequent operation of the string. So add this statement:

Copy the codeThe code is as follows:

//Change all spaces into one space

Regex r = new Regex(@"\s+");

wordsOnly = (strResponse, " ");

();