1. First, grab the entire web page content, put the data in byte[] (the form is byte when transmitted on the network), and further convert it into a String to facilitate its operation. Examples are as follows:
Copy the codeThe code is as follows:
private static string GetPageData(string url)
{
if (url == null || () == "")
return null;
WebClient wc = new WebClient();
= ;
Byte[] pageData = (url);
return (pageData);//.
}
2. After obtaining the string form of the data, you can parse the web page (in fact, it is the application of various operations and regular expressions of the string):
There are several commonly used analysis:
1. Get the title
Copy the codeThe code is as follows:
Match TitleMatch = (strResponse, "<title>([^<]*)</title>", | );
title = [1].Value;
2. Obtain description information
Copy the codeThe code is as follows:
Match Desc = (strResponse, "<meta name=\"DESCRIPTION\" content=\"([^<]*)\">", | );
strdesc = [1].Value;
3. Get pictures
Copy the codeThe code is as follows:
public class HtmlHelper
{
/// <summary>
/// Extract image address from HTML
/// </summary>
public static List<string> PickupImgUrl(string html)
{
Regex regImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", );
MatchCollection matches = (html);
List<string> lstImg = new List<string>();
foreach (Match match in matches)
{
(["imgUrl"].Value);
}
return lstImg;
}
/// <summary>
/// Extract image address from HTML
/// </summary>
public static string PickupImgUrlFirst(string html)
{
List<string> lstImg = PickupImgUrl(html);
return == 0 ? : lstImg[0];
}
}
4. Remove the HTML tag
Copy the codeThe code is as follows:
private string StripHtml(string strHtml)
{
Regex objRegExp = new Regex("<(.|\n)+?>");
string strOutput = (strHtml, "");
strOutput = ("<", "<");
strOutput = (">", ">");
return strOutput;
}
Some exceptions can make the removal unclean, so it is recommended to convert twice in a row. This converts the HTML tag into spaces. Too many consecutive spaces will affect the subsequent operation of the string. So add this statement:
Copy the codeThe code is as follows:
//Change all spaces into one space
Regex r = new Regex(@"\s+");
wordsOnly = (strResponse, " ");
();