I'm working on a project recently, one of the functions is to get the source code of the web page based on a URL address. In (C#), there seem to be many ways to get web page source code. I made a simple WebClient, which is very simple and easy. But the latter very annoying question came out, that is, the Chinese garbled code.
After careful research, Chinese web pages are nothing more than two encodings: GB2312 and UTF-8. So the following code is found:
Copy the codeThe code is as follows:
/// <summary>
///Get source code HTML according to the URL of the URL
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
public static string GetHtmlByUrl(string url)
{
using (WebClient wc = new WebClient())
{
try
{
= true;
= new WebProxy();
= ;
= ;
byte[] bt = (url);
string txt = ("GB2312").GetString(bt);
switch (GetCharset(txt).ToUpper())
{
case "UTF-8":
txt = .(bt);
break;
case "UNICODE":
txt = (bt);
break;
default:
break;
}
return txt;
}
catch (Exception ex)
{
return null;
}
}
}
To explain a little, we used WebClient to create a wc object (this naming is a bit awkward). Then call the DownloadData method of the wc object, pass in the URL value, and return a byte array. By default, GB2312 is used to read this byte array and convert it into a string. Find the characteristic characters of the encoding format of the web page from the string of the web page source code, such as finding information such as charset="utf-8" to determine the encoding format of the current web page.
The GetCharset function is to obtain the encoding format of the current web page. The specific code is as follows:
Copy the codeThe code is as follows:
/// <summary>
/// Get charset from HTML
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string GetCharset(string html)
{
string charset = "";
Regex regCharset = new Regex(@"content=[""'].*\s*charset\b\s*=\s*""?(?<charset>[^""']*)", );
if ((html))
{
charset = (html).Groups["charset"].Value;
}
if ((""))
{
regCharset = new Regex(@"<\s*meta\s*charset\s*=\s*[""']?(?<charset>[^""']*)", );
if ((html))
{
charset = (html).Groups["charset"].Value;
}
}
return charset;
}
/// Get charset from HTML
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string GetCharset(string html)
{
string charset = "";
Regex regCharset = new Regex(@"content=[""'].*\s*charset\b\s*=\s*""?(?<charset>[^""']*)", );
if ((html))
{
charset = (html).Groups["charset"].Value;
}
if ((""))
{
regCharset = new Regex(@"<\s*meta\s*charset\s*=\s*[""']?(?<charset>[^""']*)", );
if ((html))
{
charset = (html).Groups["charset"].Value;
}
}
return charset;
}
I feel that the writing is not very good, so I just use it first, haha. Original by the editor, reprinted, haha.