c# Parsing Html using HtmlAgilityPack

HtmlAgilityPack is an open source C# class library that quickly parses Html. To understand simply, it can convert Html into Node nodes according to XPATH like parsing Xml, and supports adjusting nodes and various attributes of nodes.

Portal:Official website | Github source code

Loading Html in various ways

There are 3 main loading methods: loading from network links, loading from string text, and loading from file

var doc = new HtmlDocument();
//Load directly through urldoc = new HtmlWeb().Load("/");
//Load via string(result);
//Load through html file to specify the encoding method(@"c://",Encoding.UTF8)

Common methods for HtmlNode

The target nodes obtained using the SelectNodes() and SelectSingleNode() methods (similar to XmlDocument that parses XML format data), correspond to the two classes HtmlNodeCollection and HtmlNode respectively.

"//" means to start searching from the root node, two slashes "//" means to find all childrennodes; one slash "/" means to only look for childrennodes on the first layer (that is, not grandchild); point slash "./" means to start searching from the current node rather than the root node (only appear at the beginning of xpath)

Notice:

id class attribute matches case sensitivity
xpath matching subscript starts at 1

1. Select the corresponding node by matching attributes and paths

var node = ;

//Select a div node that does not contain class attributevar result = (".//div[not(@class)]");

//Select a div node that does not contain class and id attributesvar result = (".//div[not(@class) and not(@id)]");

//Select the span node containing "expire" in the classvar result = (".//span[contains(@class,'expire')]");

//Select a span node that does not contain "expire" in the classvar result = (".//span[not(contains(@class,'expire'))]");

//Select the span node of class="expire"var result = (".//span[@class='expire']");

//The first div node under the selected div nodevar result = (".//div[@id='expire']/div[1]");

2. Get the node text content

Depending on the needs, the corresponding text content is obtained in different ways.
OuterHtml: Returns all Htmls including the current node
InnerHtml: Return all child nodes in the current node Html
InnerText: Returns the text content after removing all Html in the current node

&lt;div &gt;
  &lt;p&gt;
   &lt;a class="MainTitle" href="/cplemom/" rel="external nofollow" rel="external nofollow" rel="external nofollow" &gt;Fu Xiaohui&lt;/a&gt;
  &lt;/p&gt;
&lt;/div&gt;

The above Html is an example

var node= ("//div[@id='title'/p]");

; //Return result: <p><a class="MainTitle"href="/cplemom/" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Fu Xiaohui</a></p>
; //Return result: <a class="MainTitle"href="/cplemom/" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Fu Xiaohui</a>
; //Return result：Fu Xiaohui

3. Get/modify node attribute values

As an example in the above Html, we obtained the tag a as node node. We want to get the link address pointed to by the tag a and modify it to the address we set. Here, take the href attribute as an example, which can also be used on attributes such as class/src/id.

var node= ("//div[@id='title'/p/a]");

//The second parameter is the default value returned when the corresponding attribute cannot be foundvar url = ("href", "");//Return result: /cplemom/
//Set attribute value("href", "/");

//Get all attribute valuesvar list = ();

4. Delete/replace nodes

Continuing the above Html example, we obtained the tag a as node node.
For content we don't need, we just need to call the node Remove method.

var node= ("//div[@id='title'/p/a]");

();//Delete nodes

A very common scenario is that we need to remove the a tag, but keep the text of the a tag in the html context.
PS: The text in the a tag is actually a text-type node in the HtmlDocument. So we can achieve the goal by deleting the a tag and retaining the text tag.

(node,true);

True means that the child node that leaves the a tag only deletes the a tag, which means that the "Fu Xiaohui" text node is retained; false means that this node is deleted together with all the child nodes.

From another perspective, the current node node represents a single A tag. So what if there are multiple A tags under the p tag that need to be processed, or the node node points to the p tag? Of course, we can do it by getting all a tags and then looping through it, but is there any other better way to deal with it?

Here is an idea to obtain all text content, create it into a text node, and then replace the current node.

((), node);

Several common usage scenarios and solutions

1. Get all img tags

//Get the img tags in all descendants through Descendantsvar list = ("img");


//Get all img tags through Xpath matchingvar list = ("//img");

2. When accessing through the URL, you need to carry cookies and other verification information.

Some pages need to carry verification information to access, such as user center, order list, etc. At this time, it will be rejected to obtain html directly through the HtmlWeb class. There is a simple way to request the corresponding html content through HttpClient and then load it using HtmlDocument. In fact, HtmlWeb is also a packaged HttpWebRequest for network requests, so it exposes a delegate to the outside to modify the request context.

var web = new HtmlWeb();
 = new (GetRequest);
var node = ("/");

public static bool GetRequest(HttpWebRequest req)
{
  ("Host", "");
	("Cookie", "xxxxxxxxxxxxx");
  return true;
}

Summarize

Until now, I personally feel that the above method can achieve more than 90% of the related requirements for Html parsing. For more convenient and fast methods, please go to the official website API documentation to learn more.

The above is the detailed content of c# using HtmlAgilityPack to parse Html. For more information about c# HtmlAgilityPack to parse Html, please follow my other related articles!