SoFunction
Updated on 2025-03-03

Positive expression matches attribute value of html tag

Regular expressions are an essential skill for text parsing. Such as web server log analysis, web front-end development, etc. Many advanced text editors support a subset of regular expressions. Being proficient in regular expressions can often make some of your work more effective with half the effort. For example, counting the number of lines of code, you only need a regular. Matching nested Html tags is a relatively difficult topic in regular expression applications because it involves a lot of regular syntax and is also difficult. Therefore, it is more valuable for research.

Today, due to work needs, I need to get the attribute value of the html tag. I immediately thought of the regular expression, and the tag is as follows:

<circle  cx="200" cy="2000" r="2" stroke="black" stroke-width="0" fill="red"/>
<circle  cx="201" cy="2001" r="2" stroke="black" stroke-width="0" fill="red"/>
<circle  cx="202" cy="2002" r="2" stroke="black" stroke-width="0" fill="red"/>
<circle  cx="203" cy="2003" r="2" stroke="black" stroke-width="0" fill="red"/>

You need to get the attribute values ​​of cx and cy of the <circle /> tag, and after thinking about it, I wrote one:

$circle is the content of the circle tag above

preg_match_all('/<\s*circle\s+[^>]*?cx\s*=\s*(\'|\")(.*?)\\1[^>]*?cy\s*=\s*(\'|\")(.*?)\\1[^>]*?\/?\s*>/i', $circle, $arr);
var_dump($arr);

$arr[2] is the value of cx, and $arr[4] is the value of cy.

Below is a regular matching closed HTML tag (supports nesting)

Any complex regular expression is composed of simple subexpressions. To write complex regular expressions, on the one hand, we need to have the foundation of simplifying the complexity, and on the other hand, we need to think about the problem from the perspective of the regular engine. Regarding the principles of regular engines, the Chinese name of "Mastering Regular Expression" is called "Mastering Regular Expression". Very good book.

OK, first determine the problem we want to solve - find the innerHTML of the tag of a specific id from a paragraph of Html text.

The biggest difficulty here is that the HTML tag supports nesting, so how can we find the closed tag corresponding to the specified tag?

We can think of this, first match the first starting tag, assuming it is a div (<div), then once a nested div is encountered, it is "pushed into the stack", and if the div is encountered, it is "pop-up the stack". If there is nothing in the stack when a closed tag is encountered, then the match ends, and this end tag is the correct closed tag.

The reason why I can think like this is because I have understood the characteristics of regularity, and I know that the balanced groups in regularity can implement the "stack" operation I just mentioned. Therefore, if we want to write complex regular expressions, we need to at least have some understanding of some advanced features of regularity, so that we can think about the problem in a direction.

================================

Regular expressions matching any closed HTML tags:

<(?<HtmlTag>[\w]+)[^>]*?>((?<Nested><\k<HtmlTag>[^>]*>)|</\k<HtmlTag>>(?<-Nested>)|.*?)*</\k<HtmlTag>>

If you only want to match the div tag, you can use the following regular expression:

<(?<HtmlTag>div)[^>]*?>((?<Nested><\k<HtmlTag>[^>]*>)|</\k<HtmlTag>>(?<-Nested>)|.*?)*</\k<HtmlTag>>

Yes, you can modify the div to any HTML tag you want to match

If you want to match multiple HTML tags at the same time, you can use the following regular expression:
<(?<HtmlTag>(div|span|h1))[^>]*?>((?<Nested><\k<HtmlTag>[^>]*>)|</\k<HtmlTag>>(?<-Nested>)|.*?)*</\k<HtmlTag>>
You can also continue to add more tags to match

If you want to match the tag containing the ID, you can use the following regular expression:

<(?<HtmlTag>[\w]+)[^>]*\s[iI][dD]=(?<Quote>["']?)footer(?(Quote)\k<Quote>)[^>]*?(/>|>((?<Nested><\k<HtmlTag>[^>]*>)|</\k<HtmlTag>>(?<-Nested>)|.*?)*</\k<HtmlTag>>)

This regular matches any HTML tag with arbitrary id as footer