PHP and Regular Expressions Tutorial Collection 2, Page 1/2

Quick Start of Regular Expressions (II)
[Introduction] In this article, we mainly introduce subpatterns, back references and quantifiers
In the previous article, we introduced the pattern correction characters and metacharacters of regular expressions. Careful readers may find that this part is very brief and has few practical examples to explain. This is mainly because the existing regular expression materials on the Internet have detailed introductions and many examples of this part. If you feel that you lack understanding of the previous part, you can refer to these materials. This article hopes to involve as many advanced regular expression features as possible.
In this article, we mainly introduce subpatterns, back references, and quantifiers, which focus on some extended applications of these concepts, such as non-capturing subpatterns in subpatterns, greedy and ungreedy when quantifiers are matched.
Subpatterns and backreferences
Regular expressions can contain multiple word patterns, and sub-patterns are delimited by parentheses and can be nested. This is also the function of the two metacharacters "(" and ")". The sub-mode can have the following functions:
1. Localize the branch with multiple choices.
For example, pattern: cat(aract|erpillar|) matches one of "cat", "cataract" or "caterpillar". Without parentheses, it will match "cataract", "erpillar" or empty string.
2. Set the sub-mode to capture the sub-mode (such as the example above). When the entire pattern matches, the part of the target string that matches the subpattern can be called by reverse reference. The left brackets count from left to right (starting from 1) to get the number of capture sub-modes.
Note that sub-patterns can be nested. For example, if the string "the red king" is matched with pattern /the ((red|white) (king|queen))/, the captured sub-strings are "red king", "red" and "king", and are counted as 1, 2 and 3. They can be referenced by "1", "2", and "3". "1" contains "2" and "3", and their sequence numbers are determined by the order of the left brackets.
In some old Linux/unux tools, the parentheses used in sub-modes need to be escaped with backslashes to this (subpattern), but modern tools are no longer needed, and the examples used in this article are not escaped.
non-capturing subpatterns
There are sometimes problems with the two functions of the above mentioned sub-patterns using a pair of brackets, for example, because the number of reverse references is limited (usually no more than 9 at most), and there are often times when there is no capture of the sub-pattern definition. At this time, you can add a question mark and a colon to the beginning bracket to indicate that this sub-pattern does not need to be captured, just like this (?:red|white) (king|queen)).
If "the white queen" is used as the target string for pattern matching, the captured strings include "white queen" and "queen", which are "1" and "2" respectively. Although white conforms to the sub-mode "(?:red|white)", it is not captured.
We have introduced the method of using brackets and question marks to represent pattern correction characters. For convenience, if you need to insert pattern correction characters in non-capturing sub-modes, you can directly place them between the question mark and the colon. For example, the following two patterns are equivalent.
/(?i:saturday|sunday)/ and /(??i)saturday|sunday)/.
Back references
When introducing the function of a backslash, one of its functions has been mentioned is to represent a reverse reference. When a backslash outside the character class is followed by a decimal number greater than 0, it is likely to be a reverse reference. Its meaning is as its name says, and it represents a reference to a subpattern that has been captured before it appears. This number represents the order in which the left bracket it references appears in the pattern. We have seen an example of reverse reference when introducing the sub-mode, where "1", "2", and "3" respectively represent the contents of the captured sub-mode defined by the first, second, and third brackets.
It is worth noting that when the number after the backslash is less than 10, it can be determined that this is a reverse reference. In this way, this reverse reference can appear before the corresponding number of left brackets are captured without confusion. Only the entire mode can provide so many capture sub-modes, and there will be no errors. It seems confusing, so let's take a look at the following example. I will use the examples given when introducing the sub-model to modify it. I mentioned earlier that the string "the red king" matches the pattern /the ((red|white) (king|queen))/. The captured sub-strings are "red king", "red" and "king", and are counted as 1, 2 and 3. Now, modify the string to "king, the red king", and change the pattern to /3, the ((red|white) (king|queen))/. This pattern should also be able to match. However, not all regular expression tools support this usage, and it is safe to use the relevant reverse reference after the left bracket of the corresponding sequence number.
Another thing to note is that the reverse referenced value is a string fragment that actually captures in the target string that conforms to the sub-pattern rather than the sub-pattern itself. For example / (sens|respons)e and 1ability/ will match “sense and sensibility” and “response and response”, but it will not be “sense and responsibility”. When the sub-pattern that is reversed is followed by a quantifier and is repeatedly matched multiple times, the value of the reverse reference will be based on the last matched value. For example /([abc]){3}/ when matching the string "abc", the value of the reverse reference "1" will be the last matched result "c".
Named subpattern
Some tools (such as Python) can name reverse references, thus defining naming sub-patterns. The use of regular expressions in Python is in the format of function or method calls, and the syntax is very different from the examples given here. Interested friends can refer to the tools they use to see if they support naming sub-mode.
non-capturing subpatterns
There are sometimes problems with the two functions of the above mentioned sub-patterns using a pair of brackets, for example, because the number of reverse references is limited (usually no more than 9 at most), and there are often times when there is no capture of the sub-pattern definition. At this time, you can add a question mark and a colon to the beginning bracket to indicate that this sub-pattern does not need to be captured, just like this (?:red|white) (king|queen)).
If "the white queen" is used as the target string for pattern matching, the captured strings include "white queen" and "queen", which are "1" and "2" respectively. Although white conforms to the sub-mode "(?:red|white)", it is not captured.
We have introduced the method of using brackets and question marks to represent pattern correction characters. For convenience, if you need to insert pattern correction characters in non-capturing sub-modes, you can directly place them between the question mark and the colon. For example, the following two patterns are equivalent.
/(?i:saturday|sunday)/ and /(?:(?i)saturday|sunday)/.
Back references
When introducing the function of a backslash, one of its functions has been mentioned is to represent a reverse reference. When a backslash outside the character class is followed by a decimal number greater than 0, it is likely to be a reverse reference. Its meaning is as its name says, and it represents a reference to a subpattern that has been captured before it appears. This number represents the order in which the left bracket it references appears in the pattern. We have seen an example of reverse reference when introducing the sub-mode, where "1", "2", and "3" respectively represent the contents of the captured sub-mode defined by the first, second, and third brackets.
It is worth noting that when the number after the backslash is less than 10, it can be determined that this is a reverse reference. In this way, this reverse reference can appear before the corresponding number of left brackets are captured without confusion. Only the entire mode can provide so many capture sub-modes, and there will be no errors. It seems confusing, so let's take a look at the following example. I will use the examples given when introducing the sub-model to modify it. I mentioned earlier that the string "the red king" matches the pattern /the ((red|white) (king|queen))/. The captured sub-strings are "red king", "red" and "king", and are counted as 1, 2 and 3. Now, modify the string to "king, the red king", and change the pattern to /3, the ((red|white) (king|queen))/. This pattern should also be able to match. However, not all regular expression tools support this usage, and it is safe to use the relevant reverse reference after the left bracket of the corresponding sequence number.
Another thing to note is that the reverse referenced value is a string fragment that actually captures in the target string that conforms to the sub-pattern rather than the sub-pattern itself. For example / (sens|respons)e and 1ability/ will match “sense and sensibility” and “response and response”, but it will not be “sense and responsibility”. When the sub-pattern that is reversed is followed by a quantifier and is repeatedly matched multiple times, the value of the reverse reference will be based on the last matched value. For example /([abc]){3}/ when matching the string "abc", the value of the reverse reference "1" will be the last matched result "c".
Named subpattern
Some tools (such as Python) can name reverse references, thus defining naming sub-patterns. The use of regular expressions in Python is in the format of function or method calls, and the syntax is very different from the examples given here. Interested friends can refer to the tools they use to see if they support naming sub-mode.
Repetition and quantifiers
In the previous part introducing the reverse quotation, we have come into contact with the concept of quantifiers, for example, the previous example /([abc]){3}/ represents three consecutive characters, each character must be one of the three characters "abc". In this mode, {3} belongs to the quantifier. It indicates the number of repetitions that a pattern needs to be repeated.
Quantifiers can be placed after the following items:
?●Single character (maybe it escaped single character, such as xhh)
?●".” Metacharacter
?● Character class represented by square brackets
?● Reverse Quote
?●Subpattern defined by brackets (unless it is an assertion, we will introduce it later)
The most common form of quantifiers is to use two commas separated numbers enclosed in curly braces, such as in this format {min,max}, for example, /z{2,4}/ can match "zz", "zzz", or "zzzz". The maximum value in curly braces and the previous comma can be omitted. For example, /d{3,}/ can match more than three numbers, and there is no upper limit to the number of numbers, while /d{3}/(note that there is no comma) matches 3 numbers exactly. When curly braces appear at positions where quantifiers are not allowed or the grammar does not match the aforementioned, here it only represents the curly brace characters themselves and no longer have special meanings. For example, {,6} is not a quantifier, it only represents the meaning of these four characters themselves.
For convenience, the three most commonly used quantifiers have their single-character abbreviations, and their meanings are as follows:
* Equivalent to {0,}
+ equivalent to {1,}
? Equivalent to {0,1}
This is also the meaning of the above three meta characters as quantifiers.
When using quantifiers, especially quantifiers without upper limits, special attention should be paid to not form an infinite loop, such as /(a?)*/, in some regular expression tools. This will result in a compilation error, but some tools allow this structure, but it cannot be guaranteed that all kinds of tools can handle this structure well.
Repetition and quantifiers
In the previous part introducing the reverse quotation, we have come into contact with the concept of quantifiers, for example, the previous example /([abc]){3}/ represents three consecutive characters, each character must be one of the three characters "abc". In this mode, {3} belongs to the quantifier. It indicates the number of repetitions that a pattern needs to be repeated.
Quantifiers can be placed after the following items:
?●Single character (maybe it escaped single character, such as xhh)
?●".” Metacharacter
?● Character class represented by square brackets
?● Reverse Quote
?●Subpattern defined by brackets (unless it is an assertion, we will introduce it later)
The most common form of quantifiers is to use two commas separated numbers enclosed in curly braces, such as in this format {min,max}, for example, /z{2,4}/ can match "zz", "zzz", or "zzzz". The maximum value in curly braces and the previous comma can be omitted. For example, /d{3,}/ can match more than three numbers, and there is no upper limit to the number of numbers, while /d{3}/(note that there is no comma) matches 3 numbers exactly. When curly braces appear at positions where quantifiers are not allowed or the grammar does not match the aforementioned, here it only represents the curly brace characters themselves and no longer have special meanings. For example, {,6} is not a quantifier, it only represents the meaning of these four characters themselves.
For convenience, the three most commonly used quantifiers have their single-character abbreviations, and their meanings are as follows:
* Equivalent to {0,}
+ equivalent to {1,}
? Equivalent to {0,1}
This is also the meaning of the above three meta characters as quantifiers.
When using quantifiers, especially quantifiers without upper limits, special attention should be paid to not form an infinite loop, such as /(a?)*/, in some regular expression tools. This will result in a compilation error, but some tools allow this structure, but it cannot be guaranteed that all kinds of tools can handle this structure well.
"greedy" and "ungreedy" matching quantifiers
When using pattern with quantifiers, we often find that for the same pattern, there are multiple ways to match the same target string. For example, /d{0,1}d/, it can match two or three decimal numbers. If the target string is 123, when the quantifier is taken at the lower limit of 0, it matches "12", when the quantifier is taken at the upper limit of 1, it matches the entire character "123". Both matching results are correct. If we take its subpattern /(d{0,1}d)/, is the matching result 1 "12" or "123"?
The actual run result will generally be the latter, because by default, most regular expression tools match according to the "greedy" principle. The meaning of the word "greedy" is "greedy, greedy", and its behavior also has the same meaning as the word. The so-called greedy match means that within the limits of the quantifier, as long as the subsequent pattern match can be maintained, the match will always be repeated as much as possible until a mismatch occurs. For ease of understanding, let’s look at the following simple example.
/(d{1,5})d/ matches the string "12345". This pattern means that a number is followed by 1 to 5 numbers, and the quantifier ranges from 1 to 5. When its value is 1-4, the entire pattern is matched. The value of 1 can be "1", "12", "123", "1234". In the case of greedy matching, it takes the maximum value of the quantifier when it matches, so the final result of the match is "1234".
In most cases, this is what we want, but that is not always the case. For example, we want to use the following pattern to extract the comment part of C language (in C language, the comment statement is placed between the string /* and */). The regular expression we use is /*.**/, but the matching result is completely different from what is needed. When the regular expression parses to ".*" after "/*", because "." can represent any character, it also contains the "*/" that needs to be matched later. Under the action of quantifiers, this match will continue, exceeding the next "*"/ until the end of the text, which is obviously not the result we need.
In order to accomplish the match we want as in the example above, the regular expression introduces an ungreedy matching method. In contrast to greedy matching, it always takes the smallest number of quantifiers when satisfying the entire pattern matching. Ungreedy matches are expressed by adding a question mark "?" after the quantifier. For example, when matching comments in C language, we write the regular expression into the following form: /*.*?*/, and add a question mark to the quantifier "*" to achieve the desired result. There is also the previous example that uses /(d{1,5})d/ to match the string "12345". If it is rewritten to the ungreedy mode to this /(d{1,5}?)d/, the value of 1 will be 1.
The above explanation may be somewhat inaccurate. The function of the question mark after the quantifier is actually to reverse the greedy and ungreedy behaviors of the current regular expression. You can set the regular expression to the ungreedy mode through the pattern correction character "U" and then reverse it to greedy through the question mark after the quantifier in the pattern.
Once-only subpatterns
Another interesting topic about quantifiers is Once-only subpatterns. To understand its concept, you need to first understand the matching process of regular expressions containing quantifiers. Let's give an example here.
Now, let's use pattern /d+foo/ to match the string "123456bar", of course, its result is no match. But how does the regular expression engine work? It first analyzes the previous d+, which represents more than one number, then checks the first character "1" in the corresponding position of the target string, conforms to the pattern, and then repeats this pattern according to the quantifier to match the string until "123456" always conforms to the "d+" pattern. Then it encounters the character "b" in the target string that cannot match "d+", so it looks at the subsequent pattern "foo" of "d+", which cannot match the subsequent part of the target string "bar" that cannot match the subsequent part of the target string "bar". At this time, an interesting thing appears. The explanation engine will backtrack the "d+" pattern that has been parsed before, reduce the number of quantifiers by one, and see if the remaining part can match. At this time, the value of "d+" is changed to "12345", and then the explanation engine will see if the remaining part of the target string "6bar" can match the remaining pattern "foo". If it does not work, the number of quantifiers will be reduced by one until the minimum quantifier limit is reached. If it still cannot match, it means that the target string cannot match and returns an unmatched result.
Now, we can come into contact with the one-time sub-mode. The so-called one-time sub-mode defines a sub-mode that does not require the above backtracking process when parsing regular expressions. It is represented by the question mark and the less than sign after the left parentheses, in this way (?>). If you change the example mentioned above to a one-time sub-mode, you can write it like this:
/(?>d)+foo/, when the parser encounters a mismatched bar, it will immediately return the mismatched result, without performing the backtracking process mentioned above.
It should be understood that the one-time sub-mode belongs to the non-capturing sub-mode, and its matching result cannot be reverse referenced.
When a sub-mode without a repetition limit includes a mode that also does not have a repetition limit, using a one-time sub-mode is the only way to avoid getting your program into a long wait. For example, if you use the pattern "/(D+|<d+>)*[!?]/" to match a long list of a characters, and to "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa This pattern means a string of non-numeric characters or a string of numbers enclosed in angle brackets followed by exclamation marks or question marks. There are many ways to divide this string into two repeated parts, and the possible values of quantifiers in the sub-mode must be tested one by one, which will make the final calculation to a great extent. In this way, you will wait in front of the computer for quite a long time before you see the results. If you use the one-time sub-mode to rewrite the previous mode and change it to this / ((?>D+)|<d+>)*[!?]/, you can quickly get the result of the operation.
Quick Start of Regular Expressions (III)
In the above article, we introduce sub-patterns, reverse quotations and quantifiers of regular expressions. In this article, we will focus on assertions in regular expressions.
Assertions
Assertions are a test performed at the current matching position of the target string, but this test does not occupy the target string, that is, it does not move the current matching position of the pattern in the target string.
It seems a bit difficult to read, so let’s give you a few simple examples.
The two most common assertions are the metacharacters "^" and "$", which check whether the matching pattern appears at the beginning or end of the line.
Let's look at this pattern /^ddd$/ and try to use it to match the target string "123". "ddd" means three numeric characters, matching the three characters of the target string, while ^ and $ in the pattern indicate that these three characters appear at the beginning and end of the line respectively, and they themselves do not correspond to any characters in the target string.
There are some other simple assertions b, B, A, Z, z, which all start with a backslash. We have introduced this usage of a backslash. The meaning of these assertions is as follows.
Assertion Meaning
b font dividing line
B Non-word Dividing Line
A The beginning of the target (independent of multi-line mode)
Z The end of the target or before the ending newline character (independent of multi-line mode)
z The end of the target (independent of multi-line mode)
G The first matching position in the target
Note that these assertions cannot appear in character classes, and if they appear, they also have other meanings. For example, b represents the backslash character 0x08 in the character class.
The tests of the assertions introduced above are all tests based on the current location, and the assertions also support more complex test conditions. More complex assertions are expressed in a sub-pattern manner, which include forward assertions and backward assertions.
Lookahead assertions
Forward assertions Test forward from the current position of the target string whether the assertion condition is true. Forward assertions can be divided into forward affirmative assertions and forward negative assertions, respectively represented by (?= and {?!). For example, pattern/ w+(?=;)/ is used to indicate that a string of text characters has a semicolon followed by, but this semicolon is not included in the matching result. An interesting pattern/ (?=;)w+/ does not mean an alpha string that is not a semicolon before. In fact, regardless of whether the alpha character is preceded by a semicolon, it always matches, to complete this function, we need the backward assertions mentioned below (Lookbehind assertions).
Backward assertions (Lookbehind assertions)
Backward assertions use (?<= and (?<! to represent positive backward assertions and negative backward assertions respectively. For example, / (?<!foo)bar/ will look for a bar string that is not preceded by foo. Generally speaking, the sub-pattern used by backward assertions requires a definite length value, otherwise a compilation error will occur.
Use backward assertions and one-time sub-pattern to match the end of the text, let's take a look at the example here.
Consider if you use a simple pattern like /abcd$/ to match a long text ending with abcd, because the pattern matching process is carried out from left to right, the regular expression engine will look for each a character in the text and try to match the remaining patterns. If there are only a few a characters in this long text, this is obviously very inefficient. If the above pattern is replaced by /^.*abcd$/, the previous "^.*" part will match the entire text, and then it finds that the next pattern a cannot match. The backtracking process mentioned above will occur. The parser will gradually shorten the length of the character matching "^.*" from right to left to find the remaining sub-patterns, which will also result in multiple attempts. Now, we use the one-time sub-mode and backward assertion to rewrite the pattern used, and change it to /^(?>.*)(?<=abcd)/. At this time, the one-time sub-mode matches the entire piece of text at one time, and then use the backward assertion to check whether the first four characters are abcd. It only takes one test to determine whether the entire pattern matches immediately. This method can significantly improve processing efficiency when encountering the need to match a very long text.
A pattern can contain multiple successive assertions, and assertions can also be nested. In addition, the sub-pattern used by assertions is also non-capturing and cannot be reversed.
An important application area of assertion is to act as a conditional sub-pattern. So what is a conditional sub-mode?
Conditional subpatterns
Regular expressions allow different matching subpatterns in patterns according to different conditions. That is, conditional subpatterns. Its format is as follows?(condition)yes-pattern) or (?(condition)yes-pattern|no-pattern). If the condition is satisfied, yes-pattern is used, otherwise no-pattern is used (if provided in the mode).
There are two types of conditions in the conditional sub-mode, one is to assert the result, and the other is to see if a sub-mode provided earlier is captured.
If the content in the parentheses representing the condition is a number, it means that the condition is true when the subpattern represented by this number is successfully matched. Take a look at the following example, /( ( )? [^()]+ (?(1) ) )/x, (Note that the "x" pattern correction character indicates that the content after the whitespace characters outside the character class and # symbol are ignored).
The first part of this pattern "( ( )?" matches an optional left bracket "(", the second part "[^()]+" matches more than one non-bracketed characters, and the last part "(?(1) ))" is a conditional sub-patterned pattern, indicating that if 1 is captured, that is, the optional left bracket, the third part should appear with a right bracket ")".
If there is a "R" character in the parentheses representing the condition, it means that the condition is true when this pattern or sub-mode is called recursively, and at the top level of the recursive call, this condition is false. Regarding recursion in regular expressions, we will introduce it in the following section.
If the condition is not a number or R character, it must be an assertion. An assertion can be a positive or negative predecessor or backward assertion. Let's look at the following example.
/(?(?=[^a-z]*[a-z])
d{2}-[a-z]{3}-d{2} | d{2}-d{2}-d{2} )/x
In order to make this regular expression easier to read, we specially used the x-mode correction character so that we can add spaces to the pattern to separate the character formulas in the format and represent them in a row without affecting the parsing of the pattern.
The conditional sub-pattern on the first line uses a positive forward assertion that a string of optional non-lowercase letters is followed by a lowercase letter. In other words, it checks whether the target string contains at least one lowercase letter. If so, it matches the target with the pattern before "|", and sees whether the target is in the format of two numbers - three lowercase letters - two numbers. Otherwise, use "|" to match the target and sees whether the target string is a three-part two-digit decimal number separated by "-".
Comments in regular expressions
To make regular expressions easier to read, comment statements can be added to them. Usually the comment ends with the left parentheses and the pound sign - "(#" when encountering the next right parentheses"). Comments are not nested.
If the "x" mode correction character is set, the part between the pound sign (#) outside of any character class (i.e. outside of []) and the next new line mark is also regarded as annotation.
Quick Start of Regular Expressions (IV)
In the previous article, we introduced some concepts related to assertions in regular expressions. In this article, we will introduce the use of recursively in regular expressions and the use of regular expressions to modify target strings.
Recursion in regular expressions
Friends who have been exposed to programs may have encountered various pairs of brackets. These brackets are often nested with each other, and the number of nested layers cannot be determined. Just imagine if you want to extract a piece of code enclosed in a program with brackets, which may contain other pairs of brackets with varying numbers of levels. How to complete it with regular expressions?

12Next pageRead the full text