SoFunction
Updated on 2025-04-11

PHP and regular expression tutorial collection 2


This was indeed a bit difficult before Perl 5.6, but after Perl 5.6, recursive regular expressions were introduced, and this problem was solved. Usually, "(?R)" is used in regular expressions to represent a reference to yourself. Let's see what regular expressions to use to solve the problem raised just now.
/( ( (?>[^()]+) | (?R) )* )/x 
Now let's analyze the meaning of this pattern, using the "x" pattern modifier so that spaces can be added to the pattern for easy reading.
The pattern begins with matching the first left parentheses, and then we need to capture the sub-pattern. Note that the word pattern is followed by the quantifier "*", indicating that this pattern can be repeated 0 to multiple times. Finally there is a final bracket. Now we analyze the content of the sub-pattern ( (?>[^()]+) | (?R) ). This is a branch sub-mode, which means that there are two cases of the pattern. The first is (?>[^()]+), which is a one-time sub-mode, representing more than one non-bracketed characters, and the other is | (?R), that is, its own recursive call to the regular expression - ( ( (?>[^()]+) | (?R) )* ), and then search for a left parentheses and start looking for the content contained in a pair of nested parentheses.
After analyzing this, the meaning of this regular expression is basically clear, but have you noticed that why do you need to use one-time sub-patterned (?>[^()]+) to find non-bracketed strings?
In fact, since the levels of recursion are infinite, this kind of processing is very necessary, especially when encountering mismatched strings, it won't let you get stuck in a long wait. Consider the following target string.
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
Before deriving a mismatch final result, if one-time sub-patterning is not used, the parser will try every possible way to split the target string, which will waste a lot of time.
Modify the target with regular expressions
Not all regular expression tools allow you to modify target strings. Some of them only use regular expressions to find strings that match the specified pattern. In Linux, the most widely used tool that supports regular expressions is the grep command, which is a tool specifically used for searching, and some text editor tools. Some of them allow regular expression replacement, while others do not. This requires checking the online manual of the tools you are using.
For tools that allow you to use regular expressions to modify target strings, you must take some of the differences between them to your mind:
These differences are first reflected in the specific form of replacement. Some use dialog boxes to enter the pattern to be found and the content to be replaced. Some use commands to separate the matching pattern from the content to be replaced through delimiters. For some programming language tools, they usually define the pattern to be matched and the content to be replaced through different parameters of the function.
Another difference that needs to be noted is the objects that these tools specifically modify. Most Linux-based command-line tools generally modify cached contents through standard output or pipelines rather than directly modify files stored on disk, while text editor tools or programming languages ​​usually modify target files directly.
Let’s use the format of the sed command under Linux to give a few examples of regular expressions:
Mode: s/cat/dog/g
Enter: wild dogs, bobcats, lions, and other wild cats
Output: wild dogs, bobdogs, lions, and other wild dogs
Mode: s/[a-z]+i[a-z]*/nice/g
Enter: wild dogs, bobcats, lions, and other wild cats
Output: nice dogs, bobcats, nice, and other nice cats
When we use patterns for replacement operations, all strings matching patterns in the target string will be replaced.
Here is another example of using reverse references for replacement:
Mode: s/([A-Z])([0-9]{2,4}) /2:1/g
Input: A37 B4 C107 D54112 E1103 XXX
Output: 37:A B4 107:C D54112 1103:E XXX
As mentioned earlier, the match by default is generally greedy, which often makes the actual match larger than what you want to match, especially when performing a replacement operation, which will be more dangerous, because if you perform a replacement operation in the wrong match, you are actually deleting the valid content in the target, especially when this operation is file-oriented, it will cause even greater harm. Therefore, remember that a non-strict character class plus a non-strict quantifier is enough to cause irreparable consequences. Before performing similar operations, you must test different target strings more to avoid this situation as much as possible.
In the next article of this tutorial, we will introduce a tool that can facilitate regular expression learning and some ideas for writing regular expressions.
Quick Start of Regular Expressions (V)
In the previous article, we introduced recursion and substitution in regular expressions. Now let us come into contact with a tool that is convenient for testing when learning regular expressions, and introduce some ideas for writing regular expressions.
A convenient tool for learning regular expressions
The best way to learn regular expressions is of course practice, but although there are many tools that support regular expressions, it is not very convenient if you only use them for exercises.
Here I recommend a special regular expression writing test tool to a company, PHPEdit's Regular Expression Editor tool. This is a free software that is mainly used to debug Perl-compatible regular expression functions used by PHP. Use it to easily enter the target string and regular expression and see the matching results in real time. You can download this tool on its download page.
The interface of the program is very concise, but during use, I found that some of its functions seem to be problematic when using it. Only preg_match_all and preg_replace functions are normal. In addition, in the matching pattern input box, do not add a pattern delimiter. The program seems to parse all the contents in the input box as a pattern.
Fortunately, as a regular expression practice tool, its functions are sufficient. Below is its running interface.
Program running interface
All the examples mentioned in the article can be tested inside, enter the pattern in the top box, write the target string into the middle input box, and click the "run the regxwp" button to get the matching result below.
Ideas for writing regular expressions
A tip to avoid too many matches
We have talked a lot about the problem of excessive matching caused by writing unreasonable regular expressions. The problem now is how to avoid similar situations as much as possible. Here is a small trick.
If you find that your custom pattern matches too many results, a good way is to change your mind. Instead of considering what my pattern needs to match next, considering what my pattern needs to avoid matching next. We can easily achieve this effect with meta-word answers "^" and character classes, which can often achieve more precise matches.
To illustrate the benefits of this idea, let’s first give an example that is not related to regular expressions. Consider this problem. The probability of throwing a dice at one time is one-sixth. If you are asked to throw six times, what is the probability of throwing a six?
Some people may calculate this way. The probability of one time is 1/6, and six times is 6 1/6, which adds up to 1. This result is obviously wrong. Although you throw it six times, there is definitely no guarantee that you will throw a 6. It seems a bit difficult to solve this problem from a positive perspective.
If we change our thinking, the solution will be much clearer. We can change the question of this question to this. If you are asked to roll dice 6 times, what is the probability that you cannot roll 6 each time? This question is much easier to solve. According to the principle of probability multiplication, the probability that each time you roll a point that is not 6 is 5/6, and the probability that every time you are not 6 is 5/6 is to the power of 6, which is about 33%. Then subtract this number with 1 to get the answer we need.
You can think of the matching of each part in the pattern as the process of rolling a dice once. The matching probability and the total matching probability of each part are very similar to our example above.
How to improve the parsing efficiency of regular expressions
For regular expressions that also match content, some patterns tend to be more efficient than others. To give a simple example, using the character class "[aeiou]" would be more efficient than using the branch selection mode "(a|e|i|o|u). In general, using the simplest and basic modes can achieve higher efficiency.
Nested infinite repetitive quantifiers should be used as cautious as possible. When encountering a mismatched target string, parsing the string may take considerable time. For example, the following pattern fragment "(a+)*", when encountering a mismatched target string "aaaa", the parser will try 33 different matching methods for it, and this number will grow extremely rapidly with the increase of the length of the mismatched string.
Some regular expression tools optimize some specific pattern matching to improve efficiency. Understanding what regular expression work you use has done and using optimized patterns as much as possible can greatly improve your regular expression execution efficiency. For example, PHP optimizes the parsing of patterns such as /a+)*b /. When the end of the pattern is a definite character, the parser will first find out whether the end of the target conforms to the pattern. If otherwise, the failed matching result will be returned immediately and the parsing will stop. If the above style is changed to "(a+)*d", because the ending is no longer a definite character, this pattern will be parsed according to the normal process. If you want to see the difference in the effects of the two, in the tool we mentioned earlier, set the target string to 25 lowercase a characters, and then test the two modes separately. The former ends immediately, while the latter needs to wait for about one second (the author uses the XP1700+ processor).
In addition to using optimized patterns as much as possible, reconstructing some patterns can also greatly improve efficiency. The method we introduced when introducing backward assertions to match ending characters using backward assertions combined with one-time subpatterns is a good example.
Here we are going to end this tutorial. Due to the limitations of length and my level, there may be many omissions in the article, and I would like to ask everyone for understanding. The most comprehensive introduction to regular expressions is probably some documents and works related to Perl. If you want to have a deeper understanding of regular expressions, you can refer to the book "Mastering Regular Expressions" written by Jeffrey Friedl, which contains many examples. However, I think after understanding the basic concepts of regular expressions, it is more practical to read the relevant parts of regular expressions in the relevant tools I often use. Finally, the same sentence is the case. Practice to create true knowledge. I hope everyone can better master the use of regular expressions in continuous practice.
Special characters in regular expressions
Reprinted from: /articles/
Character Description
Mark the next character as a special character, or an primitive character, or a backward reference, or an octal escape character. For example, 'n' match the character "n". 'n' matches a newline character. The sequence '' matched " and "(" matched "(".
^
Matches the start position of the input string. If the Multiline property of the RegExp object is set, ^  also matches the position after 'n' or 'r'.
$
Matches the end position of the input string. If the Multiline property of the RegExp object is set, $  also matches the position before 'n' or 'r'.
*
Matches the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * Equivalent to {0,}.
+ Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but cannot match "z". + equivalent to {1,}.
?
Matches the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ? Equivalent to {0,1}.
{n}
n is a non-negative integer. Match n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob" , but can match 'o' in "food" .
{n,}
n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" , but can match all  o in "fooooood" . 'o{1,}' equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}
m  and n  are both non-negative integers, where n <= m. Match at least n and match at most m. Liu, "o{1,3}" will match the first three o in "foooooood". 'o{0,1}' equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers.
?
When the character is immediately followed by any other restriction character (*, +, ?, {n}, {n,}, {n,m}) , the matching pattern is non-greedy. The non-greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible. For example, for the string "oooo", 'o+?' will match a single "o", and 'o+' will match all 'o'.

Match any single character except "n". To match any character including 'n', use the pattern like '[.n]'.
(pattern)
Match pattern  and get this match. The obtained matches can be obtained from the generated Matches collection, using the SubMatches collection in VBScript, and using the {CONTENT}… attribute in JScript. To match parentheses characters, use '('  or ')'.
(?:pattern)
Match pattern but does not get the matching result, that is, this is a non-get match and is not stored for future use. This is useful when using the "or" character (|) to combine various parts of a pattern. For example, 'industr(?:y|ies) is an expression that is simpler than 'industry|industries'.
(?=pattern)
Forward pre-check, match the search string at the beginning of any string matching pattern. This is a non-get match, that is, the match does not need to be retrieved for later use. For example, 'Windows (?=95|98|NT|2000)' can match "Windows" in "Windows 2000" but cannot match "Windows" in "Windows 3.1". Pre-checking does not consume characters, that is, after a match occurs, the next match's search begins immediately after the last match, rather than after the characters containing the pre-checking.
(?!pattern)
Negative pre-check, match the search string at the beginning of any string that does not match Negative lookahead matches the search string at any point where a string not matching pattern . This is a non-get match, that is, the match does not need to be retrieved for later use. For example, 'Windows (?!95|98|NT|2000)' can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000". Pre-checking does not consume characters, that is, after a match occurs, the next match search begins immediately after the last match, rather than after the characters containing the pre-checking
x|y 
Match x or y. For example, 'z|food' can match "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz]
Character collection. Match any character contained. For example, '[abc]' can match 'a' in "plain" .
[^xyz]
A collection of negative values ​​characters. Match any characters not included. For example, '[^abc]' can match 'p' in "plain".
[a-z]
Character range. Match any character in the specified range. For example, '[a-z]' can match any lowercase alphabetical characters in the range 'a' to 'z'.
[^a-z]
Negative value character range. Match any arbitrary characters that are not within the specified range. For example, '[^a-z]' can match any arbitrary character that is not in the range of 'a' to 'z'.

Match a word boundary, which means the position between the word and space. For example, 'erb' can match 'er' in "never" , but not 'er' in "verb".

Match non-word boundaries. 'erB' can match 'er' in "verb" , but cannot match 'er' in "never".
cx
Matches the control characters specified by x. For example, cM matches a Control-M or carriage return. The value of x must be one of A-Z or a-z. Otherwise, treat c as an original 'c' character.
d
Match a numeric character. Equivalent to [0-9].
D
Match a non-numeric character. Equivalent to [^0-9].
f
Match a page break. Equivalent to x0c and cL.

Match a newline character. Equivalent to x0a and cJ.
r
Match a carriage return character. Equivalent to x0d and cM.
Match any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [ fnrtv].
Match any non-whitespace characters. Equivalent to [^ fnrtv].
t
Match a tab character. Equivalent to x09 and cI.
v
Match a vertical tab. Equivalent to x0b and cK.
w
Match any word character that includes an underscore. Equivalent to '[A-Za-z0-9_]'.
W
Match any non-word characters. Equivalent to '[^A-Za-z0-9_]'.
xn
Match  n, where n is a hexadecimal escape value. The hexadecimal escape value must be the length of two numbers that are determined. For example, 'x41' matched "A". 'x041' is equivalent to 'x04' & "1". ASCII encoding can be used in regular expressions. .
num
Match num, where num is a positive integer. Reference to the obtained match. For example, '(.)' matches two consecutive identical characters.

Identifies an octal escape value or a backward reference. If at least n obtained subexpressions before n, then n is a backward reference. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
nm
Identifies an octal escape value or a backward reference. If there are at least is preceded by at least nm, nm is a backward reference before. If there are at least n fetches before nm, then n is a backward reference followed by the text m. If the previous conditions are not satisfied, if both n  and m  are octal numbers (0-7), then nm will match the octal escape value nm.
nml
If n is an octal number (0-3), and m  and l  are both octal numbers (0-7), then the octal escape value nml is matched.
un
Match n, where n is a Unicode character represented by four hexadecimal numbers. For example, u00A9 matches the copyright symbol (?).