SoFunction
Updated on 2025-03-04

A brief discussion on regular expression efficiency Greedy, non-greed and backtracking

Let’s first literate what is greed for regular expressions and what is non-greed? Or what is a matching priority quantifier, and what is an ignorant priority quantifier?
OK, I don’t know what the concept is, let’s give an example.
A classmate wants to filter the content between them, so he writes the rules and procedures like this.
Copy the codeThe code is as follows:

$str = preg_replace('%<script>.+?</script>%i','',$str);//Non-greedy

It seems that there is no problem, but it is not. like
Copy the codeThe code is as follows:

$str = '<script<script>alert()</script>>alert()</script>';

Then after the above program, the result is
Copy the codeThe code is as follows:

$str = '<script<script>alert()</script>>alert()</script>';
$str = preg_replace('%<script>.+?</script>%i','',$str);//Non-greedy
print_r($str);
//$str output is <script>alert()</script>

Still not achieving the effect he wanted. The above is non-greed, and some are also called laziness. The non-greedy mark is marked as adding ? after the numeric character, such as +?, *?,? (It is quite special, I will write it in the future BLOG). That is, the mark is not greed, if not written, it is greed. for example
Copy the codeThe code is as follows:

$str = '<script<script>alert()</script>>alert()</script>';
$str = preg_replace('%<script>.+</script>%i','',$str);//Non-greedy
print_r($str);
//The output of $str is <script. There is only this, and it seems that it is still not suitable. Ha, do you know how to rewrite that regular?

The above is an introduction to the difference between greed and not greed. Next, let’s talk about the backtracking issues caused by greed and non-greed. Let’s take a look at a small example first.
The regular expression is \w*(\d+) and the string is cfc456n. So, what is the $1 that this regular match? ?

If your answer is 456, then, congratulations, the answer is wrong, the result is not 456, but 6. Do you know why?

CFC4N explains that when the regular engine uses regular \w*(\d+) to match the string cfc456n, it will first use \w* to match the string cfc456n. First, \w* will match all the characters of the string cfc456n, and then hand it to \d+ to match the remaining string, and the rest will be gone. At this time, the \w* rule will reluctantly spit out a character to match for \d+. At the same time, before spitting out the characters, record a point. This point is the point used for backtracking, and then \d+ match n. It is found that the match cannot be successful. It will ask \w* to spit out another character, and \w* will record a backtracking point again and then spit out a character. At this time, the result of the \w* match is only cfc45, and 6n has been spit out. \d+ matches 6 again. If the match is successful, the engine will be notified. If the match is successful, it will be displayed directly. So, the result of (\d+) is 6, not 456.

When the above regular expression is changed to \w*?(\d+) (note that this is non-greedy), the string is still cfc456n, then, what is the $1 of the regular match at this time? ?
Classmate A replied: The result is 456.
Well, yes, right, it's 456. CFC4N asked weakly, why is it 456?
I'm going to explain why it's 456
Regular expressions have rules, and quantifiers are matched first, so \w*? will first match the string cfc456. Because \w*? is not greedy, the regular engine will use the expression \w+? to match only one string at a time, and then hand over control to the subsequent \d+ to match the next character. At the same time, a point is recorded, which is used to return here when the match is unsuccessful, and match again, that is, the traceback point. Since the quantifier after \w is *, * means 0 to countless times, so first it is 0 times, that is, \w*? matches an empty, record a traceback point, and hand over control to \d+,\d+ to match the first character c of cfc456n, and then, the match fails, so, then hand over control to \w*? to match c of cfc456n, \w*? to match c successfully. Because it is not greedy, it only matches one character at a time, records the traceback point, and then hand over control to \d+ to match. f, then, \d+ matches f and fails again, then control is given to \w*?, \w*? matches c, record the traceback point (this time \w*? matches cfc), then control is given to \d+, \d+ matches 4, and the match is successful, then, since the quantifier is +, that is, 1 to countless times, then, then matches 5, succeed, then, matches 6, succeed, and then continue to match the operation. The next character is n, match failed, at this time, \d+ will hand over the control. Since there is no regular expression after \d+, the entire regular expression declares that the match is completed, and the result is cfc456, where the first group of results is 456. Dear classmate, do you understand the result of the question just now, why is it 456?

OK, have you learned about the matching principle of greed, non-greed, from the example above? So do you understand when you need to use greed, non-greed, to process your strings?
Bird's article talks about targeting
Expression, program
Copy the codeThe code is as follows:

$reg = "/<script>.*?<\/script>/is";
$str = "<script>*********</script>"; //Length greater than 100014
$ret = preg_repalce($reg, "", $str); //Return NULL

The reason is that there are too many backtracks until the stack space is exhausted.

Let’s take a look at another example.
String
Copy the codeThe code is as follows:

$str = '<script>123456</script>';

The regular expression is
Copy the codeThe code is as follows:

$strRegex1 = '%<script>.+<\/script>%';
$strRegex2 = '%<script>.+?<\/script>%';
$strRegex3 = '%<script>(?:(?!<\/script>).)+<\/script>%';

How many backtracks will these three rules cause? ?

See the next article The efficiency of PHP regular expressions: backtracking and solidified grouping