SoFunction
Updated on 2025-04-06

Share the usage of recursive regular expressions in PHP

When will recursive regular expressions be used? Of course, it is when a certain pattern appears recursively in the string to be matched (seemingly nonsense). The most classic example is the problem of recursive regularity dealing with nested brackets. The example is as follows.

Suppose your text contains correctly paired nested brackets. The depth of the brackets can be infinite layers. You want to capture such bracket groups.
Copy the codeThe code is as follows:

<?php
$string = "some text (a(b(c)d)e) more text";
if(preg_match("/\(([^()]+|(?R))*\)/",$string,$matches)) {
echo "<pre>"; print_r($matches); echo "</pre>";
}
?>

turn out:
Copy the codeThe code is as follows:

Array
(
[0] => (a(b(c)d)e)
[1] => e
)

It can be seen that the text we need has been captured into $matches[0].

principle

Now think about the principle.

The key point in the above regular expression is (?R). The purpose of (?R) is to recursively replace the entire regular expression where it is located. On each iteration, the PHP syntax analyzer replaces (?R) with "\(([^()]+|(?R))*\)".
Therefore, in detail the above example, the regular expression is equivalent to:
Copy the codeThe code is as follows:

"/\(([^()]+|\(([^()]+|\(([^()]+)*\))*\))*\)/"

However, the above code is only suitable for brackets with a depth of 3. For bracket nesting of unknown depths, I have to use this rule:
Copy the codeThe code is as follows:

"/\(([^()]+|(?R))*\)/"

It not only matches infinite depth, but also simplifies the syntax of regular expressions. It is powerful and has concise syntax.

Now let’s take a closer look at how “/\(([^()]+|(?R))*\)/" matches “(a(b(c)d)e)”:

"(c)" is matched by the regular expression "\(([^()]+)*\)". Please note that (c) is actually equivalent to a microcosm of the entire recursion. Although the sparrow is small and has all the internal organs, it uses the entire regular expression.
In other words, (c) in the next step can be matched using (?R).

The matching process of (b(c)d) is:
"\("match"(";
"[^()]+" matches "b";
(?R) matches "(c)";
"[^()]+" matches "d";
"\)"match")".

According to the above matching principle, it is not difficult to understand why the second element of the array $matches[1] is equivalent to 'e'. The substring 'e' is captured in the last matching iteration. During the matching process, only the last capture result will be saved to the array.
Regarding this feature, you can try it yourself and see what the capture result $1 is using regular form ([a-z]+[0-9]+)+ to match the string abc123xyz890. Note that the result does not conflict with the Left Longest principle.

If we just need to capture $matches[0], we can do this:
Copy the codeThe code is as follows:

<?php
$string = "some text (a(b(c)d)e) more text";
if(preg_match("/((?:[^()]+|(?R))*)/",$string,$matches))
{
echo "<pre>"; print_r($matches); echo "</pre>";
}
?>
The results are the same:

Array
(
[0] => (a(b(c)d)e)
)

The change made is to change the capture brackets() to non-capture capture brackets(?:).

It can be further improved as:
Copy the codeThe code is as follows:

<?php
$string = "some text (a(b(c)d)e) more text";
if(preg_match("/((?>[^()]+|(?R))*)/",$string,$matches))
{
echo "<pre>"; print_r($matches); echo "</pre>";
}
?>

Here we use the so-called one-time mode (rex note: "Proficient in regular expression v3.0In 》, it is called "cured grouping". You can refer to this book.) The PHP manual also recommends that you use this pattern as much as possible as possible as long as conditions allow, in order to improve the speed of regular expressions.