Basics of the complete tutorial on PHP regular expressions

At present, regular expressions have been widely used in many software, including *nix (Linux, Unix, etc.), operating systems such as HP, development environments such as PHP, C#, Java, and many application software, the shadow of regular expressions can be seen.

The use of regular expressions can achieve powerful functions in simple ways.

In order to be simple and effective without losing its power, regular expression code is more difficult and it is not easy to learn.

example:^.+@.+..+$

Such code has scared me away many times. Maybe many people are also scared away by such codes.

After completing this tutorial, you will be able to apply such code freely.

History of regular expressions

The "ancestor" of regular expressions can be traced back to early research on how the human nervous system works. Warren McCulloch and Walter Pitts, two neurophysiologists, have developed a mathematical way to describe these neural networks.

In 1956, a mathematician named Stephen Kleene published a paper titled "Representation of Neural Net Events" based on the early work of McCulloch and Pitts, introducing the concept of regular expressions. Regular expressions are expressions that he calls "algebra of regular sets", so the term "regular expression" is used.

It was subsequently discovered that this work could be applied to some early research using Ken Thompson's computational search algorithm, the main inventor of Unix. The first practical application for regular expressions is the qed editor in Unix.

Regular expressions have been an important part of text-based editors and search tools since then.

Regular expressions describe a string matching pattern, which can be used to check whether a string contains a certain substring, replace the matching substring, or take out a substring that meets a certain condition from a string, etc.

In this section we introduce regular expression definitions.

Regular expressions describe a pattern of string matching, which can be used to check whether a string contains a certain substring, replace the matching substring, or take out a substring that meets a certain condition from a string, etc.

When listing directories, *.txt in dir *.txt or ls *.txt is not a regular expression, because the meaning of * here is different from the regular expression.

Regular expressions are literal patterns composed of normal characters (such as characters a to z) and special characters (called metacharacters). The regular expression acts as a template to match a character pattern with the searched string.

1. Normal characters

Consisting of all those printed and non-printed characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase alphabet characters, all numbers, all punctuation marks, and some symbols.

2. Non-printed characters

Character Meaning

cx matches the control characters specified by x. For example, cM matches a Control-M or carriage return. The value of x must be one of A-Z or a-z. Otherwise, treat c as an original 'c' character.
f matches a page break. Equivalent to x0c and cL.
n Matches a newline character. Equivalent to x0a and cJ.
r matches a carriage return character. Equivalent to x0d and cM.
s matches any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [fnrtv].
S matches any non-whitespace characters. Equivalent to [^ fnrtv].
t matches a tab character. Equivalent to x09 and cI.
v Matches a vertical tab character. Equivalent to x0b and cK.

3. Special characters

The so-called special characters are some characters with special meanings, such as * in "*.txt" mentioned above, which simply means any string. If you want to find a file with * in the file name, you need to escape *, that is, add one before it. ls *.txt. Regular expressions have the following special characters.

Special Character Description

$ Matches the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches 'n' or 'r'. To match the $ character itself, use $.

( ) Marks the start and end positions of a subexpression. Subexpressions can be obtained for later use. To match these characters, use ( and ).

* Matches the previous subexpression zero or multiple times. To match the * character, use *.

+ Match the previous subexpression once or more times. To match the + character, use +.

. Match any single character except line break n. To match ., use .

[ Mark the beginning of a bracket expression. To match [, use [.

? Match the previous subexpression zero or once, or indicate a non-greedy qualifier. To match the ? character, use ?.
Mark the next character as a special character, or a literal character, or a backward reference, or an octal escape character. For example, 'n' matches the character 'n'. 'n' matches the newline character. The sequence '' matches "", and '(' matches "(".

^ Matches the start position of the input string unless used in a square bracket expression, at which point it means that the character collection is not accepted. To match the ^ character itself, use ^.
{ Markup qualifier expression start. To match {, use {.

| Specify a choice between two items. To match |, use |.

The method of constructing regular expressions is the same as the method of creating mathematical expressions. That is, use multiple metacharacters and operators to combine small expressions to create larger expressions. Components of regular expressions can be a single character, a set of characters, a range of characters, a choice between characters, or any combination of all of these components.

4. Qualifiers

Qualifiers are used to specify how many times a given component of a regular expression must appear before a match can be met. There are 6 types of * or + or ? or {n} or {n,} or {n,m}.
*, + and qualifiers are all greedy because they will match as much text as possible, and only by adding one ? to the aftermath of them can achieve non-greedy or minimal matching.
The qualifiers of regular expressions are:

Character Description

* Matches the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}.

+ Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but not "z". + is equivalent to {1,}.

? Match the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ? is equivalent to {0,1}.

{n} n is a non-negative integer. Match the n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob", but can match two os in "food".

{n,} n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" but can match all os in "fooooood". 'o{1,}' is equivalent to

'o+'. 'o{0,}' is equivalent to 'o*'.

{n,m} m and n are non-negative integers, where n <= m. Match at least n times and match up to m times. For example, "o{1,3}" will match the first three os in "fooooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers.

5. Locator

Used to describe the boundary of a string or word, ^ and $ refer to the beginning and end of a string, b describes the front or back boundary of a word, and B represents the non-word boundary. Qualifiers cannot be used for locators.

6. Select

Enclose all selections in parentheses, separated by | between adjacent selections. But using parentheses will have a side effect, that the relevant match will be cached, and at this time you can use the first option to eliminate this side effect.

Among them ?: one of the non-capturing elements, and two non-capturing elements are ?= and ?!, which have more meanings. The former is a positive pre-check, which matches the search string at any position where the regular expression pattern in parentheses begins to match, and the latter is a negative pre-check, which matches the search string at any position where the regular expression pattern begins to match.

7. Backward Quotation

Adding parentheses to both sides of a regular expression pattern or partial pattern will cause the correlation match to be stored in a temporary buffer, and each captured submatch is stored as what is encountered from left to right in the regular expression pattern. The buffer number that stores sub-matches starts at 1 and is numbered continuously until the maximum 99 sub-expressions. Each buffer can be accessed using 'n', where n is a one- or two-digit decimal number that identifies a particular buffer.

The non-capturing metacharacter '?:', '?=', or '?!' can be used to ignore savings of related matches.

The content of this article ends here. The php regular expressions shared with you above are very useful. The regular expression improvement article will be updated in the future. Please continue to pay attention.