SoFunction
Updated on 2025-04-10

Regular Expression Learning Notes

Regular Expression Learning Notes
Regular expression (regular expression) describes a string matching pattern that can be used to check whether a string contains
There is a certain kind of substring, a substring that replaces the matching substring or a substring that meets a certain condition from a certain string, etc.
When listing directories, *.txt in dir *.txt or ls *.txt is not a regular expression, because * is here with ** and regular expressions.
The meanings are different.
For easy understanding and memory, start with some concepts. All special characters or combinations have a total table afterwards, and the last one
Some examples are used to understand the corresponding concepts.
Regular expressions
It is a text pattern composed of ordinary characters (such as characters a  to z) and special characters (called metacharacters). Regular expression
As a template, the formula matches a certain character pattern with the searched string.
A regular expression can be constructed by putting various components of the expression pattern between a pair of separators.
That is /expression/
Normal characters
Consisting of all those printed and non-printed characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase alphabet characters
, all numbers, all punctuation marks and some symbols.
Non-printed characters
Characters Meaning
\cx  Match the control characters specified by x. For example,  \cM matches a Control-M or carriage return. The value of x must be A-Z
or one of a-z. Otherwise, treat c as an original 'c' character.
\f  Match a page breaker. Equivalent to  \x0c  and  \cL.
\n  Match a newline character. Equivalent to  \x0a  and  \cJ.
\r  Match a carriage return character. Equivalent to  \x0d  and  \cM.
\s  Match any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [ \f\n\r\t\v].
\S  Match any non-whitespace characters. Equivalent to [^ \f\n\r\t\v].
\t  Match a tab character. Equivalent to  \x09  and  \cI.
\v  Match a vertical tab character. Equivalent to  \x0b  and  \cK.
Special characters
The so-called special characters are some characters with special meanings, such as * in "*.txt" mentioned above, which simply means Licai
What string means. If you want to find a file with * in the file name, you need to escape *, that is, add a \ before it. ls
\*.txt. Regular expressions have the following special characters.
Special Character Description 
$ Match the end position of the input string. If the Multiline property of the RegExp object is set, then $ also matches '\n'
or '\r'. To match the $ character itself, use  \$.
( ) Mark the start and end positions of a subexpression. Subexpressions can be obtained for later use. To match these characters, make
Use  \( and  \).
* Match the previous subexpression zero or multiple times. To match the * characters, use  \*.
+ Match the previous subexpression once or more times. To match the + characters, use  \+.
. Match any single character except line breaks \n. To match  ., please use  \.
[  Tag the beginning of a bracket expression. To match  [, please use  \[.
? Match the previous subexpression zero or once, or indicate a non-greedy qualifier. To match the  ? character, use  \?.
\ Mark the next character as a special character, or a literal character, or a backward reference, or an octal escape character. For example, 'n' Peak
Compatible with characters 'n'. '\n' Match line breaks. The sequence '\\' matched "\", and '\(' matched "(".
^ Matches the start position of the input string unless used in a square bracket expression, at which point it means that the character collection is not accepted. Want to match
To match  ^ character itself, please use  \^.
{ Tag the beginning of the qualifier expression. To match «{, use «{.
| Indicate a choice between two items. To match  |, please use  \|.
The method of constructing regular expressions is the same as the method of creating mathematical expressions. That is, use multiple metacharacters and operators to convert small tables
The expressions are combined to create larger expressions. The components of regular expressions can be single characters, character sets, and character ranges.
, selection between characters, or any combination of all these components.
Qualifier
Qualifiers are used to specify how many times a given component of a regular expression must appear before a match can be met. There is * or + or? or {n}
There are 6 kinds of {n,} or {n,m}.
*, + and qualifiers are all greedy because they will match as much text as possible, and only by adding one after them is enough.
Implement non-greedy or minimal matching
The qualifiers of regular expressions are:
Characters Description
*  Match the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * Equivalent to {0,}.
+   Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but cannot match "z". +
Equivalent to {1,}.
?  Match the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ?
Equivalent to {0,1}.
{n}  n is a non-negative integer. Match n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob" but can match 'o'
Compatible with two of "food" o.
{n,}  n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" , but can match
All o in "fooooood" . 'o{1,}' equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}  m  and n  are both non-negative integers, where n <= m. Match at least n and match at most m. For example, "o{1,3}"
Will match the first three o in "fooooooood". 'o{0,1}' equivalent to 'o?'. Please note that there is no space between commas and two numbers.
grid.
Locator
Used to describe the boundary of a string or word. ^ and $ refer to the beginning and end of a string, respectively, and \b describe the front or back boundary of a word.
\B means non-word boundary. Qualifiers cannot be used for locators.
choose
Enclose all selections in parentheses, separated by | between adjacent selections. But using brackets will have a side effect, which is
The related matches will be cached, and at this time you can use it: put the first option to eliminate this side effect.
Among them ?: one of the right and non-capturing elements, and there are two non-capturing elements that are ?= and ?!, these two have more meanings, the former is a positive pre-predict
Check, match the search string at any position where the regular expression pattern in parentheses is started, the latter is a negative pre-check, in any way
How to start not matching the position of the regular expression pattern to match the search string.
Backward quote
Adding parentheses to both sides of a regular expression pattern or partial pattern will cause the correlation match to be stored in a temporary buffer.
Each submatch captured is stored as the content encountered from left to right in the regular expression pattern. Store sub-match buffer
Numbering starts at 1 and continues up to a maximum of 99 subexpressions. Each buffer can be accessed using '\n', where  n
A one- or two-digit decimal number that identifies a specific buffer.
The non-capturing metacharacter '?:', '?=', or '?!' can be used to ignore the saving of related matches.
Operation priority of various operators
Operations with the same priority are performed from left to right, operations with different priority are higher first and lower. The priority of various operators is from high to low
as follows:
Operator Description
\  Escape Character
(), (?:), (?=), []  Branch and square brackets
*, +, ?, {n}, {n,}, {n,m}  Quotator
^, $, \anymetacharacter  Position and order
|  “OR” operation
All symbol explanations
Characters Description
\  Mark the next character as a special character, or an original character, or a backward reference, or an octal escape character
. For example, 'n' match the character "n". '\n' matches a newline character. The sequence '\\' matched "\" and "\(" matched "(".
^  Match the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches '\n' or
'\r' Position after.
$  Match the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches '\n' or
'\r' Previous location.
*  Match the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * Equivalent to {0,}.
+   Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but cannot match "z". +
Equivalent to {1,}.
?  Match the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ?
Equivalent to {0,1}.
{n}  n is a non-negative integer. Match n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob" but can match 'o'
Compatible with two of "food" o.
{n,}  n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" , but can match
All o in "fooooood" . 'o{1,}' equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}  m  and n  are both non-negative integers, where n <= m. Match at least n and match at most m. For example, "o{1,3}"
Will match the first three o in "fooooooood". 'o{0,1}' equivalent to 'o?'. Please note that there is no space between commas and two numbers.
grid.
?  When this character is immediately followed by any other restriction character (*, +, ?, {n}, {n,}, {n,m}) , the matching pattern is non-greedy
Vulgar. The non-greedy pattern matches as few as possible the strings searched, while the default greedy pattern matches as many as possible the strings searched.
String. For example, for the string "oooo", 'o+?' will match a single "o", and 'o+' will match all 'o'.
.  Match any single character except "\n". To match any character including '\n', use the '[.\n]'
model.
(pattern)  Match pattern and get this match. The obtained matches can be obtained from the generated Matches collection, in
The SubMatches collection is used in VBScript, and the $0…$9 attribute is used in JScript. To match parentheses characters, please
Use '\('  or '\)'.
(?:pattern)  Match pattern but does not get the matching result, that is, this is a non-get match and is not stored for
Use later. This is useful when using the "or" character (|) to combine various parts of a pattern. For example, 'industr
(?:y|ies) is an expression that is simpler than 'industry|industry'.
(?=pattern)  Forward pre-check, match the search string at the beginning of any string matching pattern . This is a non-acquisition
Match, that is, the match does not need to be obtained for later use. For example, 'Windows (?=95|98|NT|2000)' can match
"Windows" in "Windows 2000" , but cannot match "Windows" in "Windows 3.1". Pre-examination does not consume words
char, that is, after a match occurs, the next match search begins immediately after the last match, rather than from the package
Start after the pre-checked characters.
(?!pattern)  Negative pre-check, match the search string at the beginning of any string that does not match pattern . This is a bad result
Take a match, that is, the match does not need to be obtained for later use. For example, 'Windows(?!95|98|NT|2000)' can match
"Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000". Pre-examination does not consume characters
, that is, after a match occurs, the next match's search begins immediately after the last match, rather than from the inclusion
Start after the pre-checked characters
x|y  match x  or y. For example, 'z|food' can match "z" or "food". '(z|f)ood' matches "zood" or
"food"。 
[xyz]  character collection. Match any character contained. For example, '[abc]' can match 'a' in "plain" .
[^xyz]  The set of negative values ​​characters. Match any characters not included. For example, '[^abc]' can match 'p' in "plain".
[a-z]  character range. Match any character in the specified range. For example, '[a-z]' can match 'a'  in the range 'z'
Any lowercase alphabet character.
[^a-z]  Negative value character range. Match any arbitrary characters that are not within the specified range. For example, '[^a-z]' can match anything that doesn't
Any character in the range 'a'  to 'z'.
\b  Match a word boundary, which means the position of the word and space. For example, 'er\b' can match the    in "never"
'er', but cannot match 'er' in "verb".
\B  Match non-word boundaries. 'er\B' can match 'er' in "verb" , but cannot match 'er' in "never".
\cx  Match the control characters specified by x . For example,  \cM matches a Control-M or carriage return. The value of x must be A-Z
or one of a-z. Otherwise, treat c as an original 'c' character.
\d  Match a numeric character. Equivalent to [0-9].
\D  Match a non-numeric character. Equivalent to [^0-9].
\f  Match a page breaker. Equivalent to  \x0c  and  \cL.
\n  Match a newline character. Equivalent to  \x0a  and  \cJ.
\r  Match a carriage return character. Equivalent to  \x0d  and  \cM.
\s  Match any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [ \f\n\r\t\v].
\S  Match any non-whitespace characters. Equivalent to [^ \f\n\r\t\v].
\t  Match a tab character. Equivalent to  \x09  and  \cI.
\v  Match a vertical tab character. Equivalent to  \x0b  and  \cK.
\w  Match any word character that includes an underscore. Equivalent to '[A-Za-z0-9_]'.
\W  Match any non-word character. Equivalent to '[^A-Za-z0-9_]'.
\xn  Match n, where n is a hexadecimal escape value. The hexadecimal escape value must be the length of two numbers that are determined. Example
For example, '\x41' matched "A". '\x041' is equivalent to '\x04' & "1". ASCII encoding can be used in regular expressions. .
\num  Match num, where num is a positive integer. Reference to the obtained match. For example, '(.)\1' matches two connections
Continued same character.
\n  Identifies an octal escape value or a backward reference. If  \n before at least n retrieved subexpressions, then  n is a direction
Quote later. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
\nm  Identifies an octal escape value or a backward reference. If there are at least nm obtained subexpressions before  \nm , then nm
For backward reference. If there are at least n fetches before  \nm , then n is a backward reference followed by the literal m . If the previous one
Neither condition is met. If both n  and m  are octal numbers (0-7), then  \nm  will match the octal escape value nm.
\nml  If n is an octal number (0-3), and both m  and l  are octal numbers (0-7), then the octal escape value is matched
nml。 
\un  Match n, where n is a Unicode character represented by four hexadecimal numbers. For example,  \u00A9 matched version
Rights symbol (?).
Some examples
Regular expression description
/\b([a-z]+) \1\b/gi The position where a word appears continuously
/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/  Resolve a URL into a protocol, domain, port and relative path
/^(?:Chapter|Section) [1-9][0-9]{0,1}$/ Position of the section
/[-a-z]/ A to z add a - number to 26 letters.
/ter\b/ can match chapter, not terminal
/\Bapt/ can match chapter, not aptitude
/Windows(?=95 |98 |NT )/ can match Windows95 or Windows98 or WindowsNT. When a match is found, it will be from
The next search match will be started later on Windows.