Getting started with the basics of regular expressions

Preface

Regular expressions are tedious, but powerful. The application after learning them will not only improve efficiency, but also bring you an absolute sense of accomplishment. As long as you read these materials carefully and make certain references when applying them, mastering regular expressions is not a problem.

index

1. Introduction

At present, regular expressions have been widely used in many software, including *nix (Linux, Unix, etc.), operating systems such as HP, development environments such as PHP, C#, Java, and many application software, the shadow of regular expressions can be seen.

The use of regular expressions can be achieved in simple ways. In order to be simple and effective without losing its power, regular expression code is more difficult and not easy to learn, so it requires some effort. After getting started, refer to certain references and it is relatively simple and effective to use.

Example: ^.+@.+\\..+$

Such code has scared me away many times. Maybe many people are also scared away by such codes. Continue reading this article will allow you to apply such code freely.

Note: The 7th part here seems to be a bit repetitive with the previous content. The purpose is to redescribe the part in the previous table once, with the purpose of making these contents easier to understand.

2. History of regular expressions

The "ancestor" of regular expressions can be traced back to early research on how the human nervous system works. Warren McCulloch and Walter Pitts, two neurophysiologists, have developed a mathematical way to describe these neural networks.

In 1956, a mathematician named Stephen Kleene published a paper titled "Representation of Neural Network Events" based on the early work of McCulloch and Pitts, introducing the concept of regular expressions. Regular expressions are expressions that he calls "algebra of regular sets", so the term "regular expression" is used.

It was later discovered that this work could be applied to some early research using Ken Thompson’s computational search algorithm, the main inventor of Unix. The first practical application for regular expressions is the qed editor in Unix.

As they said, what remains is well-known history. Regular expressions have been an important part of text-based editors and search tools since then.

3. Regular expression definition

Regular expressions describe a string matching pattern, which can be used to check whether a string contains a certain substring, replace the matching substring, or take out a substring that meets a certain condition from a certain string, etc.

When listing directories, *.txt in dir *.txt or ls *.txt is not a regular expression, because the meaning of * here is different from the regular one.

Regular expressions are literal patterns composed of ordinary characters (such as characters a to z) and special characters (called metacharacters). The regular expression acts as a template to match a character pattern with the searched string.

3.1 Normal characters

Consisting of all those printed and non-printed characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase alphabet characters, all numbers, all punctuation marks, and some symbols.

3.2 Non-printed characters

Characters Meaning
\cx  Match the control characters specified by x. For example, \cM matches a Control-M or carriage return. The value of x must be one of A-Z or a-z. Otherwise, treat c as an original 'c' character.
\f  Match a page breaker. Equivalent to \x0c and \cL.
\n  Match a newline character. Equivalent to \x0a and \cJ.
\r  Match a carriage return character. Equivalent to \x0d and \cM.
\s  Match any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [ \f\n\r\t\v].
\S  Match any non-whitespace characters. Equivalent to [^ \f\n\r\t\v].
\t  Match a tab character. Equivalent to \x09 and \cI.
\v  Match a vertical tab character. Equivalent to \x0b and \cK.

　
3.3 Special characters

The so-called special characters are some characters with special meanings, such as * in "*.txt" mentioned above, which simply means any string. If you want to find a file with * in the file name, you need to escape *, that is, add a \ before it. ls \*.txt. Regular expressions have the following special characters.

Special Character Description
$ Match the end position of the input string. If the Multiline property of the RegExp object is set, then $ also matches '\n' or '\r'. To match the $ character itself, use \$.
( ) Mark the start and end positions of a subexpression. Subexpressions can be obtained for later use. To match these characters, use $ and $.
* Match the previous subexpression zero or multiple times. To match the * characters, use \*.
+ Match the previous subexpression once or more times. To match the + characters, use \+.
. Match any single character except line breaks \n. To match ., please use \.
[  Tag the beginning of a bracket expression. To match [, please use \[.
? Match the previous subexpression zero or once, or indicate a non-greedy qualifier. To match the ? character, use \?.
\ Mark the next character as a special character, or a literal character, or a backward reference, or an octal escape character. For example, 'n' matches the character 'n'. '\n' Match line breaks. The sequence '\\' matched "\", and '\(' matched "(".
^ Matches the start position of the input string unless used in a square bracket expression, at which point it means that the character collection is not accepted. To match the ^ character itself, use \^.
{ Tag the beginning of the qualifier expression. To match «{, use «{.
| Indicate a choice between two items. To match |, please use \|.

The method of constructing regular expressions is the same as the method of creating mathematical expressions. That is, use multiple metacharacters and operators to combine small expressions to create larger expressions. Components of regular expressions can be a single character, a set of characters, a range of characters, a choice between characters, or any combination of all of these components.

3.4 Qualifiers

Qualifiers are used to specify how many times a given component of a regular expression must appear before a match can be met. There are 6 types of * or + or ? or {n} or {n,} or {n,m}.

*, + and qualifiers are all greedy because they will match as many text as possible, and only by adding one ? to them can achieve non-greedy or minimal matching.

The qualifiers of regular expressions are:

Characters Description
*  Match the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * Equivalent to {0,}.
+   Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but cannot match "z". + equivalent to {1,}.
?  Match the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ? Equivalent to {0,1}.
{n}  n is a non-negative integer. Match n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob" , but can match 'o' in "food" .
{n,}  n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" , but can match all o in "fooooood" . 'o{1,}' equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}  m and n are both non-negative integers, where n <= m. Match at least n and match at most m. For example, "o{1,3}" will match the first three o in "fooooooood". 'o{0,1}' equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers.

3.5 Positioner

Used to describe the boundary of a string or word. ^ and $ refer to the beginning and end of a string respectively. \b describes the front or back boundary of a word, and \B represents the non-word boundary. Qualifiers cannot be used for locators.

3.6 Select

Enclose all selections in parentheses, separated by | between adjacent selections. But using parentheses will have a side effect, that the relevant match will be cached, and at this time you can use the first option to eliminate this side effect.

Among them ?: one of the non-capturing elements, and two non-capturing elements are ?= and ?!, which have more meanings. The former is a positive pre-check, which matches the search string at any position where the regular expression pattern in parentheses begins to match, and the latter is a negative pre-check, which matches the search string at any position where the regular expression pattern begins to match.

3.7 Backward Quotation

Adding parentheses to both sides of a regular expression pattern or partial pattern will cause the correlation match to be stored in a temporary buffer, and each captured submatch is stored according to the content encountered from left to right in the regular expression pattern. The buffer number that stores sub-match starts from 1 and is consecutively numbered until the maximum 99 sub-expressions. Each buffer can be accessed using '\n', where n is a one- or two-digit decimal number that identifies a particular buffer.

The non-capturing metacharacter '?:', '?=', or '?!' can be used to ignore the saving of related matches.

4. Operation priority of various operators

Operations with the same priority are performed from left to right, operations with different priority are higher first and lower. The priority of various operators is from high to low as follows:

Operator Description
\  Escape Character
(), (?:), (?=), []  Branch and square brackets
*, +, ?, {n}, {n,}, {n,m}  Quotator
^, $, \anymetacharacter  Position and order
|  “OR” operation

5. All symbol interpretation

Characters Description
\  Mark the next character as a special character, or an primitive character, or a backward reference, or an octal escape character. For example, 'n' match the character "n". '\n' matches a newline character. The sequence '\\' matched "\" and "\(" matched "(".
^  Match the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after '\n' or '\r'.
$  Match the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the previous position of '\n' or '\r'.
*  Match the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * Equivalent to {0,}.
+   Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but cannot match "z". + equivalent to {1,}.
?  Match the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ? Equivalent to {0,1}.
{n}  n is a non-negative integer. Match n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob" , but can match 'o' in "food" .
{n,}  n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" , but can match all o in "fooooood" . 'o{1,}' equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}  m and n are both non-negative integers, where n <= m. Match at least n and match at most m. For example, "o{1,3}" will match the first three o in "fooooooood". 'o{0,1}' equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers.
?  When this character is immediately followed by any other restriction character (*, +, ?, {n}, {n,}, {n,m}) , the matching pattern is non-greedy. The non-greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible. For example, for the string "oooo", 'o+?' will match a single "o", and 'o+' will match all 'o'.
.  Match any single character except "\n". To match any character including '\n', use the pattern like '[.\n]'.
(pattern)  Match pattern and get this match. The obtained matches can be obtained from the generated Matches collection, using the SubMatches collection in VBScript, and the $0…$9 attribute in JScript. To match parentheses characters, use '$' or '$'.
(?:pattern)  Match pattern but does not get the matching result, that is, this is a non-get match and is not stored for future use. This is useful when using the "or" character (|) to combine various parts of a pattern. For example, 'industr(?:y|ies) is an expression that is simpler than 'industry|industries'.
(?=pattern)  Forward pre-check, match the search string at the beginning of any string matching pattern . This is a non-get match, that is, the match does not need to be retrieved for later use. For example, 'Windows (?=95|98|NT|2000)' can match "Windows" in "Windows 2000" but cannot match "Windows" in "Windows 3.1". Pre-checking does not consume characters, that is, after a match occurs, the next match's search begins immediately after the last match, rather than after the characters containing the pre-checking.
(?!pattern)  Negative pre-check, match the search string at the beginning of any string that does not match pattern . This is a non-get match, that is, the match does not need to be retrieved for later use. For example, 'Windows (?!95|98|NT|2000)' can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000". Pre-checking does not consume characters, that is, after a match occurs, the next match search begins immediately after the last match, rather than starting after the characters containing the pre-checking.
x|y  match x or y. For example, 'z|food' can match "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz]  character collection. Match any character contained. For example, '[abc]' can match 'a' in "plain" .
[^xyz]  The set of negative values characters. Match any characters not included. For example, '[^abc]' can match 'p' in "plain".
[a-z]  character range. Match any character in the specified range. For example, '[a-z]' can match any lowercase alphabetical characters in the range 'a' to 'z'.
[^a-z]  Negative value character range. Match any arbitrary characters that are not within the specified range. For example, '[^a-z]' can match any arbitrary character that is not in the range of 'a' to 'z'.
\b  Match a word boundary, which means the position of the word and space. For example, 'er\b' can match 'er' in "never" , but not 'er' in "verb".
\B  Match non-word boundaries. 'er\B' can match 'er' in "verb" , but cannot match 'er' in "never".
\cx  Match the control characters specified by x . For example, \cM matches a Control-M or carriage return. The value of x must be one of A-Z or a-z. Otherwise, treat c as an original 'c' character.
\d  Match a numeric character. Equivalent to [0-9].
\D  Match a non-numeric character. Equivalent to [^0-9].
\f  Match a page breaker. Equivalent to \x0c and \cL.
\n  Match a newline character. Equivalent to \x0a and \cJ.
\r  Match a carriage return character. Equivalent to \x0d and \cM.
\s  Match any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [ \f\n\r\t\v].
\S  Match any non-whitespace characters. Equivalent to [^ \f\n\r\t\v].
\t  Match a tab character. Equivalent to \x09 and \cI.
\v  Match a vertical tab character. Equivalent to \x0b and \cK.
\w  Match any word character that includes an underscore. Equivalent to '[A-Za-z0-9_]'.
\W  Match any non-word character. Equivalent to '[^A-Za-z0-9_]'.
\xn  Match n, where n is a hexadecimal escape value. The hexadecimal escape value must be the length of two numbers that are determined. For example, '\x41' matched "A". '\x041' is equivalent to '\x04' & "1". ASCII encoding can be used in regular expressions. .
\num  Match num, where num is a positive integer. Reference to the obtained match. For example, '(.)\1' matches two consecutive identical characters.
\n  Identifies an octal escape value or a backward reference. If \n has at least n obtained subexpressions before, then n is a backward reference. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
\nm  Identifies an octal escape value or a backward reference. If there are at least nm obtaining subexpressions before \nm , then nm is a backward reference. If there are at least n fetches before \nm , then n is a backward reference followed by the literal m . If the previous conditions are not satisfied, if both n and m are octal numbers (0-7), then \nm will match the octal escape value nm.
\nml  If n is an octal number (0-3), and both m and l are octal numbers (0-7), then the octal escape value nml is matched.
\un  Match n, where n is a Unicode character represented by four hexadecimal numbers. For example, \u00A9 matches the copyright symbol (?).
6. Some examples

Regular expression description
/\b([a-z]+) \1\b/gi The position where a word appears continuously
/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/  Resolve a URL into a protocol, domain, port and relative path
/^(?:Chapter|Section) [1-9][0-9]{0,1}$/ Position of the section
/[-a-z]/ A to z add a - number to 26 letters.
/ter\b/ can match chapter, not terminal
/\Bapt/ can match chapter, not aptitude
/Windows(?=95 |98 |NT )/ can match Windows95 or Windows98 or WindowsNT. When a match is found, the next search match will be performed from behind Windows.

7. Regular expression matching rules

7.1 Basic Pattern Matching

It all starts with the most basic. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of strings. The pattern can be very simple, composed of ordinary strings, or very complex, and often uses special characters to represent characters in a range, recurrence, or context. For example:

^once

This pattern contains a special character ^, indicating that the pattern only matches strings starting with once. For example, this pattern matches the string "once on a time" and does not match "There once a man from NewYork". Just as the ^ symbol indicates the beginning, the $ symbol is used to match strings ending in a given pattern.

bucket$

This pattern matches "Who kept all of this cash in a bucket" and does not match "buckets". When the characters ^ and $ are used at the same time, they represent an exact match (strings are the same as patterns). For example:

^bucket$

Match only the string "bucket". If a pattern does not include ^ and $, then it matches any string containing the pattern. For example: mode

once

With string

There once was a man from NewYork
Who kept all of his cash in a bucket.

It's a match.

The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letter itself, and the numbers are the same. Other slightly complex characters, such as punctuation marks and white characters (spaces, tabs, etc.), need to use escape sequences. All escape sequences are headed with a backslash (\). The escape sequence of tab characters is: \t. So if we want to detect whether a string starts with a tab, we can use this pattern:

^\t

Similarly, \n means "new line" and \r means enter. Other special symbols can be used to add backslashes in front of them, such as the backslash itself is represented by \\, periods are represented by \., and so on.

7.2 Character cluster

In INTERNET programs, regular expressions are usually used to verify user input. After the user submits a FORM, it is not enough to determine whether the entered phone number, address, EMAIL address, credit card number, etc. are valid.

So we need to use a more free way to describe the pattern we want, which is the character cluster. To create a cluster of characters representing all vowel characters, put all vowel characters in a square bracket:

[AaEeIiOoUu]

This pattern matches any vowel character, but can only represent one character. A hyphen can represent a range of a character, such as:

[a-z]//Match all lowercase letters
[A-Z]//Match all capital letters
[a-zA-Z]//Match all letters
[0-9] //Match all numbers
[0-9\.\-] //Match all numbers, periods and minus signs
[ \f\r\t\n] //Match all white characters

Similarly, these also represent only one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:

^[a-z][0-9]$

Although [a-z] represents a range of 26 letters, here it can only match a string whose first character is a lowercase letter.

It was mentioned earlier that ^ represents the beginning of a string, but it has another meaning. When using ^ in a set of square brackets, it means "non" or "exclude", and is often used to remove a character. Also using the previous example, we require that the first character cannot be a number:

^[^0-9][0-9]$

This pattern matches "&5", "g7" and "-2", but does not match "12" and "66". Here are a few examples of excluding specific characters:

[^a-z]//All characters except lowercase letters
[^\\\/\^] //All characters except (\)(/)(^)
[^\"\'] //All characters except double quotes (") and single quotes (')

The special character "." (dot, period) is used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string ending with the number 5 and starting with other non-"newline" characters. Pattern "." can match any string except empty string and string that only includes one "new line".

PHP regular expressions have some built-in general character clusters, and the list is as follows:

Character cluster meaning

[[:alpha:]] Any letter
[[:digit:]] Any number
[[:alnum:]] Any letters and numbers
[[:space:]] Any white character
[[:upper:]] Any capital letter
[[:lower:]] Any lowercase letter
[[:punct:]] Any punctuation mark
[[:xdigit:]] Any hexadecimal number is equivalent to [0-9a-fA-F]

7.3 Confirm repeated occurrence

Until now, you have known how to match a letter or number, but more often, you might want to match a word or a set of numbers. A word consists of several letters, and a set of numbers consists of several singular numbers. Braces ({}) followed by characters or clusters of characters are used to determine the number of repetitions of the previous content.

Character cluster meaning
^[a-zA-Z_]$ All letters and underscores
^[[:alpha:]]{3}$ All 3 letter words
^a$ letter a
^a{4}$ aaaa
^a{2,4}$ aa,aaa or aaaa
^a{1,3}$ a,aa or aaa
^a{2,}$ contains more than two a strings
^a{2,} For example: aardvark and aaab, but apple cannot
a{2,} such as: baad and aaa, but Nantucket cannot
\t{2} Two tab characters
.{2} All two characters

These examples describe three different uses of curly braces. A number, {x} means "the previous character or character cluster only appears x times"; a number with a comma, {x,} means "the previous content appears x or more times"; two commas separated numbers, {x,y} means "the previous content appears at least x times, but no more than y times". We can extend the pattern to more words or numbers:

^[a-zA-Z0-9_]{1,}$//All strings containing more than one letter, number or underscore
^[0-9]{1,}$//All positive numbers
^\-{0,1}[0-9]{1,}$//All integers
^\-{0,1}[0-9]{0,}\.{0,1}[0-9]{0,}$ //All decimals

The last example is not easy to understand, isn’t it? Let's look at it this way: with all starting with an optional minus sign (\-{0,1}), followed by 0 or more numbers ([0-9]{0,}), and an optional decimal point (\.{0,1}) followed by 0 or more numbers ([0-9]{0,}), and nothing else ($). Below you will know the simpler methods that can be used.

The special characters "?" are equal to {0,1}, and they both represent: "0 or 1 previous content" or "the previous content is optional". So the example just now can be simplified to:

^\-?[0-9]{0,}\.?[0-9]{0,}$

The special characters "*" are equal to {0,}, and they both represent "0 or more previous contents". Finally, the character "+" is equal to {1,}, indicating "1 or more previous contents", so the above 4 examples can be written as:

^[a-zA-Z0-9_]+$//All strings containing more than one letter, number or underscore
^[0-9]+$ //All positive numbers
^\-?[0-9]+$ //All integers
^\-?[0-9]*\.?[0-9]*$ //All decimals

Of course this does not technically reduce the complexity of regular expressions, but can make them easier to read.