Regular expression basic tutorial

5. All symbol interpretation
Characters Description
\  Mark the next character as a special character, or an primitive character, or a backward reference, or an octal escape character. For example, 'n' match the character "n". '\n' matches a newline character. The sequence '\\' matched "\" and "\(" matched "(".
^  Match the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after '\n' or '\r'.
$  Match the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the previous position of '\n' or '\r'.
*  Match the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * Equivalent to {0,}.
+   Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but cannot match "z". + equivalent to {1,}.
?  Match the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ? Equivalent to {0,1}.
{n}  n is a non-negative integer. Match n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob" , but can match 'o' in "food" .
{n,}  n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" , but can match all o in "fooooood" . 'o{1,}' equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}  m and n are both non-negative integers, where n <= m. Match at least n and match at most m. For example, "o{1,3}" will match the first three o in "fooooooood". 'o{0,1}' equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers.
?  When this character is immediately followed by any other restriction character (*, +, ?, {n}, {n,}, {n,m}) , the matching pattern is non-greedy. The non-greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible. For example, for the string "oooo", 'o+?' will match a single "o", and 'o+' will match all 'o'.
.  Match any single character except "\n". To match any character including '\n', use the pattern like '[.\n]'.
(pattern)  Match pattern and get this match. The obtained matches can be obtained from the generated Matches collection, using the SubMatches collection in VBScript, and the $0…$9 attribute in JScript. To match parentheses characters, use '$' or '$'.
(?:pattern)  Match pattern but does not get the matching result, that is, this is a non-get match and is not stored for future use. This is useful when using the "or" character (|) to combine various parts of a pattern. For example, 'industr(?:y|ies) is an expression that is simpler than 'industry|industries'.
(?=pattern)  Forward pre-check, match the search string at the beginning of any string matching pattern . This is a non-get match, that is, the match does not need to be retrieved for later use. For example, 'Windows (?=95|98|NT|2000)' can match "Windows" in "Windows 2000" but cannot match "Windows" in "Windows 3.1". Pre-checking does not consume characters, that is, after a match occurs, the next match's search begins immediately after the last match, rather than after the characters containing the pre-checking.
(?!pattern)  Negative pre-check, match the search string at the beginning of any string that does not match pattern . This is a non-get match, that is, the match does not need to be retrieved for later use. For example, 'Windows (?!95|98|NT|2000)' can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000". Pre-checking does not consume characters, that is, after a match occurs, the next match search begins immediately after the last match, rather than starting after the characters containing the pre-checking.
x|y  match x or y. For example, 'z|food' can match "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz]  character collection. Match any character contained. For example, '[abc]' can match 'a' in "plain" .
[^xyz]  The set of negative values characters. Match any characters not included. For example, '[^abc]' can match 'p' in "plain".
[a-z]  character range. Match any character in the specified range. For example, '[a-z]' can match any lowercase alphabetical characters in the range 'a' to 'z'.
[^a-z]  Negative value character range. Match any arbitrary characters that are not within the specified range. For example, '[^a-z]' can match any arbitrary character that is not in the range of 'a' to 'z'.
\b  Match a word boundary, which means the position of the word and space. For example, 'er\b' can match 'er' in "never" , but not 'er' in "verb".
\B  Match non-word boundaries. 'er\B' can match 'er' in "verb" , but cannot match 'er' in "never".
\cx  Match the control characters specified by x . For example, \cM matches a Control-M or carriage return. The value of x must be one of A-Z or a-z. Otherwise, treat c as an original 'c' character.
\d  Match a numeric character. Equivalent to [0-9].
\D  Match a non-numeric character. Equivalent to [^0-9].
\f  Match a page breaker. Equivalent to \x0c and \cL.
\n  Match a newline character. Equivalent to \x0a and \cJ.
\r  Match a carriage return character. Equivalent to \x0d and \cM.
\s  Match any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [ \f\n\r\t\v].
\S  Match any non-whitespace characters. Equivalent to [^ \f\n\r\t\v].
\t  Match a tab character. Equivalent to \x09 and \cI.
\v  Match a vertical tab character. Equivalent to \x0b and \cK.
\w  Match any word character that includes an underscore. Equivalent to '[A-Za-z0-9_]'.
\W  Match any non-word character. Equivalent to '[^A-Za-z0-9_]'.
\xn  Match n, where n is a hexadecimal escape value. The hexadecimal escape value must be the length of two numbers that are determined. For example, '\x41' matched "A". '\x041' is equivalent to '\x04' & "1". ASCII encoding can be used in regular expressions. .
\num  Match num, where num is a positive integer. Reference to the obtained match. For example, '(.)\1' matches two consecutive identical characters.
\n  Identifies an octal escape value or a backward reference. If \n has at least n obtained subexpressions before, then n is a backward reference. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
\nm  Identifies an octal escape value or a backward reference. If there are at least nm obtaining subexpressions before \nm , then nm is a backward reference. If there are at least n fetches before \nm , then n is a backward reference followed by the literal m . If the previous conditions are not satisfied, if both n and m are octal numbers (0-7), then \nm will match the octal escape value nm.
\nml  If n is an octal number (0-3), and both m and l are octal numbers (0-7), then the octal escape value nml is matched.
\un  Match n, where n is a Unicode character represented by four hexadecimal numbers. For example, \u00A9 matches the copyright symbol (?).
6. Some examples
Regular expression description
/\b([a-z]+) \1\b/gi The position where a word appears continuously
/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/  Resolve a URL into a protocol, domain, port and relative path
/^(?:Chapter|Section) [1-9][0-9]{0,1}$/ Position of the section
/[-a-z]/ A to z add a - number to 26 letters.
/ter\b/ can match chapter, not terminal
/\Bapt/ can match chapter, not aptitude
/Windows(?=95 |98 |NT )/ can match Windows95 or Windows98 or WindowsNT. When a match is found, the next search match will be performed from behind Windows.
7. Regular expression matching rules
7.1 Basic Pattern Matching
Everything starts with the most basic ones. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of strings. The pattern can be very simple, composed of ordinary strings, or very complex, and often uses special characters to represent characters in a range, recurrence, or context. For example:
^once
This pattern contains a special character ^, indicating that the pattern matches only those strings starting with once. For example, this pattern matches the string "once on a time" and does not match "There once a man from NewYork". Just as the ^ symbol indicates the beginning, the $ symbol is used to match strings ending in a given pattern.
bucket$
This pattern matches "Who kept all of this cash in a bucket" and does not match "buckets". When the characters ^ and $ are used at the same time, they represent an exact match (strings are the same as patterns). For example:
^bucket$
Match only the string "bucket". If a pattern does not include ^ and $, then it matches any string containing the pattern. For example: mode
once
With string
There once was a man from NewYork
Who kept all of his cash in a bucket.
It's a match.
The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letter itself, and the numbers are the same. Other slightly more complex characters, such as punctuation and white characters (spaces, tabs, etc.), use escape sequences. All escape sequences are headed with a backslash (\). The escape sequence of tab characters is: \t. So if we want to detect whether a string starts with a tab, we can use this pattern:
^\t
Similarly, \n means "new line" and \r means enter. Other special symbols can be used to add backslashes in front of them, such as the backslash itself is represented by \\, periods are represented by \., and so on.
7.2 Character cluster
In INTERNET programs, regular expressions are usually used to verify user input. After the user submits a FORM, it is not enough to determine whether the entered phone number, address, EMAIL address, credit card number, etc. are valid.
So we need to use a more free way to describe the pattern we want, which is the character cluster. To create a cluster of characters representing all vowel characters, put all vowel characters in a square bracket:
[AaEeIiOoUu]
This pattern matches any vowel character, but can only represent one character. A hyphen can represent a range of a character, such as:
[a-z]//Match all lowercase letters
[A-Z]//Match all capital letters
[a-zA-Z]//Match all letters
[0-9] //Match all numbers
[0-9\.\-] //Match all numbers, periods and minus signs
[ \f\r\t\n] //Match all white characters
Similarly, these also represent only one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:
^[a-z][0-9]$
Although [a-z] represents a range of 26 letters, here it can only match a string whose first character is a lowercase letter.
It was mentioned earlier that ^ represents the beginning of a string, but it has another meaning. When using ^ in a set of square brackets, it means "non" or "exclude", and is often used to remove a character. Also using the previous example, we require that the first character cannot be a number:
^[^0-9][0-9]$
This pattern matches "&5", "g7" and "-2", but does not match "12" and "66". Here are a few examples of excluding specific characters:
[^a-z]//All characters except lowercase letters
[^\\\/\^] //All characters except (\)(/)(^)
[^\"\'] //All characters except double quotes (") and single quotes (')
The special character "." (dot, period) is used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string ending with the number 5 and starting with other non-"newline" characters. Pattern "." can match any string except empty string and string that only includes one "new line".
PHP regular expressions have some built-in general character clusters, and the list is as follows:
Character cluster meaning
[[:alpha:]] Any letter
[[:digit:]] Any number
[[:alnum:]] Any letters and numbers
[[:space:]] Any white character
[[:upper:]] Any capital letter
[[:lower:]] Any lowercase letter
[[:punct:]] Any punctuation mark
[[:xdigit:]] Any hexadecimal number is equivalent to [0-9a-fA-F]
7.3 Confirm repeated occurrence
Until now, you have known how to match a letter or number, but more often, you might want to match a word or a set of numbers. A word consists of several letters, and a set of numbers consists of several singular numbers. Braces ({}) followed by characters or clusters of characters are used to determine the number of repetitions of the previous content.
Character cluster meaning
^[a-zA-Z_]$ All letters and underscores
^[[:alpha:]]{3}$ All 3 letter words
^a$ letter a
^a{4}$ aaaa
^a{2,4}$ aa,aaa or aaaa
^a{1,3}$ a,aa or aaa
^a{2,}$ contains more than two a strings
^a{2,} For example: aardvark and aaab, but apple cannot
a{2,} such as: baad and aaa, but Nantucket cannot
\t{2} Two tab characters
.{2} All two characters
These examples describe three different uses of curly braces. A number, {x} means "the previous character or character cluster only appears x times"; a number with a comma, {x,} means "the previous content appears x or more times"; two commas separated numbers, {x,y} means "the previous content appears at least x times, but no more than y times". We can extend the pattern to more words or numbers:
^[a-zA-Z0-9_]{1,}$//All strings containing more than one letter, number or underscore
^[0-9]{1,}$//All positive numbers
^\-{0,1}[0-9]{1,}$//All integers
^\-{0,1}[0-9]{0,}\.{0,1}[0-9]{0,}$ //All decimals
The last example is not easy to understand, is it? Look at it this way: with all starting with an optional minus sign (\-{0,1}), followed by 0 or more numbers ([0-9]{0,}), and an optional decimal point (\.{0,1}) followed by 0 or more numbers ([0-9]{0,}), and nothing else ($). Below you will know the simpler methods that can be used.
The special characters "?" are equal to {0,1}, and they both represent: "0 or 1 previous content" or "the previous content is optional". So the example just now can be simplified to:
^\-?[0-9]{0,}\.?[0-9]{0,}$
The special characters "*" are equal to {0,}, and they both represent "0 or more previous contents". Finally, the character "+" is equal to {1,}, indicating "1 or more previous contents", so the above 4 examples can be written as:
^[a-zA-Z0-9_]+$//All strings containing more than one letter, number or underscore
^[0-9]+$ //All positive numbers
^\-?[0-9]+$ //All integers
^\-?[0-9]*\.?[0-9]*$ //All decimals
Of course this does not technically reduce the complexity of regular expressions, but can make them easier to read.
8. References

JScript and VBScript regular expressions

Examples on Microsoft MSDN (in English):

Scanning for HREFS: Provides an example that searches an input string and prints out all the href="..." values and their locations in the string.
Changing Date Formats: Provides an example that replaces dates of the form mm/dd/yy with dates of the form dd-mm-yy.
Extracting URL Information: Provides an example that extracts a protocol and port number from a string containing a URL. For example, ":8080/letters/" returns "http:8080".
Cleaning an Input String: provides an example that strips invalid non-alphanumeric characters from a string.
Confirming Valid E-Mail Format: Provides an example that you can use to verify that a string is in valid e-mail format.