SoFunction
Updated on 2025-03-03

Regular expression quick lookup table

character

Copy the codeThe code is as follows:

x     Character x
\\    Backslash Character
\0n    Character n with octal value 0 (0 <= n <= 7)
\0nn    Character nn with octal value 0 (0 <= n <= 7)
\0mnn    Character mnn with octal value 0 (0 <= m <= 3, 0 <= n <= 7)
\xhh    Character hh with hexadecimal value 0x
\uhhhh    Character hhhhh with hexadecimal value 0x
\t    Tab ('\u0009')
\n    New line (line break) character ('\u000A')
\r    Carriage return character ('\u000D')
\f    Page change ('\u000C')
\a   Alarm (bell) symbol ('\u0007')
\e    Escape character ('\u001B')
\cx    Control character corresponding to x

Character Class

Copy the codeThe code is as follows:

[abc]    a, b or c (simple class)
[^abc]    Any character except a, b, or c (negative)
[a-zA-Z]    a to z or A to Z, letters at both ends are included (range)
[a-d[m-p]]    a to d or m to p: [a-dm-p] (union)
[a-z&&[def]]    d, e or f (intersection)
[a-z&&[^bc]]    a to z, except b and c: [ad-z] (minus)
[a-z&&[^m-p]]    a to z, not m to p: [a-lq-z] (minus)

Predefined character classes

Copy the codeThe code is as follows:

.    Any character (may or may not match the line ending character)
\d   Number: [0-9]
\D    Non-number: [^0-9]
\s    Whitespace characters: [ \t\n\x0B\f\r]
\S    Non-whitespace characters: [^\s]
\w    Word characters: [a-zA-Z_0-9]
\W    Non-word characters: [^\w]

POSIX character class (US-ASCII only)

Copy the codeThe code is as follows:

\p{Lower}    Lowercase alphabet characters: [a-z]
\p{Upper}   Uppercase characters: [A-Z]
\p{ASCII}    All ASCII: [\x00-\x7F]
\p{Alpha}    Alpha characters: [\p{Lower}\p{Upper}]
\p{Digit}    Decimal number: [0-9]
\p{Alnum}    Alphanumeric characters: [\p{Alpha}\p{Digit}]
\p{Punct}    Punctuation:!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}    Visible characters: [\p{Alnum}\p{Punct}]
\p{Print}   Printable characters: [\p{Graph}\x20]
\p{Blank}    Space or tab character: [ \t]
\p{Cntrl}    Control characters: [\x00-\x1F\x7F]
\p{XDigit}    Hexadecimal number: [0-9a-fA-F]
\p{Space}    Whitespace characters: [ \t\n\x0B\f\r]

Class (simple java character type)

Copy the codeThe code is as follows:

\p{javaLowerCase}    Equivalent to ()
\p{javaUpperCase}    Equivalent to ()
\p{javaWhitespace}    Equivalent to ()
\p{javaMirrored}    Equivalent to ()

Unicode blocks and classes

Copy the codeThe code is as follows:

\p{InGreek}    Characters in Greek block (simple block)
\p{Lu}   Uppercase letters (simple category)
\p{Sc}    Currency Symbol
\P{InGreek}    All characters, except in the Greek block (negative)
[\p{L}&&[^\p{Lu}]]      All letters, except capital letters (minus)

Boundary matcher

Copy the codeThe code is as follows:

^    The beginning of the line
$    The end of the line
\b    Word boundary
\B    Non-word boundary
\A    The beginning of the input
\G    The end of the previous match
\Z    The end of the input, only for the last ending character (if any)
\z    End of input

Greedy Quantitative Word

Copy the codeThe code is as follows:

X?    X, once or once
X*    X, zero or multiple times
X+    X, once or more
X{n}    X, exactly n times
X{n,}   X, at least n times
X{n,m}    X, at least n times, but not more than m times

Reluctant Quantitative Word

Copy the codeThe code is as follows:

X??    X, once or once
X*?    X, zero or multiple times
X+?    X, once or more
X{n}?    X, exactly n times
X{n,}?    X, at least n times
X{n,m}?    X, at least n times, but not more than m times

Possessive Quantitative Words

Copy the codeThe code is as follows:

X?+    X, once or once
X*+    X, zero or multiple times
X++    X, once or more
X{n}+    X, exactly n times
X{n,}+    X, at least n times
X{n,m}+    X, at least n times, but not more than m times

Logical operator

Copy the codeThe code is as follows:

XY    X Heel Y
X|Y   X or Y
(X)    X, as a capture group

Back Quote

Copy the codeThe code is as follows:

\n    Any matching nth capture group

Quote

Copy the codeThe code is as follows:

\    Nothing, but quote the following characters
\Q    Nothing, but quotes all characters until \E
\E    Nothing, but ends the reference starting with \Q

Special construction (non-captured)

Copy the codeThe code is as follows:

(?:X)    X, as a non-capture group
(?idmsux-idmsux)     Nothing, but will match the flag i d m s u x on - off
(?idmsux-idmsux:X)      X, as with the given flag i d m s u x on - off
(?=X)    X, through a zero-width positive lookahead
(?!X)    X, negative lookahead through zero width
(?<=X)    X, through a positive lookbehind of zero width
(?<!X)    X, through a negative lookbehind of zero width
(?>X)    X, as an independent non-capture group

Backslashes, escapes, and references

The backslash character ('\') is used to reference escape constructs, as defined in the table above, and also to reference other characters that will be interpreted as non-escaped constructs. Therefore, the expression \\ matches a single backslash, and \{ matches the left bracket.

It is wrong to use backslashes before any alphabetical characters that do not represent escape constructs; they are reserved for future extensions of regular expression languages. A backslash can be used before a non-alphabetical character, regardless of whether the character is not part of the escaped construct.

According to the requirements of Java Language Specification, backslashes in strings of Java source code are interpreted as Unicode escapes or other character escapes. Therefore, two backslashes must be used in the string literal to indicate that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, and "\\b"Match word boundaries. String literal "\(hello\)" is illegal and will cause a compile-time error; to match string (hello) you must use string literal"\\(hello\\)"。

Character Class

Character classes can appear in other character classes and can contain union operators (implicitly) and intersection operators (&&). The union operator represents a class that contains at least all characters in one of its operand classes. The intersection operator represents a class that contains all characters in its two operand classes at the same time.

The priority of character class operators is as follows, arranged in order from highest to lowest:
Literal escape                                                             �
Grouping    [...]
Range      a-z
union    [a-e][i-u]
Intersection    [a-z&&[aeiou]]
Note that different sets of metacharacters are actually located inside the character class, not outside the character class. For example, regular expressions . lose their special meaning inside a character class, and expressions - become the range that forms metacharacters.

Line ending character

Line ending is a sequence of one or two characters that marks the end of the line of the input character sequence. The following code is recognized as a line ending character:

new line (line break) character ('\n'),
The carriage return character ("\r\n") followed by the new line character,
A separate carriage return ('\r'),
Next line character ('\u0085'),
Line delimiter ('\u2028') or
Paragraph separator ('\u2029).
If UNIX_LINES mode is activated, the new line character is the uniquely recognized line end character.

If the DOTALL flag is not specified, the regular expression . can match any character (except the end of the line).

By default, regular expressions ^ and $ ignore line endings and only match the beginning and end of the entire input sequence, respectively. If MULTILINE mode is activated, a match occurs only after the beginning of the input and the end of the line (the end of the input). When in MULTILINE mode, $ matches only before the line ending or at the end of the input sequence.

Group and Capture

Capture groups can be numbered by calculating their open brackets from left to right. For example, in the expression ((A)(B(C)))), there are four such groups:
       ((A)(B(C)))
       \A
       (B(C))
       (C)
Group zeros always represent the entire expression.

The capture groups are named in this way because in the match, each subsequence of the input sequence matching those groups is saved. The captured subsequence can later be used in expressions through Back references or can be obtained from the matcher after the match operation is completed.

The capture input associated with a group is always the subsequence that matches the group most recently. If the group is calculated again due to quantization, its previously captured value (if any) will be retained on the second calculation failure. For example, matching the string "aba" to the expression (a(b)?)+ will set the second group to "b". At the beginning of each match, all captured inputs are discarded.

Groups starting with (?) are pure non-capture groups that do not capture text and do not count against combo counts.

The above is the entire content of this article, I hope you like it.