character
x Character x
\\ Backslash Character
\0n Character n with octal value 0 (0 <= n <= 7)
\0nn Character nn with octal value 0 (0 <= n <= 7)
\0mnn Character mnn with octal value 0 (0 <= m <= 3, 0 <= n <= 7)
\xhh Character hh with hexadecimal value 0x
\uhhhh Character hhhhh with hexadecimal value 0x
\t Tab ('\u0009')
\n New line (line break) character ('\u000A')
\r Carriage return character ('\u000D')
\f Page change ('\u000C')
\a Alarm (bell) symbol ('\u0007')
\e Escape character ('\u001B')
\cx Control character corresponding to x
Character Class
[abc] a, b or c (simple class)
[^abc] Any character except a, b, or c (negative)
[a-zA-Z] a to z or A to Z, letters at both ends are included (range)
[a-d[m-p]] a to d or m to p: [a-dm-p] (union)
[a-z&&[def]] d, e or f (intersection)
[a-z&&[^bc]] a to z, except b and c: [ad-z] (minus)
[a-z&&[^m-p]] a to z, not m to p: [a-lq-z] (minus)
Predefined character classes
. Any character (may or may not match the line ending character)
\d Number: [0-9]
\D Non-number: [^0-9]
\s Whitespace characters: [ \t\n\x0B\f\r]
\S Non-whitespace characters: [^\s]
\w Word characters: [a-zA-Z_0-9]
\W Non-word characters: [^\w]
POSIX character class (US-ASCII only)
\p{Lower} Lowercase alphabet characters: [a-z]
\p{Upper} Uppercase characters: [A-Z]
\p{ASCII} All ASCII: [\x00-\x7F]
\p{Alpha} Alpha characters: [\p{Lower}\p{Upper}]
\p{Digit} Decimal number: [0-9]
\p{Alnum} Alphanumeric characters: [\p{Alpha}\p{Digit}]
\p{Punct} Punctuation:!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph} Visible characters: [\p{Alnum}\p{Punct}]
\p{Print} Printable characters: [\p{Graph}\x20]
\p{Blank} Space or tab character: [ \t]
\p{Cntrl} Control characters: [\x00-\x1F\x7F]
\p{XDigit} Hexadecimal number: [0-9a-fA-F]
\p{Space} Whitespace characters: [ \t\n\x0B\f\r]
Class (simple java character type)
\p{javaLowerCase} Equivalent to ()
\p{javaUpperCase} Equivalent to ()
\p{javaWhitespace} Equivalent to ()
\p{javaMirrored} Equivalent to ()
Unicode blocks and classes
\p{InGreek} Characters in Greek block (simple block)
\p{Lu} Uppercase letters (simple category)
\p{Sc} Currency Symbol
\P{InGreek} All characters, except in the Greek block (negative)
[\p{L}&&[^\p{Lu}]] All letters, except capital letters (minus)
Boundary matcher
^ The beginning of the line
$ The end of the line
\b Word boundary
\B Non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input, only for the last ending character (if any)
\z End of input
Greedy Quantitative Word
X? X, once or once
X* X, zero or multiple times
X+ X, once or more
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n times, but not more than m times
Reluctant Quantitative Word
X?? X, once or once
X*? X, zero or multiple times
X+? X, once or more
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n times, but not more than m times
Possessive Quantitative Words
X?+ X, once or once
X*+ X, zero or multiple times
X++ X, once or more
X{n}+ X, exactly n times
X{n,}+ X, at least n times
X{n,m}+ X, at least n times, but not more than m times
Logical operator
XY X Heel Y
X|Y X or Y
(X) X, as a capture group
Back Quote
\n Any matching nth capture group
Quote
\ Nothing, but quote the following characters
\Q Nothing, but quotes all characters until \E
\E Nothing, but ends the reference starting with \Q
Special construction (non-captured)
(?:X) X, as a non-capture group
(?idmsux-idmsux) Nothing, but will match the flag i d m s u x on - off
(?idmsux-idmsux:X) X, as with the given flag i d m s u x on - off
(?=X) X, through a zero-width positive lookahead
(?!X) X, negative lookahead through zero width
(?<=X) X, through a positive lookbehind of zero width
(?<!X) X, through a negative lookbehind of zero width
(?>X) X, as an independent non-capture group
Backslashes, escapes, and references
The backslash character ('\') is used to reference escape constructs, as defined in the table above, and also to reference other characters that will be interpreted as non-escaped constructs. Therefore, the expression \\ matches a single backslash, and \{ matches the left bracket.
It is wrong to use backslashes before any alphabetical characters that do not represent escape constructs; they are reserved for future extensions of regular expression languages. A backslash can be used before a non-alphabetical character, regardless of whether the character is not part of the escaped construct.
According to the requirements of Java Language Specification, backslashes in strings of Java source code are interpreted as Unicode escapes or other character escapes. Therefore, two backslashes must be used in the string literal to indicate that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, and "\\b"Match word boundaries. String literal "\(hello\)" is illegal and will cause a compile-time error; to match string (hello) you must use string literal"\\(hello\\)"。
Character Class
Character classes can appear in other character classes and can contain union operators (implicitly) and intersection operators (&&). The union operator represents a class that contains at least all characters in one of its operand classes. The intersection operator represents a class that contains all characters in its two operand classes at the same time.
The priority of character class operators is as follows, arranged in order from highest to lowest:
Literal escape �
Grouping [...]
Range a-z
union [a-e][i-u]
Intersection [a-z&&[aeiou]]
Note that different sets of metacharacters are actually located inside the character class, not outside the character class. For example, regular expressions . lose their special meaning inside a character class, and expressions - become the range that forms metacharacters.
Line ending character
Line ending is a sequence of one or two characters that marks the end of the line of the input character sequence. The following code is recognized as a line ending character:
new line (line break) character ('\n'),
The carriage return character ("\r\n") followed by the new line character,
A separate carriage return ('\r'),
Next line character ('\u0085'),
Line delimiter ('\u2028') or
Paragraph separator ('\u2029).
If UNIX_LINES mode is activated, the new line character is the uniquely recognized line end character.
If the DOTALL flag is not specified, the regular expression . can match any character (except the end of the line).
By default, regular expressions ^ and $ ignore line endings and only match the beginning and end of the entire input sequence, respectively. If MULTILINE mode is activated, a match occurs only after the beginning of the input and the end of the line (the end of the input). When in MULTILINE mode, $ matches only before the line ending or at the end of the input sequence.
Group and Capture
Capture groups can be numbered by calculating their open brackets from left to right. For example, in the expression ((A)(B(C)))), there are four such groups:
((A)(B(C)))
\A
(B(C))
(C)
Group zeros always represent the entire expression.
The capture groups are named in this way because in the match, each subsequence of the input sequence matching those groups is saved. The captured subsequence can later be used in expressions through Back references or can be obtained from the matcher after the match operation is completed.
The capture input associated with a group is always the subsequence that matches the group most recently. If the group is calculated again due to quantization, its previously captured value (if any) will be retained on the second calculation failure. For example, matching the string "aba" to the expression (a(b)?)+ will set the second group to "b". At the beginning of each match, all captured inputs are discarded.
Groups starting with (?) are pure non-capture groups that do not capture text and do not count against combo counts.
The above is the entire content of this article, I hope you like it.