Comparison of differences between linux shell regular expressions (BREs,EREs,PREs)

In computer science, it is a single string used to describe or match a series of strings that conform to a certain syntactic rule. In many text editors or other tools, regular expressions are often used to retrieve and/or replace text content that conforms to a certain pattern. Many programming languages support string manipulation with regular expressions. For example, a powerful regular expression engine is built in Perl. The concept of regular expression was originally popularized by tool software in Unix (such as sed and grep). Regular expressions are usually abbreviated as "regex". The singular numbers include regexp and regex, and the plural numbers include regexps, regexes, and regexen. These are the definitions of regular expressions. Since it originated from the Unix system, many syntax rules are the same. However, with the gradual development, the following types were expanded later. Understand these for learning regular expressions.

1. Regular expression classification:

1. Basic Regular Expression (Basic RegEx, also known as BREs)
2. Extended Regular Expression (Extended RegEx, also known as EREs)
3. Perl Regular Expression (Perl Regular Expression, also known as Perl RegEx, referred to as PREs)

Note: Only by mastering regular expressions can you fully understand the usage of commonly used text tools in Linux (for example: grep, egrep, GUN sed, Awk, etc.)

2. The relationship between commonly used text tools and regular expressions in Linux

It is very helpful for us to better use regular expressions.

grep , egrep regular expression features:

Copy the codeThe code is as follows:

1) grep supports: BREs, EREs, PREs regular expressions
The grep instruction does not follow any parameters, which means that "BREs" is to be used
The grep directive followed by the "-E" parameter means that you want to use "EREs"
The grep directive followed by the "-P" parameter means that you want to use "PREs"

2) egrep supports: EREs, PREs regular expressions
The egrep directive does not follow any parameters, which means that "EREs" is to be used
The egrep directive followed by the "-P" parameter means that you want to use "PREs"

3) grep and egrep regular matching files, and how to process files
a. Processing objects of grep and egrep: text file
b. grep and egrep processing process: find whether the text file contains the "keyword" to be searched (the keyword can be a regular expression). If the "keyword" to be searched, the content of the line containing the "keyword" in the text file is returned by default and displayed in the standard output, unless the ">" redirect symbol is used.
c. grep and egrep are processed by line when processing text files

Features of sed regular expressions

Copy the codeThe code is as follows:

1) Sed text tool support: BREs, EREs
The sed directive uses "BREs" by default
The sed command parameter "-r" means that you want to use "EREs"
2) Sed function and function
a. Object processed by sed: text file
b. sed processing operation: perform --- search, replace, delete, add and other operations on the content of the text file
c. sed is also processed by line when processing text files

Characteristics of Awk (gawk) regular expressions

Copy the codeThe code is as follows:

1) Awk text tool support: EREs
The awk directive uses "EREs" by default
2) Characteristics of Awk text tool processing text
a. object processed by awk: text file
b. awk processing operation: mainly operates on columns

3. Comparison of regular expressions in common 3 types

character	illustrate	Basic RegEx	Extended RegEx	python RegEx	Perl regEx
Escape		\	\	\	\
^	Match the beginning of the line, for example '^dog' matches the line that starts with the string dog (note: in the awk directive, '^' is the beginning of the matching string)	^	^	^	^
$	Match the end of the line, for example: '^ and dog$' match the line ending with the string dog (note: in the awk directive, '$' is the end of the matching string)	$	$	$	$
^$	Match empty lines	^$	^$	^$	^$
^string$	Match lines, for example: '^dog$' matches lines with only one string dog	^string$	^string$	^string$	^string$
\<	Match words, for example: '\<frog' (equivalent to '\bfrog'), match words starting with frog	\<	\<	Not supported	Not supported(But you can use \b to match words, for example: '\bfrog')
\>	Match words, for example: 'frog\>' (equivalent to 'frog\b '), match words ending with frog	\>	\>	Not supported	Not supported(But you can use \b to match words, for example: 'frog\b')
\<x\>	Match a word or a specific character, for example: '\<frog\>' (equivalent to '\bfrog\b'), '\<G\>'	\<x\>	\<x\>	Not supported	Not supported(But you can use \b to match words, for example: '\bfrog\b'
()	Match expressions, for example: '(frog)' is not supported	Not supported(But you can use , such as: $dog$	()	()	()
	Match expressions, for example: '(frog)' is not supported		Not supported(same())	Not supported(same())	Not supported(same())
？	Match the previous subexpression 0 or 1 (equivalent to {0,1}), for example: where(is)? Can match "where" and "whereis"	Not supported(same\?)	？	？	？
\?	Match the previous subexpression 0 or 1 (equivalent to '\{0,1\}'), for example: 'where$is$\? 'can match "where" and "whereis"	\?	Not supported(same?)	Not supported(same?)	Not supported(same?)
?	The matching pattern is non-greedy when the character is immediately followed by any other restriction character (*, +, ?, {n},{n,}, {n,m}). The non-greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible. For example, for the string "oooo", 'o+?' will match a single "o", and 'o+' will match all 'o'	Not supported	Not supported	Not supported	Not supported
.	Match any single character except the newline character ('\n') (Note: periods in the awk directive can match newline characters)	.	(If you want to match any character including "\n", use: '(^$)\|(.)	.	(If you want to match any character including "\n", use: ' [.\n] '
*	Match the previous subexpression 0 or more times (equivalent to {0, }), for example: zo* can match "z" and "zoo"	*	*	*	*
\+	Match the previous subexpression 1 or more times (equivalent to '\{1, \}'), for example: 'where$is$\+ ' can match "whereis" and "whereisis"	\+	Not supported(Same as +)	Not supported(Same as +)	Not supported(Same as +)
+	Match the previous subexpression 1 or more times (equivalent to {1, }), for example: zo+ can match "zo" and "zoo", but cannot match "z"	Not supported(Same as \+)	+	+	+
{n}	n must be a 0 or a positive integer, match the subexpression n times, for example: zo{2} can match	Not supported(Same as \{n\})	{n}	{n}	{n}
{n,}	"zooz", but cannot match "Bob"n must be a 0 or positive integer, and the matching subexpression is greater than or equal to n times, for example: go{2,}	Not supported(Same as \{n,\})	{n,}	{n,}	{n,}
{n,m}	Can match "good", but cannot match both godm and n are non-negative integers, where n <= m, match at least n times and match at most m times, for example: o{1,3} will match the first three o in "fooooood" (note that there cannot be spaces between commas and two numbers)	Not supported(Same as \{n,m\})	{n,m}	{n,m}	{n,m}
x\|y	Match x or y, for example: 'z\|(food)' is not supported and can match "z" or "food"; '(z\|f)ood' matches "zood" or "food"	Not supported(Same as x\\|y)	x\|y	x\|y	x\|y
[0-9]	Match any numeric character from 0 to 9 (note: write in increments)	[0-9]	[0-9]	[0-9]	[0-9]
[xyz]	Character set, match any character contained, for example: '[abc]' can match 'a' in "lay" (note: if metacharacters, such as: . * etc., are placed in [ ], then they will become a normal character)	[xyz]	[xyz]	[xyz]	[xyz]
[^xyz]	Negative value character collection, matches any character not included (note: line breaks are not included). For example: '[^abc]' can match 'L' in "Lay" (note: [^xyz] in the awk directive matches any character not included + line breaks)	[^xyz]	[^xyz]	[^xyz]	[^xyz]
[A-Za-z]	Match any character in uppercase or lowercase letters (note: write in increments)	[A-Za-z]	[A-Za-z]	[A-Za-z]	[A-Za-z]
[^A-Za-z]	Match any character except uppercase and lowercase letters (note: write in increments)	[^A-Za-z]	[^A-Za-z]	[^A-Za-z]	[^A-Za-z]
\d	Match any numeric character from 0 to 9 (equivalent to [0-9])	Not supported	Not supported	\d	\d
\D	Match non-numeric characters (equivalent to [^0-9])	Not supported	Not supported	\D	\D
\S	Match any non-whitespace characters (equivalent to [^\f\n\r\t\v])	Not supported	Not supported	\S	\S
\s	Match any whitespace characters, including spaces, tabs, page breaks, etc. (equivalent to [ \f\n\r\t\v])	Not supported	Not supported	\s	\s
\W	Match any non-word character (equivalent to [^A-Za-z0-9_])	\W	\W	\W	\W
\w	Match any word character including an underscore (equivalent to [A-Za-z0-9_])	\w	\w	\w	\w
\B	Match non-word boundaries, for example: 'er\B' can match 'er' in "verb", but cannot match 'er' in "never"	\B	\B	\B	\B
\b	Match a word boundary, which means the position between the word and space. For example: 'er\b' can match 'er' in "never", but cannot match 'er' in "verb"	\b	\b	\b	\b
\t	Match a horizontal tab character (equivalent to \x09 and \cI)	Not supported	Not supported	\t	\t
\v	Match a vertical tab character (equivalent to \x0b and \cK)	Not supported	Not supported	\v	\v
\n	Match a newline character (equivalent to \x0a and \cJ)	Not supported	Not supported	\n	\n
\f	Match a page break (equivalent to \x0c and \cL)	Not supported	Not supported	\f	\f
\r	Match a carriage return (equivalent to \x0d and \cM)	Not supported	Not supported	\r	\r
\\	Match the escape character itself "\"	\\	\\	\\	\\
\cx	Match the control character specified by x, for example: \cM matches a Control-M or carriage return character, the value of x must be one of A-Z or a-z, otherwise, c is regarded as an original 'c' character	Not supported	Not supported		\cx
\xn	Match n, where n is a hexadecimal escape value. The hexadecimal escape value must be the length of two digits that are determined, for example: '\x41' matches "A". '\x041' is equivalent to '\x04' & "1". ASCII encoding can be used in regular expressions	Not supported	Not supported		\xn
\num	Match num, where num is a positive integer. Represents a reference to the obtained match	Not supported	\num	\num
[:alnum:]	Match any letter or number ([A-Za-z0-9]), for example: '[[:alnum:]] '	[:alnum:]	[:alnum:]	[:alnum:]	[:alnum:]
[:alpha:]	Match any letter ([A-Za-z]), for example: ' [[:alpha:]] '	[:alpha:]	[:alpha:]	[:alpha:]	[:alpha:]
[:digit:]	Match any number ([0-9]), for example: '[[:digit:]] '	[:digit:]	[:digit:]	[:digit:]	[:digit:]
[:lower:]	Match any lowercase letter ([a-z]), for example: ' [[:lower:]] '	[:lower:]	[:lower:]	[:lower:]	[:lower:]
[:upper:]	Match any capital letter ([A-Z])	[:upper:]	[:upper:]	[:upper:]	[:upper:]
[:space:]	Any whitespace character: supports tab characters and spaces, for example: ' [[:space:]] '	[:space:]	[:space:]	[:space:]	[:space:]
[:blank:]	Spaces and tabs (landscape and portrait), for example: '[[:blank:]]'ó'[\s\t\v]'	[:blank:]	[:blank:]	[:blank:]	[:blank:]
[:graph:]	Any visible and printable character (note: spaces and line breaks are not included, for example: '[[:graph:]] '	[:graph:]	[:graph:]	[:graph:]	[:graph:]
[:print:]	Any character that can be printed (note: does not include: [:cntrl:], string ending character '\0', EOF file ending character (-1), but includes space symbols), for example: '[[:print:]] '	[:print:]	[:print:]	[:print:]	[:print:]
[:cntrl:]	Any control character (the first 32 characters in the ASCII character set, i.e., decimal denotation from 0 to 31, for example: newlines, tabs, etc.), for example: ' [[:cntrl:]]'	[:cntrl:]	[:cntrl:]	[:cntrl:]	[:cntrl:]
[:punct:]	Any punctuation mark (excluding character sets such as [:alnum:], [:cntrl:], and [:space:])	[:punct:]	[:punct:]	[:punct:]	[:punct:]
[:xdigit:]	Any hexadecimal number (ie: 0-9, a-f, A-F)	[:xdigit:]	[:xdigit:]	[:xdigit:]	[:xdigit:]

Comparison of four and three different types of regular expressions

Copy the codeThe code is as follows:

Note: When using BERs (basic regular expressions), you must prefix the following symbols with escape characters ('\') to block their speical meaning “?,+,|,{,},(,)” characters, and you need to add escape symbols “\”

Note: Modifiers are used at the end of regular expressions, such as: /dog/i, where "i" is a modifier, which represents the meaning of: case-insensitive when matching, so what are the modifiers? Common modifiers are as follows:

g   Global matching (i.e.: each occurrence on a line, not just the first occurrence on a line)
s    Process the entire matching string as one line
m     Multi-line matching
i     Ignore case
x     Allow comments and spaces to appear
U    Non-greedy match

The above are the similarities and differences between the three common types of regular expressions in Linux. Understanding these as a whole, I believe that when using these tools, you will be more clear and clear.