1. Overview
This may be a confusing and even confusing topic, but it is precisely because of this that there is a need for discussion.
In the regularity, some characters or character sequences with special meanings are called metacharacters, such as "?" means that the modified subexpression matches 0 or 1 time, and "(?i)" means that the matching pattern that ignores upper and lower case, etc. And when these metacharacters are required to match themselves, they need to be escaped.
In different languages or application scenarios, the regular definition methods and metacharacters appear differently, and the escape methods are also numerous and different.
2. Character escape in .NET regularity
2.1 Escape characters in .NET regularity
In most languages, "\" is used as an escape character to escape some characters or character sequences with special meanings, such as "\n" means line breaks, "\t" means horizontal tab characters, etc. And such escapes will change unexpectedly when applied to the regular.
The topic is derived from a regular question in C#
string[] test = new string[]{"\\", "\\\\"}; Regex reg = new Regex("^\\\\$"); foreach (string s in test) { += "Source String: " + (5, ' ') + "Match result: " + (s) + "\n"; } /*----------------------------------------------------------------------------------------------------------------------------- Source string: \ Match result: True Source string: \\ Match result: False */
Some people may be confused about this result. Doesn’t the “\\” in the string represent an escaped “\” character? And shouldn't "\\\\" represent two escaped "\" characters? Then the result of the above regular match should be False and the second one should be True?
It may not be easy to explain directly to this issue, so I should explain it in another way.
For example, the character to match is like this
string test = "(";
So how to write regularity? Because "(" has special meaning in a regular, it must be escaped when writing a regular, that is, "\(", and in a string, "\\" should be used to represent "\" itself, that is,
Regex reg = new Regex("^\\($");
If you understand this, then replace "(" back to "\". By the same token, in a string, you should use "\\" to represent "\" itself, that is,
Regex reg = new Regex("^\\\\$");
Through this analysis, we can see that in the regular declared in a string, the match of "\\\\" is actually a single "\" character. Let's summarize the relationship between them:
String output to the console or interface:\
String declared in the program: string test = "\\";
Regular declared in the program: Regex reg = new Regex("^\\\\$");
Is this explanation already understandable? So does it feel so clumsy? Yes, the regular declared in a program as a string is so clumsy when it comes to escape characters.
Therefore, in C#, another way to declare strings is provided. Adding a "@" before the string can ignore escapes.
string[] test = new string[] { @"\", @"\\" }; Regex reg = new Regex(@"^\\$"); foreach (string s in test) { += "Source String: " + (5, ' ') + "Match result: " + (s) + "\n"; } /*----------------------------------------------------------------------------------------------------------------------------- Source string: \ Match result: True Source string: \\ Match result: False */
This makes it much more concise and conforms to the usual understanding.
But at the same time, it also brings another problem, which is the escaping of double quotes. In normal string declarations, double quotes can be escaped with "\".
string test = "<a href=\"\">only a test</a>";
However, after adding "@" to the string, "\" will be recognized as the "\" character itself, so that double quotes cannot be escaped with "\"" and double quotes need to be escaped with """".
string test = @"<a href="""">only a test</a>";
In this case, the definition of regularity has only one form, which is consistent with the definition method after adding "@" to C#.
Dim test As String() = New String() {"\", "\\"} Dim reg As Regex = New Regex("^\\$") For Each s As String In test += "Source String:" & (5, " "c) & "Match result:" & (s) & vbCrLf Next '----------------------- 'Source string:\ Match results:True 'Source String:\\ Match result: False '--------------------
2.2 Metachars that need to be escaped in .NET regularity
In MSDN, the following characters are used as metacharacters in the regular and need to be escaped when matching itself.
. $ ^ { [ ( | ) * + ? \
However, in actual applications, we must also judge based on the actual situation. The above characters may not need to be escaped, or more than the above characters may need to be escaped.
In normal regular writing, the escape of the above characters can usually be handled normally by the writer, but when dynamically generating the regular, extra attention is required. Otherwise, when the variable contains metacharacters, the dynamically generated regular may throw exceptions at compile time. Fortunately, .NET provides a method to deal with this problem. For example, extract the corresponding div tag content based on the dynamically obtained id.
string id = ();
Regex reg = new Regex(@"(?is)<div(?:(?!id=).)*id=(['""]?)" + id + @"\1[^>]*>(?><div[^>]*>(?<o>)|</div>(?<-o>)|(?:(?!</?div\b).)*)* (?(o)(?!))</div>");
If the escape process is not performed, if the dynamically obtained id is in the form of "abc(def", an exception will be thrown during the program running.
2.3 Escape of character groups in .NET regularity
In character groups [], metacharacters usually do not need to be escaped, and even "[" does not need to be escaped.
string test = @"the test string: . $ ^ { [ ( | ) * + ? \"; Regex reg = new Regex(@"[.$^{[(|)*+?\\]"); MatchCollection mc = (test); foreach (Match m in mc) { += + "\n"; } /*----------------------------------------------------------------------------------------------------------------------------- . $ ^ { [ ( | ) * + ? \ */
However, when writing regular scripts, it is recommended to use "\[" in the character group to escape them. Regular itself is already very abstract and has very low readability. If such unescaped "[" is doped into the character group, it will make the readability worse. Moreover, when incorrect nesting occurs, regular compilation exceptions may be caused. The following regular will throw exceptions during compilation.
Regex reg = new Regex(@"[.$^{[(]|)*+?\\]");
However, in the character group of .NET, set subtraction is supported, and in this normal syntax form, character group nesting is allowed.
string test = @"abcdefghijklmnopqrstuvwxyz"; Regex reg = new Regex(@"[a-z-[aeiou]]+"); MatchCollection mc = (test); foreach (Match m in mc) { += + "\n"; } /*----------------------------------------------------------------------------------------------------------------------------- bcd fgh jklmn pqrst vwxyz */
This usage is very readable and rare. Even if there is such a requirement, it can be implemented in other ways. Just understand it and don’t need to go into it.
Going back to the topic, the only thing that must be escaped in the character group is "\", and when "[" and "]" appear in the character group, it is also recommended to do escape processing. There are two characters "^" and "-", which appear in the character group position. If you want to match it, you also need to escape it.
"^" appears at the start position of the character group, indicating an excluded character group, and "[^Char]" means matching any character except the characters contained in the character group, such as "[^0-9]" means any character except the number. So in a character group, to match the "^" character itself, either not placed at the beginning of the character group, or escaped with "\^".
Regex reg1 = new Regex(@"[0-9^]");
Regex reg2 = new Regex(@"[\^0-9]");
Both of these methods express matching any number or ordinary character "^".
As for the special nature of "-" in character groups, give an example.
string test = @"$"; Regex reg = new Regex(@"[#-*%&]"); = "Match result:" + (test); /*----------------------------------------------------------------------------------------------------------------------------- Match result: True */
There is obviously no "$" in the regular expression, so why does the matching result be "True"?
[] Supports using the hyphen "-" to concatenate two characters to represent a character range. It should be noted that the two characters before and after "-" are in sequence. When using the same encoding, the subsequent character code bits should be greater than or equal to the code bits of the previous character.
for (int i = '#'; i <= '*'; i++) { += (char)i + "\n"; } /*----------------------------------------------------------------------------------------------------------------------------- # $ % & ' ( ) * */
Since "#" and "*" meet the requirements, "[#-*]" can represent a character range, which contains the character "$", so the above rules can match "$". If you just treat "-" as a normal character, then you either change the position or escape "-".
Regex reg1 = new Regex(@"[#*%&-]");
Regex reg2 = new Regex(@"[#\-*%&]");
Both methods represent matching any of the characters listed in the character group.
In the character group, there is also a special escape character. When "\b" appears in a general position in a regular expression, it represents the word boundary, that is, one side is the character that makes up the word, and the other side is not; and when "\b" appears in a character group, it represents a backspace character, which has the same meaning as the "\b" that appears in a normal string.
Similarly, there is an escape character "|", which is easily overlooked and often overlooked. When "|" appears in a general position in a regular expression, it represents the relationship between the "or" on the left and right sides; and when "|" appears in a character group, it only represents the "|" character itself and has no special meaning, so it is wrong to try to use "|" in a character group. For example, the regular expression "[a|b]" means any one of "a", "b", and "|", rather than "a" or "b".
2.4 Invisible character escape processing in .NET regular applications
For some invisible characters, when representing them in a string, escape characters need to be used. The more common ones include "\r", "\n", "\t", etc., and it becomes a bit magical to apply these characters in regular rules. Let's look at a piece of code first.
string test = "one line. \n another line."; List<Regex> list = new List<Regex>(); (new Regex("\n")); (new Regex("\\n")); (new Regex(@"\n")); (new Regex(@"\\n")); foreach (Regex reg in list) { += "regular expression:" + (); MatchCollection mc = (test); foreach (Match m in mc) { += "Match content:" + + "Match the start position:" + + "Match length:" + ; } += "Total number of matches:" + (test).Count + "\n----------------\n"; } /*----------------------------------------------------------------------------------------------------------------------------- Regular expression: Match content: Match start position: 10 Match length: 1 Total number of matches: 1 ---------------- Regular expression: \n Match content: Match start position: 10 Match length: 1 Total number of matches: 1 ---------------- Regular expression: \n Match content: Match start position: 10 Match length: 1 Total number of matches: 1 ---------------- Regular expression:\\n Total number of matches: 0 ---------------- */
It can be seen that although the output rules of the first three writing methods are different, the execution results are exactly the same, and only the last one does not match.
Regex("\n") is actually a regular expression that declares regularity in the form of a normal string. The same is true for matching the character "a" with Regex("a") and is not escaped by the regular engine.
Regex("\\n") declares regular expression in the form of a regular expression. Just as "\\\\" in the regular is equivalent to "\\" in the string, "\\n" in the regular is equivalent to "\n" in the string, and is escaped by the regular engine.
Regex(@"\n"), which is equivalent to regular expressions, is the writing method of adding "@" before the string.
Regex(@"\\n"), in fact, this represents the character "\" followed by a character "n", which is two characters. This naturally cannot find a match in the source string.
What needs special attention here is "\b". Different declaration methods, the meaning of "\b" is different.
string test = "one line. \n another line."; List<Regex> list = new List<Regex>(); (new Regex("line\b")); (new Regex("line\\b")); (new Regex(@"line\b")); (new Regex(@"line\\b")); foreach (Regex reg in list) { += "regular expression:" + () + "\n"; MatchCollection mc = (test); foreach (Match m in mc) { += "Match content:" + + "Match the start position:" + + "Match length:" + + "\n"; } += "Total number of matches:" + (test).Count + "\n----------------\n"; } /*----------------------------------------------------------------------------------------------------------------------------- Regular expression: line_ Total number of matches: 0 ---------------- Regular expression: line\b Match content: line Match start position: 4 Match length: 4 Match content: line Match start position: 20 Match length: 4 Total number of matches: 2 ---------------- Regular expression: line\b Match content: line Match start position: 4 Match length: 4 Match content: line Match start position: 20 Match length: 4 Total number of matches: 2 ---------------- Regular expression: line\\b Total number of matches: 0 ---------------- */
Regex("line\b"), where "\b" is a backspace character and is not escaped by the regular engine. There is no in the source string, so the matching result is 0.
Regex("line\\b") declares regular expression in the form of a regular expression. Here, "\\b" is the word boundary and is escaped by the regular engine.
Regex(@"line\b"), which is equivalent to regular expression, refers to the word boundary.
Regex(@"line\\b"), in fact, this represents the character "\" followed by a character "b", which is two characters. This naturally cannot find a match in the source string.
2.5 Other escape processing in .NET regular applications
There are some other escape methods in .NET regular applications. Although they are not used much, I will mention it by the way.
Requirement: prefix the number between "<" and ">" in the string with "$"
string test = "one test <123>, another test <321>"; Regex reg = new Regex(@"<(\d+)>"); string result = (test, "<$$1>"); = result; /*----------------------------------------------------------------------------------------------------------------------------- one test <$1>, another test <$1> */ Maybe you will be surprised to find,The replacement result is not added before the number“$”,Instead, replace all numbers with“$1”It's。 Why is this happening,This is because in the replacement structure,“$”It has special significance,Following it with numbers,References to the matching result of the corresponding number capture group,And in some cases,Need to appear in the replacement result“$”The character itself,但它后面又跟It's数字,You need to use it now“$$”对它进行转义It's。而上面这个例子却恰恰是由于这种转义效果导致出现It's异常结果,To avoid this problem,Cannot refer to the capture group in the replacement result。 string test = "one test <123>, another test <321>"; Regex reg = new Regex(@"(?<=<)(?=\d+>)"); string result = (test, "$"); = result; /*----------------------------------------------------------------------------------------------------------------------------- one test <$123>, another test <$321> */
3. JavaScript and escape characters in Java
When the regular escape character processing in JavaScript and Java is declared in string form, it is basically consistent with that in .NET. Let me give you a brief introduction.
In JavaScript, declaring regularity in string form is the same as in C# and will also appear clumsy.
<script type="text/javascript"> var data = ["\\", "\\\\"]; var reg = new RegExp("^\\\\$", ""); for(var i=0;i<;i++) { ("Source String:" + data[i] + "Match result:" + (data[i]) + "<br />"); } </script> /*----------------------------------------------------------------------------------------------------------------------------- Source string:\ Match result: true Source string:\\ Match result: false */
Although JavaScript does not provide string declaration method for this "@" method in C#, it provides another proprietary declaration method for regular expressions.
<script type="text/javascript"> var data = ["\\", "\\\\"]; var reg = /^\\$/; for(var i=0;i<;i++) { ("Source String:" + data[i] + "Match result:" + (data[i]) + "<br />"); } </script> /*----------------------------------------------------------------------------------------------------------------------------- Source string:\ Match result: true Source string:\\ Match result: false */
In JavaScript
var reg = /Expression/igm;
This declaration method can also simplify the rules containing escape characters.
Of course, when declaring the regular in this form, "/" will naturally become a metacharacter. When this character appears in the regular, it must be escaped. For example, the rules that match the domain name in the link
var reg = /http:\/\/:([^\/]+)/ig;
Unfortunately, in Java, there is currently only one regular declaration method, that is, a string declaration method
String test[] = new String[]{"\\", "\\\\" }; String reg = "^\\\\$"; for(int i=0;i< ;i++) { ("Source String:" + test[i] + "Match result:" + (reg).matcher(test[i]).find()); } /*----------------------------------------------------------------------------------------------------------------------------- Source string:\ Match result: true Source string:\\ Match result: false */
I can only expect that subsequent versions of Java can provide optimizations in this regard.
This is the end of this article about the basics of regularity. For more related regularity, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!