Regular expressions in C#
Jeffrey. Friedl wrote a book about regular expressions, "Mastering in Regular Expressions". In order to enable readers to better understand and master regular expressions, the author made up a story. The language of the book is mainly perl. As far as I know, the regular expressions in C# are also based on perl5. So they should have a lot in common. http://ike.
In fact, I do not intend to translate the content of the book as it is. First, the book has too much content and I am not competent for the translation work; second, if I really translate the book and replace the code in it with C#, there may be a suspected infringement without obtaining the original author. So, just take it as reading notes.
Skipping the lengthy preface, we can go directly to Chapter 1:
Introducing regular expressions
The author says this chapter is prepared for the absolute novices of regular expressions, with the purpose of laying a solid foundation for future chapters. Then if you are a rookie or not, you can ignore this chapter.
Story scene:
The head of your archives department wants a tool to check duplicate words (such as this this), a problem that you usually encounter when editing documents in large quantities. Your job is to create a solution:
Accept any number of files to be checked, report those lines with duplicate words in each file, highlight those duplicate words, while ensuring that the original file name and those lines appear in the report.
Check across lines and find that the last word of a line and the first word at the beginning of the next line are duplicated.
Find duplicate words, regardless of whether they are case-different (such as The) and allow different numbers of whitespace characters (spaces, tabs, new lines, etc.) between these duplicate words
Find the repeated words, even those words are separated by the Html tag. (For example:…it is <B>very</B> very important.)
To solve the above practical problem, the first thing we have to do is write out the regular expression, find the text we want, ignore the text we don't need, and then use our C# code to process the obtained text.
Before using regular expressions, you may already know more or less what is a regular expression. Even if you don't know, you're almost certainly already familiar with its basic concepts.
You know it's a specific file name, but if you have any Unix or DOS/Windows experience, you also know that "*.txt" can be used to select multiple files. This form of file names has some characters with special meanings. An asterisk means matching anything, and a question mark means matching a character. For example: "*.txt" means any file whose file name ends with .txt.
The file name has a pattern matching, and a limited match character is used. There are also search engines on the current network that also allow certain specified matches to search content. Regular expressions use rich matching characters to handle various complex problems.
First, we introduce two position matching characters:
^ : Indicates the starting position of a line of text
$ : Indicates the end position of a line of text
For example: expression: "^Cat", the matching word Cat appears at the beginning of the line. Note that ^ is a position character, not to match the character itself.
Similarly, the expression: "Cat$" matches the word Cat appears at the end of the next line.
Next, we introduce the square bracket "[]" in the expression, which means matching one of the characters in the bracket. like:
Expression: "[0123456789]" will match any of the numbers 0 to 9.
For example: we want to find all text that contain gray or grey, then the expression can be written like this: "gr[ea]y"
[ea] means matching one of the eas, not the entire ea.
If we want to match the tag of <H1><H2><H3><H4><H5><H6> in html, we can write the expression:
"<H[123456]>", but what if we want to match one of all characters? Ha, the question is, write all the characters in square brackets? Fortunately, we don't have to do this, we introduce the scope symbol "-";
Using range symbols, we only need to give a range boundary character. In the above Html example, we can write it as "<H[1-6]>"
And the expression: "[0-9a-zA-Z]" is now clear? It matches numeric characters, one of 26 letters in lowercase and 26 letters in uppercase.
The "^" symbol that appears in []
If you see an expression such as "[^0-9]", at this time, "^" is no longer the positional symbol mentioned earlier. Here it is a negative symbol, indicating the meaning of exclusion, and the above expression means that there are no characters containing numbers 0 to 9.
Think 1: The meaning of the expression "q[^u]". If there are the following words, which will be matched?
Iraqi
Iraqian
miqra
qasida
qintar
qoph
zaqqum
In addition to the representation of range characters, there is also a dot character ".", which appears in the expression to match any character.
If the expression: "07.04.76" will match:
The shape is: 07/04/76, 07-04-76, 07.04.76.
If we need to be selectable among certain characters, we can use the option character "|":
The option character has the meaning of "or", such as the expression "[Bob|Robert]" means that Bob or Robert will be matched.
Now look at the expression we mentioned earlier: "gr[ea]y" , using the option characters we can write "grey|gray", they are the same.
Use of parentheses: Parentheses are also used as metacharacters in expressions. For example, in the previous expression, we can write them as: "gr(e|a)y". The parentheses here are necessary. If there are no parentheses, the expression "gre|ay" will match gre or ay, which is not the result we want. If you are not very clear yet, let's take a look at the following example:
Find all lines starting with From: or Subject: or Date: in the email, we compare the following two expressions:
Expression 1: "^From|Subject|Data: "
Expression 2: "^(From|Subject|Data): "
Which one is what we want?
It is obvious that the result of expression 1 is not the result we want, it will match: From or Subjec or Data: , Expression 2 uses round brackets to meet our needs.
Word boundaries
We can already match characters that appear at the beginning and end of the line, so what if we want to locate more than just the beginning or end of the line? We need to introduce word boundary symbols, and the word boundary symbols are: "\b", and the slash cannot be omitted, otherwise it will become a matching letter b. Using word boundary symbols, we can locate that the matching position must appear at the beginning or end of a word, not in the middle of the word. For example: the "\bis\b" expression will match the word "is" in the string "This is a cat." and will not match the "is" in the word "This".
String boundary symbol
In addition to the above positional symbols, if we want to match the entire string (including multiple words), then we can use the following two symbols:
\A : Indicates the beginning of the string;
\z : Indicates the end of the string.
Expression: "\AThis is a cat\z" will match this string "This is a cat".
Using boundary positioning symbols, an important concept is mentioned here, that is, word characters. Word characters represent characters that can constitute words. They are any character in [a-zA-Z0-9]. Therefore, the above expression will also be matched in the sentence "This is a cat." The matching result does not contain a period.
Repeat quantity symbol
Let's look at the expression: "Colou?r", There is a question mark that we have not seen before in this expression (the question mark matches the file name with different meanings), which means that a character in front of the symbol can be repeated, "?" means 0 or 1 time. In the previous expression, the question mark means that u can appear 0 or 1 time, so it will match "Color" or "Colour".
Below are other repeated quantity symbols:
+ : means 1 or multiple times
* : means 0 or multiple times
For example, if we want to represent one or more spaces, we can write the expression: " +";
What if you want to indicate the specific number of times? We introduce the curly curly {}.
{n} : n is a specific number, indicating repeated n times.
{n,m}: Denotes the least that time, and the maximum time.
These symbols all limit the number of matches of a character before the symbol. But what if you want to repeat multiple characters, such as one word? We use parentheses again. In the previous example, we used parentheses as the range symbol of the option. Here is another way to use brackets. It is represented as a group, for example, the expression: "(this)" Here this is a group, so the problem is easy to solve. The number of repetitions can be used to represent the number of repetitions of the previous group.
Now back to the question of finding duplicate words, if we want to find "the the", based on what we have learned so far, we can write the expression:
"\bthe +the\b"
An expression means matching two of the two spaces separated by one or more spaces.
Similarly, we can also write it as:
"\b(the +){2}"
But what if you want to find all possible repetition words? Our current knowledge is not enough to solve this problem. Below we introduce the concept of backreference. We have seen that parentheses can be used as the boundaries of groups. There can be multiple groups limited by brackets in an expression. According to the order in which they appear, these groups are assigned a group number by default, and the first group number that appears is number 1, and so on. Then backreference means that the group can be referenced using "\n" at the following expression, where n is the referenced group number. Backreferences are like variables in a program. Let’s look at specific examples:
The previous word repetition expression, now we can use backreferences to write:
"\b(the) +\1\b"
Now, if we want to match all the repetitive words, we can rewrite the expression as:
"\b([a-zA-Z]+) +\1\b"
The last question is, what if the character we are going to match is a symbol in a regular expression? Yes, use the escape symbol "\", for example, if you want to match a decimal point, then you can: "\." Also note that if you use an expression in a program, then "\" should also become "\\" according to the string's regulations or add @ before the expression.
This chapter is just to provide the newbie with basic knowledge about regular expressions. It is only a part of it. We still have a lot to learn, which will be introduced one by one in the following chapters. In fact, learning regular expressions is not difficult. What you need is patience and practice, if you want to master it. Maybe someone says, "I don't want to know the details of the car, I just want to learn how to drive." If you think so, then you will never know how to use regular expressions to solve your problem, and then you will never understand the true power of regular expressions.
Jeffrey. Friedl wrote a book about regular expressions, "Mastering in Regular Expressions". In order to enable readers to better understand and master regular expressions, the author made up a story. The language of the book is mainly perl. As far as I know, the regular expressions in C# are also based on perl5. So they should have a lot in common. http://ike.
In fact, I do not intend to translate the content of the book as it is. First, the book has too much content and I am not competent for the translation work; second, if I really translate the book and replace the code in it with C#, there may be a suspected infringement without obtaining the original author. So, just take it as reading notes.
Skipping the lengthy preface, we can go directly to Chapter 1:
Introducing regular expressions
The author says this chapter is prepared for the absolute novices of regular expressions, with the purpose of laying a solid foundation for future chapters. Then if you are a rookie or not, you can ignore this chapter.
Story scene:
The head of your archives department wants a tool to check duplicate words (such as this this), a problem that you usually encounter when editing documents in large quantities. Your job is to create a solution:
Accept any number of files to be checked, report those lines with duplicate words in each file, highlight those duplicate words, while ensuring that the original file name and those lines appear in the report.
Check across lines and find that the last word of a line and the first word at the beginning of the next line are duplicated.
Find duplicate words, regardless of whether they are case-different (such as The) and allow different numbers of whitespace characters (spaces, tabs, new lines, etc.) between these duplicate words
Find the repeated words, even those words are separated by the Html tag. (For example:…it is <B>very</B> very important.)
To solve the above practical problem, the first thing we have to do is write out the regular expression, find the text we want, ignore the text we don't need, and then use our C# code to process the obtained text.
Before using regular expressions, you may already know more or less what is a regular expression. Even if you don't know, you're almost certainly already familiar with its basic concepts.
You know it's a specific file name, but if you have any Unix or DOS/Windows experience, you also know that "*.txt" can be used to select multiple files. This form of file names has some characters with special meanings. An asterisk means matching anything, and a question mark means matching a character. For example: "*.txt" means any file whose file name ends with .txt.
The file name has a pattern matching, and a limited match character is used. There are also search engines on the current network that also allow certain specified matches to search content. Regular expressions use rich matching characters to handle various complex problems.
First, we introduce two position matching characters:
^ : Indicates the starting position of a line of text
$ : Indicates the end position of a line of text
For example: expression: "^Cat", the matching word Cat appears at the beginning of the line. Note that ^ is a position character, not to match the character itself.
Similarly, the expression: "Cat$" matches the word Cat appears at the end of the next line.
Next, we introduce the square bracket "[]" in the expression, which means matching one of the characters in the bracket. like:
Expression: "[0123456789]" will match any of the numbers 0 to 9.
For example: we want to find all text that contain gray or grey, then the expression can be written like this: "gr[ea]y"
[ea] means matching one of the eas, not the entire ea.
If we want to match the tag of <H1><H2><H3><H4><H5><H6> in html, we can write the expression:
"<H[123456]>", but what if we want to match one of all characters? Ha, the question is, write all the characters in square brackets? Fortunately, we don't have to do this, we introduce the scope symbol "-";
Using range symbols, we only need to give a range boundary character. In the above Html example, we can write it as "<H[1-6]>"
And the expression: "[0-9a-zA-Z]" is now clear? It matches numeric characters, one of 26 letters in lowercase and 26 letters in uppercase.
The "^" symbol that appears in []
If you see an expression such as "[^0-9]", at this time, "^" is no longer the positional symbol mentioned earlier. Here it is a negative symbol, indicating the meaning of exclusion, and the above expression means that there are no characters containing numbers 0 to 9.
Think 1: The meaning of the expression "q[^u]". If there are the following words, which will be matched?
Iraqi
Iraqian
miqra
qasida
qintar
qoph
zaqqum
In addition to the representation of range characters, there is also a dot character ".", which appears in the expression to match any character.
If the expression: "07.04.76" will match:
The shape is: 07/04/76, 07-04-76, 07.04.76.
If we need to be selectable among certain characters, we can use the option character "|":
The option character has the meaning of "or", such as the expression "[Bob|Robert]" means that Bob or Robert will be matched.
Now look at the expression we mentioned earlier: "gr[ea]y" , using the option characters we can write "grey|gray", they are the same.
Use of parentheses: Parentheses are also used as metacharacters in expressions. For example, in the previous expression, we can write them as: "gr(e|a)y". The parentheses here are necessary. If there are no parentheses, the expression "gre|ay" will match gre or ay, which is not the result we want. If you are not very clear yet, let's take a look at the following example:
Find all lines starting with From: or Subject: or Date: in the email, we compare the following two expressions:
Expression 1: "^From|Subject|Data: "
Expression 2: "^(From|Subject|Data): "
Which one is what we want?
It is obvious that the result of expression 1 is not the result we want, it will match: From or Subjec or Data: , Expression 2 uses round brackets to meet our needs.
Word boundaries
We can already match characters that appear at the beginning and end of the line, so what if we want to locate more than just the beginning or end of the line? We need to introduce word boundary symbols, and the word boundary symbols are: "\b", and the slash cannot be omitted, otherwise it will become a matching letter b. Using word boundary symbols, we can locate that the matching position must appear at the beginning or end of a word, not in the middle of the word. For example: the "\bis\b" expression will match the word "is" in the string "This is a cat." and will not match the "is" in the word "This".
String boundary symbol
In addition to the above positional symbols, if we want to match the entire string (including multiple words), then we can use the following two symbols:
\A : Indicates the beginning of the string;
\z : Indicates the end of the string.
Expression: "\AThis is a cat\z" will match this string "This is a cat".
Using boundary positioning symbols, an important concept is mentioned here, that is, word characters. Word characters represent characters that can constitute words. They are any character in [a-zA-Z0-9]. Therefore, the above expression will also be matched in the sentence "This is a cat." The matching result does not contain a period.
Repeat quantity symbol
Let's look at the expression: "Colou?r", There is a question mark that we have not seen before in this expression (the question mark matches the file name with different meanings), which means that a character in front of the symbol can be repeated, "?" means 0 or 1 time. In the previous expression, the question mark means that u can appear 0 or 1 time, so it will match "Color" or "Colour".
Below are other repeated quantity symbols:
+ : means 1 or multiple times
* : means 0 or multiple times
For example, if we want to represent one or more spaces, we can write the expression: " +";
What if you want to indicate the specific number of times? We introduce the curly curly {}.
{n} : n is a specific number, indicating repeated n times.
{n,m}: Denotes the least that time, and the maximum time.
These symbols all limit the number of matches of a character before the symbol. But what if you want to repeat multiple characters, such as one word? We use parentheses again. In the previous example, we used parentheses as the range symbol of the option. Here is another way to use brackets. It is represented as a group, for example, the expression: "(this)" Here this is a group, so the problem is easy to solve. The number of repetitions can be used to represent the number of repetitions of the previous group.
Now back to the question of finding duplicate words, if we want to find "the the", based on what we have learned so far, we can write the expression:
"\bthe +the\b"
An expression means matching two of the two spaces separated by one or more spaces.
Similarly, we can also write it as:
"\b(the +){2}"
But what if you want to find all possible repetition words? Our current knowledge is not enough to solve this problem. Below we introduce the concept of backreference. We have seen that parentheses can be used as the boundaries of groups. There can be multiple groups limited by brackets in an expression. According to the order in which they appear, these groups are assigned a group number by default, and the first group number that appears is number 1, and so on. Then backreference means that the group can be referenced using "\n" at the following expression, where n is the referenced group number. Backreferences are like variables in a program. Let’s look at specific examples:
The previous word repetition expression, now we can use backreferences to write:
"\b(the) +\1\b"
Now, if we want to match all the repetitive words, we can rewrite the expression as:
"\b([a-zA-Z]+) +\1\b"
The last question is, what if the character we are going to match is a symbol in a regular expression? Yes, use the escape symbol "\", for example, if you want to match a decimal point, then you can: "\." Also note that if you use an expression in a program, then "\" should also become "\\" according to the string's regulations or add @ before the expression.
This chapter is just to provide the newbie with basic knowledge about regular expressions. It is only a part of it. We still have a lot to learn, which will be introduced one by one in the following chapters. In fact, learning regular expressions is not difficult. What you need is patience and practice, if you want to master it. Maybe someone says, "I don't want to know the details of the car, I just want to learn how to drive." If you think so, then you will never know how to use regular expressions to solve your problem, and then you will never understand the true power of regular expressions.