A brief discussion on the use of regular expressions in C#

So far, many programming languages and tools have included support for regular expressions, and C# is no exception. The basic C# class library contains a namespace () and a series of classes that can fully exert the power of regular expressions (Regex, Match, Group, etc.). So, what is a regular expression and how to define a regular expression?

1. Regular expression basics

What is a regular expression

When writing string handlers, there is often a need to find strings that meet certain complex rules. Regular expressions are tools used to describe these rules. In other words, a regular expression is the code that records text rules.

Usually, when we use WINDOWS to find files, we use wildcards (* and ?). If you want to find all Word documents in a directory, you can use *.doc to search, where * is interpreted as an arbitrary string. Similar to wildcards, regular expressions are also tools for text matching, but they can describe your needs more accurately than wildcards - of course, the price is more complicated.

A simple example - verifying phone number

The best way to learn regular expressions is to start with the examples. Let’s start by verifying the phone number and understand regular expressions step by step.

In our country, phone numbers (eg: 0379-65624150) usually contain area codes starting with 0 and a number 7 or 8, usually separated by a hyphen '-'. In this example, first we will introduce a metacharacter \d, which is used to match a number from 0 to 9. This regular expression can be written as: ^0\d{2,3}-\d{7,8}$

Let's analyze it. 0 matches the number "0", \d matches a number, {2,3} means repeating 2 to 3 times, - only matches "-" itself, the next \d also matches a number, while {7,8} means repeating 7 to 8 times. Of course, the phone number can also be written as (0379)65624150, and it will be left to the readers to complete.

Metacharacter

In the example above, we came across a meta character \d. As you think, regular expressions also have many meta characters like \d. The following table lists some commonly used meta characters:

Metacharacter	illustrate
.	Match any character other than line break
\b	Match the beginning or end of a word
\d	Match numbers
\s	Match any blank characters
\w	Match letters or numbers or underscores or Chinese characters
^	Match the beginning of the string
$	Match the end of the string

Table 1. Commonly used metacharacters

Escape characters

If you want to find the metacharacter itself, for example, you look for it., or *, there is a problem: you can't specify them because they will be interpreted as something else. At this time, you have to use \ to cancel the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \self, you have to use \\.

For example: Unibetter\.com match, C:\\Windows match C:\Windows.

Qualifier

Qualifiers are also called repeated description characters, which represent the number of times a character will appear. For example, the {3,4} we use when matching the phone number means 3 to 4 times. Commonly used qualifiers are:

Qualifier	illustrate
*	Repeat zero or more times
+	Repeat once or more times
?	Repeat zero or once
{n}	Repeat n times
{n,}	Repeat n or more times
{n,m}	Repeat n to m times

Table 2. Commonly used qualifiers

2. Support for regular expressions in .NET

The namespace contains classes that provide access to the .NET Framework regular expression engine. This namespace provides regular expression functionality that can be used from any platform or language running within the Microsoft .NET Framework.

1. Use regular expressions in C#

After understanding the classes that support regular expressions in C#, let’s write the regular expressions that verify the phone number mentioned above into the C# code to realize the verification of the phone number.

The first step is to create a Windows project called SimpleCheckPhoneNumber.

The second step is to introduce the namespace.

The third step is to write out the regular expression. The regular expression here is the string of the verification number above. Since the above string can only verify the phone number that uses hyphen to connect area codes and numbers, we made some modifications: 0\d{2,3}-\d{7,8}|$0\d{2,3}$\d{7,8}. In this expression, part of the | sign face is what we mentioned above, and the latter part is used to verify the writing method of (0379)65624150. Since ( and ) are also metacharacters, escape characters should be used. | Indicates that the branch matches, either match the previous part or the next part.

In the fourth step, the regular expression constructs a Regex class.

Step 5: Use the IsMatch method of the Regex class to verify the match. The IsMatch() method of the Regex class returns a bool value, and returns true if there is a match, otherwise returns false.

3. Regular expression advancement

Grouping

We have used repeated single characters when matching phone numbers. Let's learn how to use packets to match an IP address.

As we all know, IP addresses are represented by four dotted decimal strings. Therefore, we can match through grouping of addresses. First, let's match the first paragraph: 2[0-4]\d|25[0-5]|[01]?\d\d? This regular expression can match a number of IP addresses. 2[0-4]\d Matches a three-bit field with starting with 2, ten digits are 0 to 4, and single digits are any number, 25[0-5] Matches a three-bit field with starting with 25 and single digits are 0 to 5, [01]?\d\d? Matches any field with 1 having 0 heads, single digits and ten digits are any number subs. ? Indicates zero or once occurrence. Therefore, neither the [01] and the last \d can appear. If we add a \. to the string to match. Then we can divide a segment. Now, we treat 2[0-4]\d|25[0-5]|[01]?\d\d?\. as a group, and we can write it as (2[0-4]\d|25[0-5]|[01]?\d\d?\.). Next we will use this grouping. Repeat this group twice, then use 2[0-4]\d|25[0-5]|[01]?\d\d? The complete regular expression is: (2[0-4]\d|25[0-5]|[01]?\d\d?\.){3}2[0-4]\d|25[0-5]|[01]?\d\d?

Backward quote

Once we understand grouping, we can use backward references. The so-called backward reference is to use the results captured before to match the following characters. Mostly used to match duplicate characters. For example, match duplicate characters like go go. We can use (go) \1 to match.

By default, each group will automatically have a group number. The rule is: from left to right, marked by the left bracket of the group, the first group number of the group is 1, the second one is 2, and so on. Of course, you can also specify the group name of the subexpression yourself. To specify the group name of a subexpression, use this syntax: (?<Word>\w+) (or replace the angle bracket with ', also: (?'Word'\w+)), so that the group name of \w+ is specified as Word. To backreference the content captured by this group, you can use \k<Word>, so the previous example can also be written like this: \b(?<Word>\w+)\b\s+\k<Word>\b.

There is another advantage of customizing group names. In our C# program, if we need to get the group value, we can clearly use the group name we define to get it without using subscripts.

When we do not want to use backward references, we do not need to capture group to remember anything. In this case, we can use the (?:nocapture) syntax to actively tell the regular expression engine not to treat the content of parentheses as capture group in order to improve efficiency.

Zero-width assertion

In the previous metacharacter introduction, we already know that there is such a type of character that can match the beginning and end of a sentence (^$) or the beginning and end of a word (\b). These metacharacters match only one position, specifying that this position satisfies certain conditions, rather than matching certain characters, so they are made as zero-width assertions. The so-called zero width means that they do not match any character, but match a position; the so-called assertion refers to a judgment. In regular expressions, matches will only continue when assertions are true.

Sometimes, we match exactly a position, not just a sentence or word, which requires us to write assertions ourselves to match. Here is the syntax for the assertion:

Assertion Syntax	illustrate
(?=pattern)	Forward affirmative assertion, matching the position ahead of pattern
(?!pattern)	Forward negative assertion, matches the position that is not pattern afterwards
(?<=pattern)	Backward affirmative assertion, matching the position behind the pattern
(?<!pattern)	Backward negative assertion, matching the position before the pattern

Table 3. Syntax and description of assertions

Is it difficult to understand? Let's take a look at an example.

There is a tag: <book>. We want to get the tag name (book) of the tag <book>. At this time, we can use assertions to handle it. Look at the following expression: (?<=\<)(?<tag>\w*)(?=\>) . Using this expression, you can match the characters between < and >, which is the book here. Using assertions can also write more complex expressions, so I won't give any more examples here.

Another very important thing is that the parentheses used in assertion syntax are not used as capture groups, so they cannot be referenced using numbers or naming.

Greed and laziness

When a regular expression contains qualifiers that can accept duplication, the usual behavior is to match as many characters as possible (with the premise that the entire expression can be matched). Let’s take a look at this expression: a\w*b . When using it to match the string aabab , the matching result obtained is aabab . This matching is called greedy matching.

Sometimes, we want to make it repeat as little as possible, that is, the match result obtained with the above example is aab, and we need to use lazy matching. Lazy matching requires adding a ? symbol after the duplicate qualifier, and the above expression can be written as: a\w*?b When we match the string aabab, the matching results are aab and ab.

Maybe at this time you have to ask, ab has fewer repetitions than aab, so why not match ab first? In fact, there are rules in regular expressions that have higher priority than greed/laziness: the first match has the highest priority - The match that begins early wins.

Comments

Syntax: (?#comment)

For example: 2[0-4]\d(?#200-249)|25[0-5](?#250-255)|[01]?\d\d?(?#0-199)

Note: If you use comments, you need to be careful not to appear in front of the brackets of the comments, such as spaces, line breaks, etc. in front of the comments. If you can ignore these characters, it is best to use the "Ignore whitespace characters in mode" option, that is, the IgnorePatternWhitespace option of the RegexOptions enumeration in C# (the RegexOptions enumeration in C# will be mentioned below).

Processing options in C#

In C#, you can use the RegexOptions enumeration to select how C# handles regular expressions. The following is an introduction to the members of the RegexOptions enumeration in MSDN:

Capture class, Group class, and Match class in C#

Capture class:Represents the result in a single subexpression capture. The Capture class represents a substring in a single successfully captured. This class does not have a public constructor, and you can get a collection of objects of Capture from the Group class or Match class. The Capture class has three common properties, namely Index, Length and Value. Index represents the position of the first character of the captured substring. Length represents the length of the captured substring, and Value represents the captured substring.

Group class:Represents information for grouping in regular expressions. This class provides support for group matching regular expressions. This class does not have a public constructor. You can get a collection of Group classes from the Match class. If the grouping in a regular expression is named, it can be accessed with a name, and if it is not named, it can be accessed with a subscript. Note: The 0th element (Groups[0]) in each Match's Groups collection is a string captured by this Match, and it is also the Capture's Value.

Match class:Represents the result of a single regular expression match. This class also has no public constructor. You can get an instance of the class from the Match() method of the Regex class, or you can use the Matches() method of the Regex class to get a set of the class.

All three classes can represent the results of a single regular expression match, but the Match class is more detailed, including capture and grouping information. Therefore, the Match class is the most commonly used among these three classes.

The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.