SoFunction
Updated on 2025-03-02

Further Study of Rugular Expressions Page 1/2

I started to think of learning to learn Re, but I was born lazy and always wanted to see if there was any way to learn quickly. So I invited the master of Google. With his divine power, I found Mr. Jim Hollenhorst's article on the Internet. After reading it, I thought it was really good, so I made a small report on experience and shared it with my friends, hoping to bring you some help when studying Re. The website of jim hollenhorst article is as follows, if you need it, you can directly link it.
the 30 minute regex tutorial by jim hollenhorst
/useritems/
What is re?
I believe everyone has used the universal character "*" when searching for files. For example, when you want to find all word files in the Windows directory, you may use "*.doc" to search because "*" represents any character. What re does is a function like this, but it is more powerful.
When writing programs, it is often necessary to compare whether the string conforms to a specific style. The main function of re is to describe this specific style. Therefore, re can be regarded as a descriptive form of a specific style. For example, "\w +" represents a non-null string composed of any letter and number. A very powerful category library is provided in the .net framework, so you can easily use re to find and replace text, decoding complex headers, and verify text.
The best way to learn Re is to do it yourself through examples. jim hollenhorst also provides a tool program expresso (Cup of coffee) to help us learn re. The download URL is /useritems/regextutorial/expresssetup2_1c.zip.
Next, let’s experience some examples.
Some simple examples
Suppose you want to find a text string followed by an alive in the article, using re may go through the following process, and the brackets are the meaning of re:
1. elvis (Find elvis)
The above representatives are the characters to be searched for are elvis. In .net, you can set the case of slight characters, so "elvis", "elvis" or "elvis" are all in line with the re under 1. But because this only occurs in the order of characters in elvis, pelvis also conforms to the re under 1. It can be improved with 2 re.
2. \belvis\b (Watch elvis as a whole word search, such as elvis and elvis when the upper and lower characters are case)
"\b" has a special meaning in re. In the above example, it refers to the boundary of the word, so \belvis\b uses \b to define the front and back boundaries of elvis, that is, the word elvis is required.
Suppose you want to find the text string followed by alive in the same line, and then two other special characters "." and "*" will be used. "." means any character except the newline character, while "*" means repeating the previous item until a string that matches re is found. So ".*" refers to any number of characters except line break characters. So if you look for the text string followed by elvis in the same line, you can find it as follows: 3.
3. \belvis\b.*\balive\b (Find a text string followed by elvis, such as elvis is alive)
Use simple special characters to form a powerful re, but it is also found that when more and more special characters are used, re will become increasingly difficult to understand.
Take a look at another example
Make up a valid phone number
If you want to collect a 7-digit phone number in the customer format xxx-xxxx from the web page, where x is the number, re might be written like this.
4. \b\d\d\d-\d\d\d\d (Find a seven-digit phone number, such as 123-1234)
Each \d represents a number. "-" is a general hyphen. To avoid too many repetitions, re can be rewritten into a way like 5.
5. \b\d{3}-\d{4} (A better way to find seven-digit phone numbers, such as 123-1234)
{3} after \d means repeating the previous item three times, which is equivalent to \d\d\d\d.
Re's learning and testing tools expresso
Because Re is not easy to read and users will easily make mistakes, Jim has developed a tool software expresso to help users learn and test Re. In addition to the URL mentioned above, you can also access ultrapico website(). After installing express, in the expression library, Jim has built all the examples of articles. You can read the article while testing, or try to modify the re in the examples. You can see the results immediately. I think it is very useful. Everyone can give it a try.
The basic concept of re in .net
Special characters
Some characters have special meanings, such as "\b", ".", "*", "\d" and so on, as seen before. "\s" represents any space symbol, such as spaces, tabs, newlines, etc. "\w" means any letter or numeric character.
Let's see some examples
6. \ba\w*\b (Find the word starting with a, such as able)
This re describes that you want to find the beginning boundary of a word (\b), then the letter "a", add any number of alphanumeric numbers (\w*), and then end the end boundary of the word (\b).
7. \d+ (Find numeric string)
"+" and "*" are very similar, except that + should be repeated at least once. That is to say, there is at least one number.
8. \b\w{6}\b (Find six alphanumeric characters, such as ab123c)
The following table is the special characters commonly used in re
. Any character except line break characters
\w Any alphanumeric characters
\s Any space character
\d Any numerical character
\b Define the boundaries of the word
^ The beginning of the article, such as "^the" is used to indicate that the string that appears at the beginning of the article is "the"
$ The end of the article, such as "end$" is used to indicate that it appears at the end of the article as "end"
The special characters "^" and "$" are used to find that certain characters must be the beginning or end of the article. This is especially useful when verifying whether the input meets a certain style. For example, to verify a seven-digit phone number, you may enter the following 9 re.
9. ^\d{3}-\d{4}$ (Verify the phone number of seven digits)
This is the same as the fifth re, but there are no other characters before and after it, that is, the entire string has only these seven numbers of phone numbers. If you set the multiline option in .net, "^" and "$" will be compared for each line, as long as the beginning and end of a line meet re, rather than the entire article string.
escape characters
Sometimes, the simple literal meaning of "^" and "$" may be needed instead of treating them as special characters. At this time, the "\" characters are used to remove special characters with special meanings, so "\^", "\.", "\" represent the literal meaning of "^", ".", "\".
Repeat the above items
I have seen "{3}" and "*" before that can be used to repeat the above characters, and then we will see how to repeat the entire subexpressions in the same syntax. The following table shows some ways to repeat the above items.
* Repeat any number of times
+ Repeat at least once
? Repeat zero or once
{n} Repeat n times
{n,m} Repeat at least n times, but not more than m times
{n,} Repeat at least n times
Let's try some examples
10. \b\w{5,6}\b (Find characters with five or six alphanumeric characters, such as as25d, d58sdf, etc.)
11. \b\d{3}\s\d{3}-\d{4} (Find a phone number with ten numbers, such as 800 123-1234)
12. \d{3}-\d{2}-\d{4} (Find social insurance number, such as 123-45-6789)
13. ^\w* (The first word of each line or the entire article)
In espresso, you can try the difference between multiline and non-multiline.
Match a range of characters
Sometimes, how to distinguish when you need to find certain specific characters? At this time, the brackets "[]" come in handy. Therefore, what [aeiou] is looking for vowels such as "a", "e", "i", "o", and "u", and what [.?!] is looking for ".", "?", "?", "!", and "!" symbols, the special meaning of the special characters in the brackets, will be removed, that is, they will be interpreted into simple literal meanings. You can also specify certain ranges of characters, such as "[a-z0-9]", which refers to any lowercase letter or any number.
Next, let’s take a look at a more complicated example of finding phone numbers.
14. \(?\d{3}[( ] \s?\d{3}[- ]\d{4} (Find the phone number of ten digits, such as (080) 333-1234)
Such re can find telephone numbers in more formats, such as (080) 123-4567, 511 254 6654, etc. "\(?" represents one or zero left brackets" (" while "[( ]" represents finding a right bracket")" or space character, "\s?" refers to a group of one or zero space characters. But such a re will find a call like "800) 45-3321", that is, there is no symmetric balance between brackets, and you will learn to choose alternatives to solve such problems.
Not included in a specific character group (negation)
Sometimes you need to find characters contained in a specific character group. The following table shows how to make a description like this.
\w is not an alphanumeric character
\s is not arbitrary character of space character
\d is not arbitrary character
\b is not located at the boundary of the word
[^x] is not any character that is x
[^aeiou] is not any character of a, e, i, o, u
15. \s+ (String that does not contain space characters)
Alternatives
Sometimes you need to find several specific options, and the special character "" comes in handy. For example, you need to find the postal code of five numbers and nine numbers (with the "-" number).
16. \b\d{5}-\d{4}\b\b\d{5}\b (Find the postal codes with five numbers and nine numbers (with the "-" number))
When using alternatives, you need to pay attention to the order of front and back, because re will prefer items that match the leftmost in alternatives. In 16, if the items that look for five numbers are placed in front, this re will only find the postal code of five numbers. After understanding the choice of one, you can make better corrections to 14.
12Next pageRead the full text