Further study of Rugula Rex pre-dead ONS

17. (\(\d{3}\)\d{3})\s?\d{3}[- ]\d{4} (Televen digit phone number)
Grouping
Parentheses can be used to describe a description, and through the description of the description, it can be repeated or processed for the description.
18. (\d{1,3}\.){3}\d{1,3} (Simple re for finding network addresses)
The first part of this re means (\d{1,3}\.){3}, which means that the minimum number is three digits at most, and the "." symbol is followed by it. There are three of this type, and then one to three digits are followed, which is a number such as 192.72.28.1.
But this will have a disadvantage, because the network address number is only up to 255, but the above re is consistent as long as one to three digits, so this requires that the comparison number be less than 256, but using re alone cannot make such a comparison. Selection is used in 19 to limit the address to the desired range, i.e. 0 to 255.
19. ((2[0-4]\d25[0-5][01]?\d\d?)\.){3}(2[0-4]\d25[0-5][01]?\d\d?) (Look for network address)
Have you ever noticed that Re is becoming more and more like what aliens say? Just simply searching for the network address, and it is very difficult to understand directly.
expresso analyzer view
Expresso provides a function that can turn the res under it into a tree-like description, a group of separate descriptions, providing a good debugging environment. Other functions, such as partial matching (partial match only looks for anti-white re parts) and excluding matching (exclude match only doesn't look for anti-white re parts) will be left to you to try.
When the description is grouped in brackets, the text that meets the description can be processed later in the program or re itself. In the preset case, the group that matches are named by a number, starting from 1, and from the order from left to right. This automatic group name can be seen in the skeleton view or result view in express.
Backreference is used to find text that matches the same text that is captured in a group. For example, "\1" refers to text that matches the grabbing of group 1.
20. \b(\w+)\b\s*\1\b (Look for duplicate words. The repetition mentioned here refers to the same word, with blank spaces in the middle separated words such as dog dog)
(\w+) will grab at least one character of the letter or number and name it Group 1, then look for any space character and then follow the same text as Group 1.
If you don't like the 1 that the group is automatically named, you can also name it yourself. Taking the above example as an example, (\w+) is rewritten to (?<word>\w+). This means that the captured group is named word, and the backreference must be rewritten to \k<word>
21. \b(?<word>\w+)\b\s*\k<word>\b (Crawl duplicate words with self-name group)
There are many special syntax elements in using brackets, and the more common list is as follows:
Captures
(exp) Comply with exp and grab it into an automatically named group
(?<name>exp) Comply with exp and grab it into the named group name
(?:exp) Comply with exp, do not crawl it
lookarounds
(?=exp) Text that matches the ending of the word exp
(?<=exp) Comply with text with prefix exp
(?!exp) Comply with text that does not end with exp afterwards
(?<!exp) Comply with text that did not follow the exp prefix in the previous text
Comment
(?#comment) Comment
positive lookaround
Next we will talk about lookahead and lookbehind assertions. What they look for is text that currently conforms to before or after, and does not include the current conformity itself. These are like the special characters of "^" and "\b", which do not correspond to any text (used to define positions), and are therefore called zero-width assertions. You may be clearer by looking at some examples.
(?=exp) is a "zero-width positive lookahead assertion". It refers to text that meets the ending of the word exp, but does not include exp itself.
22. \b\w+(?=ing\b) (A character with ing at the end of the word, for example, filling is filled)
(?<=exp) is a "zero-width positive lookbehind assertion". It refers to text that conforms to the prefix exp, but does not include exp itself.
23. (?<=\bre)\w+\b (A word with prefixed re, for example, repeated conforms to peated)
24. (?<=\d)\d{3}\b (Three digits at the end of the word, followed by one digit)
25. (?<=\s)\w+(?=\s) (alphanumeric string separated by space characters)
negative lookaround
It has been mentioned before how to find a character that is not specific or is not in a specific group. But what if you just want to verify that a certain character does not exist and do not correspond to these characters? For example, suppose you want to find a character that has q in its letter but the next letter is not u, you can use the following re to do it.
26. \b\w*q[^u]\w*\b (A word, its letter has q but the next letter is not u)
There will be a problem with such re, because [^u] must correspond to a character, so if q is the last letter of the character, the following method such as [^u] will correspond to the space character, and the result may be two characters, such as "iraq haha". Using negative lookaround can solve this problem.
27. \b\w*q(?!u)\w*\b (A word, its letter has q but the next letter is not u)
This is "zero-width negative lookahead assertion".
28. \d{3}(?!\d) (Three digits, no digits are followed)
Similarly, you can use (?<!exp), "zero-width negative lookbehind assertion" to match the literal string without the exp prefix.
29. (?<![a-z ])\w{7} (a string with seven alphanumeric characters, no letters or spaces before it)
30. (?<=<(\w+)>).*(?=<\/\1>) (text between html tags)
This uses lookahead and lookbehind assertion to retrieve the text between html, excluding the html tag.
Please comments please
There is another special purpose of brackets to wrap annotations. The syntax is "(?#comment). If the "ignore pattern whitespace" option is set, the space symbols in re will be omitted when re is used. When this option is set, the text after "#" will be omitted.
31. Text between html tags, plus annotations
(?<= 　#Find the prefix, but does not include it
<(\w+)> #html tag
) #End search for prefix
.* #Compare for any text
(?= #Find the ending of the word, but does not include it
<\/\1> #Compat the string of the captured group 1, that is, the html tag in the brackets in the previous
) #End search for ending
greedy and lazy
When looking for a range of duplicates under re (such as ".*"), it usually looks for the most characters of the matching characters, that is, greedy matching. For example.
32. a.*b (The beginning is the most characters with the end is b)
If a string is "aabab", the string matching using the above re is "aabab", because this is the word that looks for the most characters. Sometimes it is hoped that the word that meets the least characters is lazy matching. Just add a question mark (?) to the table that repeats the aforementioned items to turn them all into lazy matching. Therefore, "*?" represents the number of repetitions, but the number of repetitions is used to match. For example:
33. a.*?b (The beginning is the minimum character with the end is b)
If a string is "aabab", the first string obtained by using the above re is "aab" and then "ab", because this is the word that looks for the least characters.
*? The number of repetitions is the principle
+? Repeat at least once, with the minimum number of repetitions as the principle
?? Repeat zero or once, the minimum number of repetitions is the principle
{n,m}? Repeat at least n times, but not more than m times, the minimum number of repetitions is the principle
{n,}? Repeat at least n times, the minimum number of repetitions is the principle
Anything else not mentioned?
So far, many elements that have been created have been mentioned, and of course many elements have not been mentioned. The following table organizes some elements that are not mentioned. The numbers in the field at the far left are examples of the description in express.
# Syntax Description
\a bell character
\b usually refers to the boundary of a word, and what it represents in the character group is backspace
\t tab
34 \r carriage return
\v vertical tab
\f from feed
35 \n new line
\e escape
36 \nnn Character with octet code nnn
37 \xnn Character with sixteen digit code nn
38 \unnnn character with unicode nnnn
39 \cn control n characters, for example ctrl-m is \cm
40 \a The beginning of the string (similar to ^, but does not require the multiline option)
41 \z End of string
\z End of string
42 \g The current search start
43 \p{name} unicode character with the name of the character group, for example \p{lowercase_letter} refers to lowercase characters
(?>exp) greedy description, also known as non-backtracking description. This only meets once and does not take backtracking.
44 (?<x>-<y>exp)
or (?-<y>exp) Balance the group. Although complicated, it is easy to use. It allows named crawler groups to operate on the stack. (I don't know much about this, either)
45 (?im-nsx:exp) Change the re option for the description exp. For example, (?-i:elvis) is to turn off the elvis option that is slightly case-free.
46 (?im-nsx) Change the re option for subsequent groups.
(?(exp)yesno) The description of exp is regarded as zero-width positive lookahead. If there is a compliance at this time, then yes will be described as the next compliance subject; if not, then no interpretation will be described as the next compliance subject.
(?(exp)yes) The same as above but no description
(?(name)yesno) If the name group is a valid group name, then yes will be described as the next one that matches the subject matter. If not, then no will be described as the next one that matches the subject matter.
47 (?(name)yes) The same as above but no description
in conclusion
After a series of examples and express's help, I believe you have a basic understanding of Re. Of course, there are many articles about Re on the Internet. If you are interested, there are many related articles about Re.