Basics of regular expression pattern matching strings

introduce

In actual projects, there is a function implementation that requires parsing some strings of specific patterns. In the existing code base, in some of the implemented functions, they use to detect specific characters. The disadvantages of using this method are:

Logically easy to make mistakes
It is easy to miss checks on some boundary conditions
The code is complex and difficult to understand and maintain
Poor performance

I saw that there is a cpp in the code base, with more than 2,000 lines of code in the entire cpp. In one method, there are more than 400 lines of parsing strings alone! Comparing characters one by one, it is really unsightly. Moreover, many of the comments above have expired, and the writing styles of many codes are also different. It can be basically judged that many people have passed.

In this case, there is basically no way to continue along this old path, so I naturally thought of using regular expressions. But I have no practical experience in regular expressions, especially when it comes to writing matching rules. The first thing I thought of finding some information online was to get a general idea. But the result of the Baidu mother is still very disappointing. (Of course, if you want to find some more professional knowledge, the results of the Baidu Mama will be heartbreaking every time, and they are all the same copies. But usually the life of the Baidu Mama is still OK) Later, I gave up the search results of the Baidu Mama, went outside to search for FQ, and found some relatively basic videos (FQ required).

This article can be said to be a summary, introducing the basic knowledge of writing regular expression matching strings. It is mainly divided into the following two parts:

Basic rules for matching strings
Regular matching, search and substitution

The regular expression rule introduced in this article is ECMAScript. The programming language used is C++. No introduction to other aspects.

Basic rules for matching strings

1. Match fixed strings

regex e("abc");

2. Match fixed strings, case-insensitive

regex e("abc", regex_constants::icase);

3. One more character besides the fixed string, and is case-insensitive

regex e("abc.", regex_constants::icase); // . Any character except newline. 1 character

4. Match 0 or 1 character

regex e("abc?"); // ? Zero or 1 preceding character. Match? Previous character

5. Match 0 or more characters

regex e("abc*"); // * Zero or more preceding character. Match * previous character

6. Match 1 or more characters

regex e("abc+"); // + One or more preceding character. Match + previous character

7. Match characters in a specific string

regex e("ab[cd]*"); // [...] Any character inside square brackets. Match any character in []

8. Match characters that are not specific strings

regex e("ab[^cd]*"); // [...] Any character not inside square brackets. Match any character within non[]

9. Match a specific string and specify a number

regex e("ab[cd]{3}"); // {n} Match any character before {}, and the number of characters is 3

10. Match a specific string and specify a range of numbers

regex e("ab[cd]{3,}");  // {n} matches any character before {}, and the number of characters is 3 or moreregex e("ab[cd]{3,5}");  // {n} match{}Any character before，And the number of characters is3More than one，5Closed ranges below

11. Match a rule in a rule

regex e("abc|de[fg]"); // | Match|Arbitrary rule on both sides

12. Match grouping

regex e("(abc)de+"); // () �

13. Match subgrouping

regex e("(abc)de+\\1");  // () () means a sub-group, and \1 means the content of the first group matches at this locationregex e("(abc)c(de+)\\2\\1");  // \2 It means that the content of the second group is matched here

14. Match the beginning of a string

regex e("^abc."); 
// ^ begin of the string FindabcThe beginning of the substring

15. Match the end of a string

regex e("abc.$");
// $ end of the string FindabcThe ending substring

The above is the writing of the most basic matching pattern. Usually if you want to match a specific character, you need to use \ for escape, such as "." in the matching string, then you should prepend the specific character in the matching string. The above basic rules are out. If the specific needs are not met, you can refer to this link. After understanding the basic matching patterns, you need to use regular expressions to match, find or replace them.

Regular matching, search and substitution

After writing the pattern string, you need to match the string to be matched with the pattern string by certain rules. It includes three methods: matching (regex_match), search (regex_search), and replacement (regex_replace).

Matching is very simple. You can directly pass the string to be matched and the pattern string into the regex_match, and return a bool quantity to indicate whether the string to be matched meets the rules of the pattern string. Match the entire str string.

bool match = regex_match(str, e);
// Match the entire stringstr

Search is a substring that finds and satisfies the pattern string throughout the string. That is, as long as there is a satisfactory pattern string in str, it will return true.

bool match = regex_search(str, e);
// Find stringsstrMedium matcheSubstrings of rules

But in many cases, it is not enough to just return a matching bool quantity, and we need to get the matching substring. Then you need to group the matching strings in the pattern string, refer to the [Basic Rules for Matching Strings] point 12. Then pass the flash into regex_search to get the string that satisfies each subgroup.

smatch m;
bool found = regex_search(str, m, e);
for (int n = 0; n < (); ++n)
  {
    cout << "m[" << n << "].str()=" << m[n].str() << endl;
  }

Replacement is also done in grouping based on pattern strings.

cout << regex_replace(str, e, "$1 is on $2");

At this time, "is on" will be added in the middle of the string that satisfies group 1 and group 2.

The above three functions have many versions of overloading, which can meet the needs of different situations.

Actual combat

Requirements: Find the pattern string that satisfies sectionA("sectionB") or sectionA ("sectionB"). And separate sectionA and sectionB. sectionA and sectionB do not appear numbers, characters can be case-capable, at least one character.

Analysis: According to the requirements, it can be roughly divided into two parts, namely sectionA and sectionaB. This requires grouping.

Step 1: Write out the pattern string that satisfies the section situation

[a-zA-Z]+

Step 2: Spaces may appear in sectionA and sectionB. Let's assume that there is at most 1 space

\\s?

Combining the above two situations is a pattern string that can meet our needs. But how to organize it to be divided into two groups?

[a-zA-Z]+\\s[a-zA-Z]+

The above writing method is definitely wrong. According to the grouping rules, the grouping needs to be distinguished by ()

regex e("([a-zA-Z]+)\\s?\\(\"([a-zA-Z]+)\"\\)");

At this time,\\s?The one behind\\(\" is to satisfy the escape of quotes and brackets on the outer layer of sectionB.

After the above is completed, you can first use regex_match to match. If it matches, then continue to use regex_search to search the string.

if (regex_match(str, e))
{
 smatch m;
 auto found = regex_search(str, m, e);
 for (int n = 0; n < (); ++n)
 {
 cout << "m[" << n << "].str()=" << m[n].str() << endl;
 }
}
else
{
 cout << "Not matched" << endl;
}

The first string of the object m array is the entire substring that meets the needs, and the next is the substring that meets the group 1 and the group 2.

Summarize

The above is the basic knowledge of regular expression pattern matching strings introduced to you by the editor. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time. Thank you very much for your support for my website!