Detailed introduction to regular expressions (Part 1)

This article is a translation of a tutorial written by Jan Goyvaerts for RegexBuddy. Let’s take a look below!

1. What is a regular expression

Basically, a regular expression is a pattern used to describe a certain amount of text. Regex stands for Regular Express. This article will use <<regex>> to represent a specific regular expression.

A piece of text is the most basic pattern, simply matching the same text.

2. Different regular expression engines

The regular expression engine is a software that can handle regular expressions. Typically, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial focuses on the Perl 5 type engine, because this engine is the most widely used engine. At the same time, we will also mention some differences from other engines. Many modern engines are very similar, but not exactly the same. For example, the .NET regular library and the JDK regular package.

3. Text symbols

The most basic regular expression consists of a single literal notation. As in <<a>>, it will match the first occurrence of the character "a" in the string. For example, for the string "Jack is a boy". "a" after "J" will be matched. And the second "a" will not be matched.

Regular expressions can also match the second "a", which must be what you tell the regex engine to start searching from where the first match is. In a text editor, you can use Find Next. In programming languages, there is a function that allows you to continue searching backwards from the previous match.

Similarly, <<cat>> will match "cat" in "About cats and dogs". This is equivalent to telling the regular expression engine to find a <<c>>, followed by a <<a>>, and then a <<t>>.

Note that the regex engine is case sensitive by default. Unless you tell the engine to ignore case, <<cat>> won't match "Cat".

(1) Special characters

For text characters, 11 characters are reserved for special purposes. They are:

[ ] \ ^ $ . | ? * + ( )

These special characters are also called metacharacters.

If you want to use these characters as text characters in a regular expression, you need to escape them with the backslash "\". For example, if you want to match "1+1=2", the correct expression is <<1\+1=2>>.

It should be noted that <<1+1=2>> is also a valid regular expression. But it will not match "1+1=2", but will match "111=2" in "123+111=234". Because "+" here means a special meaning (repeat once to multiple times).

In programming languages, it is important to note that some special characters will be processed by the compiler first and then passed to the regular engine. Therefore, the regular expression <<1\+2=2>> should be written as "1\\+1=2" in C++. To match "C:\temp", you need to use the regular expression <<C:\\temp>>. In C++, regular expressions become "C:\\\\temp".

(2) Characters cannot be displayed

Special character sequences can be used to represent certain non-displayable characters:

<<\t>> stands for Tab(0x09)

<<\r>> represents carriage return character (0x0D)

<<\n>> stands for newline character (0x0A)

It should be noted that text files in Windows use "\r\n" to end a line while Unix uses "\n".

4. The internal working mechanism of the regular expression engine

Knowing how the regex engine works helps you quickly understand why a certain regex doesn't work as you expect.

There are two types of engines: text-directed engines and regular-directed engines. Jeffrey Friedl calls them DFA and NFA engines. This article talks about a regular-oriented engine. This is because some very useful features, such as lazy quantifiers and backreferences, can only be implemented in regular-oriented engines. So it's no surprise that this engine is the most popular one at present.

You can easily tell whether the engine you are using is text-oriented or regular-oriented. If the backreference or "lazy" quantifier is implemented, you can be sure that the engine you are using is regular-oriented. You can do the following test: Apply the regular expression <<regex|regex not>> to the string "regex not". If the matching result is regex, the engine is regular-oriented. If the result is regex not, it is text-oriented. Because the regular-oriented engine is "urgent", it will eagerly perform the performance and report the first match it finds.

Regularly oriented engines always return the leftmost match

This is an important point that you need to understand: even if it is possible to find a "better" match in the future, the regular-oriented engine always returns the leftmost match.

When <<cat>> is applied to "He captured a catfish for his cat", the engine first compares <<c>> and "H", but it fails. So the engine failed to compare <<c>> and "e". Until the fourth character, <<c>> matches "c". <<a>>Matches the fifth character. By the sixth character <<t>>, it failed to match "p". The engine continues to recheck the matching from the fifth character. Until the fifteenth character starts, <<cat>> matches "cat" in "catfish", and the regular expression engine eagerly returns the result of the first match, without continuing to look for any other better matches.

5. Character Set

A character set is a collection of characters enclosed by a pair of square brackets "[]". Using character sets, you can tell the regex engine to match only one of multiple characters. If you want to match a "a" or a "e", use <<[ae]>>. You can use <<gr[ae]y>> to match gray or grey. This is especially useful when you are not sure whether the characters you are searching for are in American or British English. Conversely, <<gr[ae]y>> will not match graay or graey. The character order in the character set has nothing to do with it, and the results are all the same.

You can use the hyphen "-" to define a character range as a character set. <<[0-9]>> Match a single number between 0 and 9. You can use more than one range. <<[0-9a-fA-F] >> Matches a single hexadecimal number and is case-insensitive. You can also combine range definitions with single character definitions. <<[0-9a-fxA-FX]>> Match a hexadecimal number or letter X. Again, the order of character and range definitions has no effect on the results.

(1) Some applications of character set

Find a word that may have a misspelled word, such as <<sep[ae]r[ae]te>> or <<li[cs]en[cs]e>>.

Find the identifier of the programming language, <<A-Za-z_][A-Za-z_0-9]*>>. (* means repeat 0 or more times)

Find C-style hexadecimal number <<0[xX][A-Fa-f0-9]+>>. (+ means repeating once or more times)

(2) Inverse character set

The left square bracket "[" is followed by an angle bracket "^" and will inverse the character set. The result is that the character set will match any characters that are not in square brackets. Unlike ".", inverse character sets can match carriage return line breaks.

It is important to remember that inverse character sets must match one character. <<q[^u]>> does not mean: match a q, and no u follows it. It means: match a q, followed by a character that is not u. So it will not match the q in "Iraq", but will match the q in "Iraq is a country" and a space character. In fact, the space character is part of the match because it is a "character not u".

If you only want to match a q, the condition is that there is a character that is not u after q, we can solve it by looking forward later.

(3) Metachars in the character set

It should be noted that only 4 characters in the character set have special meanings. They are: "] \ ^ -". "]" represents the end of the character set definition; "\" represents escape; "^" represents inverse; "-" represents scope definition. Other common metacharacters are normal characters inside the character set definition and do not require escape. For example, to search for asterisk* or plus sign+, you can use <<[+*]>>. Of course, if you escape those normal metachars, your regex will work just as well, but this will reduce readability.

In the character set definition, in order to use the backslash "\" as a literal character rather than a special character, you need to escape it with another backslash. <<[\\x]>> will match a backslash and an X. "]^-" can be escaped with a backslash, or put them in a position where it is impossible to use their special meaning. We recommend the latter because this increases readability. For example, for the character "^", put it except the position after the left bracket "[", and use the meaning of the character rather than the inverse meaning. If <<[x^]>> will match an x or ^. <<[]x]>> will match a "]" or "x". <<[-x]>> or <<[x-]>> will match a "-" or "x".

(4) Abbreviation of character set

Because some character sets are very commonly used, there are some abbreviations.

<<\d>> stands for <<[0-9]>>;

<<\w>> represents word characters. This differs depending on the implementation of regular expressions. Most regular expressions implement word character sets include <<A-Za-z0-9_]>>.

<<\s>> stands for "white characters". This is also related to different implementations. In most implementations, space characters and Tab characters are included, as well as carriage return and line breaks <<\r\n>>.

The abbreviation of the character set can be used within or outside square brackets. <<\s\d>>Match a white character followed by a number. <<[\s\d]>>Match a single white character or number. <<[\da-fA-F]>> will match a hexadecimal number.

Abbreviation of inverse character sets

<<[\S]>> = <<[^\s]>>

<<[\W]>> = <<[^\w]>>

<<[\D]>> = <<[^\d]>>

(5) Repeat of character set

If you use the "?*+" operator to repeat a character set, you will repeat the entire character set. And not just the character it matches. Regular expression <<[0-9]+>> will match 837 and 222.

If you just want to repeat the matching character, you can use backward reference to achieve your goal. We will talk about backward quotes later.

6. Repeat with ?* or +

?: Tell the engine to match leading characters 0 or once. In fact, it means that the leading characters are optional.

+: Tell the engine to match the leading characters 1 or more times

*: Tell the engine to match leading characters 0 or more times

<[A-Za-z][A-Za-z0-9]*> Matches HTML tags without attributes, "<" and ">" are literal symbols. The first character set matches a letter, and the second character set matches a letter or number.

We also seem to be able to use <[A-Za-z0-9]+>. But it will match <1>. But this regex is still valid enough when you know that the string you are searching for does not contain similar invalid tags.

(1) Restrictive duplication

Many modern regular expression implementations allow you to define how many times you repeat a character. The wording is: {min,max}. Both min and max are non-negative integers. If the comma is present and max is ignored, there is no limit to max. If both comma and max are ignored, then repeat min times.

Therefore {0,} is the same as *, {1,} and + do the same.

You can use <<\b[1-9][0-9]{3}\b>> to match the number between 1000 and 9999 ("\b" represents the word boundary). <<\b[1-9][0-9]{2,4}\b>>Match a number between 100~99999.

(2) Pay attention to greed

Suppose you want to match an HTML tag with a regular expression. You know that the input will be a valid HTML file, so the regular expression does not need to exclude invalid tags. So if it is content between two angle brackets, it should be an HTML tag.

Many newbies with regular expressions will first think of using regular expressions << <.+> >>, and they will be surprised to find that for test strings, "This is a first test", you may expect to return and then continue to match when the match is returned.

But the truth is not. The regular expression will match "first". Obviously this is not the result we want. The reason is that "+" is greedy. That is, "+" causes the regular expression engine to try to repeat leading characters as much as possible. The engine will backtrack only if this repetition causes the entire regular expression to fail to match. That is, it abandons the last "repeat" and then processes the rest of the regular expression.

Similar to "+", the repetition of "?*" is also greedy.

(3) Go deep into the regular expression engine

Let's take a look at how the regulating engine matches the previous example. The first token is "<", which is a text symbol. The second symbol is ".", which matches the character "E", and then "+" can match the rest until the end of a line. Then when it comes to the newline character, the matching fails ("." does not match the newline character). So the engine starts matching the next regular expression symbol. That is, trying to match ">". So far, "<.+" has matched "first test". The engine will try to match ">" with the newline, but it fails. So the engine traces back. The result is that "<.+" now matches "first tes". So the engine matches ">" with "t". Obviously it will still fail. This process continues until “<.+” matches “first</EM”, “>” matches “>”. So the engine found a match "first". Remember, the regurgit engine is "urgent", so it will rush to report the first match it found. Instead of continuing to trace back, even if there may be a better match, such as "". So we can see that due to the greedness of "+", the regular expression engine returns a leftmost longest match.

(4) Replace greed with laziness

A possible solution for correcting the above problems is to replace greed with the lazyness of "+". You can do this by following a question mark "?" after "+". This scheme can also be used for duplications of "*", "{}" and "?". So in the above example we can use "<.+?>". Let's take a look at the process of the regular expression engine.

Once again, the regular expression token "<" matches the first "<" of the string. The next regular token is ".". This time it's a lazy "+" to repeat the previous character. This tells the regurgit to repeat the previous character as little as possible. So the engine matches "." and the character "E", and then matches "M" with ">", and the result fails. The engine will backtrack, unlike the previous example, because it is lazy repetition, so the engine extends lazy repetition rather than reduction, so "<.+" is now extended to "<EM". The engine continues to match the next mark ">". This time I got a successful match. The engine then reports that "" is a successful match. The whole process is roughly like this.

(5) An alternative to lazy extension

We have a better alternative. You can repeat with a greedy character set: "<[^>]+>". The reason this is a better solution is that when using lazy repetition, the engine will backtrack each character before finding a successful match. However, using inverse character sets does not require backtracking.

The last thing to remember is that this tutorial only talks about regular-oriented engines. Text-oriented engines do not go backtrack. But at the same time, they do not support lazy repetition.

7. Use "." to match almost any character

In regular expressions, "." is one of the most commonly used symbols. Unfortunately, it is also one of the most misused symbols.

"." matches a single character without caring about what the matched character is. The only exception is the new line character. The engines mentioned in this tutorial do not match new line characters by default. Therefore, by default, "." is equal to the abbreviation of the character set [^\n\r](Window) or [^\n](Unix).

This exception is due to historical reasons. Because early tools that used regular expressions were line-based. They are all read into a file line by line, and apply regular expressions to each line separately. In these tools, strings do not contain new line characters. Therefore, "." never matches the new line character.

Modern tools and languages can apply regular expressions to large strings and even entire files. All the regular expression implementations discussed in this tutorial provide an option to make "." match all characters, including new line characters. In tools such as RegexBuddy, EditPad Pro or PowerGREP, you can simply select "Document matches new line characters". In Perl, a pattern in which "." can match a new line character is called "single-line pattern". Unfortunately, this is a very confusing noun. Because there is also the so-called "multi-line mode". The multi-line mode only affects the anchor at the beginning and end of the line, while the single-line mode only affects the ".".

Other languages and regular expression libraries also use Perl's term definitions. When using regular expression classes in .NET Framework, you can activate single-line mode with statements like the following: ("string","regex",)

Conservatively use the point number "."

The dot number can be said to be the most powerful metacharacter. It allows you to be lazy: with a dot, you can match almost all characters. But the problem is that it often matches characters that should not be matched.

I'll illustrate it with a simple example. Let's see how to match a date with the format "mm/dd/yy", but we want to allow the user to select the delimiter. A solution that I can think of soon is <<\d\d.\d\d.\d\d>>. It looks like it can match the date "02/12/03". The problem is that 02512703 will also be considered a valid date.

<<\d\d[-/.]\d\d[-/.]\d\d>>It looks like a better solution. Remember that dot numbers are not metacharacters in a character set. This plan is far from perfect, it will match "99/99/99". And <<[0-1]\d[-/.][0-3]\d[-/.]\d\d>> goes one step further. Although he will also match "19/39/99". How perfect your regex you want depends on what you want to achieve. If you want to verify user input, you need to be as perfect as possible. If you just want to analyze a known source and we know there is no error data, it is enough to use a better regular expression to match the characters you want to search for.

8. Anchors for the beginning and end of a string

Anchoring is different from the general regular expression notation, it does not match any characters. Instead, they match the position before or after the character. "^" matches the position before the first character of a string. <<^a>> will match a in the string "abc". <<^b>> will not match any characters in "abc".

Similarly, $ matches the position after the last character in the string. So <<c$>> matches c in "abc".

(1) Application of anchoring

Using anchors is very important when verifying user input in programming languages. If you want to verify that the user's input is an integer, use <<^\d+$>>.

There are often extra leading or ending spaces in user input. You can use <<^\s*>> and <<\s*$>> to match leading or ending spaces.

(2) Use "^" and "$" as the start and end anchors of the row

If you have a string containing multiple lines. For example: "first line\n\rsecond line" (where \n\r represents a new line character). It is often necessary to process each line separately rather than the entire string. Therefore, almost all regular expression engines provide an option to extend the meaning of both anchors. "^" can match the start position of the string (before f) and the following position of each new line character (between \n\r and s). Similarly, $ will match the end position of the string (after the last e), and the front of each new line character (between e and \n\r).

In .NET, when you use the following code, the anchor will define the positions before and after each new line character: ("string", "regex", )

Application: string str = (Original, "^", "> ", )--The "> " will be inserted at the beginning of each line.

(3) Absolute anchoring

<<\A>> only matches the beginning position of the entire string, <<\Z>> only matches the end position of the entire string. Even if you use "multi-line pattern", <<\A>> and <<\Z>> never match new line characters.

Even if \Z and $ only match the end position of the string, there is still an exception. If the string ends with a new line character, \Z and $ will match the position before the new line character, not the end of the entire string. This "improvement" was introduced by Perl and was followed by many regular expression implementations, including Java, .NET, etc. If <<^[a-z]+$>> is applied to "joe\n", the matching result is "joe" instead of "joe\n".

The above is the detailed introduction of regular expressions in the previous article. I hope it will be helpful for everyone to better understand regular expressions.