Detailed explanation of character set and character range of advanced usage of Python regular expressions

Advanced usage of Python regular expressions: detailed explanation of character sets and character ranges

Regular expressions are indispensable tools in text processing and data cleaning. In the previous study, we have learned about basic regular expression matching, such as matching single characters, the starting and ending positions of strings, etc. Today, we will enter an advanced topic of regular expressions:Character Set and Character Range。

These two concepts are the cornerstone of the powerful matching function of regular expressions. With them mastered, you can handle various text patterns more efficiently, perform more complex data matching and cleaning.

1. The concept of character set

A character set is a set of regular expressions that match a specific set of characters. A character set can contain multiple characters, or a special character class can be used to match a common set of characters.

The basic form of a character set is to put characters in a pair of brackets[]middle. For example,[abc]Match charactersa、borcAny one of them. Character sets are a very common and important component in regular expressions.

Example: Simple character set matching

Suppose we have a stringtext = "apple banana cherry", and we want to match all letters containinga、borcWords:

import re
text = "apple banana cherry"
pattern = r"\b[abc]\w*\b"  # Match words starting with a, b, cmatches = (pattern, text)
print(matches)

In this example:

\bRepresents word boundaries, ensuring that we match the complete word.
[abc]It is a character set, which means that the character we want to match can bea、borcAny one of them.
\w*Match word characters (letters, numbers and underscores),*Indicates zero or multiple matches.

Output result:

['apple', 'banana', 'cherry']

The regular expression here successfully found alla、borcThe word that begins.

Special characters in the character set

Character sets are not limited to letters and numbers, but can also use some special characters to match specific types of characters. For example:

\d: Match any number, equivalent to[0-9]。
\w: Match any letter, number or underscore, equivalent to[a-zA-Z0-9_]。
\s: Match any whitespace characters (such as spaces, tabs, etc.).
.: Match any single character except line breaks.

Example: Match words containing numbers

If we want to match words containing numbers from the text, we can use\dTo match numbers:

text = "apple 123banana cherry 4567"
pattern = r"\b\w*\d\w*\b"  # Match words containing numbersmatches = (pattern, text)
print(matches)

Output result:

['123banana', '4567']

In this example,[abc]Replaced with\dto match words containing numbers.

2. The concept of character range

Character Range is the way in which a regular expression is used to represent a continuous set of characters. It uses hyphen-To specify a character range. For example,[a-z]Match any character in lowercase letters.

Example: Match lowercase letters

If we want to match lowercase letters in text, we can use the character range[a-z]：

import re
text = "apple Banana Cherry"
pattern = r"[a-z]"  # Match all lowercase lettersmatches = (pattern, text)
print(matches)

Output result:

['a', 'p', 'p', 'l', 'e', 'a', 'n', 'a', 'b', 'a', 'n', 'a']

In this example, the regular expression[a-z]Match all lowercase letters, includinga、p、landewait.

Combination of character ranges

Character ranges are not limited to a single range, but multiple ranges can also be combined to match different types of characters. For example,[a-zA-Z]Match all letters, whether in lowercase or uppercase.

text = "apple Banana Cherry 123"
pattern = r"[a-zA-Z]"  # Match all letters (including upper and lower case)matches = (pattern, text)
print(matches)

Output result:

['a', 'p', 'p', 'l', 'e', 'B', 'a', 'n', 'a', 'n', 'a', 'C', 'h', 'e', 'r', 'r', 'y']

Example: Match lowercase letters, numbers and underscores

We can use\wto match letters, numbers and underscores, which is equivalent to[a-zA-Z0-9_]：

text = "apple_123 Banana Cherry_456 789"
pattern = r"\w+"  # Match words composed of letters, numbers and underscoresmatches = (pattern, text)
print(matches)

Output result:

['apple_123', 'Banana', 'Cherry_456', '789']

In this example,\w+Match one or more letters, numbers, or underscores.

3. Use character sets and character ranges to match complex patterns

Character sets and character ranges provide powerful flexibility for regular expressions and can help us match more complex text patterns. Next, we will demonstrate how to use character sets and character ranges to match more complex strings with some practical examples.

Example: Match words containing numbers and letters

Suppose we have the following text and want to match words containing letters and numbers:

text = "apple 123banana Cherry_456 789"
pattern = r"\b[a-zA-Z0-9_]+\b"  # Match words containing letters, numbers, and underscoresmatches = (pattern, text)
print(matches)

Output result:

['apple', '123banana', 'Cherry_456', '789']

In this example,\b[a-zA-Z0-9_]+\bIndicates matching words composed of letters, numbers, and underscores.

Example: Match date format (yyyy-mm-dd)

Suppose we have a date formatyyyy-mm-ddstring and want to extract all dates:

text = "Today is 2023-11-05, tomorrow is 2023-11-06."
pattern = r"\b\d{4}-\d{2}-\d{2}\b"  # Match date formatmatches = (pattern, text)
print(matches)

Output result:

['2023-11-05', '2023-11-06']

In this example,\d{4}-\d{2}-\d{2}Match a date consisting of 4 digits, 2 digits and 2 digits.

Example: Match mobile phone number

If we want to extract mobile phone numbers from text, we can use the character set and character range to match the mobile phone numbers that match the format. Assume that the format of the mobile phone number is "(xxx) xxx-xxxx":

text = "My number is (123) 456-7890 and your number is (987) 654-3210."
pattern = r"\(\d{3}\) \d{3}-\d{4}"  # Match mobile phone numbermatches = (pattern, text)
print(matches)

Output result:

['(123) 456-7890', '(987) 654-3210']

In this example,$\d{3}$ \d{3}-\d{4}Match three digits in brackets and phone numbers in the format "xxx-xxxx".

4. Notes on character sets and character ranges

Although character sets and character ranges are very powerful features of regular expressions, there are some things to note when using:

Character sets are greedy: If not used^or$To limit the start and end positions of the match, the regular expression will match as many characters as possible.
The order of character ranges is important: The order in the character range affects the result of the match, for example,[a-zA-Z]and[A-Za-z]Both can match letters, but the results will be different if the defined character range order is different.

5. Summary

Through this article's explanation, we have a deeper understanding of Python regular expressions

Character set and character range in the formula. These advanced usages provide us with powerful matching capabilities that can handle a variety of complex text processing tasks. By using the character set and character range rationally, we can efficiently extract information from text, clean and convert data.

In practical applications, the powerful functions of regular expressions are not limited to these two concepts, but can also be used in combination with other advanced features such as grouping, backreferences, etc., thereby achieving more complex text matching tasks. Mastering character sets and character ranges will greatly improve the efficiency and accuracy of our writing regular expressions.

This is the article about the detailed explanation of character sets and character ranges of advanced usage of Python regular expressions. For more related contents of Python character sets and character ranges, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!