Regular expressions (regular) knowledge (organized)

Regular (regular), to use regular expressions, you need to import the re (abbreviation of regular regular) module in Python. Regular expressions are processing strings. We know that strings sometimes contain a lot of information we want to extract. Mastering these methods of processing strings can facilitate our operations.

Regular expression (regular), a method for processing strings.

Regularity is a common method, because file processing is very common in python, and files contain strings. To process strings, regular expressions are required. Therefore, you must master regular expressions.Let's take a look at the methods contained in the regular expression:

（1）match(pattern, string, flags=0)

 def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)

From the above comment: Try to apply the pattern at the start of the string, returning a match object, or None if no match was found. Starting from the beginning of the string, returning a match object object, and if not found, returning a None.

Key points: (1) Start searching from the beginning; (2) Return None if it cannot be found.

Let’s take a look at a few examples:

 import re
string = "abcdef"
m = ("abc",string)  （1）match"abc"，And see what the returned result is
print(m)
print(()) 
n = ("abcf",string)
print(n)      (2）The situation where strings are not found in the list
l = ("bcd",string)  （3）Strings look up situations in the middle of the list
print(l)

The operation results are as follows:

 <_sre.SRE_Match object; span=(0, 3), match='abc'>  （1）abc             （2）　None             （3）
None             （4）

From the above output result (1), it can be seen that using match() to match, returns a match object object. To convert to a visible situation, you must use group() to convert it as shown in (2); if the matching regular expression is not in the string, None(3); match(pattern, string, flag) matches from the beginning of the string, and can only match from the beginning of the string (4).

（2）fullmatch(pattern, string, flags=0)

def fullmatch(pattern, string, flags=0):
"""Try to apply the pattern to all of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).fullmatch(string)

From the above comment: Try to apply the pattern to all of the string,returning a match object, or None if no match was found...

（3）search(pattern,string,flags)

 def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
 search(pattern,string,flags)The comment isScan throgh string looking for a match to the pattern,returning a match object,or None if no match was found.Find regular expressions at any position of the string，If found, returnmatch objectObject，If the search cannot be found, returnNone。

Key points: (1) Look up from any position in the middle of the string, unlike match() that starts from the beginning; (2) If it cannot be found, return None;

 import re
string = "ddafsadadfadfafdafdadfasfdafafda"
m = ("a",string)   （1）Start matching from the middle
print(m)
print(())
n = ("N",string)   （2）Failure to match
print(n)

The operation results are as follows:

 <_sre.SRE_Match object; span=(2, 3), match='a'>  （1）a             （2）None             （3）

From the above result (1), we can see that search(pattern, string, flag=0) can be matched from any position in the middle, which expands the scope of use, unlike match() that can only match from the beginning, and the match returns a match_object object; (2) If you want to display a match_object object, you need to use the group() method; (3) If it cannot be found, a None is returned.

（4）sub(pattern,repl,string,count=0,flags=0)

def sub(pattern, repl, string, count=0, flags=0):
"""Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used."""
return _compile(pattern, flags).sub(repl, string, count)
 sub(pattern,repl,string,count=0,flags=0)Find replacement，Just look up firstpatternIs it in a string?stringmiddle；replIt's about topatternMatched objects，You need to replace the characters found in the regular expression.；countYou can specify the number of matches，How many matches。The example is as follows：
 import re
string = "ddafsadadfadfafdafdadfasfdafafda"
m = ("a","A",string) #Not specify the number of replacements (1)print(m)
n = ("a","A",string,2) #Specify the number of replacements (2)print(n)
l = ("F","B",string) #The situation where the match is not available (3)print(l)

The operation results are as follows:

    ddAfsAdAdfAdfAfdAfdAdfAsfdAfAfdA        --（1）
ddAfsAdadfadfafdafdadfasfdafafda        -- (2）
ddafsadadfadfafdafdadfasfdafafda        --（3）

If the above code (1) does not specify the number of matches, then the default is to match everything; if the number of matches is specified at (2) the number of matches is specified, then only the number of matches is specified; if the regular pattern to be matched at (3) the original string is returned.

Key points: (1) You can specify the number of matches, but not all matches; (2) If the match does not match, the original string will be returned;

（5）subn(pattern,repl,string,count=0,flags=0)

def subn(pattern, repl, string, count=0, flags=0):
"""Return a 2-tuple containing (new_string, number).
new_string is the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in the source
string by the replacement repl. number is the number of
substitutions that were made. repl can be either a string or a
callable; if a string, backslash escapes in it are processed.
If it is a callable, it's passed the match object and must
return a replacement string to be used."""
return _compile(pattern, flags).subn(repl, string, count)

The above comment Return a 2-tuple containing(new_string,number): Returns a tuple to store the new string after the regular match and the number of matches (new_string,number).

 import re
string = "ddafsadadfadfafdafdadfasfdafafda"
m = ("a","A",string) #Stories of all replacements (1)print(m)
n = ("a","A",string,3) #Replacement part (2)print(n)
l = ("F","A",string) #Specify the replaced string does not exist (3)print(l)

The operation results are as follows:

    ('ddAfsAdAdfAdfAfdAfdAdfAsfdAfAfdA', 11)     （1）
('ddAfsAdAdfadfafdafdadfasfdafafda', 3)      （2）
('ddafsadadfadfafdafdadfasfdafafda', 0)       （3）

From the results output from the above code, we can see that sub() and subn(pattern, repl, string, count=0, flags=0) can be seen that the matching effect of the two is the same, except that the returned results are different. Sub() returns a string, while subn() returns a tuple, which is used to store the new string after the regularity and the number of replaced ones.

（6）split(pattern,string,maxsplit=0,flags=0)

 def split(pattern, string, maxsplit=0, flags=0):
"""Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list."""
return _compile(pattern, flags).split(string, maxsplit) 
 split(pattern,string,maxsplit=0,flags=0)It is a string segmentation，According to a regular requirementpatternSplit string，Return to a listreturning a list containing the resulting substrings.就是按照某种方式Split string，and put the string in a list。Example:：
 import re
string = "ddafsadadfadfafdafdadfasfdafafda"
m = ("a",string) #Segment string (1)print(m)
n = ("a",string,3) #Specify the number of splitsprint(n)
l = ("F",string) #Segment string does not exist in the listprint(l)

The operation results are as follows:

 ['dd', 'fs', 'd', 'df', 'df', 'fd', 'fd', 'df', 'sfd', 'f', 'fd', '']  （1）
['dd', 'fs', 'd', 'dfadfafdafdadfasfdafafda']        （2）
['ddafsadadfadfafdafdadfasfdafafda']          （3）

From (1), we can see that if the string to be divided at the beginning or end of the string includes the string to be divided, the following element will be a ""; (2) we can specify the number of times to be divided; (3) if the string to be divided does not exist in the list, then put the original string in the list.

（7）findall(pattern,string,flags=)

def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
 findall(pattern,string,flags=)Return a list，Contains all matching elements。Store in a list。The example is as follows：
 import re
string = "dd12a32d46465fad1648fa1564fda127fd11ad30fa02sfd58afafda"  
m = ("[a-z]",string)  #Match letters, match all letters, return a list (1)print(m)
n = ("[0-9]",string)  #Match all numbers and return a list (2)print(n)
l = ("[ABC]",string)  #The situation where the match is not matched (3)print(l)

The operation results are as follows:

 ['d', 'd', 'a', 'd', 'f', 'a', 'd', 'f', 'a', 'f', 'd', 'a', 'f', 'd', 'a', 'd', 'f', 'a', 's', 'f', 'd', 'a', 'f', 'a', 'f', 　 'd', 'a']  （1）
['1', '2', '3', '2', '4', '6', '4', '6', '5', '1', '6', '4', '8', '1', '5', '6', '4', '1', '2', '7', '1', '1', '3', '0', '0', 　 '2', '5', '8']  （2）
 []     （3）

The above code run result (1) matches all strings and matches individually; (2) matches the numbers in the string and returns to a list; (3) matches the case where it does not exist, returns an empty list.

Key points: (1) Return an empty list when the match is not matched; (2) If no match number is specified, only a single match is matched.

（8）finditer(pattern,string,flags=0)

def finditer(pattern, string, flags=0):
"""Return an iterator over all non-overlapping matches in the
string. For each match, the iterator returns a match object.
Empty matches are included in the result."""
return _compile(pattern, flags).finditer(string)
 finditer(pattern,string)Find Patterns，Return an iterator over all non-overlapping matches in the  each match,the iterator a match object.

The code is as follows:

 import re
string = "dd12a32d46465fad1648fa1564fda127fd11ad30fa02sfd58afafda"
m = ("[a-z]",string)
print(m)
n = ("AB",string)
print(n)

The operation results are as follows:

<callable_iterator object at 0x7fa126441898>   （1）
<callable_iterator object at 0x7fa124d6b710>   （2）

From the above run results, we can see that finditer(pattern, string, flags=0) returns an iterator object.

（9）compile(pattern,flags=0)

 def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a pattern object."
return _compile(pattern, flags)

（10）pruge()

 def purge():
"Clear the regular expression caches"
_cache.clear()
_cache_repl.clear()

（11）template(pattern,flags=0)

def template(pattern, flags=0):
"Compile a template pattern, returning a pattern object"
return _compile(pattern, flags|T)

Regular expression:

grammar:

　import re
string = "dd12a32d46465fad1648fa1564fda127fd11ad30fa02sfd58afafda"
p = ("[a-z]+")  #First use compile(pattern) to compilem = (string)   #Then make a matchprint(())

The above lines 2 and 3 can also be combined into one line to write:

 m = ("^[0-9]",'14534Abc')

The effect is the same. The difference is that the first way is to compile the format to match in advance (analyze the matching formula). In this way, when matching, you don’t need to compile the matching format. The second abbreviation is to compile the matching formula every time you match. Therefore, if you need to match all lines starting with numbers from a 5w line file, it is recommended to compile the regular formula first and then match it, so the speed will be faster.

Matching format:

(1)^ Match the beginning of the string

 import re
string = "dd12a32d41648f27fd11a0sfdda"
#^ matches the beginning of the string, now we use search() to match the one that starts with numbersm = ("^[0-9]",string) #Match string starts with a number (1)print(m)
n = ("^[a-z]+",string) #Match string starts with letters. If it is matched from the beginning, it is not much different from search() (2)print(())

The operation results are as follows:

None
dd

In the above (1), we use ^ to start the match from the beginning of the string. Whether the match starts is a number. Since the string is preceded by a letter, not a number, the match fails and returns None; (2) we start the match with a letter, because the beginning is a letter, the match is correct, and the correct result is returned; in this way, ^ is actually similar to match() starting from the beginning.

(2) $ Match the end of the string

import re
string = "15111252598"
#^ matches the beginning of the string, now we use search() to match the one that starts with numbersm = ("^[0-9]{11}$",string)
print(())

The operation results are as follows:

15111252598

("^[0-9]{11}$",string) means a match that starts with a number, length is 11, and ends with a number;

(3) Dot (·) Match any character except line breaks. When the tag is specified, any character including a newline can be matched

 import re
string = "1511\n1252598"
#Dound (·) matches all characters except line breaksm = (".",string) #Point (·) matches any character, and matches a single character without specifying a number (1)print(())
n = (".+",string) #.+ is to match multiple arbitrary characters, except for line breaks (2)print(())

The operation results are as follows:

1
1511

From the above code running results, we can see that the dot (·) at (1) matches any character; (2) we match any multiple characters, but since the string contains spaces, the result only matches the content before the line break in the string, and the content afterwards does not match.

Key points: (1) The dot (·) matches any character except line breaks; (2).+ can match multiple characters except line breaks.

(4) [...] If [abc] matches "a", "b" or "c"

[object] Matches the contained characters in brackets. [A-Za-z0-9] means matching A-Z or a-z or 0-9.

 import re
string = "1511\n125dadfadf2598"
#[] Match characters containing bracketsm = ("[5fd]",string) #Match 5,f,d in stringprint(m)

The operation results are as follows:

['5', '5', 'd', 'd', 'f', 'd', 'f', '5']

In the above code, we want to match 5, f, d in the string and return a list.

(5)[^...] [^abc] Match any character except abc

 import re
string = "1511\n125dadfadf2598"
#[^]Match characters containing bracketsm = ("[^5fd]",string) #Match characters other than 5, f, d of the stringprint(m)

Run as follows:

['1', '1', '1', '\n', '1', '2', 'a', 'a', '2', '9', '8']

In the above code, we match characters other than 5, f, d, [^] is a character that matches characters other than characters in brackets.

(6)* Match 0 or more expressions

 import re
string = "1511\n125dadfadf2598"
#* is an expression that matches 0 or morem = ("\d*",string) #Match 0 or more numbersprint(m)

The operation results are as follows:

['1511', '', '125', '', '', '', '', '', '', '', '2598', '']

From the above run results, we can see that (*) is an expression that matches 0 or more characters. We match 0 or more numbers. It can be seen that if the match does not match, the return is empty, and where the return is an empty ("").

(7)+ Match 1 or more expressions

 import re
string = "1511\n125dadfadf2598"
#(+) is an expression that matches 1 or morem = ("\d+",string) #Match 1 or more numbersprint(m)

Run as follows:

['1511', '125', '2598']

Add (+) is to match 1 or more expressions, and above \d+ is to match 1 or more numeric expressions, at least one number.

(8)? Expressions that match 0 or 1, non-greedy

 import re
string = "1511\n125dadfadf2598"
#(?) is an expression that matches 0 or 1m = ("\d?",string) #Match 0 or 1 expressionprint(m)

The operation results are as follows:

['1', '5', '1', '1', '', '1', '2', '5', '', '', '', '', '', '', '', '2', '5', '9', '8', '']

The above question mark (?) matches 0 or 1 expressions, and above is an expression that matches 0 or 1. If it does not match, it returns empty ("")

(9) {n}

(10){n,m}

(11)\w �

\w is a matching letter and number in a string, the code is as follows:

 import re
string = "1511\n125dadfadf2598"
#(?) is an expression that matches 0 or 1m = ("\w",string) #Match 0 or 1 expressionprint(m)

Run as follows:

['1', '5', '1', '1', '1', '2', '5', 'd', 'a', 'd', 'f', 'a', 'd', 'f', '2', '5', '9', '8']

As can be seen from the above code, \w is used to match alphanumeric characters in a string. We use regular matching letters and numbers.

(12)\W \W The W capital W is used to match non-letters and numbers, which is exactly the opposite of lowercase w

Examples are as follows:

 import re
string = "1511\n125dadfadf2598"
#\W is used to match non-letters and numbers in stringsm = ("\W",string) #\W is used to match non-letters and numbers in stringsprint(m)

Run as follows:

['\n']

In the above code, \W is used to match non-letters and numbers, and the newline character is matched.

(13)\s Match any whitespace character, equivalent to [\n\t\f]

Examples are as follows:

 import re
string = "1511\n125d\ta\rdf\fadf2598"
#\s is used to match any whitespace character in a string, equivalent to [\n\t\r\f]m = ("\s",string) #\s is used to match any whitespace characters in a stringprint(m)

Run as follows:

['\n', '\t', '\r', '\x0c']

From the above code running results, we can see that: \s is used to match any empty characters. We match the empty characters.

(14)\S Match any non-null character

Examples are as follows:

 import re
string = "1511\n125d\ta\rdf\fadf2598"
#\S is used to match any non-null charactersm = ("\S",string) #\S is used to match any non-empty characters on the dayprint(m)

Run as follows:

['1', '5', '1', '1', '1', '2', '5', 'd', 'a', 'd', 'f', 'a', 'd', 'f', '2', '5', '9', '8']

From the above code, we can see that \S is used to match any non-null characters. In the result, we match any non-null characters.

(15)\d Match any number, equivalent to [0-9]

(16)\D Match any non-number

Summary: findall() and split() generate lists, one is used as a delimiter, and the other is used to find all values in it. Just the opposite.