The re module gives the Python language full regular expression capabilities.
Syntax that will be used
regular character | an explanation of the meaning of words or phrases | give an example | ||||||||||||||||||||||||||||||
+ | The preceding element appears at least once | ab+: ab, abbbb, etc. | ||||||||||||||||||||||||||||||
* | The preceding element appears 0 or more times | ab*: a, ab, abb, etc. | ||||||||||||||||||||||||||||||
? | Match the previous one or zero times | Ab?: A, Ab, etc. | ||||||||||||||||||||||||||||||
^ | serve as a start marker | ^a: abc, aaaaaa, etc. | ||||||||||||||||||||||||||||||
$ | serve as a closing marker | c$: abc, cccc, etc. | ||||||||||||||||||||||||||||||
\d | numeric | 3, 4, 9, etc.
|
||||||||||||||||||||||||||||||
\D | non-numeric | A, a, -, etc. | ||||||||||||||||||||||||||||||
[a-z] | Any letter between A and z | a, p, m, etc. | ||||||||||||||||||||||||||||||
[0-9] | Any number between 0 and 9 | 0, 2, 9, etc. |
注意:
1. Escape characters
>>> s '(abc)def' >>> m = ("(\(.*\)).*", s) >>> print (1) (abc)
function (math.)
Attempts to match a pattern from the start of the string, match() returns none if the match is not successful from the start.
Example 1:
#!/usr/bin/python # -*- coding: UTF-8 -*- import re print(('www', '').span()) # Match at the start position print(('net', '')) # Not matching at start position
Output results:
(0, 3)
None
Example 2:
#!/usr/bin/python import re line = "Cats are smarter than dogs" matchObj = ( r'(.*) are (.*?) .*', line, |) if matchObj: print "() : ", () print "(1) : ", (1) print "(2) : ", (2) else: print "No match!!"
Output results:
() : Cats are smarter than dogs
(1) : Cats
(2) : smarter
The above is python2 print output, python remember to add () can be, python output similar to other languages \n and so on to match the contents of the acquisition.
python group()
In regular expressions, group() is used to present the grouped intercepted strings, () is used to group the
Repeat the previous string many times
>>> a = "kdlal123dk345" >>> b = "kdlal123345" >>> m = ("([0-9]+(dk){0,1})[0-9]+", a) >>> (1), (2) ('123dk', 'dk') >>> m = ("([0-9]+(dk){0,1})[0-9]+", b) >>> (1) '12334' >>> (2) >>>
trace something to its source
1. The three sets of parentheses in the regular expression divide the matches into three groups
group() is the same as group(0), which is the overall result of matching the regular expression
group(1) lists the first bracketed match, group(2) lists the second bracketed match, and group(3) lists the third bracketed match.
2. No successful match, () returns None
3. Of course there are no parentheses in the regular expression, and group(1) must be incorrect.
typical example
1. Determining whether a string is all lowercase
# -*- coding: cp936 -*- import re s1 = 'adkkdk' s2 = 'abc123efg' an = ('^[a-z]+$', s1) if an: print 's1:', (), 'All lowercase' else: print s1, "Not all lowercase!" an = ('[a-z]+$', s2) if an: print 's2:', (), 'All lowercase' else: print s2, "Not all lowercase!"
in the end
trace something to its source
1. Regular expressions are not part of python and require the re module to be utilized.
2. Matching takes the form: (regular expression, with matching string) or (regular expression, with matching string). The difference between the two is that the latter starts with a start character (^) by default. Therefore.
('^[a-z]+$', s1) is equivalent to ('[a-z]+$', s2)
3. if the match fails, an = ('^[a-z]+$', s1) returns None
group is used to group matches together
for example
import re a = "123abc456" print ("([0-9]*)([a-z]*)([0-9]*)",a).group(0) #123abc456, return to whole print ("([0-9]*)([a-z]*)([0-9]*)",a).group(1) #123 print ("([0-9]*)([a-z]*)([0-9]*)",a).group(2) #abc print ("([0-9]*)([a-z]*)([0-9]*)",a).group(3) #456
output result
123abc456
123
abc
456
1) The three sets of parentheses in the regular expression divide the matching results into three groups
group() is the same as group(0), which is the overall result of matching the regular expression
group(1) lists the first bracketed match, group(2) lists the second bracketed match, and group(3) lists the third bracketed match.
2) No successful match, () returns None
3) Of course there are no parentheses in the regular expression, and group(1) must be incorrect.
2. Acronym expansion
concrete example
FEMA Federal Emergency Management Agency
IRA Irish Republican Army
DUP Democratic Unionist Party
FDA Food and Drug Administration
OLC Office of Legal Counsel
analyze
Abbreviation FEMA
Decomposed into F*** E*** M*** A***
Regular Uppercase + Lowercase (greater than or equal to 1) + Spaces
reference code
import re def expand_abbr(sen, abbr): lenabbr = len(abbr) ma = '' for i in range(0, lenabbr): ma += abbr[i] + "[a-z]+" + ' ' print 'ma:', ma ma = (' ') p = (ma, sen) if p: return () else: return '' print expand_abbr("Welcome to Algriculture Bank China", 'ABC')
in the end
concern
The above code is correct for the first three in the example, but the last two are wrong because there are lowercase words interspersed between words that begin with capital letters
rule (e.g. of science)
Upper case letters + lower case (greater than or equal to 1) + space + [lower case + space] (0 times or 1 time)
reference code
import re def expand_abbr(sen, abbr): lenabbr = len(abbr) ma = '' for i in range(0, lenabbr-1): ma += abbr[i] + "[a-z]+" + ' ' + '([a-z]+ )?' ma += abbr[lenabbr-1] + "[a-z]+" print 'ma:', ma ma = (' ') p = (ma, sen) if p: return () else: return '' print expand_abbr("Welcome to Algriculture Bank of China", 'ABC')
skill
Middle A collection of lowercase letters + a space, viewed as a whole, is bracketed. Either both or neither at the same time, which would require the use of ? , matching the whole ahead.
3. Removing commas from figures
concrete example
When dealing with natural language 123,000,000 can be problematic if it is split by punctuation, and a good number is mutilated by commas, so you can go ahead and clean up the number (comma removal).
analyze
In numbers there are often groups of 3 numbers followed by a comma, so the pattern is: ***, ***, ****
regular formula
[a-z]+,[a-z]?
Reference Code 3-1
import re sen = "abc,123,456,789,mnp" p = ("\d+,\d+?") for com in (sen): mm = () print "hi:", mm print "sen_before:", sen sen = (mm, (",", "")) print "sen_back:", sen, '\n'
in the end
finesse
Use the function finditer(string[, pos[, endpos]]) | (pattern, string[, flags]).
Searches for string, returning an iterator that accesses each match result (Match object) in order.
Reference Code 3-2
sen = "abc,123,456,789,mnp" while 1: mm = ("\d,\d", sen) if mm: mm = () sen = (mm, (",", "")) print sen else: break
in the end
reach
Such a program is specific to the problem, i.e., a group of 3 digits, and if the digits are mixed with letters, take out the commas between the digits, i.e., convert "abc,123,4,789,mnp" to "abc,1234789,mnp".
reasoning
More specifically, look for the regular formula "number, number" and replace it with a comma-removed substitution.
Reference Code 3-3
sen = "abc,123,4,789,mnp" while 1: mm = ("\d,\d", sen) if mm: mm = () sen = (mm, (",", "")) print sen else: break print sen
in the end
4. Year conversion for Chinese processing (e.g. 1949 -->1949)
Chinese language processing involves encoding issues. For example, the following program recognizes the year (**** year) when
# -*- coding: cp936 -*- import re m0 = "At the founding of the new China in 1949." m1 = "5.2 percent below 1990." m2 = "The defeat of the Russian army and the achievement of substantial independence in 1996". def fuc(m): a = ("[zero|one|two|three|four|five|six|seven|eight|nine]+year", m) if a: for key in a: print key else: print "NULL" fuc(m0) fuc(m1) fuc(m2)
running result
You can see that the second and third are in error.
Improvement - quasi-unicode recognition
# -*- coding: cp936 -*- import re m0 = "At the founding of the new China in 1949." m1 = "5.2 percent below 1990." m2 = "The defeat of the Russian army and the achievement of substantial independence in 1996". def fuc(m): m = ('cp936') a = (u"[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", m) if a: for key in a: print key else: print "NULL" fuc(m0) fuc(m1) fuc(m2)
in the end
Recognition can be done by replacing the Chinese characters with numbers by substitution.
consultation
numHash = {} numHash['Zero'.decode('utf-8')] = '0' numHash['One'.decode('utf-8')] = '1' numHash['Two'.decode('utf-8')] = '2' numHash['Three'.decode('utf-8')] = '3' numHash['Four'.decode('utf-8')] = '4' numHash['Five'.decode('utf-8')] = '5' numHash['Six'.decode('utf-8')] = '6' numHash['Seven'.decode('utf-8')] = '7' numHash['Eight'.decode('utf-8')] = '8' numHash['Nine'.decode('utf-8')] = '9' def change2num(words): print "words:",words newword = '' for key in words: print key if key in numHash: newword += numHash[key] else: newword += key return newword def Chi2Num(line): a = (u"[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", line) if a: print "------" print line for words in a: newwords = change2num(words) print words print newwords line = (words, newwords) return line
5. Multiple cell phone numbers separated by |.
Examples:
empty value
12222222222
12222222222|12222222222
12222222222|12222222222|12222222444
displayed formula
s = "[\\d]{11}(\\|[\\d]{11})*|"
IV. Recommendations
Python Regular Expressions Guide
to this article on the python regular expression example code is introduced to this article, more relevant python regular example content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!