python regular expression example code

The re module gives the Python language full regular expression capabilities.

Syntax that will be used

regular character

an explanation of the meaning of words or phrases

give an example

The preceding element appears at least once

ab+: ab, abbbb, etc.

The preceding element appears 0 or more times

ab*: a, ab, abb, etc.

Match the previous one or zero times

Ab?: A, Ab, etc.

serve as a start marker

^a: abc, aaaaaa, etc.

serve as a closing marker

c$: abc, cccc, etc.

numeric

3, 4, 9, etc.

regular character	an explanation of the meaning of words or phrases	give an example
+	The preceding element appears at least once	ab+: ab, abbbb, etc.
*	The preceding element appears 0 or more times	ab*: a, ab, abb, etc.
?	Match the previous one or zero times	Ab?: A, Ab, etc.
^	serve as a start marker	^a: abc, aaaaaa, etc.
$	serve as a closing marker	c$: abc, cccc, etc.
\d	numeric	3, 4, 9, etc.
\D	non-numeric	A, a, -, etc.
[a-z]	Any letter between A and z	a, p, m, etc.
[0-9]	Any number between 0 and 9	0, 2, 9, etc.

non-numeric

A, a, -, etc.

[a-z]

Any letter between A and z

a, p, m, etc.

[0-9]

Any number between 0 and 9

0, 2, 9, etc.

注意：

1. Escape characters

>>> s
'(abc)def'
>>> m = ("(\(.*\)).*", s)
>>> print (1)
(abc)

function (math.)

Attempts to match a pattern from the start of the string, match() returns none if the match is not successful from the start.

Example 1:

#!/usr/bin/python
# -*- coding: UTF-8 -*- 
 
import re
print(('www', '').span()) # Match at the start position
print(('net', ''))     # Not matching at start position

Output results:

(0, 3)
None

Example 2:

#!/usr/bin/python
import re
 
line = "Cats are smarter than dogs"
 
matchObj = ( r'(.*) are (.*?) .*', line, |)
 
if matchObj:
  print "() : ", ()
  print "(1) : ", (1)
  print "(2) : ", (2)
else:
  print "No match!!"

Output results:

() : Cats are smarter than dogs
(1) : Cats
(2) : smarter

The above is python2 print output, python remember to add () can be, python output similar to other languages \n and so on to match the contents of the acquisition.

python group()

In regular expressions, group() is used to present the grouped intercepted strings, () is used to group the

Repeat the previous string many times

>>> a = "kdlal123dk345"
>>> b = "kdlal123345"
>>> m = ("([0-9]+(dk){0,1})[0-9]+", a)
>>> (1), (2)
('123dk', 'dk')
>>> m = ("([0-9]+(dk){0,1})[0-9]+", b)
>>> (1)
'12334'
>>> (2)
>>>

trace something to its source

1. The three sets of parentheses in the regular expression divide the matches into three groups

group() is the same as group(0), which is the overall result of matching the regular expression
group(1) lists the first bracketed match, group(2) lists the second bracketed match, and group(3) lists the third bracketed match.
2. No successful match, () returns None

3. Of course there are no parentheses in the regular expression, and group(1) must be incorrect.

typical example

1. Determining whether a string is all lowercase

# -*- coding: cp936 -*-
import re 
s1 = 'adkkdk'
s2 = 'abc123efg'

an = ('^[a-z]+$', s1)
if an:
  print 's1:', (), 'All lowercase' 
else:
  print s1, "Not all lowercase!"

an = ('[a-z]+$', s2)
if an:
  print 's2:', (), 'All lowercase' 
else:
  print s2, "Not all lowercase!"

in the end

trace something to its source

1. Regular expressions are not part of python and require the re module to be utilized.

2. Matching takes the form: (regular expression, with matching string) or (regular expression, with matching string). The difference between the two is that the latter starts with a start character (^) by default. Therefore.

('^[a-z]+$', s1) is equivalent to ('[a-z]+$', s2)

3. if the match fails, an = ('^[a-z]+$', s1) returns None

group is used to group matches together

for example

import re
a = "123abc456"
print ("([0-9]*)([a-z]*)([0-9]*)",a).group(0)  #123abc456, return to whole
print ("([0-9]*)([a-z]*)([0-9]*)",a).group(1)  #123
print ("([0-9]*)([a-z]*)([0-9]*)",a).group(2)  #abc
print ("([0-9]*)([a-z]*)([0-9]*)",a).group(3)  #456

output result

123abc456
123
abc
456

1) The three sets of parentheses in the regular expression divide the matching results into three groups

group() is the same as group(0), which is the overall result of matching the regular expression

group(1) lists the first bracketed match, group(2) lists the second bracketed match, and group(3) lists the third bracketed match.

2) No successful match, () returns None

3) Of course there are no parentheses in the regular expression, and group(1) must be incorrect.

2. Acronym expansion

concrete example

FEMA Federal Emergency Management Agency
IRA Irish Republican Army
DUP Democratic Unionist Party

FDA Food and Drug Administration
OLC Office of Legal Counsel
analyze

Abbreviation FEMA
Decomposed into F*** E*** M*** A***
Regular Uppercase + Lowercase (greater than or equal to 1) + Spaces

reference code

import re
def expand_abbr(sen, abbr):
  lenabbr = len(abbr)
  ma = '' 
  for i in range(0, lenabbr):
    ma += abbr[i] + "[a-z]+" + ' '
  print 'ma:', ma
  ma = (' ')
  p = (ma, sen)
  if p:
    return ()
  else:
    return ''

print expand_abbr("Welcome to Algriculture Bank China", 'ABC')

in the end

concern

The above code is correct for the first three in the example, but the last two are wrong because there are lowercase words interspersed between words that begin with capital letters

rule (e.g. of science)

Upper case letters + lower case (greater than or equal to 1) + space + [lower case + space] (0 times or 1 time)

reference code

import re
def expand_abbr(sen, abbr):
  lenabbr = len(abbr)
  ma = '' 
  for i in range(0, lenabbr-1):
    ma += abbr[i] + "[a-z]+" + ' ' + '([a-z]+ )?'
  ma += abbr[lenabbr-1] + "[a-z]+"
  print 'ma:', ma
  ma = (' ')
  p = (ma, sen)
  if p:
    return ()
  else:
    return ''

print expand_abbr("Welcome to Algriculture Bank of China", 'ABC')

skill

Middle A collection of lowercase letters + a space, viewed as a whole, is bracketed. Either both or neither at the same time, which would require the use of ? , matching the whole ahead.

3. Removing commas from figures

concrete example

When dealing with natural language 123,000,000 can be problematic if it is split by punctuation, and a good number is mutilated by commas, so you can go ahead and clean up the number (comma removal).

analyze

In numbers there are often groups of 3 numbers followed by a comma, so the pattern is: ***, ***, ****

regular formula

[a-z]+,[a-z]?

Reference Code 3-1

import re

sen = "abc,123,456,789,mnp"
p = ("\d+,\d+?")

for com in (sen):
  mm = ()
  print "hi:", mm
  print "sen_before:", sen
  sen = (mm, (",", ""))
  print "sen_back:", sen, '\n'

in the end

finesse

Use the function finditer(string[, pos[, endpos]]) | (pattern, string[, flags]).

Searches for string, returning an iterator that accesses each match result (Match object) in order.

Reference Code 3-2

sen = "abc,123,456,789,mnp"
while 1:
  mm = ("\d,\d", sen)
  if mm:
    mm = ()
    sen = (mm, (",", ""))
    print sen
  else:
    break

in the end

reach

Such a program is specific to the problem, i.e., a group of 3 digits, and if the digits are mixed with letters, take out the commas between the digits, i.e., convert "abc,123,4,789,mnp" to "abc,1234789,mnp".

reasoning

More specifically, look for the regular formula "number, number" and replace it with a comma-removed substitution.

Reference Code 3-3

sen = "abc,123,4,789,mnp"
while 1:
  mm = ("\d,\d", sen)
  if mm:
    mm = ()
    sen = (mm, (",", ""))
    print sen
  else:
    break
print sen

in the end

4. Year conversion for Chinese processing (e.g. 1949 -->1949)

Chinese language processing involves encoding issues. For example, the following program recognizes the year (**** year) when

# -*- coding: cp936 -*-
import re
m0 = "At the founding of the new China in 1949."
m1 = "5.2 percent below 1990."
m2 = "The defeat of the Russian army and the achievement of substantial independence in 1996".

def fuc(m):
  a = ("[zero|one|two|three|four|five|six|seven|eight|nine]+year", m)
  if a:
    for key in a:
      print key
  else:
    print "NULL"

fuc(m0)
fuc(m1)
fuc(m2)

running result

You can see that the second and third are in error.

Improvement - quasi-unicode recognition

# -*- coding: cp936 -*-
import re
m0 = "At the founding of the new China in 1949."
m1 = "5.2 percent below 1990."
m2 = "The defeat of the Russian army and the achievement of substantial independence in 1996".

def fuc(m):
  m = ('cp936')
  a = (u"[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", m)

  if a:
    for key in a:
      print key
  else:
    print "NULL"

fuc(m0)
fuc(m1)
fuc(m2)

in the end

Recognition can be done by replacing the Chinese characters with numbers by substitution.

consultation

numHash = {}
numHash['Zero'.decode('utf-8')] = '0'
numHash['One'.decode('utf-8')] = '1'
numHash['Two'.decode('utf-8')] = '2'
numHash['Three'.decode('utf-8')] = '3'
numHash['Four'.decode('utf-8')] = '4'
numHash['Five'.decode('utf-8')] = '5'
numHash['Six'.decode('utf-8')] = '6'
numHash['Seven'.decode('utf-8')] = '7'
numHash['Eight'.decode('utf-8')] = '8'
numHash['Nine'.decode('utf-8')] = '9'

def change2num(words):
  print "words:",words
  newword = ''
  for key in words:
    print key
    if key in numHash:
      newword += numHash[key]
    else:
      newword += key
  return newword

def Chi2Num(line):
  a = (u"[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", line)
  if a:
    print "------"
    print line
    for words in a:
      newwords = change2num(words)
      print words
      print newwords
      line = (words, newwords)
  return line

5. Multiple cell phone numbers separated by |.

Examples:

empty value
12222222222
12222222222|12222222222
12222222222|12222222222|12222222444

displayed formula

s = "[\\d]{11}(\\|[\\d]{11})*|"

IV. Recommendations

Python Regular Expressions Guide

to this article on the python regular expression example code is introduced to this article, more relevant python regular example content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!