也许你认为我不需要了解它,但学习并掌握它可以大大提高你的学习生产效率(Regular expressions are a concise and flexible tool for describing patterns in strings to promote productivity.),学习正则的过程虽然并不快乐,但正则表达式(Regex, Regular expressions )也并没有想像中的那么遥不可及。
TelecomNumber = [133, 153, 173, 177, 180, 181, 189, 191, 193, 199] defisTelecomNumber(text): if len(text) != 11: returnFalse## 号码位数不对 for i in range(0, 3): ifnot text[i].isdecimal(): returnFalse## 非数字 if int(text[:3]) notin TelecomNumber: returnFalse## 号段不对 for i in range(3, 11): ifnot text[i].isdecimal(): returnFalse## 非数字 returnTrue## 是电信号码,没错了
message = '请联系:15639391212 或者 17300000000 或者 +8617756478211' for i in range(len(message)): chunk = message[i:i+11] if isTelecomNumber(chunk): print('找到一个电信号码: ' + chunk) # 找到一个电信号码: 17300000000 # 找到一个电信号码: 17756478211
我们先简单介绍一下正则表达式的常用方法:
表达式
\d
匹配任意0 到 9 的数字
{n}
匹配特定次数n
\
或者
[]
匹配方括号内任意字符, 比如[0|2|3|4|5] 或者[0-5]匹配数字 0 到 5
1 2 3 4 5 6 7
import re pattern = re.compile(r'1[3,5,7]3\d{8}|177\d{8}|18[0,1,9]\d{8}|19[1,3,9]\d{8}') TelecomNumbers = pattern.findall('请联系:15639391212 或者 17300000000 或者 +8617756478211') for p in TelecomNumbers: print('找到一个电信号码: ' + p) # 找到一个电信号码: 17300000000 # 找到一个电信号码: 17756478211
接下来,我们看一下正则表达式的具体用法,首先向 re.compile()传入一个字符串值,表示正则表达式,这里为ring, (?# I am a comment) 表示注释。
1 2 3 4 5 6 7 8 9
import re pattern = re.compile("ring(?# I am a comment)") pattern # re.compile(r'ring(?# I am a comment)', re.UNICODE) text = "The ring on the spring string rings during springtime." print(re.match("ring", text)) # None print(re.match("The", text)) # <re.Match object; span=(0, 3), match='The'>
re.sub(pattern, "RING", text) # 'The RING on the spRING stRING RINGs duRING spRINGtime.' re.sub(pattern, "RING", text, 2) # 'The RING on the spRING string rings during springtime.'
You can control how many times a pattern matches with the repetition operators:
除 0 到 9 的数字以外的任何字符(matches anything that isn’t a digit)
\S
除空格、制表符和换行符以外的任何字符
\W
除字母、数字和下划线以外的任何字符
The sequences \A , Z and \z match the beginning ,end of a string and end of a string with \n, respectively.
\b matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words.
\B matches the empty string anywhere else.
\G returns the end of match position
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
import re
pattern = r"\b(cat)\b"
match = re.search(pattern, "The cat sat!") if match: print ("Match 1") match = re.search(pattern, "We s>cat<tered?") if match: print ("Match 2") match = re.search(pattern, "We scattered.") if match: print ("Match 3")
# Match 1
1 2
re.findall(r'\bc[a-z]*', 'I scream, you scream, we all scream for ice-cream!') # ['cream']
分组 - Groups
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
pattern = r"a(bc)(de)(f(g)h)i" match = re.match(pattern, "abcdefghi") ## return none if not exists print(match) # <re.Match object; span=(0, 9), match='abcdefghi'> if match: print(match.group()) print(match.group(0)) print(match.group(1)) print(match.group(3)) print(match.groups()) print(match.span())
Named groups have the format (?P…), where name is the name of the group, and … is the content.
Non-capturing groups have the format (?:…). They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering.
1 2 3 4 5 6 7 8
import re
pattern = r"(?P<first>abc)(?:def)(ghi)"
match = re.match(pattern, "abcdefghi") if match: print(match.group("first")) print(match.groups())
abc
('abc', 'ghi')
替换 - Sub
Note that “(.+) \1” is not the same as “(.+) (.+)”, because \1 refers to the first group’s subexpression, which is the matched expression itself, and not the regex pattern.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import re
pattern = r"(.+) \1"
match = re.match(pattern, "word word") if match: print ("Match 1")
match = re.match(pattern, "?! ?!!") if match: print ("Match 2")
match = re.match(pattern, "abc cde") if match: print ("Match 3")
# Match 1 # Match 2
1 2
re.sub(r'(\b[a-z]+) \1', r'\1', 'two two two cat in the the hat hat!') # 'two two cat in the hat!'
1 2 3 4 5 6
text = 'She sells sea shells by the seashore.The shells she sells are surely sea shells.'
re.sub(r'sea \w+', 'ocean',text) # 'She sells ocean by the seashore.The shells she sells are surely ocean.' re.sub(r'sea (\w)\w*', r'\1ea',text) # 'She sells sea by the seashore.The shells she sells are surely sea.'
贪心和非贪心匹配 - Greedy, Lazy, Possessive
Python 的正则表达式默认是“贪心”的,这表示在有二义的情况下,它们会尽可能匹配最长的字符串。
花括号的“非贪心”版本匹配尽可能最短的字符串,即在结束的花括号后跟着一个问号。
operators
(?=exp)
匹配exp前面的位置,如:\b\w+(?=ing)可以匹配I’m dancing中的danc
(?<=exp)
匹配exp后面的位置,如:(?<=\bdanc)\w+\b可以匹配I love dancing and reading中的第一个ing
pattern1 = r"(t.*st)" pattern2 = r"(t.*?st)" match = re.search(pattern1, "A twister of twists once twisted a twist.") print(match.group()) # twister of twists once twisted a twist match = re.search(pattern2, "A twister of twists once twisted a twist.") print(match.group()) # twist
pattern3 = r"(twi.*)\W" pattern4 = r"(twi.*?)\W" match = re.findall(pattern3, "A twister of twists once twisted a twist.") print(match) # ['twister of twists once twisted a twist.'] match = re.findall(pattern4, "A twister of twists once twisted a twist.") print(match) # ['twister', 'twists', 'twisted', 'twist']