正则表达式 - Regular Expressions

我们习惯于按下键盘的Ctrl-F或者点击查找按钮，输入文本然后进行查找与替换，但当查找内容变得纷繁复杂的时候，使用这种简单而快乐的方法就有点力不从心了，比如有时候要我们查找一个文件里所有的电话号码和email并提取出来，使用普通的查找显然不太现实。

也许你认为我不需要了解它，但学习并掌握它可以大大提高你的学习生产效率(Regular expressions are a concise and flexible tool for describing patterns in strings to promote productivity.)，学习正则的过程虽然并不快乐，但正则表达式(Regex, Regular expressions )也并没有想像中的那么遥不可及。

表达式 - Operators

有一天，我们接到这样一个需求，查找一下某个文件中的所有中国地区的手机号码，

中国电信号段
133、153、173、177、180、181、189、191、193、199
中国联通号段
130、131、132、155、156、166、175、176、185、186、166
中国移动号段
134(0-8)、135、136、137、138、139、147、150、151、152、157、158、159、172、178、182、183、184、187、188、198

以比较简单、号段较少的中国电信为例，如没有学过正则，你可能会这样写：

TelecomNumber = [133, 153, 173, 177, 180, 181, 189, 191, 193, 199]
def isTelecomNumber(text):
    if len(text) != 11:
        return False  ## 号码位数不对
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False  ## 非数字
    if int(text[:3]) not in TelecomNumber:
        return False  ## 号段不对
    for i in range(3, 11):
        if not text[i].isdecimal():
            return False  ## 非数字
    return True  ## 是电信号码，没错了

message = '请联系：15639391212 或者 17300000000 或者 +8617756478211'
for i in range(len(message)): 
    chunk = message[i:i+11]
    if isTelecomNumber(chunk): 
        print('找到一个电信号码: ' + chunk) 
# 找到一个电信号码: 17300000000
# 找到一个电信号码: 17756478211

我们先简单介绍一下正则表达式的常用方法：

表达式
\d	匹配任意0 到 9 的数字
`{n}`	匹配特定次数n
\	或者
[]	匹配方括号内任意字符, 比如[0\|2\|3\|4\|5] 或者[0-5]匹配数字 0 到 5

import re
pattern = re.compile(r'1[3,5,7]3\d{8}|177\d{8}|18[0,1,9]\d{8}|19[1,3,9]\d{8}') 
TelecomNumbers = pattern.findall('请联系：15639391212 或者 17300000000 或者 +8617756478211') 
for p in TelecomNumbers:
    print('找到一个电信号码: ' + p) 
# 找到一个电信号码: 17300000000
# 找到一个电信号码: 17756478211

接下来，我们看一下正则表达式的具体用法，首先向 re.compile()传入一个字符串值，表示正则表达式,这里为ring, (?# I am a comment) 表示注释。

import re
pattern = re.compile("ring(?# I am a comment)")
pattern
# re.compile(r'ring(?# I am a comment)', re.UNICODE)
text = "The ring on the spring string rings during springtime."
print(re.match("ring", text))
# None
print(re.match("The", text))
# <re.Match object; span=(0, 3), match='The'>

## 将返回一组字符串，包含被查找字符串中的所有匹配，有分组()时则返回元组列表,没找到就返回[]
re.findall(pattern, text)
# ['ring', 'ring', 'ring', 'ring', 'ring', 'ring']

## 只包含第一次出现的匹配文本
re.search(pattern, text) ## 返回None若匹配不到
# <re.Match object; span=(4, 8), match='ring'>
re.search(pattern, text).span()
# (4, 8)
re.search(pattern, text).group()
# 'ring'

正则表达式不仅能找到文本模式，而且能够用新的文本替换掉这些模式。

sub()方法返回替换完成后的字符串。

re.sub(pattern, "RING", text)
# 'The RING on the spRING stRING RINGs duRING spRINGtime.'
re.sub(pattern, "RING", text, 2)
# 'The RING on the spRING string rings during springtime.'

You can control how many times a pattern matches with the repetition operators:

operators
.	通配符，表示任意字符，除了换行
`+`	匹配一次或多次
`*`	匹配零次或多次
`{n,}`	匹配大于n次，
`{n,m}`	出现 n~m 次
`?`	出现 0 或者1 次，表示可选匹配
[^]	匹配非方括号内的字符
^ $	匹配开始和结束，Carrots cost dollars
\s	空格、制表符和换行符，ASCII范围内等价于[ \t\n\r\f\v]
\w	任何字母、数字或下划线字符， ASCII范围内等价于[a-zA-Z0-9_]，Unicode范围内也可以匹配中文或者其他字母变形比如重音符号。

pattern = re.compile(r'\d+\s\w+') 
re.findall(pattern, '12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 
swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge') 
# ['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6 
geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

operators
\D	除 0 到 9 的数字以外的任何字符(matches anything that isn’t a digit)
\S	除空格、制表符和换行符以外的任何字符
\W	除字母、数字和下划线以外的任何字符

The sequences \A , Z and \z match the beginning ,end of a string and end of a string with \n, respectively.

\b matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words.
\B matches the empty string anywhere else.

\G returns the end of match position

import re

pattern = r"\b(cat)\b"

match = re.search(pattern, "The cat sat!")
if match:
   print ("Match 1")
match = re.search(pattern, "We s>cat<tered?")
if match:
   print ("Match 2")
match = re.search(pattern, "We scattered.")
if match:
   print ("Match 3")

# Match 1

1 2	re.findall(r'\bc[a-z]*', 'I scream, you scream, we all scream for ice-cream!') # ['cream']

分组 - Groups

pattern = r"a(bc)(de)(f(g)h)i"
match = re.match(pattern, "abcdefghi") ## return none if not exists
print(match)
# <re.Match object; span=(0, 9), match='abcdefghi'>
if match:
   print(match.group())
   print(match.group(0))
   print(match.group(1))
   print(match.group(3))
   print(match.groups())
   print(match.span())

# abcdefghi
# abcdefghi
# bc
# fgh
# ('bc', 'de', 'fgh', 'g')
# (0, 9)

Special groups

Named groups have the format (?P…), where name is the name of the group, and … is the content.
Non-capturing groups have the format (?:…). They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering.

import re

pattern = r"(?P<first>abc)(?:def)(ghi)"

match = re.match(pattern, "abcdefghi")
if match:
   print(match.group("first"))
   print(match.groups())

abc
('abc', 'ghi')

替换 - Sub

Note that “(.+) \1” is not the same as “(.+) (.+)”, because \1 refers to the first group’s subexpression, which is the matched expression itself, and not the regex pattern.

import re

pattern = r"(.+) \1"

match = re.match(pattern, "word word")
if match:
   print ("Match 1")

match = re.match(pattern, "?! ?!!")
if match:
   print ("Match 2")    

match = re.match(pattern, "abc cde")
if match:
   print ("Match 3")

# Match 1
# Match 2

1 2	re.sub(r'(\b[a-z]+) \1', r'\1', 'two two two cat in the the hat hat!') # 'two two cat in the hat!'

text = 'She sells sea shells by the seashore.The shells she sells are surely sea shells.'

re.sub(r'sea \w+', 'ocean',text) 
# 'She sells ocean by the seashore.The shells she sells are surely ocean.'
re.sub(r'sea (\w)\w*', r'\1ea',text) 
# 'She sells sea by the seashore.The shells she sells are surely sea.'

贪心和非贪心匹配 - Greedy, Lazy, Possessive

Python 的正则表达式默认是“贪心”的，这表示在有二义的情况下，它们会尽可能匹配最长的字符串。

花括号的“非贪心”版本匹配尽可能最短的字符串，即在结束的花括号后跟着一个问号。

operators
(?=exp)	匹配exp前面的位置,如：\b\w+(?=ing)可以匹配I’m dancing中的danc
(?<=exp)	匹配exp后面的位置，如：(?<=\bdanc)\w+\b可以匹配I love dancing and reading中的第一个ing
(?!exp)	匹配后面不是exp的位置
(?<!exp)	匹配前面不是exp的位置
*?	重复任意次，但尽可能少重复，如：a.ba.?b，将正则表达式应用于aabab，前者会匹配整个字符串aabab，后者会匹配aab和ab两个字符串
+?	重复1次或多次，但尽可能少重复
??	重复0次或1次，但尽可能少重复
{M,N}?	重复M到N次，但尽可能少重复
{M,}?	重复M次以上，但尽可能少重复

import re

pattern1 = r"(t.*st)"
pattern2 = r"(t.*?st)"
match = re.search(pattern1, "A twister of twists once twisted a twist.")
print(match.group())
# twister of twists once twisted a twist
match = re.search(pattern2, "A twister of twists once twisted a twist.")
print(match.group())
# twist

pattern3 = r"(twi.*)\W"
pattern4 = r"(twi.*?)\W"
match = re.findall(pattern3, "A twister of twists once twisted a twist.")
print(match)
# ['twister of twists once twisted a twist.']
match = re.findall(pattern4, "A twister of twists once twisted a twist.")
print(match)
# ['twister', 'twists', 'twisted', 'twist']

Flags

传入第二个参数进行匹配。

flag	match
re.DOTALL	.可匹配所有字符
re.L	locale-aware
re.M	多行模式：MULTILINE, ^ $
re.S	单行模式：. \n.*
re.U	Unicode will result in \w, \W, \b, \B.
re.X	flexible
re.l/re.IGNORECASE	不区分大小写的匹配, case insensitive
re.VERBOSE	忽略正则表达式字符串中的空白符和注释

Examples

假设你有一个无聊的任务，要在一篇长的网页或文章中，找出所有电话号码和邮件地址。如果手动翻页，可能需要查找很长时间。如果有一个程序，可以在剪贴板的文本中查找电话号码和 E-mail 地址，那你就只要按一下 Ctrl-A 选择所有文本，按下 Ctrl-C 将它复制到剪贴板，然后运行你的程序。它会用找到的电话号码和 E-mail地址，替换掉剪贴板中的文本。

Email

import re
pattern = re.compile(
    r'''             
    ^[a-zA-Z0-9._%+-]+      ## 邮箱用户名
    @						## @ symbol 
    [a-zA-Z0-9._-]+   ## 域名前缀名
    (\.((com)|(org)|(cn)|(net)|(edu))){1,3}$    ## 常用邮箱域名/顶级域名
    ''',   
    re.UNICODE | re.VERBOSE)
def valid_email():
    email = input('请输入邮箱地址: ')
    try:
        if re.match(pattern, email):
            print('邮箱有效！')
        else:
            print('邮箱无效！')
    except AttributeError:
        print('invalid attribute')
valid_email()

Plz input an email address:A_Bird_Story@vogel.edu.cn
valid email
Plz input an email address:  A_Bird_Story@vogel.moon.cn
invalid email

利用 shelve 模块，你可以将 Python 程序中的变量保存到二进制的 shelf 文件中。
shelve 模块让你在程序中添加“保存”和“打开”功能。

import shelve 
shelfFile = shelve.open('re_pattern') 
pat_email = '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._-]+(\.((com)|(org)|(cn)|(net)|(edu))){1,3}$'
shelfFile['pat_email'] = pat_email
shelfFile.close()

在 Windows 上运行前面的代码，你会看到在当前工作目录下有 3 个新文件：
re_pattern.bak、re_pattern.dat 和 re_pattern.dir。在 OS X 上，只会创建一个 re_pattern.db 文件。

shelfFile = shelve.open('re_pattern') 
type(shelfFile) 
# shelve.DbfilenameShelf
list(shelfFile.keys()) 
# ['pat_email']
list(shelfFile.values()) 
# ['^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._-]+(\\.((com)|(org)|(cn)|(net)|(edu))){1,3}$']
pattern = re.compile(r''+ shelfFile['pat_email'],re.UNICODE)
shelfFile.close()

我们还可以把常用匹配写入py以备用，

pat_email = '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._-]+(\.((com)|(org)|(cn)|(net)|(edu))){1,3}$'
re_pattern = open('re_pattern.py', 'w') 
re_pattern.write('pat_email = "' + pat_email + '"\n') 
re_pattern.close()

1
2
3

import re_pattern 
re_pattern.pat_email 
# '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._-]+(\\.((com)|(org)|(cn)|(net)|(edu))){1,3}$'

Douban

douban_director

我们先用requests模块爬取一下豆瓣导演的信息：

import requests
from lxml import etree

headers = {
    'USER-AGENT':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
response = requests.get(url="https://movie.douban.com/celebrity/1054439/",headers=headers)
html = response.text
director_info = etree.HTML(html).xpath('//div[@class="info"]/ul//text()')
director_info = "".join(director_info)

1
2
3

import re
infos = re.sub(r"\n", "",director_info)
infos

'                        性别:         男                                星座:         摩羯座                                出生日期:         1941-01-05                                出生地:         日本,东京                                职业:         编剧 / 导演 / 制片人 / 演员 / 动画                                更多外文名:         宮﨑駿 / みやざき はやお / the Japanese Walt Disney (昵称)                                    家庭成员:         宫崎吾朗(长子)                                imdb编号:         nm0594503                                官方网站:         www.ghibli.jp                           '

pat1 = r"性别:\s+(\w+)"
if re.search(pat_gender, infos):
    print(re.search(pat1, infos).group(1))
# 男
pat2= r"星座:\s+(\w+)"
if re.search(pat2, infos):
    print(re.search(pat2, infos).group(1))
# 摩羯座