您的位置：首页 > 编程语言 > Python开发

Python基础学习----正则匹配

2016-10-17 22:44 573 查看

Python中提供了re模块来实现正则匹配。

正则匹配中常用的特殊字符

符号	说明
()	对正则表达式进行分组，一对圆括号表示一组
\w	匹配字母，数字，下划线
\W	匹配不是字母，数字，下划线的字符
\s	匹配空白字符
\S	匹配不是空白的字符
\d	匹配数字
\D	匹配不是数字的字符
\b	匹配单词的开始和结束
\B	匹配不是单词开始和结束的位置
.	匹配任意一个字符，包括汉字
[m]	匹配得到单个字符m
[m1m2m3m4]	匹配多个字符串
[m-n]	匹配单个字符，字符是m到n之间的字符
[^m]	匹配单个非m的zifu
^	正则表达式开始字符
$	正则表达式结束字符

正则表达式中常用的限定符

符号	说明
*	匹配0次或多次
+	匹配1次或多次
?	匹配0次或1次
{m}	匹配重复m次
{m,n}	匹配重复m到n次，其中n可以省略，表示匹配m到任意次
*?	匹配0次或多次，进行尽量短的匹配
+?	匹配1次或多次，进行尽量短的匹配
??	匹配0次或者1次，进行尽量短的匹配
{m,n}?	匹配重复m次到n次，进行尽量短的匹配
(?P<mark>pattern)	给单组进行命名mark为该组的名称，pattern为改组要进行的正则匹配的字符串
(?P=mark)	使用名称为mark的分组进行正则匹配

正则匹配中，常见的两个过程为：

#coding=utf-8
import re

# 方法一
pat = re.compile(patternString) # patternString为正则匹配的模板字符串
mat = pat.match(string) # string为需要进行匹配的字符串
mat.group()

# 方法二
mat = re.match(patternString, string)  # patternString为正则匹配的模板字符串,string为需要进行匹配的字符串
mat.group()

Python的re模块中提供的常用函数：

函数	说明
findall(pattern, string, flag=0)	根据pattern在string中匹配字符串，匹配成功返回结果的列表，否则返回空列表，当pattern中有分组时，返回包含多个元组的列表，每个元组代表一个分组，flag代表匹配规则。
sub(pattern, repl, string, count=0)	根据给定的正则表达式pattern，将string中匹配到的字符串替换成repl，如果count=0则匹配得到string中所有结果，count大于0，则返回匹配到的count个结果
subn(pattern, repl, string, count=0)	与sub()函数的效果一致，返回一个2元的元组，第一个元素的替换的结果，第二个元素是替换的次数。
match(pattern, string , flag=0)	根据pattern从string头部开始匹配字符串，只返回第一次匹配成功的结果，否则返回None
search(pattern, string, flag=0)	根据pattern在string中匹配字符串，只返回第一次匹配成功的结果，否则返回None
compile(pattern, flag=0)	编译正则表达式，得到一个Pattern对象
split(pattern, string, maxsplit=0)	根据pattern分割string，maxsplit表示最大的分割数
escape(pattern)	匹配字符串中的特殊字符，如，"*", "+", "?"

在re模块中，提供的函数中大多会包含一个可选参数flag，flag是re模块的规则选项，下面是re模块的规则选项：

选项	说明
I 或 IGNORECASE	忽略大小写
L 或 LOCALE	字符集本地化，用于多语言环境
M 或 MULTILINE	多行匹配
S 或 DOTALL	使用"."匹配包括\n在内的所有字符
X 或 VERBOSE	忽略正则表达式中的空白，换行，以便添加注释
U 或 UNICODE	"\w","\W","\b","\B","\d","\D","\s","\S"都将使用Unicode编码格式

Python中提供的进行正则匹配的两种方式（一种是直接调用re的函数，另外一种是通过re的compile()函数编译得到一个Pattern对象，利用Pattern的函数进行正则匹配），最后都是得到一个Match的对象，下面是Match的属性和方法说明：

属性/方法	说明
pos	搜索的开始位置
endpos	搜索的结束位置
string	搜索的字符串
re	当前使用的正则表达式的对象（e.g. 通过match.re.pattern可以得到正则表达式的模板字符串）
lastindex	最后匹配的组索引
lastgroup	最后匹配的组名
group(index=0)	返回某个分组的匹配结果，index=0表示匹配整个正则表达式
groups()	返回所有分组的匹配结果
groupdict()	返回组名作为key，每个分组的匹配结果作为value的字典
start([group])	获取组的开始位置
end([group])	获取组的结束位置
span([group])	获取组的开始和结束位置，以一个二位元组的形式返回
expand(template)	使用组的匹配结果来替换template中的内容，并把替换后的字符串返回

下面对Python中的正则匹配举例说明：

首先Python中由re模块提供了正则匹配的功能，故而需要导入re模块：

# coding=utf-8
import re

（$）使用 . 匹配单个字符

# 使用. 来匹配任意字符（\n无法匹配）
mat = re.match(r".", "a")
print mat.group()

mat_1 = re.match(r".", "0")
print mat_1.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
a
0

Process finished with exit code 0

. 可以用来匹配复杂的字符串中的单个字符：

mat = re.match(r".", "{a}")
print mat.group()

mat_0 = re.match(r"{.}", "{c}")
print mat_0.group()

mat_1 = re.match(r"{\[\(.\)\]}", "{[(a)]}")
print mat_1.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
{
{c}
{[(a)]}

Process finished with exit code 0

上面使用一个 . 号来匹配得到一个字符，那么使用多个 . 号，就可以匹配得到多个字符：

mat = re.match(r"..", "a1")
print mat.group()

mat_0 = re.match(r"{..}", "{0h}")
print mat_0.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
a1
{0h}

Process finished with exit code 0

（$$）上面使用 . 号可以匹配到任意的字符（\n除外），如果我们要匹配指定字符集中的某一个字符，就需要使用中括号 [.] ：

# 匹配字符集[abcd]中的任意一个字符
mat = re.match(r"[abcd]", "a")  # 此处的字符集也可以写成 r"[a-d]"
print mat.group()

# 字符集是0到9，总共10个字符组成的
mat_0 = re.match(r"{[0-9]}", "{0}")
print mat_0.group()

# 以为 "a" 不在字符集[0123]中，故而mat_1=None
mat_1 = re.match(r"[0123]", "a")
print mat_1

mat_2 = re.match(r"[a-zA-Z0-9]", "q")
print mat_2.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
a
{0}
None
q

Process finished with exit code 0

（$）中括号在正则匹配表达式中代表一个字符集，如果要匹配中括号，则需要添加转义符。

# 里层的中括号表示字符集，外层加转移字符的中括号为正则模板字符串
pat = re.compile("\[[a-zA-Z0-9]\]")
mat = pat.match("[a]")
print mat.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
[a]

Process finished with exit code 0

（$）多字符的正则匹配

匹配一个字符串，首字母大写，其与字母为小写或数字：（* 号可以匹配0到任意个字符）

mat = re.match(r"[A-Z][a-z0-9]*", "Abcd01s")
print mat.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
Abcd01s

Process finished with exit code 0

匹配Python的变量名：（以下划线或者字母开头，+ 号可以1到任意个字符）

pat = re.compile(r"[_a-zA-Z][_a-zA-Z0-9]+")

mat = pat.match("_Ax89")
print mat.group()

mat_0 = pat.match("a_s12")
print mat_0.group()

mat_1 = pat.match("3_ah7")
print mat_1

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
_Ax89
a_s12
None

Process finished with exit code 0

匹配0-99之间的数字字符串：（?号可以匹配0次或者1次）

pat = re.compile(r"[1-9]?[0-9]") # ?号可以匹配0或者1次

mat_0 = pat.match("99")
print mat_0.group()

mat_1 = pat.match("8")
print mat_1.group()

mat_2 = pat.match("03")
print mat_2.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
99
8
0

Process finished with exit code 0

从上面的打印结果可以看到，"03"也可以匹配，但只能匹配到第一个字符0，这是因为正则匹配[1-9]?没有匹配成功，就会按照[0-9]去匹配，匹配得到0，那么要是不想匹配到"03"这一类数字字符串：

pat = re.compile(r"^[1-9]?[0-9]$") # ?号可以匹配0或者1次

mat_0 = pat.match("99")
print mat_0.group()

mat_1 = pat.match("8")
print mat_1.group()

mat_2 = pat.match("03")
print mat_2

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
99
8
None

Process finished with exit code 0

（$）匹配邮箱，前缀规定6-10为字母，数字或下划线：

# [_a-zA-Z0-9]{6, 10}表示字符的数量在6-10个之间
# (qq|163|126)表示邮箱类型可以是qq, 163, 126中的一个
pat = re.compile("[_a-zA-Z0-9]{6,10}@(qq|163|126).com")

mat_0 = pat.match("123456@qq.com")
print mat_0.group()

mat_1 = pat.match("mxd_huster@163.com")
print mat_1.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
123456@qq.com
mxd_huster@163.com

Process finished with exit code 0

（$）"*" "+" "?" 三个匹配字符是在规定的情况下，尽可能多的去匹配字符，若在后面加上"？"，变为"*?" "+?" "??"之后，则变为非贪婪模式，会在规定的情况下尽可能少的匹配字符：

# [a-z]+? 匹配a-z之间的字符1次或任意次，尽量匹配尽可能多的次数
mat_0 = re.match(r"[0-9][a-z]+", "1abcde")
print mat_0.group()

# [a-z]+? 匹配a-z之间的字符1次或任意次，尽量匹配1次
mat_1 = re.match(r"[0-9][a-z]+?", "1abcde")
print mat_1.group()

# [A-Z]? 匹配A-Z之间的字符0次或1次，尽量匹配0次
mat_2 = re.match(r"[_][A-Z]?", "_ABCDE")
print mat_2.group()

# [A-Z]? 匹配A-Z之间的字符0次或1次，尽量匹配0次
mat_3 = re.match(r"[_][A-Z]??", "_ABCDE")
print mat_3.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
1abcde
1a
_A
_

Process finished with exit code 0

（$）Python中"^"，"$"可以指定正则表达式的开头和结尾：

# ^说明开头必须是字符集[_a-zA-Z0-9]中的字符
# $说明结尾必须是 @(qq|163|126).com 这个字符串模板匹配的字符串
pat = re.compile(r"^[_a-zA-Z0-9]{6,10}@(qq|163|126).com$")

mat_0 = pat.match("_123456a@qq.com")
print mat_0.group()

mat_1 = pat.match("abcdefg@163.coma")
print mat_1

mat_2 = pat.match("$ABCDEFG@126.com")
print mat_2

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
_123456a@qq.com
None
None

Process finished with exit code 0

【说明】除了使用"^" "$"来指定正则匹配的边界外，还可以使用\A 和 \Z来实现开头和结尾的边界匹配：

# \A 说明开头必须是字符集[_a-zA-Z0-9]中的字符
# \Z 说明结尾必须是 @(qq|163|126).com 这个字符串模板匹配的字符串
pat = re.compile(r"\A[_a-zA-Z0-9]{6,10}@(qq|163|126).com\Z")

mat_0 = pat.match("_123456a@qq.com")
print mat_0.group()

mat_1 = pat.match("abcdefg@163.coma")
print mat_1

mat_2 = pat.match("$ABCDEFG@126.com")
print mat_2

结果与上面相同的。

（$）匹配1-100的数字：

pat = re.compile(r"[1-9]?\d$")

mat_0 = pat.match("99")
print mat_0.group()

mat_1 = pat.match("02")
print mat_1

mat_2 = pat.match("4")
print mat_2.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
99
None
4

Process finished with exit code 0

上面的结果可以正确的匹配到1-99之间的数字，但要是只匹配三位数中的100，最简单的办法就是使用 | 匹配符：

pat = re.compile(r"[1-9]?\d$|100")

mat = pat.match("100")
print mat.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
100

Process finished with exit code 0

（$）使用小括号()，对正则表达式的部分匹配规则进行分组

pat_0 = re.compile(r"\d{6}@163.com$|\d{6}@126.com$")
mat_0 =pat_0.match("123456@163.com")
print mat_0.group()

# 当我们需要匹配很多种邮箱后缀的时候，就会导致正则模板过长
# 可以使用分组的方法改写正则模板字符串
pat_1 = re.compile(r"\d{6}@(qq|163|sina).com")
mat_1 = pat_1.match("654321@sina.com")
print mat_1.group()

mat_2 = pat_1.match("234567@qq.com")
print mat_2.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
123456@163.com
654321@sina.com
234567@qq.com

Process finished with exit code 0

分组匹配中可以使用\<number>的方式对分组的内容进行重用：

# 下面匹配<book>Python Book<book>
pat = re.compile(r"(<[\w]+>).*\1")
mat = pat.match("<book>Python Book<book>")
print mat.group()

# 那么要是要匹配<book>Python Book</book>
pat_0 = re.compile(r"<([\w]+>).*[</]\1")
mat_0 = pat_0.match("<book>Python Book</book>")
print mat_0.group()

# 使用\<number>进行分组匹配重用的好处，在于可以控制多组匹配规则一致
# 即第一个分组匹配形成之后，后面的匹配模板也是定死了的
mat_1 = pat_0.match("<book>Java Book</Book>")
print mat_1
# 可以发现第一个分组匹配到了book>之后，
# 后面的匹配规则就被定位了 book>，
#  而不再是 [\w]+> 的匹配模板了！！！

上面使用1,2,3,4....元组的角标，来复用正则匹配的分组，也可以为分组指定一个名字，以便在复杂的正则匹配中复用正则匹配的模板：

【说明】在小括号内开始的位置通过(?P<name>)为该分组指定名字为name，下次复用该分组的时候，可以使用(?P=name)来复用名字为name的分组。

# 那么要是要匹配<book>Python Book</book>
# 使用 (?P<mark>)指定了该分组名字为mark
# 在复用该分组规则的时候，可以直接通过(？P=mark)来复用
pat = re.compile(r"<(?P<mark>[\w]+>).*[</](?P=mark)")
mat = pat.match("<book>Python Book</book>")
print mat.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
<book>Python Book</book>

Process finished with exit code 0

（$）re模块中其他方法简介

re模块中的match()方法是从字符串的开头开始匹配得到结果，若我们要匹配查找的子字符串在原字符串的任意位置开始，那么就需要使用re模块中的search(pattern, string , flag=0)函数：

# 查找到字符串中的数字
mat = re.search(r"\d+", "count of boys is 2000")
print mat.group()

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
2000

Process finished with exit code 0

（$）search()函数可以找到字符串中首个匹配到的子字符串，若要获得所有匹配到的结果，可以使用re模块的findall(pattern, string, flag=0):

# 查找到字符串中所有符合的数字
mat = re.findall(r"\d+", "Python=123, Java=345, NodeJS=222")
print mat

def sum(a, b):
return int(a)+ int(b)

result = reduce(sum, mat)
print "总和：", result

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
['123', '345', '222']
总和： 690

Process finished with exit code 0

Process finished with exit code 0

【说明】re模块的findall()将字符串中匹配正确的结果以一个列表的形式返回。

（$）sub(pattern, repl, string, count=0, flag=0)是re模块中的函数，可以通过pattern正则匹配规则首先匹配得到子字符串，然后使用repl字符串替换匹配到的字符串：

# 将字符串str中的数字替换成 '101'
str = "Python=100, Java=200, NodeJS=300"
result = re.sub(r"\d+", '101', str)
print result

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
Python=101, Java=101, NodeJS=101

Process finished with exit code 0

上面的调用中，repl是一个字符串，是将匹配的结果使用repl替换，那么如果需求是匹配到不同的结果进行的替换操作也不一样，repl可以改写成一个函数，当repl是一个函数的时候，sub()首先按照pattern正则规则在string中匹配得到多个match对象，然后将match对象传给repl函数，并将repl返回的结果替换掉对应位置上匹配到的结果：

def addOne(match):
value = match.group()
num = int(value)
if (num == 100):
return "%d" % (num + 1)
elif (num == 200):
return "{0}".format(num + 2)
elif (num == 300):
return "%d" % (num + 3)
return "%d" % num
# 需要返回的是字符串

# 将字符串str中的数字替换成 '101'
str = "Python = 100， Java = 200, NodeJS = 300"
result = re.sub(r"\d+", addOne, str)
print result

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
Python = 101， Java = 202, NodeJS = 303

Process finished with exit code 0

（$）使用split(pattern, string, maxsplit=0, flag=0)函数，可以进行字符串的分割，使用pattern进行正则匹配，在匹配成功的位置进行字符串的分割，maxsplit是分割的次数，当maxsplit=0时，表示全部位置进行分割：

str = "computer:Java Python Php C#,C++"
results = re.split(":| |,", str)
print results

打印结果：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
['computer', 'Java', 'Python', 'Php', 'C#', 'C++']

Process finished with exit code 0

（$）下面利用urllib2模块爬去“慕课网”首页的网页信息，使用re模块正则匹配获得其中图片信息，并下载保存到本地文件系统中：

# coding=utf-8
# 第一步，利用urllib2爬去网页信息
import urllib2
req = urllib2.urlopen('http://www.imooc.com/course/list')
buf = req.read()

# 第二步，利用re模块正则匹配获得图片链接
import re
results = re.findall(r"http://.+\.jpg", buf)
print results

# 第三步，将图片下载到本地文件系统中
i = 0
for url in results:
f = open(str(i) + ".jpg", 'w')

req = urllib2.urlopen(url)
buf = req.read()
f.write(buf)
i += 1

运行之后可以看到results列表中包含了首页的所有图片：

/usr/bin/python2.7 /home/mxd/文档/WorkPlace/python/PythonStudy/test.py
['http://szimg.mukewang.com/5763761f0001c35e05400300-360-202.jpg',
'http://img.mukewang.com/529dc3380001379906000338-240-135.jpg',
'http://img.mukewang.com/57035ff200014b8a06000338-240-135.jpg',
'http://img.mukewang.com/53a28e960001311b06000338-240-135.jpg',
'http://img.mukewang.com/53e1d0470001ad1e06000338-240-135.jpg',
'http://img.mukewang.com/574669dc0001993606000338-240-135.jpg',
'http://img.mukewang.com/570360620001390f06000338-240-135.jpg',
'http://img.mukewang.com/53daee770001dd0706000338-240-135.jpg',
'http://img.mukewang.com/574678bd00010a7206000338-240-135.jpg',
'http://img.mukewang.com/53bf89100001684e06000338-240-135.jpg',
'http://img.mukewang.com/53d068840001a89906000338-240-135.jpg',
'http://img.mukewang.com/540e57300001d6d906000338-240-135.jpg',
'http://img.mukewang.com/57035f110001a57906000338-240-135.jpg',
'http://img.mukewang.com/5703604a0001694406000338-240-135.jpg',
'http://img.mukewang.com/570756b0000146fc06000338-240-135.jpg',
'http://img.mukewang.com/5704a5d50001582f06000338-240-135.jpg',
'http://img.mukewang.com/5704ae850001f59906000338-240-135.jpg',
'http://img.mukewang.com/53eafb44000146c706000338-240-135.jpg',
'http://img.mukewang.com/578741d3000151e806000338-240-135.jpg',
'http://img.mukewang.com/5705d3a3000129d006000338-240-135.jpg',
'http://img.mukewang.com/576b7c04000144dc06000338-240-135.jpg',
'http://img.mukewang.com/5704a54300013d5d06000338-240-135.jpg',
'http://img.mukewang.com/5707604300018d0406000338-240-135.jpg',
'http://img.mukewang.com/5707699500012d5a06000338-240-135.jpg']

Process finished with exit code 0

在命令行利用ls查看本地文件系统：

【附一张常用正则表达式的图，来自微信公众号】

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航