您的位置：首页 > 编程语言 > Python开发

python学习笔记正则表达式re模块

2016-12-30 00:30 946 查看

1.使用match()和search()匹配字符串，使用group()查看结果

match() :从字符串开始的位置匹配，成功返回匹配的对象，失败返回None

search(): 扫描整个字符串来进行匹配，成功返回匹配的对象，失败返回None

例1：比较match() 和 search()的区别

import re

m = re.match('foo', 'seafood')
if m is not None: print("match-" + m.group())

m = re.search('foo', 'seafood')
if m is not None: print("search-" + m.group())

#结果是：search-foo

例2: match()函数从起始位开始匹配

import re

m = re.match('foo', 'foo')
if m is not None:
print("能匹配-" + m.group())

m = re.match('foo', 'bar')
if m is not None: print("不能匹配-" + m.group())

m = re.match('foo', 'food on the table')
if m is not None: print("从开始位置进行匹配-" + m.group())

#能匹配-foo
#从开始位置进行匹配-foo

例3: 匹配多个值（使用择一表达式”|”）

import re

bt = 'bat|bet|bit'

m = re.match(bt, 'bat')
if m is not None:
print("1能匹配-" + m.group())

m = re.match(bt, 'blt')
if m is not None:
print("2能匹配-" + m.group())

m = re.match(bt, 'he bit me')
if m is not None:
print("3能匹配-" + m.group())

m = re.search(bt, 'he bit me')
if m is not None:
print("4能匹配-" + m.group())

#结果：
#   1能匹配-bat
#   4能匹配-bit

例4: 匹配任何单个字符

点号”.”除了换行符\n和非字符，都能匹配

import re

bt = ".end"

m = re.match(bt, 'bend')
if m is not None:
print("bend能匹配-" + m.group())

m = re.match(bt, 'end')
if m is not None:
print("end能匹配-" + m.group())

m = re.match(bt, '\nend')
if m is not None:
print("\nend能匹配-" + m.group())

m = re.search(bt, 'the end.')
if m is not None:
print("the end.能匹配-" + m.group())

#结果：
#   bend能匹配-bend
#   the end.能匹配- end

例5: 匹配小数点

import re

bt = "3.14"
pi_bt = "3\.14"  #表示字面量的点号 （dec.point）

m = re.match(bt, '3.14')    #点号匹配
if m is not None:
print("3.14能匹配-" + m.group())

m = re.match(pi_bt, '3.14')  #精确匹配
if m is not None:
print("精确匹配-" + m.group())

m = re.match(bt, '3014')    #点号匹配0
if m is not None:
print("3014能匹配-" + m.group())

#结果：
# 3.14能匹配-3.14
# 精确匹配-3.14
# 3014能匹配-3014

例6：使用字符集”[ ]”

import re

bt = "[cr][23][dp][o2]"

m = re.match(bt, 'c3po')    #点号匹配
if m is not None:
print("c3po能匹配-" + m.group())

#结果：
# c3po能匹配-c3po

例7: 重复、特殊字符

正则表达式: \w+@\w+.com可以匹配类似nobody@xxx.com的邮箱地址，但是类似nobody@xxx.yyy.aaa.com的地址就不能匹配了。这时候我们可以使用* 操作符来表示该模式出现零次或者多次：\w+@(\w+.)*\w+.com

例8: 分组

group()可以访问每个独立的子组

groups()获取一个包含所有匹配子组的元组

>>> import re
>>> m = re.match('(\w\w\w)-(\d\d\d)', 'abc-123')
>>> m.group()
'abc-123'
>>> m.group(1)
'abc'
>>> m.group(2)
'123'
>>> m.groups()
('abc', '123')

>>> m = re.match('ab', 'ab')
>>> m.group()
'ab'
>>> m.groups()
()

例9: 匹配字符串起始和结尾

m = re.search('^the','the end.')
>>> m.group()
'the'
>>> m = re.search('^the','sthe end.')
>>> m.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

>>> m = re.search(r'\bthe','bite the dog')
>>> m
df77
.group()
'the'

>>> m = re.search(r'\bthe','bitethe dog')
>>> m.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

>>> m = re.search(r'\Bthe','bitethe dog')
>>> m.group()
'the'

2.使用findall()、finditer()查找每一次出现的位置

final() 以列表的形式返回所有能匹配的结果

>>> import re
>>> re.findall('car', 'car sscare')
['car', 'car']

finaliter()返回一个顺序访问每一个匹配结果（Match对象）的迭代器

>>> re.finditer(r'(th\w+) and (th\w+)',s, re.I).next().group(1)
'This'
>>> re.finditer(r'(th\w+) and (th\w+)',s, re.I).next().group(2)
'That'

3.使用sub()和subn()搜索和替换

两个函数都可以实现搜索和替换功能，将某字符串中所有匹配正则表达式的部分进行某种形式的替换。不同点是subn()还返回一个表示替换了多少次的总数，和返回结果一起以元组的形式返回。

>>> re.sub('[ae]','X','abcdef')
'XbcdXf'
>>> re.subn('[ae]','X','abcdef')
('XbcdXf', 2)

进行替换的时候，还可以指定替换的顺序，原理是使用匹配对象的group()方法除了能够获取匹配分组编号外，还可以使用\N，其中N表示要替换字符串中的分组的编号，通过编号就能指定替换的顺序。

例如：将美式日期MM/DD/YY{,YY}格式转换成DD/MM/YY{,YY}格式

>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',r'\2/\1/\3','2/20/91')
'20/2/91'
>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',r'\2/\1/\3','2/20/1991')
'20/2/1991'

4.在限定模式上使用split()分隔字符串

re模块的split（）可以基于正则表达式的模式分隔字符串。但是当处理的不是特殊符号匹配多重模式的正则表达式时，re.split()和str.split()的工作方式相同，如下所示：

>>> re.split(':', 'str1:str2')
['str1', 'str2']
>>> 'str1:str2'.split(':')
['str1', 'str2']

但当处理复杂的分隔时，就需要比普通字符串分隔更强大的处理方式,例如下面匹配复杂情况：

>>> DATA = ('Mountation View, CA 94040', 'sunnyvale, CA', 'Los Altos, 94023', 'Palo Alto CA','Cupertino 95014')
>>> for datum in DATA: print(re.split(', |(?= (?:\d{5}|[A-Z]{2})) ',datum))
...
['Mountation View', 'CA', '94040']
['sunnyvale', 'CA']
['Los Altos', '94023']
['Palo Alto', 'CA']
['Cupertino', '95014']

上述的正则表达式：当一个空格紧跟在5个数字或2个字母后面时就用split语句分隔。当遇到“，”也用split函数分隔。

5.扩展符号

通过使用(?iLmsux)系列选项，可以直接在正则表达式里面指定一个活着多个标记。以下是使用re.I/IGNORECASE的示例，第二个是使用re.M/MULTILINE实现多行混合。

>>> re.findall(r'(?i)yes','yes? Yes. YES!!!')
['yes', 'Yes', 'YES']
>>> re.findall(r'(?i)th\w+','The quickest way is through this tunnel.')
['The', 'through', 'this']

>>> re.findall(r'(?im)(^th[\w ]+)', """
... This is the first,
... another line,
... that line,it's the best
... """)
['This is the first', 'that line']

通过使用“多行”，能够在目标字符串中实现跨行搜索，而不必将整个字符串视为单个实体。

下一个例子用来演示re.S/DOTALL，该标记表示点号（.）能够用来表示\n符号。

>>> re.findall(r'th.+',"""
... The first line
... the second line
... the third line
... """)
['the second line', 'the third line']
>>> re.findall(r'(?s)th.+',"""
... The first line
... the second line
... the third line
... """)
['the second line\nthe third line\n']

re.X/VERBOSE标记允许用户通过抑制在正则表达式中使用空白符来创建更易读的正则表达式。

>>> re.search(r'''(?x)
... \((\d{3})\) #区号
... [ ]  #空白符
... (\d{3}) #前缀
... -  #横线
... (\d{4}) #终点数字
... ''','(800) 555-1212').groups()
('800', '555', '1212')

(?:...)符号可以对部分正则表达式进行分组，但是不会保存该分组用于后续的检索或应用。

>>> re.findall(r'http://(?:\w+\.)*(\w+\.com)',
... 'http://google.com http://www.google.com http://code.google.com')
['google.com', 'google.com', 'google.com']

>>> re.search(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
... '(800) 555-1212').groupdict()
{'areacode': '800', 'prefix': '555'}

可以同时使用(?P)和(?P=name)符号。前者通过使用一个名称标识符而不是使用从1开始增加到N的增量数字来保存匹配，如果使用数字来保存匹配结果，我们就可以通过使用\1、\2、…,\N来索引，如下所示，可以使用一个类似风格的\g来检索它们。

>>> re.sub(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
... '(\g<areacode>) \g<prefix>-xxxx', '(800) 555-1212')
'(800) 555-xxxx'

使用后者，可以在同一个正则表达式中重用模式。例如，验证一些电话号码的规范化。

bool(re.match(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?P<number>\d{4}) (?P=areacode)-(?P=prefix)-(?P=number) 1(?P=areacode)(?P=prefix)(?P=number)', '(800) 555-1212 800-555-1212 18005551212'))
True

使用（？x）使代码更易读：

>>> bool(re.match(r'''(?x)
... \((?P<areacode>\d{3})\)[ ](?P<prefix>\d{3})-(?P<number>\d{4})
... [ ]
... (?P=areacode)-(?P=prefix)-(?P=number)
... [ ]
... 1(?P=areacode)(?P=prefix)(?P=number)
... ''','(800) 555-1212 800-555-1212 18005551212'))
True

可以使用(?=…)和(?!…)符号在目标字符串中实现一个前视匹配：

(?=…)字符串后面跟着…才适配

>>> re.findall(r'\w+(?= van Rossum)',
... '''
... Guido van Rossum
... Tim Peters
... Alex Martelli
... Just van Rossum
... Raymond Hettinger
... ''')
['Guido', 'Just']

(?!…)字符串后面不跟着…才适配：

>>> re.findall(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
... '''
...  sales@phptr.com
...  postmaster@phptr.com
...  eng@phptr.com
...  noreply@phptr.com
...  admin@phptr.com
... ''')
['sales', 'eng', 'admin']

比较re.findall()和re.finditer()

>>> ['%s@awcom' % e.group(1) for e in re.finditer(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
... '''
...  postmaster@phptr.com
...  noreply@phptr.com
...  admin@phptr.com
...  eng@phptr.com
...  sales@phptr.com
... ''')]
['admin@awcom', 'eng@awcom', 'sales@awcom']

条件正则表达式匹配，假定拥有一个特殊字符，它仅仅包含字母x和y，两个字母必须由一个跟着另外一个，不能同时拥有相同的两个字母：

>>> bool(re.search(r'(?:(x)|y)(?(1)y|x)', 'xy'))
True
>>> bool(re.search(r'(?:(x)|y)(?(1)y|x)', 'xx'))
False

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 正则表达式

相关文章推荐

新的分享

章节导航