您的位置：首页 > 其它

第一章：文本-re:正则表达式-用组解析匹配

2019-01-27 09:47 281 查看

1.3.6 用组解析匹配
搜索模式匹配是正则表达式强大能力的基础。为模式提供组可以隔离匹配文本的各个部分，以扩展这些功能来创建一个解析器。可以用小括号包围模式来定义组。

# re_test_patterns.py
import re

def test_patterns(text,patterns):
"""Given source text and a list of patterns,look for
matches for each pattern within the text and print
them to stdout.
"""

# Look for each pattern in the text and print the results.
for pattern,desc in patterns:
print("'{}' ({})\n".format(pattern,desc))
print(" '{}'".format(text))
for match in re.finditer(pattern,text):
s = match.start()
e = match.end()
substr = text[s:e]
n_backslashes = text[:s].count('\\')
prefix = '.' * (s + n_backslashes)
print(" {}'{}'".format(prefix,substr))
print()
return

if __name__ == '__main__':
test_patterns('abbaaabbbbaaaaa',[('ab',"'a' followed by 'b'")])

from re_test_patterns import test_patterns

test_patterns(
'abbaaabbbbaaaaa',
[('a(ab)','a followed by literal ab'),
('a(a*b*)','a followed by 0-n a and 0-n b'),
('a(ab)*','a followed by 0-n ab'),
('a(ab)+','a followed by 1-n ab')
],
)

可以把完整的正则表达式转换为一个组，并嵌入到一个更大的表达式中。所有重复修饰符都可以应用到整个组，要求整个组模式重复。
运行结果：

要访问与模式中各个组匹配的子串，可以使用match对象的groups()方法。

import re

text = 'This is some text -- with punctuation.'

print(text)
print()

patterns = [
(r'^(\w+)','word at start of string'),
(r'(\w+)\S*$','word at end,with optional punctuation'),
(r'(\bt\w+)\W+(\w+)','word starting with t,another word'),
(r'(\w+t)\b','word ending with t'),
]

for pattern,desc in patterns:
regex = re.compile(pattern)
match = regex.search(text)
print("'{}' ({})\n".format(pattern,desc))
print(' ',match.groups())
print()

运行结果：

This is some text – with punctuation.

‘^(\w+)’ (word at start of string)

(‘This’,)

‘(\w+)\S*$’ (word at end,with optional punctuation)

(‘punctuation’,)

‘(\bt\w+)\W+(\w+)’ (word starting with t,another word)

(‘text’, ‘with’)

‘(\w+t)\b’ (word ending with t)

(‘text’,)

要访问单个组的匹配，可以使用group()方法。当使用组查找字符串的各个部分时，有些部分尽管与组匹配但在结果中并不需要，此时group()方法就很有用。

import re

text = 'This is sone text -- with punctuation.'

print('Input text    :',text)

# Word starting with 't' then another word
regex = re.compile(r'(\bt\w+)\W+(\w+)')
print('Pattern       :',regex.pattern)

match = regex.search(text)
print('Entire match   :',match.group(0))
print('Word starting with "t"：',match.group(1))
print('Word after "t" word ：',match.group(2))

组0表示与整个表达式匹配的字符串，子组按其左括号在表达式中出现的顺序编号，从1开始。
运行结果：

Input text : This is sone text – with punctuation.
Pattern : (\bt\w+)\W+(\w+)
Entire match : text – with
Word starting with “t”: text
Word after “t” word : with

python扩展了基本组语法，还增加了命名组，通过使用名字来指示组可以更容易地修改模式，而不必同时修改使用了匹配结果的代码。要设置一个组的名字，可以使用语法(?Ppattern).

import re

text = 'This is some text -- with punctuation.'

print(text)
print()

patterns = [
r'^(?P<first_word>\w+)',
r'(?P<last_word>\w+)\S*$',
r'(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)',
r'(?P<ends_with_t>\w+t)\b',
]

for pattern in patterns:
regex = re.compile(pattern)
match = regex.search(text)
print("'{}'".format(pattern))
print(' ',match.groups())
print(' ',match.groupdict())
print()

可以使用groupdict()获取一个字典，它将组名映射为匹配的子串。命名模式也包含在groups()返回的有序序列中。
运行结果：

This is some text – with punctuation.

‘^(?P<first_word>\w+)’
(‘This’,)
{‘first_word’: ‘This’}

‘(?P<last_word>\w+)\S*$’
(‘punctuation’,)
{‘last_word’: ‘punctuation’}

‘(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)’
(‘text’, ‘with’)
{‘t_word’: ‘text’, ‘other_word’: ‘with’}

‘(?P<ends_with_t>\w+t)\b’
(‘text’,)
{‘ends_with_t’: ‘text’}

更新的test_patterns()会显示一个模式匹配的编号组和命名组，使后面的例子更容易理解。

# re_test_patterns_groups.py
import re

def test_patterns(text,patterns):
"""Given source text and a list of patterns,look for
matches for each pattern within the text and print
them to stdout.
"""
# Look for each pattern in the text and print the results.

for pattern,desc in patterns:
print('{!r} ({})\n'.format(pattern,desc))
print(' {!r}'.format(text))
for match in re.finditer(pattern,text):
s = match.start()
e = match.end()
prefix = ' ' * (s)
print(
' {}{!r}{} '.format(prefix,text[s:e],
' ' * (len(text) - e)),
end=' ',
)
print(match.groups())
if match.groupdict():
print('{}{}'.format(' ' * (len(text) - s),
match.groupdict()),
)
print()
return

组本身是一个完整的正则表达式，可以嵌套在其他组中来创建更复杂的表达式。

from re_test_patterns_groups import test_patterns

test_patterns(
'abbaabbba',
[(r'a((a*)(b*))','a followed by 0-n a and 0-n b')]
)

在这里，组(a*)匹配一个空串，所以groups()的返回值包含空串作为匹配值。
运行结果：

组还可以用于指定替代模式。可以使用管道符号（|）指示应当匹配某一个模式。不过，要仔细考虑管道符号的放置位置。下面这个例子中的第一个表达式匹配一个a序列，该序列后面跟着一个完全由某一个字母（a或b）组成的序列。第二个表达式匹配a，其后面跟着一个可能包含a或b的序列。模式很相似，但是得到的匹配完全不同。

from re_test_patterns_groups import test_patterns

test_patterns(
'abbaabbba',
[(r'a((a+)|(b+))','a then seq. of a or seq. of b'),
(r'a((a|b)+)','a then seq. of [ab]')
],
)

如果一个替代组不匹配，但是整个模式确实匹配，那么groups()的返回值会在序列中本应该出现替代组的位置包含一个None值。
运行结果：

如果匹配子模式的字符串不必从整个文本中抽取出来，那么在这种情况下，定义包含子模式的组也很有用。这些组被称为非捕获组（non-capturing）.非捕获组可以用来描述重复模式或替代，而不会隔离返回值中字符串的匹配部分。可以使用语法（?:pattern）创建一个非捕获组。

from re_test_patterns_groups import test_patterns

test_patterns(
'abbaabbba',
[(r'a((a+)|(b+))','capturing form'),
(r'a((?:a+)|(?:b+))','noncapturing')],

尽管一个模式的捕获和非捕获形式会匹配得到相同的结果，但它们会返回不同的组，下面来做一个比较。
运行结果：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航