您的位置:首页 > 其它

第一章:文本-re:正则表达式-模式语法(2)

2019-01-27 08:31 363 查看

1.3.4.2 字符集
字符集(character set)是一组字符,包含可以与模式中当前位置匹配的所有字符。例如,[ab]可以匹配a或b.

# re_test_patterns.py
import re

def test_patterns(text,patterns):
"""Given source text and a list of patterns,look for
matches for each pattern within the text and print
them to stdout.
"""

# Look for each pattern in the text and print the results.
for pattern,desc in patterns:
print("'{}' ({})\n".format(pattern,desc))
print(" '{}'".format(text))
for match in re.finditer(pattern,text):
s = match.start()
e = match.end()
substr = text[s:e]
n_backslashes = text[:s].count('\\')
prefix = '.' * (s + n_backslashes)
print(" {}'{}'".format(prefix,substr))
print()
return

if __name__ == '__main__':
test_patterns('abbaaabbbbaaaaa',[('ab',"'a' followed by 'b'")])
from re_test_patterns import test_patterns

test_patterns(
'abbaabbba',
[('[ab]','either a or b'),
('a[ab]+','a followed by 1 or more a or b'),
('a[ab]+?','a followed by 1 or more a or b,not greedy')
],
)

贪心形式的表达式(a[ab]+)会消费真个字符串,因为第一个字母是a,而且后续的各个字符要么是a要么是b。
运行结果:

‘[ab]’ (either a or b)

‘abbaabbba’
‘a’
.‘b’
…‘b’
…‘a’
…‘a’
…‘b’
…‘b’
…‘b’
…‘a’

‘a[ab]+’ (a followed by 1 or more a or b)

‘abbaabbba’
‘abbaabbba’

‘a[ab]+?’ (a followed by 1 or more a or b,not greedy)

‘abbaabbba’
‘ab’
…‘aa’

字符集还可以 用来排除特定的字符。尖字符(^)意味着要查找不在这个尖字符后面的集合中的字符。

from re_test_patterns import test_patterns

test_patterns(
'This is some text -- with punctuation.',
[('[^-. ]+','sequences without -, ., or space')],
)

运行结果:

‘[^-. ]+’ (sequences without -, ., or space)

‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

随着字符集变得更大,键入每一个应当或不应当匹配的字符会变得很麻烦。可以使用一种更简洁的格式,利用字符区间(character range)来定义一个字符集,包含指定的起点和终点之间所有连续的字符。

from re_test_patterns import test_patterns

test_patterns(
'This is some text -- with punctuation.',
[('[a-z]+','sequences of lowercase letters'),
('[A-Z]+','sequences of uppercase letters'),
('[a-zA-Z]+','sequences of lower- or uppercase letters'),
('[A-Z][a-z]+','one uppercase followed by lowercase')
],
)

运行结果:

‘[a-z]+’ (sequences of lowercase letters)

‘This is some text – with punctuation.’
.‘his’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

‘[A-Z]+’ (sequences of uppercase letters)

‘This is some text – with punctuation.’
‘T’

‘[a-zA-Z]+’ (sequences of lower- or uppercase letters)

‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

‘[A-Z][a-z]+’ (one uppercase followed by lowercase)

‘This is some text – with punctuation.’
‘This’

作为字符集的一种特殊情况,元字符点号(.)指示模式应当匹配该位置的单个字符。

from re_test_patterns import test_patterns

test_patterns(
'abbaabbba',
[('a.','a followed by any one chartcer'),
('b.','b followed by any one charcter'),
('a.*b','a followed by anything,ending in b'),
('a.*?b','a followed by anything,ending in b')
],
)

运行结果:

‘a.’ (a followed by any one chartcer)

‘abbaabbba’
‘ab’
…‘aa’

‘b.’ (b followed by any one charcter)

‘abbaabbba’
.‘bb’
…‘bb’
…‘ba’

‘a.*b’ (a followed by anything,ending in b)

‘abbaabbba’
‘abbaabbb’

‘a.*?b’ (a followed by anything,ending in b)

‘abbaabbba’
‘ab’
…‘aab’

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: