您的位置：首页 > 其它

第一章：文本-re:正则表达式-模式语法（2）

2019-01-27 08:31 363 查看

1.3.4.2 字符集
字符集（character set）是一组字符，包含可以与模式中当前位置匹配的所有字符。例如，[ab]可以匹配a或b.

# re_test_patterns.py
import re

def test_patterns(text,patterns):
"""Given source text and a list of patterns,look for
matches for each pattern within the text and print
them to stdout.
"""

# Look for each pattern in the text and print the results.
for pattern,desc in patterns:
print("'{}' ({})\n".format(pattern,desc))
print(" '{}'".format(text))
for match in re.finditer(pattern,text):
s = match.start()
e = match.end()
substr = text[s:e]
n_backslashes = text[:s].count('\\')
prefix = '.' * (s + n_backslashes)
print(" {}'{}'".format(prefix,substr))
print()
return

if __name__ == '__main__':
test_patterns('abbaaabbbbaaaaa',[('ab',"'a' followed by 'b'")])

from re_test_patterns import test_patterns

test_patterns(
'abbaabbba',
[('[ab]','either a or b'),
('a[ab]+','a followed by 1 or more a or b'),
('a[ab]+?','a followed by 1 or more a or b,not greedy')
],
)

贪心形式的表达式（a[ab]+）会消费真个字符串，因为第一个字母是a，而且后续的各个字符要么是a要么是b。
运行结果：

‘[ab]’ (either a or b)

‘abbaabbba’
‘a’
.‘b’
…‘b’
…‘a’
…‘a’
…‘b’
…‘b’
…‘b’
…‘a’

‘a[ab]+’ (a followed by 1 or more a or b)

‘abbaabbba’
‘abbaabbba’

‘a[ab]+?’ (a followed by 1 or more a or b,not greedy)

‘abbaabbba’
‘ab’
…‘aa’

字符集还可以用来排除特定的字符。尖字符（^）意味着要查找不在这个尖字符后面的集合中的字符。

from re_test_patterns import test_patterns

test_patterns(
'This is some text -- with punctuation.',
[('[^-. ]+','sequences without -, ., or space')],
)

运行结果：

‘[^-. ]+’ (sequences without -, ., or space)

‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

随着字符集变得更大，键入每一个应当或不应当匹配的字符会变得很麻烦。可以使用一种更简洁的格式，利用字符区间（character range）来定义一个字符集，包含指定的起点和终点之间所有连续的字符。

from re_test_patterns import test_patterns

test_patterns(
'This is some text -- with punctuation.',
[('[a-z]+','sequences of lowercase letters'),
('[A-Z]+','sequences of uppercase letters'),
('[a-zA-Z]+','sequences of lower- or uppercase letters'),
('[A-Z][a-z]+','one uppercase followed by lowercase')
],
)

运行结果：

‘[a-z]+’ (sequences of lowercase letters)

‘This is some text – with punctuation.’
.‘his’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

‘[A-Z]+’ (sequences of uppercase letters)

‘This is some text – with punctuation.’
‘T’

‘[a-zA-Z]+’ (sequences of lower- or uppercase letters)

‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

‘[A-Z][a-z]+’ (one uppercase followed by lowercase)

‘This is some text – with punctuation.’
‘This’

作为字符集的一种特殊情况，元字符点号（.）指示模式应当匹配该位置的单个字符。

from re_test_patterns import test_patterns

test_patterns(
'abbaabbba',
[('a.','a followed by any one chartcer'),
('b.','b followed by any one charcter'),
('a.*b','a followed by anything,ending in b'),
('a.*?b','a followed by anything,ending in b')
],
)

运行结果：

‘a.’ (a followed by any one chartcer)

‘abbaabbba’
‘ab’
…‘aa’

‘b.’ (b followed by any one charcter)

‘abbaabbba’
.‘bb’
…‘bb’
…‘ba’

‘a.*b’ (a followed by anything,ending in b)

‘abbaabbba’
‘abbaabbb’

‘a.*?b’ (a followed by anything,ending in b)

‘abbaabbba’
‘ab’
…‘aab’

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航