第一章:文本-re:正则表达式-模式语法(2)
1.3.4.2 字符集
字符集(character set)是一组字符,包含可以与模式中当前位置匹配的所有字符。例如,[ab]可以匹配a或b.
# re_test_patterns.py import re def test_patterns(text,patterns): """Given source text and a list of patterns,look for matches for each pattern within the text and print them to stdout. """ # Look for each pattern in the text and print the results. for pattern,desc in patterns: print("'{}' ({})\n".format(pattern,desc)) print(" '{}'".format(text)) for match in re.finditer(pattern,text): s = match.start() e = match.end() substr = text[s:e] n_backslashes = text[:s].count('\\') prefix = '.' * (s + n_backslashes) print(" {}'{}'".format(prefix,substr)) print() return if __name__ == '__main__': test_patterns('abbaaabbbbaaaaa',[('ab',"'a' followed by 'b'")])
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [('[ab]','either a or b'), ('a[ab]+','a followed by 1 or more a or b'), ('a[ab]+?','a followed by 1 or more a or b,not greedy') ], )
贪心形式的表达式(a[ab]+)会消费真个字符串,因为第一个字母是a,而且后续的各个字符要么是a要么是b。
运行结果:
‘[ab]’ (either a or b)
‘abbaabbba’
‘a’
.‘b’
…‘b’
…‘a’
…‘a’
…‘b’
…‘b’
…‘b’
…‘a’
‘a[ab]+’ (a followed by 1 or more a or b)
‘abbaabbba’
‘abbaabbba’
‘a[ab]+?’ (a followed by 1 or more a or b,not greedy)
‘abbaabbba’
‘ab’
…‘aa’
字符集还可以 用来排除特定的字符。尖字符(^)意味着要查找不在这个尖字符后面的集合中的字符。
from re_test_patterns import test_patterns test_patterns( 'This is some text -- with punctuation.', [('[^-. ]+','sequences without -, ., or space')], )
运行结果:
‘[^-. ]+’ (sequences without -, ., or space)
‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’
随着字符集变得更大,键入每一个应当或不应当匹配的字符会变得很麻烦。可以使用一种更简洁的格式,利用字符区间(character range)来定义一个字符集,包含指定的起点和终点之间所有连续的字符。
from re_test_patterns import test_patterns test_patterns( 'This is some text -- with punctuation.', [('[a-z]+','sequences of lowercase letters'), ('[A-Z]+','sequences of uppercase letters'), ('[a-zA-Z]+','sequences of lower- or uppercase letters'), ('[A-Z][a-z]+','one uppercase followed by lowercase') ], )
运行结果:
‘[a-z]+’ (sequences of lowercase letters)
‘This is some text – with punctuation.’
.‘his’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’
‘[A-Z]+’ (sequences of uppercase letters)
‘This is some text – with punctuation.’
‘T’
‘[a-zA-Z]+’ (sequences of lower- or uppercase letters)
‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’
‘[A-Z][a-z]+’ (one uppercase followed by lowercase)
‘This is some text – with punctuation.’
‘This’
作为字符集的一种特殊情况,元字符点号(.)指示模式应当匹配该位置的单个字符。
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [('a.','a followed by any one chartcer'), ('b.','b followed by any one charcter'), ('a.*b','a followed by anything,ending in b'), ('a.*?b','a followed by anything,ending in b') ], )
运行结果:
‘a.’ (a followed by any one chartcer)
‘abbaabbba’
‘ab’
…‘aa’
‘b.’ (b followed by any one charcter)
‘abbaabbba’
.‘bb’
…‘bb’
…‘ba’
‘a.*b’ (a followed by anything,ending in b)
‘abbaabbba’
‘abbaabbb’
‘a.*?b’ (a followed by anything,ending in b)
‘abbaabbba’
‘ab’
…‘aab’
- 第一章:文本-re:正则表达式-模式语法(3)
- 第一章:文本-re:正则表达式-模式语法(4)
- 第一章:文本-re:正则表达式-用模式修改字符串
- 第一章:文本-re:正则表达式-利用模式拆分
- 第一章:文本-re:正则表达式-限制搜索
- 第一章:文本-re:正则表达式-用组解析匹配
- 第一章:文本-re:正则表达式-自引用表达式
- 第一章:文本-re:正则表达式-前向或后向
- PHP扩展之文本处理(二)——PCRE正则表达式语法8——子组(子模式)
- 第一章:文本-re:正则表达式-搜索选项(1)
- 第一章:文本-re:正则表达式-搜索选项(2)
- 第一章:文本-re:正则表达式-搜索选项(3)
- 第一章:文本-re:正则表达式-搜索选项(4)
- PHP扩展之文本处理(二)——PCRE正则表达式语法14——注释及递归模式
- 第一章:文本-re:正则表达式-搜索选项(5)
- [疯狂Java]正则表达式:Java正则表达式语法、贪婪模式/勉强模式
- PHP扩展之文本处理(二)——PCRE正则表达式语法1——分隔符
- 文本匹配模式串的通配符形式转正则表达式的方法
- PHP扩展之文本处理(二)——PCRE正则表达式语法2——元字符
- PHP扩展之文本处理(二)——PCRE正则表达式语法3——转义序列(反斜线)