您的位置:首页 > 编程语言 > Python开发

Python 使用正则表达式 - 2

2015-08-20 15:14 597 查看
继续学习《正则表达式必知必会》一书中的使用子表达式、回溯引用、前后查找等章节。

一、使用子表达式

子表达式是模式的一部分,子表达式必须用小括号括起来,即(和)在正则表达式中式元字符。实例如下用 {2,}匹配不到,用( ){2,}可以匹配到,小括号中内容被视为一个整体。另外使用findall时,小括号中的内容会作为捕获的内容,用search就可以清楚地看到第二种情况匹配了两个连续的  

In [68]: str = '''Hello, my name is Ben Forta, and I am
....: the author of books on SQL, ColdFusion, WAP,
....: Windows  2000, and other subjects.
....: '''

In [69]: pattern = r' {2,}'

In [70]: tuple = re.findall(pattern,str)

In [71]: tuple
Out[71]: []

In [72]: pattern = r'( ){2,}'

In [73]: tuple = re.findall(pattern,str)

In [74]: tuple
Out[74]: ['<span style="font-family: Arial, Helvetica, sans-serif;"> </span><span style="font-family: Arial, Helvetica, sans-serif;">']</span>

In [75]: match = re.search(pattern,str)

In [76]: match.group()
Out[76]: '  '
另一个例子,匹配以19或20开头的4位年份数,需要用模式(19|20)\d{2},小括号中的子表达式被当作整体处理,其中|表示或。而用19|20\d{2}只能匹配到19。另外子表达式可以嵌套使用。

In [1]: import re

In [2]: str = '''
...: ID: 042
...: SEX: m
...: DOB: 1967-08-17
...: Status: Active
...: '''

In [3]: pattern = r'19|20\d{2}'

In [4]: match = re.search(pattern, str)

In [5]: match
Out[5]: <_sre.SRE_Match at 0x7fd54573f308>

In [6]: match.group()
Out[6]: '19'

In [7]: pattern = r'(19|20)\d{2}'

In [8]: match = re.search(pattern, str)

In [9]: match.group()
Out[9]: '1967'

二、回溯引用,前后一致匹配

下面的例子是要把HTML中所有的标题文字找出来,比较自然地想到<[hH] [1-6]>.*? </[hH] [1-6]>,但是这个模式有一个问题,就是如果文本中出现类似这样的错误<H2> This is not valid HTML</H3>也能被匹配到,而这显然是错误的。用模式<[hH]([1-6])>.*?</[hH]\1>就可以避免前面的问题,其中([1-6])是一个子表达式,匹配[1-6]之间的一个字母,这个匹配因为加上了小括号,所以会被捕获,\1表示在后面的匹配的过程中引用前面的捕获,即前面的<H2>匹配到了字母2而且被捕获,这时</[hH]\1>表示要匹配<h2>或<H2>。

In [11]: str = '''
....: <BODY>
....: <H1>Welcome to my Homepage</H1>
....: Content is divided into two sections:<BR>
....: <H2>ColdFusion</H2>
....: Information about Macromedia ColdFusion.
....: <H2>Wiressless</H2>
....: Information about Bluetooth, 802.11, and more.
....: <H2>This is not valid HTML</H3>
....: </BODY>
....:
....: '''

In [12]: pattern = r'<[hH]([1-6])>.*?</[hH]\1>'

In [13]: tuple = re.finditer(pattern,str)

In [14]: tuple
Out[14]: <callable-iterator at 0x7fd545766d10>

In [15]: match = tuple.next()

In [16]: match.group()
Out[16]: '<H1>Welcome to my Homepage</H1>'

In [17]: match = tuple.next()

In [18]: match.group()
Out[18]: '<H2>ColdFusion</H2>'

In [19]: match = tuple.next()

In [20]: match.group()
Out[20]: '<H2>Wiressless</H2>'

In [21]: match = tuple.next()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-21-5a16e7999233> in <module>()
----> 1 match = tuple.next()

StopIteration:

In [22]:
替换操作的例子,模式(\w+[\w\.]*@[\w\.]+\.\w+)用小括号括起来,在替换中用\g<0>引用(这是python 替换中的引用方式)。注意回溯引用只能引用模式中用小括号括起来的片段。

In [25]: str = 'Hello, ben@forta.com is my email address.'

In [26]: pattern = r'(\w+[\w\.]*@[\w\.]+\.\w+)'

In [27]: tuple = re.findall(pattern, str)

In [28]: tuple
Out[28]: ['ben@forta.com']

In [29]: repl = r'<A HREF="mailto:\g<0>>\g<0></A>'

In [30]: str2 =  re.sub(pattern, repl, str)

In [31]: str2
Out[31]: 'Hello, <A HREF="mailto:ben@forta.com>ben@forta.com</A> is my email address.'
替换操作例子2

In [35]: str = '''
....: 313-555-1234
....: 248-555-9999
....: 810-555-9000
....: '''

In [36]: pattern = r'(\d{3})(-)(\d{3})(-)(\d{4})'

In [37]: repl = r'(\g<1>) \g<3>-\g<5>'

In [38]: str2 = re.sub(pattern, repl, str)

In [39]: str2
Out[39]: '\n(313) 555-1234\n(248) 555-9999\n(810) 555-9000\n'

In [40]: print str2

(313) 555-1234
(248) 555-9999
(810) 555-9000


三、前后查找

在下面的例子中,提取协议的名称,但是用模式.+:会把:也给匹配出来。用模式 .+(?=:)可以解决这个问题,简单理解是,仍按有:查找,但不返回结果,也可以理解为查找时比模式.+向前多看一个字母:。

In [63]: str = '''
....: http://www.forta.com ....: https://mail.forta.com/ ....: ftp://ftp.forta.com/ ....: '''
In [64]: pattern = r'.+:'

In [65]: tuple = re.findall(pattern,str)

In [66]: tuple
Out[66]: ['http:', 'https:', 'ftp:']

In [67]: pattern = r'.+(?=:)'

In [68]: it = re.finditer(pattern,str)

In [69]: match = it.next()

In [70]: match.group()
Out[70]: 'http'

In [71]: match = it.next()

In [72]: match.group()
Out[72]: 'https'

In [73]: match = it.next()

In [74]: match.group()
Out[74]: 'ftp'


下面的例子提取$后的金额,模式\$[0-9.]+会返回$,模式[0-9.]+会匹配到其它不需要的内容,模式(?<=\$)[0-9.]+符合要求。模式(?<=\$)[0-9.]+可以简单理解为,在查找到满足[0-9.]+时,还要看左边的字符是否是$,但$不返回。()表示子表达式,子表达式中?<=表示回头看(往左边看)\$表示要看的那个字母,这个字母必须紧接?<=,?<=\$必须用()括起来。

In [76]: str = '''
....: ABC01: $23.45
....: HGG42: $5.31
....: CFMX1: $899.00
....: XTC99: $69.96
....: Total items found: 4
....: '''

In [77]: pattern = r'\$[0-9.]+'

In [78]: tuple = re.findass(pattern, str)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-78-2a8b8624d1a6> in <module>()
----> 1 tuple = re.findass(pattern, str)

AttributeError: 'module' object has no attribute 'findass'

In [79]: tuple = re.findall(pattern,str)

In [80]: tuple
Out[80]: ['$23.45', '$5.31', '$899.00', '$69.96']

In [81]: pattern = r'[0-9.]+'

In [82]: tuple = re.findall(pattern,str)

In [83]: tuple
Out[83]: ['01', '23.45', '42', '5.31', '1', '899.00', '99', '69.96', '4']

In [84]: pattern = r'(?<=\$)[0-9.]+'

In [85]: it = re.finditer(pattern, str)

In [86]: match = it.next()

In [87]: match.group()
Out[87]: '23.45'

In [88]: match = it.next()

In [89]: match.group()
Out[89]: '5.31'

In [90]: match = it.next()

In [91]: match.group()
Out[91]: '899.00'

In [92]: match = it.next()

In [93]: match.group()
Out[93]: '69.96'
向前看?=与向后看想结合?<=

In [95]: str = '''
....: <HEAD>
....: <TITLE>Ben Forta's Homepage</TITLE>
....: </HEAD>
....: '''

In [96]: pattern = r'(?<=<[tT][iI][tT][lL][eE]>).*(?=</[tT][iI][tT][lL][eE]>)'

In [97]: match = re.search(pattern,str)

In [98]: match
Out[98]: <_sre.SRE_Match at 0x7fd545699b28>

In [99]: match.group()
Out[99]: "Ben Forta's Homepage"
练习:将1234567890这一串数字,从右往左每3位用逗号隔开,结果是这样的1,234,567,890

In [109]: str = '1234567890'

In [110]: pattern = r'\d{1,3}(?=(\d{3})+(?!\d))'

In [111]: repl = r'\g<0>,'

In [112]: str2 = re.sub(pattern, repl, str)

In [113]: str2
Out[113]: '1,234,567,890'
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: