Python 使用正则表达式 - 2
2015-08-20 15:14
597 查看
继续学习《正则表达式必知必会》一书中的使用子表达式、回溯引用、前后查找等章节。
下面的例子提取$后的金额,模式\$[0-9.]+会返回$,模式[0-9.]+会匹配到其它不需要的内容,模式(?<=\$)[0-9.]+符合要求。模式(?<=\$)[0-9.]+可以简单理解为,在查找到满足[0-9.]+时,还要看左边的字符是否是$,但$不返回。()表示子表达式,子表达式中?<=表示回头看(往左边看)\$表示要看的那个字母,这个字母必须紧接?<=,?<=\$必须用()括起来。
一、使用子表达式
子表达式是模式的一部分,子表达式必须用小括号括起来,即(和)在正则表达式中式元字符。实例如下用 {2,}匹配不到,用( ){2,}可以匹配到,小括号中内容被视为一个整体。另外使用findall时,小括号中的内容会作为捕获的内容,用search就可以清楚地看到第二种情况匹配了两个连续的In [68]: str = '''Hello, my name is Ben Forta, and I am ....: the author of books on SQL, ColdFusion, WAP, ....: Windows 2000, and other subjects. ....: ''' In [69]: pattern = r' {2,}' In [70]: tuple = re.findall(pattern,str) In [71]: tuple Out[71]: [] In [72]: pattern = r'( ){2,}' In [73]: tuple = re.findall(pattern,str) In [74]: tuple Out[74]: ['<span style="font-family: Arial, Helvetica, sans-serif;"> </span><span style="font-family: Arial, Helvetica, sans-serif;">']</span> In [75]: match = re.search(pattern,str) In [76]: match.group() Out[76]: ' '另一个例子,匹配以19或20开头的4位年份数,需要用模式(19|20)\d{2},小括号中的子表达式被当作整体处理,其中|表示或。而用19|20\d{2}只能匹配到19。另外子表达式可以嵌套使用。
In [1]: import re In [2]: str = ''' ...: ID: 042 ...: SEX: m ...: DOB: 1967-08-17 ...: Status: Active ...: ''' In [3]: pattern = r'19|20\d{2}' In [4]: match = re.search(pattern, str) In [5]: match Out[5]: <_sre.SRE_Match at 0x7fd54573f308> In [6]: match.group() Out[6]: '19' In [7]: pattern = r'(19|20)\d{2}' In [8]: match = re.search(pattern, str) In [9]: match.group() Out[9]: '1967'
二、回溯引用,前后一致匹配
下面的例子是要把HTML中所有的标题文字找出来,比较自然地想到<[hH] [1-6]>.*? </[hH] [1-6]>,但是这个模式有一个问题,就是如果文本中出现类似这样的错误<H2> This is not valid HTML</H3>也能被匹配到,而这显然是错误的。用模式<[hH]([1-6])>.*?</[hH]\1>就可以避免前面的问题,其中([1-6])是一个子表达式,匹配[1-6]之间的一个字母,这个匹配因为加上了小括号,所以会被捕获,\1表示在后面的匹配的过程中引用前面的捕获,即前面的<H2>匹配到了字母2而且被捕获,这时</[hH]\1>表示要匹配<h2>或<H2>。In [11]: str = ''' ....: <BODY> ....: <H1>Welcome to my Homepage</H1> ....: Content is divided into two sections:<BR> ....: <H2>ColdFusion</H2> ....: Information about Macromedia ColdFusion. ....: <H2>Wiressless</H2> ....: Information about Bluetooth, 802.11, and more. ....: <H2>This is not valid HTML</H3> ....: </BODY> ....: ....: ''' In [12]: pattern = r'<[hH]([1-6])>.*?</[hH]\1>' In [13]: tuple = re.finditer(pattern,str) In [14]: tuple Out[14]: <callable-iterator at 0x7fd545766d10> In [15]: match = tuple.next() In [16]: match.group() Out[16]: '<H1>Welcome to my Homepage</H1>' In [17]: match = tuple.next() In [18]: match.group() Out[18]: '<H2>ColdFusion</H2>' In [19]: match = tuple.next() In [20]: match.group() Out[20]: '<H2>Wiressless</H2>' In [21]: match = tuple.next() --------------------------------------------------------------------------- StopIteration Traceback (most recent call last) <ipython-input-21-5a16e7999233> in <module>() ----> 1 match = tuple.next() StopIteration: In [22]:替换操作的例子,模式(\w+[\w\.]*@[\w\.]+\.\w+)用小括号括起来,在替换中用\g<0>引用(这是python 替换中的引用方式)。注意回溯引用只能引用模式中用小括号括起来的片段。
In [25]: str = 'Hello, ben@forta.com is my email address.' In [26]: pattern = r'(\w+[\w\.]*@[\w\.]+\.\w+)' In [27]: tuple = re.findall(pattern, str) In [28]: tuple Out[28]: ['ben@forta.com'] In [29]: repl = r'<A HREF="mailto:\g<0>>\g<0></A>' In [30]: str2 = re.sub(pattern, repl, str) In [31]: str2 Out[31]: 'Hello, <A HREF="mailto:ben@forta.com>ben@forta.com</A> is my email address.'替换操作例子2
In [35]: str = ''' ....: 313-555-1234 ....: 248-555-9999 ....: 810-555-9000 ....: ''' In [36]: pattern = r'(\d{3})(-)(\d{3})(-)(\d{4})' In [37]: repl = r'(\g<1>) \g<3>-\g<5>' In [38]: str2 = re.sub(pattern, repl, str) In [39]: str2 Out[39]: '\n(313) 555-1234\n(248) 555-9999\n(810) 555-9000\n' In [40]: print str2 (313) 555-1234 (248) 555-9999 (810) 555-9000
三、前后查找
在下面的例子中,提取协议的名称,但是用模式.+:会把:也给匹配出来。用模式 .+(?=:)可以解决这个问题,简单理解是,仍按有:查找,但不返回结果,也可以理解为查找时比模式.+向前多看一个字母:。In [63]: str = ''' ....: http://www.forta.com ....: https://mail.forta.com/ ....: ftp://ftp.forta.com/ ....: ''' In [64]: pattern = r'.+:' In [65]: tuple = re.findall(pattern,str) In [66]: tuple Out[66]: ['http:', 'https:', 'ftp:'] In [67]: pattern = r'.+(?=:)' In [68]: it = re.finditer(pattern,str) In [69]: match = it.next() In [70]: match.group() Out[70]: 'http' In [71]: match = it.next() In [72]: match.group() Out[72]: 'https' In [73]: match = it.next() In [74]: match.group() Out[74]: 'ftp'
下面的例子提取$后的金额,模式\$[0-9.]+会返回$,模式[0-9.]+会匹配到其它不需要的内容,模式(?<=\$)[0-9.]+符合要求。模式(?<=\$)[0-9.]+可以简单理解为,在查找到满足[0-9.]+时,还要看左边的字符是否是$,但$不返回。()表示子表达式,子表达式中?<=表示回头看(往左边看)\$表示要看的那个字母,这个字母必须紧接?<=,?<=\$必须用()括起来。
In [76]: str = ''' ....: ABC01: $23.45 ....: HGG42: $5.31 ....: CFMX1: $899.00 ....: XTC99: $69.96 ....: Total items found: 4 ....: ''' In [77]: pattern = r'\$[0-9.]+' In [78]: tuple = re.findass(pattern, str) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-78-2a8b8624d1a6> in <module>() ----> 1 tuple = re.findass(pattern, str) AttributeError: 'module' object has no attribute 'findass' In [79]: tuple = re.findall(pattern,str) In [80]: tuple Out[80]: ['$23.45', '$5.31', '$899.00', '$69.96'] In [81]: pattern = r'[0-9.]+' In [82]: tuple = re.findall(pattern,str) In [83]: tuple Out[83]: ['01', '23.45', '42', '5.31', '1', '899.00', '99', '69.96', '4'] In [84]: pattern = r'(?<=\$)[0-9.]+' In [85]: it = re.finditer(pattern, str) In [86]: match = it.next() In [87]: match.group() Out[87]: '23.45' In [88]: match = it.next() In [89]: match.group() Out[89]: '5.31' In [90]: match = it.next() In [91]: match.group() Out[91]: '899.00' In [92]: match = it.next() In [93]: match.group() Out[93]: '69.96'向前看?=与向后看想结合?<=
In [95]: str = ''' ....: <HEAD> ....: <TITLE>Ben Forta's Homepage</TITLE> ....: </HEAD> ....: ''' In [96]: pattern = r'(?<=<[tT][iI][tT][lL][eE]>).*(?=</[tT][iI][tT][lL][eE]>)' In [97]: match = re.search(pattern,str) In [98]: match Out[98]: <_sre.SRE_Match at 0x7fd545699b28> In [99]: match.group() Out[99]: "Ben Forta's Homepage"练习:将1234567890这一串数字,从右往左每3位用逗号隔开,结果是这样的1,234,567,890
In [109]: str = '1234567890' In [110]: pattern = r'\d{1,3}(?=(\d{3})+(?!\d))' In [111]: repl = r'\g<0>,' In [112]: str2 = re.sub(pattern, repl, str) In [113]: str2 Out[113]: '1,234,567,890'
相关文章推荐
- Python3字符串学习教程
- python中if __name__ == '__main__': 的解析
- Python中列表和元组的相关语句和方法讲解
- python --类方法、对象方法、静态方法
- python 字典
- python基础二——list与字符串
- Python 中的进程、线程、协程、同步、异步、回调
- 在 Python 中使用 GDB 来调试 转载
- 开第一贴,记录自己学习python的过程!
- [完美解决]Python:'ascii’ codec can’t encode characters in position
- python-var
- python 获取exception 名字
- 【Python】[面向对象高级编程] 使用__slots__,使用@property
- python numpy在保持行的整体性的情况下按列排序
- python3 writerow CSV文件多一个空行
- Python:Bubble 排序算法
- Python 中的for-else用法
- Python 向 DataFrame 添加一列,该列为同一值
- Python 3 - 基本类属性和方法
- Python的字符串操作和Unicode