您的位置：首页 > 编程语言 > Python开发

python 学习记录（2）—re 正则表达式模块的使用

2012-06-21 16:24 996 查看

1.1 创建与使用

In [87]: import re

In [88]: re_string = "{{(.*?)}}"

In [89]: test_string = " this{{is}} a test
{{string}} and {{may be}} [wrong], ok?"

In [90]: for match in re.findall(re_string, test_string): 利用re.findall（）进行字符搜索{{ }}内的任意内容

....: print "match result are: ",match

....:

....:

match result are: is

match result are: string

match result are: may be

1.2编译与非编译正则表达式使用比较

（1）非编译代码性能

#!/usr/bin/python

2

3 #filename: re_test.py

4

5 import re

6

7 def run_re():

8 pattern = 'pDq' 要进行查询的字符

9

10 infile = open('large_re_file.txt', 'r') large_re_file.txt 文件需要自己提供

11 match_count = 0 匹配计数初始值0

12 lines= 0
查询行数初始值0

13 for line in infile:
每行依次进行

14 match = re.search(pattern, line)
进行匹配查找

15 if match:

16 match_count +=1 计数

17 lines+=1 行计数

18 return (lines, match_count)
返回值

19

20 if __name__ == "__main__":

21 lines,match_count = run_re() 运行函数

22 print 'LINES::', lines 输出

23 print 'MATCHES::', match_count

~

结果：

root@test-desktop:/home/lijy#python re_test.py

LINES:: 65

MATCHES:: 0

root@test-desktop:/home/lijy#time python re_test.py 利用time命令进行时间统计

LINES:: 65

MATCHES:: 0

real 0m0.023s

user 0m0.016s

sys 0m0.004s

In [93]: import re_test

In [94]: timeit -n 10 re_test.run_re() 利用timeit进行时间统计

10 loops, best of 3: 158 usper loop
表示执行run_re()函数10次，最佳花费时间用了158 us；

（2）编译代码性能

使用re.compile 来创建一个编译的模式对象------以提高性能

#!/usr/bin/python

2

3 #filename: re_test.py

4

5 import re

6

7 def run_re():

8 pattern = 'pDq'

9 re_obj = re.compile(pattern)

10 infile = open('large_re_file.txt', 'r')

11 match_count = 0

12 lines= 0

13 for line in infile:

14 match = re_obj.search(pattern, line)

15 if match:

16 match_count +=1

17 lines+=1

18 return (lines, match_count)

19

20 if __name__ == "__main__":

21 lines,match_count = run_re()

22 print 'LINES::', lines

23 print 'MATCHES::', match_count

结果：

root@test-desktop:/home/lijy#time python re_test.py

Traceback (most recent call last):

File "re_test.py", line 21, in <module>

lines,match_count = run_re()

File "re_test.py", line 14, in run_re

match = re_obj.search(pattern, line)

TypeError: an integer isrequired

real 0m0.023s

user 0m0.016s

sys 0m0.004s

In [95]: timeit -n 10 re_test.run_re()

10 loops, best of 3: 314 usper loop

可能由于我的测试文本'large_re_file.txt' 内容太少，比较不出来效果，但预计后者性能应该更佳！

常用的正则表达式函数有：findall(); finditer(); match(); search();

上述内容暂时未用到，理解起来也繁琐，暂时了解到此。

附学习资料：

1.概念：

003	# 正则表达式（或 RE）是一种小型的、高度专业化的编程语言，

004	# （在Python中）它内嵌在Python中，并通过 re 模块实现。使用这个小型语言，

005	# 你可以为想要匹配的相应字符串集指定规则；该字符串集可能包含英文语句、email

006	# 地址、TeX命令或任何你想搞定的东西。然后你可以问诸如“这个字符串匹配

007	# 该模式吗？”或“在这个字符串中是否有部分匹配该模式呢？”。你也可以使用 RE

008	# 以各种方式来修改或分割字符串。

010	# 正则表达式语言相对小型和受限（功能有限），因此并非所有字符串处理都能用

011	# 正则表达式完成。当然也有些任务可以用正则表达式完成，不过最终表达式会变

012	# 得异常复杂。碰到这些情形时，编写 Python 代码进行处理可能反而更好；尽管

013	# Python 代码比一个精巧的正则表达式要慢些，但它更易理解。

015	#2.在正则表达式中，如下的字符是具有特殊含义的

016	# . (所有字符) ^ $ *(0-N次) +(1-N次) ? (0-1次) { } [ ] \ \| ( )

017	# 1)."[" 和 "]"。它们常用来指定一个字符类别，所谓字符类别就是你想匹配的一个字符集

018	# 2).其它地方的"^"只会简单匹配 "^"字符本身。例[^5] 将匹配除 "5" 之外的任意字符。

019	# 3).反斜杠后面可以加不同的字符以表示不同特殊意义。它也可以用于取消所有的元字符

021	#3.RE 函数用法:

022	# findall(rule , target [,flag] ) 在目标字符串中查找符合规则的字符串。

023	# match() 决定 RE 是否在字符串刚开始的位置匹配

024	# search() 扫描字符串，找到这个 RE 匹配的位置

025	# findall() 找到 RE 匹配的所有子串，并把它们作为一个列表返回

026	# finditer() 找到 RE 匹配的所有子串，并把它们作为一个迭代器返回

027	# group() 返回被 RE 匹配的字符串

028	# start() 返回匹配开始的位置

029	# end() 返回匹配结束的位置

030	# span() 返回一个元组包含匹配 (开始,结束) 的位置

031	# compile( rule [,flag] )将正则规则编译成一个Pattern对象，以供接下来使用第一个参数

033	# 是规则式，第二个参数是规则选项。(使用compile加速)

#4 : 含义:

036	# 预定义转义字符集： “\d” “\w” “\s” 等等，它们是以字符’\’开头，后面接一个特定

038	#字符的形式,用来指示一个预定义好的含义

040	# ‘^’ 和’$’ 匹配字符串开头和结尾

041	# ‘.’ 匹配所有字符除\n以外

042	# ‘\d’ 匹配数字

043	# ‘\D’ 匹配非数字

044	# ‘\w’ 匹配字母和数字

045	# ‘\W’ 匹配非英文字母和数字

046	# ‘\s’ 匹配间隔符

047	# ‘\S’ 匹配非间隔符

048	# ‘\A’ 匹配字符串开头

049	# ‘\Z’ 匹配字符串结尾

050	# ‘\b’ 只用以匹配单词的词首和词尾。单词被定义为一个字母数字序列，因此词尾就

052	# 是用空白符或非字母数字符来标示的。(退格)

053	# ‘\B’，它正好同 \b 相反，只在当前位置不在单词边界时匹配。

054	#5.前向界定与后向界定:

055	# ‘(?<=…)’ 前向界定:括号中’…’代表你希望匹配的字符串的前面应该出现的字符串。

056	# ‘(?=…)’后向界定 :括号中的’…’代表你希望匹配的字符串后面应该出现的字符串

057	# ‘(?<!..)’前向非界定 :只有当你希望的字符串前面不是’…’的内容时才匹配

058	# ‘(?!...)’后向非界定 :只有当你希望的字符串后面不跟着’…’内容时才匹配。

059	#6.组的基本知识:

060	# ‘(‘’)’ 无命名组 [a-z]+(\d+)[a-z]+

061	# ‘(?P<name>…)’ 命名组 (?P<g1>[a-z]+)\d+(?P=g1)

062	# ‘(?P=name)’ 调用已匹配的命名组

063	# ‘\number’通过序号调用已匹配的组正则式中的每个组都有一个序号，序号是按组

065	#从左到右，从1开始的数字，你可以通过下面的形式来调用已匹配的组

066	# ( r"(\d+)([a-z]+)(\d+)(\2)(\1)" )

067	import rhinoscriptsyntax asrs

#正则表达式

069	import re

070	str1 = "abc \\ 123 456"

071	print re.findall( "\\\\" ,str1) #不用r和用r的区

072	print re.findall(r "\d\Z" ,str1) #用"r"来定义规则字符串

074	p = re. compile ( '(a)b' )

075	m = p.match( 'ab' )

076	print m.group()

078	s = "aaa1 22 gg 333 ccc 4444 pppp55555 666"

079	print re.findall(r "\b\d{3}\b" ,s)

080	print re.findall(r "\b\d{2,4}\b" ,s)

082	s2 = "aaa111aaa , bbb222 , 333ccc"

083	print re.findall( r "(?<=[a-z]+)\d+(?=[a-z]+)" ,s2)

084	print re.findall( r "\d+(?=[a-z]+)" ,s2)

085	##目标前面是a-z 1-多次、中间数字1-9 1-多次

086	print re.findall(r "\d+(?!\w+)" ,s2)

#无命名组

088	print re.findall(r "[a-z]+(\d+)[a-z]+" ,s2) #只返回()里面的

089	s3 = 'aaa111aaa,bbb222,333ccc,444ddd444,555eee666,fff777ggg,hhh888hhh'

090	print re.findall(r "([a-z]+)\d+([a-z]+)" ,s3) #返回括号里面的

091	#‘(?P<name>…)’ 命名组

092	print re.findall( r "(?P<g1>[a-z]+)\d+(?P=g1)" ,s3) #找出被中间夹有数字的前后同样的字母

093	print re.findall(r "([a-z]+)\d+\1" ,s3)

094	s4 = "111aaa222aaa111,333bbb444bb33"

095	print re.findall( r "(\d+)([a-z]+)(\d+)(\2)(\1)" , s4) #数字、字母、数字、字母、数字相对称

096	print re. compile (r "(\d+)([a-z]+)(\d+)(\2)(\1)" ).findall(s4)

098	#compile( rule [,flag] ) 使用compile加速

099	s5 = "111,222,aaa,bbb,ccc333,444ddd"

100	print re. compile (r "\d+\b" ).findall(s5) #\退格匹配一个位于开头的数字，没有使用M选项

102	s6 = "123 456\n789 012\n345 678"

103	print re. compile (r "^\d+" ,re.M).findall(s6) #匹配位于(M/多行)开头的数字

105	rcm = re. compile (r "\d+$" ) # 对于’$’来说，没有使用M选项，它将匹配最后一个行尾的数字，即’678’，加上以后，就能匹配三个行尾的数字456 012和678了.

106	print re. compile (r "\d+$" ,re.M).findall(s6) #

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航