您的位置：首页 > 编程语言 > Python开发

Python正则表达式

2016-05-17 15:36 260 查看

字符串匹配方法正则表达式：RE因为在自己在编写网络爬虫过程中，用到了正则表达式，感受到了它的优雅！为了方便用时能够记起一些重要的使用方式，所以，在此记下。

详细介绍在这里：点击查看

对于正则表达式（RE：Regular Expression），我们主要是了解它的：1.常用符号、2.常用方法、3.常用搭配

一RE的常用字符

二RE的常用方法

三常用搭配

（一）RE的常用字符


Symbol	含义
.	匹配任意字符，除了换行符‘\n’
*	匹配前一个字符0次或无限次
+	匹配至少一次
?	匹配前一个字符0次或1次
.*	贪心算法（找到尽可能长的匹配串）
.*?	非贪心算法（找到尽量多的匹配串）
()	将匹配到的串返回括号内的部分
^	匹配行首
$	匹配行尾
[a-zA-Z0-9]	匹配大小写字母或数字
\d	匹配数字，相当于[0-9]
\D	行首匹配数字，相当于[^0-9]
\s	匹配空白字符[\t\r\n\f\v]
\S	匹配非空白字符
\w	相当于[a-zA-Z0-9]
\W	\w的反义

（二）RE的常用方法


Function	函数功能说明
findall	匹配所有符合规律的内容，并将结果以列表返回< 4000 /td>
search	匹配第一个符合规律，将对象group返回
match	从开头开始匹配（一般很少用到）
sub	替换符合规律的内容，并返回替换后的值
-	-

我们先看 re.findall(pattern, string, flags=0) 方法

string_temp = ‘ljlgjkXXIXXlkjglkjXXLoveXXkljlgjXXYouXXlkjlgj’

model_1 =’XX.XX’

model_2 = ‘XX.?XX’

model_3 = ‘XX.*XX’

model_4 = ‘XX.*?XX’

model_5 = ‘XX(.*?)XX’

result_1 = re.findall(model_1, string_temp)

result_2 = re.findall(model_2, string_temp)

result_3 = re.findall(model_3, string_temp)

result_4 = re.findall(model_4, string_temp)

result_5 = re.findall(model_5, string_temp)

如何将result_1~result_5进行输出，我们将看到

../cuttent>>> result_1

[‘XXIXX’]

../cuttent>>> result_2

[‘XXIXX’]

../cuttent>>> result_3

[‘XXIXXlkjglkjXXLoveXXkljlgjXXYouXX’]

../cuttent>>> result_4

[‘XXIXX’, ‘XXLoveXX’, ‘XXYouXX’]

../cuttent>>> result_5

[‘I’, ‘Love’, ‘You’]

我们用的较多的主要是‘.’和‘.?’。我们可以通过比较result_3和result_4的输出，不难看出：‘.’是匹配到所能匹配到的尽量长的字符，而‘.?’则是匹配所能匹配到的尽量多的字符。

另外，re.findall()可以接受三个参数，第三个参数是修改findall()内部的一些匹配操作。常有re.S(正常的匹配都是以换行符进行切割，然后进行匹配。忽略换行符，进行全局匹配)、re.I(忽略大小写)，当然还有其它的一些：re.L、re.M、re.X、re.X，这些大家可以在需要的时候再去了解既可以。

../cuttent>>> result_10 = re.search(‘XXX.XXX’, string_temp)

../cuttent>>> result_10../cuttent>>> print result_10

None

../cuttent>>> temp = ‘wearetheone-\n-wearethechildren’

../cuttent>>> temp

‘wearetheone-\n-wearethechildren’

../cuttent>>> model = ‘-.*-’

../cuttent>>> result = re.findall(model, temp)

../cuttent>>> result

[]

../cuttent>>> result = re.findall(model, temp, re.S)../cuttent>>> result

[‘-\n-‘]

../cuttent>>>

在上述用例中，‘\n’是换行符。我们可以看到的是：在没有re.S时，是以‘\n’分界，所以没有匹配到我们需要的部分。当我们加上re.S这个参数后，就不再如此，而是整体都进行匹配，包括‘\n’。

我们再看 re.search(pattern, string, flags=0) 方法，并输出
>result_11 = re.search(model_1, string_temp)

result_12 = re.search(model_2, string_temp)

result_13 = re.search(model_3, string_temp)

result_14 = re.search(model_4, string_temp)

result_15 = re.search(model_5, string_temp)

result_10 = re.search(‘XXX.XXX’, string_temp)

../cuttent>>> result_11

../cuttent>>> result_12

<_sre.SRE_Match object at 0x02610720>

../cuttent>>> result_13

<_sre.SRE_Match object at 0x026108E0>

../cuttent>>> result_14

<_sre.SRE_Match object at 0x02610918>

../cuttent>>> result_15

<_sre.SRE_Match object at 0x02229360>

../cuttent>>> result_10

../cuttent>>> print result_10

None

../cuttent>>>

如果匹配到将会返回一个对象，否则将会返回None。我们通过返回对象的group属性进行数据访问。

如果要访问返回的对象中的单个字符串，就根据其位置进行获取：第i个位置，那么就使用result.group(i)进行访问(result是re.search()返回的对象)。

re.group()没有参数，那么就默认为0，保存的是所有的匹配结果，正如下例输出的结果所示（‘-’符号是因为’\n’不可见所增加的提示符）。

../cuttent>>> print result_11.group()

XXIXX

../cuttent>>>print result_12.group()

XXIXX

../cuttent>>> print result_13.group()

XXIXXlkjglkjXXLoveXXkljlgjXXYouXX

../cuttent>>> print result_14.group()

XXIXX

../cuttent>>> print result_15.group()

XXIXX

../cuttent>>> model = ‘-(.)-we(.)the’

../cuttent>>> result = re.search(model, temp)

../cuttent>>> result

../cuttent>>> result = re.search(model, temp, re.S)

../cuttent>>> result

<_sre.SRE_Match object at 0x01E1D8D8>

../cuttent>>> print result.group()

-wearethe

../cuttent>>> print result.group(1)

../cuttent>>> print result.group(2)

are

../cuttent>>>

上述例子中，可以看到，re.search()方法和re.findall()方法一样，也可接受第三个参数，用途相同。re.match(pattern, string, flags=0)基本和re.search()相同，其差异就只是re.match()是从行首进行匹配。

最后是re.sub(pattern, repl, string, count=0)方法：string如果从开始匹配（re.match()）到了pattern，再用repl来替换，count表示要替换的次数，不传参表示全部替换，返回的是替换过后的字符串。

../cuttent>>> temp

‘wearetheone-\n-wearethechildren’

../cuttent>>>model

‘-(.)-we(.)the’

../cuttent>>> result = re.sub(model, ‘**’, temp, 0)

../cuttent>>> result

‘wearetheone-\n-wearethechildren’

../cuttent>>> model = ‘we(.*)the’

../cuttent>>> result = re.sub(model, ‘000’, temp)

../cuttent>>> temp

‘wearetheone-\n-wearethechildren’

../cuttent>>> result

‘000one-\n-000children’

../cuttent>>> result = re.sub(model, ‘000’, temp, 2)

../cuttent>>> result

‘000one-\n-000children’

../cuttent>>> temp

‘wearetheone-\n-wearethechildren’

../cuttent>>> result = re.sub(model, ‘000’, temp, 1)

../cuttent>>>result

‘000one-\n-wearethechildren’

../cuttent>>>

（三）常用搭配

从以上的例中，不难看出：findall是用的最多的方法，而.、.?为核心的符号。所以，常用也主要围绕这三展开，根据需要使用其它的方法和符号即可。

参考资料：http://www.jb51.net/article/50511.htm

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 正则表达式网络爬虫

相关文章推荐

新的分享

章节导航