您的位置：首页 > 编程语言 > Python开发

python爬虫教程（4）-正则表达式解析网页

2019-02-16 16:34 459 查看

欢迎来到python爬虫大讲堂，现在开始你的爬虫旅程吧！

正则表达式解析网页

正则表达式就是对字符串进行操作的逻辑公式，相当于‘过滤’这个字符串。

我们可以把网页源码变成字符串，再用正则表达式对其进行提取，一开始学正则表达式会感到有点困难，但是加油！

接下来就是一个难懂的表格：

.	*
匹配除了换行符外的任意字符	匹配前一个字符0或多次
+	?
匹配前一个字符1或多次	匹配前一个字符0或1次
^	$
匹配字符串开头	匹配字符串结尾
\s	\S
匹配空白字符	匹配任意非空白字符
\d	\D
匹配数字	匹配任何非数字
\w	\W
匹配字母和数字	匹配任何非字母和数字
[]	()
一组字符	表达式

想详细了解正则表达式的可以参考这里：http://www.runoob.com/regexp/regexp-tutorial.html

我们这里直接开始讲解正则表达式的match，search和findall方法，主要讲解re.match方法：

re.match方法

re.match代表从字符串起始进行匹配，无法匹配则为None。

re.match的使用方法是：re.match(pattern,string,flags=0)，pattern是正则表达式，string是字符串

接下来我们来看一段实例代码：

import re
test=re.match('www','www.baidu.com')
print('result:',test)
print('begin and end tuple:',test.span())
print('begin:',test.start())
print('end:',test.end())

结果为：

result: <re.Match object; span=(0, 3), match='www'>
begin and end tuple:(0, 3)
begin:0
end:3

span()是匹配结果的开始位置和结束位置

start()和end()则是分别开始和结束。

接下来我们试着把pattern进行一些改变：

import re
pattern='Cats are smarter than dogs'
test=re.match(r'(.*) are (.*?) dogs',pattern)
print('the whole sentence:',test.group(0))
print('the first result:',test.group(1))
print('the second result:',test.group(2))
print('a tuple for result:',test.groups())

你会得到：

the whole sentence: Cats are smarter than dogs
the first result: Cats
the second result: smarter than
a tuple for result: ('Cats', 'smarter than')

r表示纯字符，防止反斜杠转译。

这里我们使用()在里面嵌入正则表达式，最后匹配了Cats和smarter than。

re.search方法

re.match只从字符串开始进行匹配，而search扫描整个字符串并返回第一个匹配，看看下面：

import re
link='www.baidu.com'
print(re.match('baidu',link))
print(re.search('baidu',link))

你会得到：

None
<re.Match object; span=(4, 9), match='baidu'>

可见两者的区别，match发现开头没有就直接返回None了，search则从头扫描到尾。

re.findall方法

re.findall可以找到所有的匹配，见下：

import re
link='www.baidu.com www.baidu,com'
print(re.match('www',link))
print(re.search('www',link))
print(re.findall('www',link))

你会得到：

<re.Match object; span=(0, 3), match='www'>
<re.Match object; span=(0, 3), match='www'>
['www', 'www']

这里match和search都只返回了一个www，但是findall找到了所有的。

实践部分

我们现在试着用正则表达式爬取博客日期：

import re
import requests
link='https://blog.csdn.net/weixin_42183408'
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}

r=requests.get(link,headers=headers)
html=r.text

date=re.findall('<span class="date">(.*?)</span>',html)
for each in date:
print(each)

首先import requests和re库，接着定制请求头，请求网页，用r.text获取网页源码，我们日期的element是像这样的：

<span class="date">2019-02-15 17:20:11</span>

所以我们把日期去掉，改为正则表达式(.*?)，这样就可以用findall方法匹配其中的日期了，接下来我们还能制作翻页功能，在我写这篇文章时，我的博客有三页（你可以自行根据现有页数调整）：

import re
import requests
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}

for i in range(1,4):
link='https://blog.csdn.net/weixin_42183408/article/list/'+str(1)+'?'
r=requests.get(link,headers=headers)
html=r.text

date=re.findall('<span class="date">(.*?)</span>',html)
for each in date:
print(each)

我们可以发现第二页的url如下：

https://blog.csdn.net/weixin_42183408/article/list/2?

，第三页是：

https://blog.csdn.net/weixin_42183408/article/list/3?

，可以看到list/后的数字变化了，因此我们可以使用for循环翻页，也就是

fori in range(1,4)

，再改变一下link就好了！

这样我们就可以获取到所有博客发布日期了！

下次见！

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航