您的位置：首页 > 编程语言 > Python开发

Python爬虫之知乎图片抓取

2017-07-13 09:35 459 查看

最近觉得python爬虫挺好玩的，就网上找了找教程自学了几天，真的还挺有意思的，推荐一个中国大学mooc平台的一个关于Python爬虫的课程，老师讲的很好，非常适合入门，这里是链接。

想起曾经在知乎的一个专栏里面看到过一个非常好玩的东西，之前看的时候还不会爬虫，只是将文章收藏了下来，代码在这里。现在回过头来去看，发现还挺简单的。专栏文章是用lxml来解析html文件的，我查了下，发现xpath真的非常好用（哈哈，其实我还不怎么会，待找找教程学了再来吧），不过既然学了BeautifulSoup，我就用BeautifulSoup来简单实现一下吧。

废话不多说，直接上代码吧。（这里只是爬取一个问题下排名第一的答案里的图片）

import requests
from bs4 import BeautifulSoup
import os
import time

cookie = ''#那篇专栏文章里有教怎么用cookie
headers = {'User-Agent': 'Mozilla/5.0',#模拟浏览器向网页发出请求
'Cookie': cookie}

def getHtmlText(url):
try:
response = requests.get(url,headers=headers)
response.raise_for_status()
response.encoding = 'utf-8'
return response.text
except:
exit('模拟cookie登陆失败')

def savePictures():
html_text = getHtmlText('https://www.zhihu.com/question/40063489')
#soup = BeautifulSoup(open('zhihu.html','r',encoding='utf-8'),'html.parser') #如果不会cookie登陆，可以自己手动保存网页源码为html文件
soup = BeautifulSoup(html_text,'html.parser')

#question和author，自己打开网页源码，找到他们，然后看所在标签
question = soup.h1.text.strip()
author = str(soup.find_all(name='a',attrs='UserLink-link')[1].text)
#info就是author回答question的答案的所有信息
info = soup.find_all(name='div',attrs="RichContent-inner")[0] #一个question下会显示两位排名靠前的回答，这里选择第一位
x = info.find_all(name='noscript') #这就是所有图片链接所在的标签列表
links = []
for i in x:
link = i.img.attrs['src']
links.append(link)

try:
filename = question + ' - ' + author
#print(filename)
if not os.path.exists(filename):
os.mkdir(filename)
for i in range(len(links)):
img_source = requests.get(links[i]).content
img_path = filename + '/' + str(i)+ '.' + links[i].split('.')[-1]
with open(img_path,'wb') as f:
f.write(img_source)
print(links[i],'保存成功')
except:
print('error')

start = time.time()
savePictures()
end = time.time()

print('总耗时: ',end-start,'秒')

哈哈，上截图：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 爬虫 python html

相关文章推荐

新的分享

章节导航