您的位置：首页 > 理论基础 > 计算机网络

【Python】网络爬虫-批量下载图片

2017-01-28 17:30 141 查看

Description

Yixiaohan/show-me-the-code 第0008题 && 第0009题 && 第0013题

0008 ：一个HTML文件，找出里面的正文。

0009 ：一个HTML文件，找出里面的链接。

0013 ：用 Python 写一个爬图片的程序。

Notes

这个小项目中涉及BeautifulSoup模块的使用、文件I/O操作、从网络上下载文件等内容。几个知识点：

requests模块的使用 && Response类对象

request这个模块可以处理HTML请求，GET/POST/DELETE/PUT等都可以。

response = requests.get(url)

调用上述函数之后，会返回一个Response类对象。

因为不同网站的编码方式可能有所不同，所以在这个项目中显示地将response的编码方式改成utf-8。

response.encoding = "utf-8"

得到从Response类对象的text属性得到html原文。

html_code = response.text

另一种得到html原文的方式如下：

html_code = urllib.request.urlopen(url).read()

BeautifulSoup模块

建立BeautifulSoup对象，参数是html文件

soup = BeautifulSoup(html_code, "html.parser")

如果要打开本地的html文件的话，可以采用如下方式：

soup = BeautifulSoup(open('index.html'))

找到想爬的内容，比如要找到全部的连接：

links = soup.findAll('a')

如果要找到每个’a’标签中的纯链接部分，可以采用get()函数，挑选想要的属性：

print(link.get('href'))

如果要得到链接的文字部分，可以采用：

print(link.string)

findAll()函数加其他限制条件，比如只挑选某一个class的img：

imgs = soup.findAll('img', {'class' : "BDE_Image"})

用python从网上下载文件

用urllib.request模块的urlretrieve()函数。

urllib.request.urlretrieve(src, fileName)

My Code

"""
* 0008 && 0009 && 0013
by VegB
2017/1/26
"""

from bs4 import BeautifulSoup
import requests
import urllib.request

"""
request这个模块可以处理HTML请求，GET/POST/DELETE/PUT等都可以
调用上述函数之后，会返回一个Response类对象
"""

raw_url = "http://tieba.baidu.com/p/4945979003?see_lz=1&pn="
cnt = 0

for pageNum in range(1, 2):
url = raw_url + str(pageNum)

response = requests.get(url)
response.encoding = "utf-8"
# 原来百度的编码方式可能是gb2312啥的 windows的编码方式是gbk，用gbk的方式去解释就会出问题，还是设置为utf-8好了
# print(response.text)

html_code = response.text; # Response类对象的text属性，得到html原文
soup = BeautifulSoup(html_code, "html.parser")

# websiteCode = urllib.request.urlopen(url).read()
# soup = BeautifulSoup(websiteCode, "html.parser") # 建立一个BeautifulSoup对象，参数是html文件 或者BeautifulSoup(open('index.html'))

# 爬链接
# 输出到文件？
"""
links = []
links = soup.findAll('a')
cnt = 0;
for link in links:
print("LINK %d:", cnt)
print(link.get('href'))
print(link.string)
cnt += 1
"""

# 下载图片
imgs = []
imgs = soup.findAll('img', {'class' : "BDE_Image"}) # 定好类名，不要那些广告的图片
for img in imgs:
src = img.get('src')
print("IMAGE %d:"%cnt)
print(src)
fileName = str(cnt) + ".jpg"
urllib.request.urlretrieve(src, fileName)
cnt += 1

pageNum += 1

# 找正文并写入文件
url = "http://162.105.146.180:8130/" # 爬自己写的网站好了哈哈哈
response = requests.get(url)
response.encoding = 'utf-8'
html_code = response.text

soup = BeautifulSoup(html_code)
html_body = soup.findAll('body')
print(html_body)

fp = open('html_body.txt','w')
for li
fp.close()

Result

利用这个小项目就能轻松地从贴吧中批量下载壁纸啦。

比如我们想要下载【壁纸】黑白简约壁纸这个帖子里面的壁纸。

那么运行程序之后，就可以看到本地的文件夹中已经出现了想要的壁纸了！同时，也出现了html.txt文件。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 网络爬虫文件下载

相关文章推荐

新的分享

章节导航