您的位置：首页 > 编程语言 > Python开发

Python爬虫之美味鸡汤-BeautifulSoup

2017-09-06 15:18 369 查看

Python爬虫之美味鸡汤-BeautifulSoup

进一步学习：

python3实现网络爬虫（2）–BeautifulSoup使用（1）

python3实现网络爬虫（3）–BeautifulSoup使用（2）

python3实现网络爬虫（4）–BeautifulSoup使用（3）

安装

1.在Pycharm中安装插件：bs4

2.

pip install beautifulsoup4

拓展

安装lxml –> 插件：lxml 或者

pip install lxml

最简单的使用

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://tieba.baidu.com/')

bsObj = BeautifulSoup(html, 'lxml')  # 在这里讲html对象转化为BeautifulSoup对象.

print(bsObj.title)

通过标签的名称和属性来查找标签

find_all方法

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://movie.douban.com')

bsObj = BeautifulSoup(html, 'lxml')

liList = bsObj.find_all('li', {'class': 'title'})  # 通过标签的名称和属性来查找标签

for li in liList:
print(li.a.get_text())  # 获取标签<a>中的文字

标签没有属性值时借助父节点处理

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://movie.douban.com')

bsObj = BeautifulSoup(html, 'lxml')

liList = bsObj.findAll('li', {'class': 'ui-slide-item'})

for li in liList:
ul = li.children
for child in ul: #由于children是个孩子集合，所以下面要迭代进行查看
print(child)

结合正则表达式批量下载图片

# coding:utf-8
import random
import re
from urllib.request import urlopen, Request, urlretrieve

from bs4 import BeautifulSoup

def get_html(url, headers):
"""
用于抓取返回403禁止访问的网页
:param url:
:param headers:
:return:
"""
random_header = random.choice(headers)

req = Request(url)
req.add_header('User-Agent', random_header)
req.add_header('GET', url)
req.add_header('Host', 'tieba.baidu.com')
req.add_header('Referer', 'http://tieba.baidu.com/p/4792769205')
html = urlopen(req)
return html

url = 'http://tieba.baidu.com/p/4792769205'

# 下面headers需要使用自己主机的User-Agent进行构造
my_headers = ['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36']

html = get_html(url, my_headers)

bsObj = BeautifulSoup(html, 'lxml')

imageList = bsObj.findAll('img', {'src': re.compile('http://imgsrc.baidu.com/forum/w%3D580/sign=.+\.jpg')})

for index, image in enumerate(imageList):
imageUrl = image['src']
imageLocation = '/home/wangdongdong/test/' + str(index + 1) + '.jpg'
urlretrieve(imageUrl, imageLocation)
print("图片 ", index + 1, "下载完成")

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航