您的位置:首页 > 编程语言 > Python开发

Python爬虫之美味鸡汤-BeautifulSoup

2017-09-06 15:18 369 查看

Python爬虫之美味鸡汤-BeautifulSoup

进一步学习:

python3实现网络爬虫(2)–BeautifulSoup使用(1)

python3实现网络爬虫(3)–BeautifulSoup使用(2)

python3实现网络爬虫(4)–BeautifulSoup使用(3)

安装

1.在Pycharm中安装插件:bs4

2.
pip install beautifulsoup4


拓展

安装lxml –> 插件:lxml 或者
pip install lxml


最简单的使用

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://tieba.baidu.com/')

bsObj = BeautifulSoup(html, 'lxml')  # 在这里讲html对象转化为BeautifulSoup对象.

print(bsObj.title)


通过标签的名称和属性来查找标签

find_all方法

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://movie.douban.com')

bsObj = BeautifulSoup(html, 'lxml')

liList = bsObj.find_all('li', {'class': 'title'})  # 通过标签的名称和属性来查找标签

for li in liList:
print(li.a.get_text())  # 获取标签<a>中的文字


标签没有属性值时借助父节点处理

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://movie.douban.com')

bsObj = BeautifulSoup(html, 'lxml')

liList = bsObj.findAll('li', {'class': 'ui-slide-item'})

for li in liList:
ul = li.children
for child in ul: #由于children是个孩子集合,所以下面要迭代进行查看
print(child)


结合正则表达式批量下载图片

# coding:utf-8
import random
import re
from urllib.request import urlopen, Request, urlretrieve

from bs4 import BeautifulSoup

def get_html(url, headers):
"""
用于抓取返回403禁止访问的网页
:param url:
:param headers:
:return:
"""
random_header = random.choice(headers)

req = Request(url)
req.add_header('User-Agent', random_header)
req.add_header('GET', url)
req.add_header('Host', 'tieba.baidu.com')
req.add_header('Referer', 'http://tieba.baidu.com/p/4792769205')
html = urlopen(req)
return html

url = 'http://tieba.baidu.com/p/4792769205'

# 下面headers需要使用自己主机的User-Agent进行构造
my_headers = ['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36']

html = get_html(url, my_headers)

bsObj = BeautifulSoup(html, 'lxml')

imageList = bsObj.findAll('img', {'src': re.compile('http://imgsrc.baidu.com/forum/w%3D580/sign=.+\.jpg')})

for index, image in enumerate(imageList):
imageUrl = image['src']
imageLocation = '/home/wangdongdong/test/' + str(index + 1) + '.jpg'
urlretrieve(imageUrl, imageLocation)
print("图片 ", index + 1, "下载完成")
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: