Python爬虫之美味鸡汤-BeautifulSoup
2017-09-06 15:18
369 查看
Python爬虫之美味鸡汤-BeautifulSoup
进一步学习:python3实现网络爬虫(2)–BeautifulSoup使用(1)
python3实现网络爬虫(3)–BeautifulSoup使用(2)
python3实现网络爬虫(4)–BeautifulSoup使用(3)
安装
1.在Pycharm中安装插件:bs42.
pip install beautifulsoup4
拓展
安装lxml –> 插件:lxml 或者pip install lxml
最简单的使用
# coding:utf-8 from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://tieba.baidu.com/') bsObj = BeautifulSoup(html, 'lxml') # 在这里讲html对象转化为BeautifulSoup对象. print(bsObj.title)
通过标签的名称和属性来查找标签
find_all方法# coding:utf-8 from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://movie.douban.com') bsObj = BeautifulSoup(html, 'lxml') liList = bsObj.find_all('li', {'class': 'title'}) # 通过标签的名称和属性来查找标签 for li in liList: print(li.a.get_text()) # 获取标签<a>中的文字
标签没有属性值时借助父节点处理
# coding:utf-8 from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://movie.douban.com') bsObj = BeautifulSoup(html, 'lxml') liList = bsObj.findAll('li', {'class': 'ui-slide-item'}) for li in liList: ul = li.children for child in ul: #由于children是个孩子集合,所以下面要迭代进行查看 print(child)
结合正则表达式批量下载图片
# coding:utf-8 import random import re from urllib.request import urlopen, Request, urlretrieve from bs4 import BeautifulSoup def get_html(url, headers): """ 用于抓取返回403禁止访问的网页 :param url: :param headers: :return: """ random_header = random.choice(headers) req = Request(url) req.add_header('User-Agent', random_header) req.add_header('GET', url) req.add_header('Host', 'tieba.baidu.com') req.add_header('Referer', 'http://tieba.baidu.com/p/4792769205') html = urlopen(req) return html url = 'http://tieba.baidu.com/p/4792769205' # 下面headers需要使用自己主机的User-Agent进行构造 my_headers = ['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'] html = get_html(url, my_headers) bsObj = BeautifulSoup(html, 'lxml') imageList = bsObj.findAll('img', {'src': re.compile('http://imgsrc.baidu.com/forum/w%3D580/sign=.+\.jpg')}) for index, image in enumerate(imageList): imageUrl = image['src'] imageLocation = '/home/wangdongdong/test/' + str(index + 1) + '.jpg' urlretrieve(imageUrl, imageLocation) print("图片 ", index + 1, "下载完成")
相关文章推荐
- [python爬虫] BeautifulSoup和Selenium简单爬取知网信息测试
- 一个简单的不用cookie的人人网状态爬取的python爬虫,使用beautifulsoup
- Python爬虫包 BeautifulSoup 学习(九) 正则表达式与Lambda表达式
- Python 爬虫(以赛马数据为例)之使用BeautifulSoup进行Html解析
- python 爬虫 beautifulsoup example 例子
- python3个人爬虫之:BeautifulSoup学习心得
- 【Python爬虫系列】内容解析之BeautifulSoup
- BeautifulSoup与正则_简单爬虫python3实现
- 使用python语言结合beautifulsoup编写简单的网络爬虫
- Python爬虫之BeautifulSoup
- Python爬虫包 BeautifulSoup 学习(五) 实例
- Python爬虫-BeautifulSoup库
- python爬虫——beautifulsoup4使用学习
- [python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息
- python 自己写爬虫 ----- BeautifulSoup
- Python爬虫:Selenium+ BeautifulSoup 爬取JS渲染的动态内容(雪球网新闻)
- 用BeautifulSoup来写python爬虫
- python简单爬虫 及 beautifulSoup简单用法
- python爬虫 - BeautifulSoup(2)子孙节点(.children .descendants)和父节点(.parents)
- python爬虫之BeautifulSoup