Python爬虫 - Beautiful Soup4(一)-本地文件爬取
2017-11-22 19:04
330 查看
1.Beautiful Soup4 安装(简称BS4)
pip或者 easy_install 安装:
easy_installbeautifulsoup4
pipinstallbeautifulsoup4
2.HTML解析器安装
解析器类型有:html.parser(python自带),lxml,html5lib等
pip或者easy_install安装lxml:
easy_installlxml
pipinstalllxml
3.Beautiful Soup4使用
https://beautifulsoup.readthedocs.io/zh_CN/latest/#replace-with新建python.py 输入如下内容: 执行
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<font><!--Hey, buddy. Want to buy a used parser?--></font>
<p class="story">...</p>
"""
#default slove "html.parser"
#read string-html
soup = BeautifulSoup(html_doc, "html.parser")
#read local-html
#soup = BeautifulSoup(open('index.html'), "lxml")
#read net-html
#html_doc
#print (soup.prettify())
#document
print (soup.name)
#<title>The Dormouse's story</title>
print (soup.title)
#<class 'bs4.element.Tag'>
print (type(soup.title))
#title
print (soup.title.name)
#The Dormouse's story
print (soup.title.string)
#<class 'bs4.element.NavigableString'>
print (type(soup.title.string))
#Hey, buddy. Want to buy a used parser?
print (soup.font.string)
#<class 'bs4.element.Comment'>
print (type(soup.font.string))
#{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
print (soup.a.attrs)
#http://example.com/elsie
print (soup.a["href"])
# all <a>
print (soup.find_all("a"))
#children node
print (soup.html.contents)
#children node size
print (len(soup.html.contents))
# first children node
print (soup.html.contents[0])
#
for child in soup.html.children:
print (child)
#
for child in soup.html.descendants:
print (child)
相关文章推荐
- python爬虫由浅入深1-从网页中爬取文件并保存至本地
- python简单应用!用爬虫来采集天猫所有优惠券信息,写入本地文件
- python爬虫由浅入深9---定向爬取股票数据信息并保存至本地文件
- 零基础写python爬虫之抓取百度贴吧并存储到本地txt文件改进版
- 不务正业--用python爬虫抓取Konachan的图片并保存到本地文件
- 零基础写python爬虫之抓取百度贴吧并存储到本地txt文件改进版
- Python爬虫(三):爬取猫眼电影网经典电影TOP100信息并存入本地Markdown文件(下)
- Python爬虫 - Beautiful Soup4(二)-网络文件爬取
- Python爬虫(三):爬取猫眼电影网经典电影TOP100信息并存入本地Markdown文件(上)
- php 爬虫的简单实现, 获取整个页面, 再把页面的数据导入本地的文件当中
- python (1)一个简单的爬虫: python 在windows下 创建文件夹并写入文件
- Python爬虫之Beautiful Soup解析库的使用(五)
- python将文件写成csv文件保存到本地
- python 爬虫获取json数据存入文件时乱码
- python保存数据到本地文件
- python爬虫:selenuim+phantomjs模拟浏览器操作,用BeautifulSoup解析页面,用requests下载文件
- Python3实现将本地JSON大数据文件写入MySQL数据库的方法
- 用Apache配置本地服务器,并用以运行html和Python文件
- Python爬虫(4):Beautiful Soup的常用方法
- mac os平台使用python爬虫自动下载巨潮网络文件