您的位置:首页 > 编程语言 > Python开发

Python爬虫 - Beautiful Soup4(一)-本地文件爬取

2017-11-22 19:04 330 查看



1.Beautiful Soup4 安装(简称BS4)

pip
或者 easy_install 安装:

easy_installbeautifulsoup4
pipinstallbeautifulsoup4


2.HTML解析器安装

解析器类型有:html.parser(python自带),lxml,html5lib

pip或者easy_install安装lxml:

easy_installlxml
pipinstalllxml


3.Beautiful Soup4使用

https://beautifulsoup.readthedocs.io/zh_CN/latest/#replace-with

新建python.py 输入如下内容: 执行

from bs4 import BeautifulSoup

html_doc = """

<html><head><title>The Dormouse's story</title></head>

    <body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<font><!--Hey, buddy. Want to buy a used parser?--></font>

<p class="story">...</p>

"""

#default slove "html.parser"

#read string-html

soup = BeautifulSoup(html_doc, "html.parser")

#read local-html

#soup = BeautifulSoup(open('index.html'), "lxml")

#read  net-html

#html_doc

#print (soup.prettify())

#document

print (soup.name)

#<title>The Dormouse's story</title>

print (soup.title)

#<class 'bs4.element.Tag'>

print (type(soup.title))

#title

print (soup.title.name)

#The Dormouse's story

print (soup.title.string)

#<class 'bs4.element.NavigableString'> 

print (type(soup.title.string))

#Hey, buddy. Want to buy a used parser?

print (soup.font.string)

#<class 'bs4.element.Comment'> 

print (type(soup.font.string))

#{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

print (soup.a.attrs)

#http://example.com/elsie

print (soup.a["href"])

# all <a>

print (soup.find_all("a"))

#children node

print (soup.html.contents)

#children node size

print (len(soup.html.contents))

# first children node

print (soup.html.contents[0])

#

for child in soup.html.children:

    print (child)

#

for child in soup.html.descendants:

    print (child)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐