数据科学工程师面试宝典系列之一----Python爬虫
2017-02-13 14:32
603 查看
1.认识网页的构成
html==结构;css== 样式;JavaScript==功能;
==================================================================================================
2.解析网页中的元素
第一步:使用BeautifulSoup解析网页
库有5种:‘html.parser’,‘lxml HTML’,‘lxml XML’,‘html5lib’,‘lxml’;
描述方式2种:“CSS Selector:”,"XPath:"
第二步:描述要爬取的东西在哪
第三步:从标签中获得你要的信息,装到字典里
==================================================================================================
3.真实世界中的网页解释
用Requests + BeautifulSoup 爬取 Tripadvisor;
第一步:服务器与本地的交换机制;
http协议【request和response】
http1.0:get,post,head
http1.1:get,post,head,put,options,connect,trace,delete
request:get , post
GET /page_one.html HTTP/1.1
Host: www.sample.comresponse:
status_code:200
第二步:解析真实网页的方法;
from bs4 import BeautifulSoup
import requests
//获取整个网页信息
url="http://..."
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
print(soup)
//点击网页某个元素检查,CSS selector复制标题的代码,
titles = soup.select('div.property_title > a[target="_blank"]')//得到网页的所有标题
imgs = soup.select('img[width="160"]')//得到网页的图片
cates = soup.select('div.path_reasoning_v2')//得到网页的
print(titles,imgs,cates)
for titles,img,cate in zip(titles,imgs,cates):
data = {
'title' : title.get_text(),
'img' : img.get('src'),
'cate' : list(cate.stripped_strings),
}
print(data)
//cookie是身份验证的,为了爬取收藏信息
headers = {
'User-Agent'= '...........................................................................',
'Cookie' = ' ......................................................................................'
}
url_saves = 'https://cn.tripadvisor.com/Saves%37685322'
wb_data = request.get(url_saves,headers = headers)
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('a.location-name')
imgs = soup.select('img.photo_image')
metas = soup.select('span.format_address')
for title,img,meta in zip(titles,imgs,metas):
data = {
'title' = title.get_text(),
'img' = img.get('src')
'meta' = list(meta.strigged_strings)
}
print(data)
from bs4 import BeautifulSoup
import requests
url_saves = 'http://.....'
url = 'http://...'
urls =[ '.....'.format(str(i)) for i in range(30,938,30)]
headers = {......}
def get_attractions(url,data=None):
wb_data = requests.get(url)
time.sleep(2) //防止反爬虫,设置两秒间隔
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('div.property_title>a[target='_blank']')
imgs = soup.select('img[width="160"]')
cates = soup.select('div.pl3n_reasoning_v2')
for title,img,cate in zip(titles,imgs,cates):
data = {
'title' : title.get_text(),
'img' : img.get('src'),
'cate' : list(cate.stripped_strings),
}
print(data)
def get_favs(url,data=None):
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('a.location-none')
imgs = soup.select('div.photo>div.sizedThurb>img.photo_image')
metas = soup.select('span.format_address')
if data == None:
for title,img,meta in zip(titles,imgs,metas):
data = {
'title' : tuitle.get_text(),
'img' : img.get('src'),
'meta' : list(meta.stripped_strings)
}
print(data)
//get_attractions(url)
//get_favs(url_saves)
//print(urls)
for single_url in urls:
get_attractions(single_url)
//爬手机端的数据
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent' : '......'
}
url = '......'
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
img = soup.select('div.thurb.thurbLLR.soThurb>img')
for i in imgs:
print(i.get(src''))
//print(soup)==================================================================================================
html==结构;css== 样式;JavaScript==功能;
<div></div>是网页中的区域; <p></p>是内容; <li></li>是列表; <img></img>是图片; <h1></h1>是不同字号的标题; <a href="">是网页中的链接header+content+footer;
==================================================================================================
2.解析网页中的元素
第一步:使用BeautifulSoup解析网页
Soup = BeautifulSoup(html,'lxml')
库有5种:‘html.parser’,‘lxml HTML’,‘lxml XML’,‘html5lib’,‘lxml’;
描述方式2种:“CSS Selector:”,"XPath:"
第二步:描述要爬取的东西在哪
... = Soup.select()
第三步:从标签中获得你要的信息,装到字典里
<p>Something</p>
title = Something rate = 4.0
==================================================================================================
3.真实世界中的网页解释
用Requests + BeautifulSoup 爬取 Tripadvisor;
第一步:服务器与本地的交换机制;
http协议【request和response】
http1.0:get,post,head
http1.1:get,post,head,put,options,connect,trace,delete
request:get , post
GET /page_one.html HTTP/1.1
Host: www.sample.comresponse:
status_code:200
第二步:解析真实网页的方法;
from bs4 import BeautifulSoup
import requests
//获取整个网页信息
url="http://..."
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
print(soup)
//点击网页某个元素检查,CSS selector复制标题的代码,
titles = soup.select('div.property_title > a[target="_blank"]')//得到网页的所有标题
imgs = soup.select('img[width="160"]')//得到网页的图片
cates = soup.select('div.path_reasoning_v2')//得到网页的
print(titles,imgs,cates)
for titles,img,cate in zip(titles,imgs,cates):
data = {
'title' : title.get_text(),
'img' : img.get('src'),
'cate' : list(cate.stripped_strings),
}
print(data)
//cookie是身份验证的,为了爬取收藏信息
headers = {
'User-Agent'= '...........................................................................',
'Cookie' = ' ......................................................................................'
}
url_saves = 'https://cn.tripadvisor.com/Saves%37685322'
wb_data = request.get(url_saves,headers = headers)
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('a.location-name')
imgs = soup.select('img.photo_image')
metas = soup.select('span.format_address')
for title,img,meta in zip(titles,imgs,metas):
data = {
'title' = title.get_text(),
'img' = img.get('src')
'meta' = list(meta.strigged_strings)
}
print(data)
from bs4 import BeautifulSoup
import requests
url_saves = 'http://.....'
url = 'http://...'
urls =[ '.....'.format(str(i)) for i in range(30,938,30)]
headers = {......}
def get_attractions(url,data=None):
wb_data = requests.get(url)
time.sleep(2) //防止反爬虫,设置两秒间隔
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('div.property_title>a[target='_blank']')
imgs = soup.select('img[width="160"]')
cates = soup.select('div.pl3n_reasoning_v2')
for title,img,cate in zip(titles,imgs,cates):
data = {
'title' : title.get_text(),
'img' : img.get('src'),
'cate' : list(cate.stripped_strings),
}
print(data)
def get_favs(url,data=None):
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('a.location-none')
imgs = soup.select('div.photo>div.sizedThurb>img.photo_image')
metas = soup.select('span.format_address')
if data == None:
for title,img,meta in zip(titles,imgs,metas):
data = {
'title' : tuitle.get_text(),
'img' : img.get('src'),
'meta' : list(meta.stripped_strings)
}
print(data)
//get_attractions(url)
//get_favs(url_saves)
//print(urls)
for single_url in urls:
get_attractions(single_url)
//爬手机端的数据
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent' : '......'
}
url = '......'
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
img = soup.select('div.thurb.thurbLLR.soThurb>img')
for i in imgs:
print(i.get(src''))
//print(soup)==================================================================================================
相关文章推荐
- 数据科学工程师面试宝典系列之一--Python爬虫实战
- 数据科学工程师面试宝典系列之一--Python爬虫实战
- 数据科学工程师面试宝典系列之二---Python机器学习kaggle案例:泰坦尼克号船员获救预测
- 数据科学工程师面试宝典系列---数据挖掘算法原理
- 数据科学工程师面试宝典系列之四---MySQL基础
- 数据科学工程师面试宝典系列---旅游评论数据中的自然语言处理
- 数据科学工程师面试宝典系列---R语言入门
- [resource-]Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱(转)
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器库
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
- python --网页爬虫,文本处理,科学计算,机器学习,数据挖掘资料+附带工具包下载
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
- Python的网页爬虫&文本处理&科学计&机器学习&数据挖掘工具集
- python爬虫系列之爬京东手机数据
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱(转)
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱