您的位置:首页 > 编程语言 > Python开发

数据科学工程师面试宝典系列之一----Python爬虫

2017-02-13 14:32 603 查看
1.认识网页的构成

html==结构;css== 样式;JavaScript==功能;

<div></div>是网页中的区域;
<p></p>是内容;
<li></li>是列表;
<img></img>是图片;
<h1></h1>是不同字号的标题;
<a href="">是网页中的链接
header+content+footer;

==================================================================================================

2.解析网页中的元素

第一步:使用BeautifulSoup解析网页

Soup = BeautifulSoup(html,'lxml')


库有5种:‘html.parser’,‘lxml  HTML’,‘lxml  XML’,‘html5lib’,‘lxml’;

描述方式2种:“CSS  Selector:”,"XPath:"

第二步:描述要爬取的东西在哪

... = Soup.select()


第三步:从标签中获得你要的信息,装到字典里

<p>Something</p>


title = Something
rate = 4.0

==================================================================================================

3.真实世界中的网页解释

用Requests + BeautifulSoup 爬取 Tripadvisor;

第一步:服务器与本地的交换机制;

http协议【request和response】

http1.0:get,post,head

http1.1:get,post,head,put,options,connect,trace,delete

request:get , post

GET /page_one.html HTTP/1.1
Host: www.sample.comresponse:

status_code:200
第二步:解析真实网页的方法;

from bs4 import BeautifulSoup
import requests

//获取整个网页信息
url="http://..."
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
print(soup)
//点击网页某个元素检查,CSS selector复制标题的代码,
titles = soup.select('div.property_title > a[target="_blank"]')//得到网页的所有标题
imgs = soup.select('img[width="160"]')//得到网页的图片
cates = soup.select('div.path_reasoning_v2')//得到网页的
print(titles,imgs,cates)
for titles,img,cate in zip(titles,imgs,cates):
data = {
'title' : title.get_text(),
'img' : img.get('src'),
'cate' : list(cate.stripped_strings),
}
print(data)

//cookie是身份验证的,为了爬取收藏信息
headers = {
'User-Agent'= '...........................................................................',
'Cookie' = ' ......................................................................................'
}
url_saves = 'https://cn.tripadvisor.com/Saves%37685322'
wb_data = request.get(url_saves,headers = headers)
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('a.location-name')
imgs = soup.select('img.photo_image')
metas = soup.select('span.format_address')
for title,img,meta in zip(titles,imgs,metas):
data = {
'title' = title.get_text(),
'img' = img.get('src')
'meta' = list(meta.strigged_strings)
}
print(data)
from bs4 import BeautifulSoup
import requests

url_saves = 'http://.....'
url = 'http://...'
urls =[ '.....'.format(str(i)) for i in range(30,938,30)]

headers = {......}

def get_attractions(url,data=None):
wb_data = requests.get(url)
time.sleep(2) //防止反爬虫,设置两秒间隔
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('div.property_title>a[target='_blank']')
imgs = soup.select('img[width="160"]')
cates = soup.select('div.pl3n_reasoning_v2')
for title,img,cate in zip(titles,imgs,cates):
data = {
'title' : title.get_text(),
'img' : img.get('src'),
'cate' : list(cate.stripped_strings),
}
print(data)

def get_favs(url,data=None):
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
titles = soup.select('a.location-none')
imgs = soup.select('div.photo>div.sizedThurb>img.photo_image')
metas = soup.select('span.format_address')
if data == None:
for title,img,meta in zip(titles,imgs,metas):
data = {
'title' : tuitle.get_text(),
'img' : img.get('src'),
'meta' : list(meta.stripped_strings)
}
print(data)

//get_attractions(url)
//get_favs(url_saves)
//print(urls)

for single_url in urls:
get_attractions(single_url)
//爬手机端的数据
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent' : '......'
}
url = '......'
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
img = soup.select('div.thurb.thurbLLR.soThurb>img')
for i in imgs:
print(i.get(src''))
//print(soup)==================================================================================================
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐