您的位置：首页 > 编程语言 > Python开发

数据科学工程师面试宝典系列之一--Python爬虫实战

2017-03-02 10:56 716 查看

1.数据抽取、转换、储存（Data ETL--extract/transfer/loading）：

原始资料【raw data】==》ETL脚本【ETL Script】==》结构化数据【Tidy Data】

2.网络爬虫：将非结构化的网页数据转成结构化信息

3.网络爬虫架构：

=======》请求

数据中心《==资料剖析《==网页链接器（Web Connector） 网页

《=======回应

4.使用开发人员工具

于网页上点选右键->检查

5.观察http请求与返回内容：选择Network页签，点选Doc，点选china/

6.什么是GET：GET内容写在上头

7.撰写网络爬虫课前须知：

（1）透过pip安装套件：pip install requests，pip install BeautifulSoup4，pip install jupyter【打开jupyter notebook】

（2）Chrome用户：可使用内建开发人员工具

（3）Firefox用户：安装Firebug

8.Requests:

Requests：（1）网络资源（URLs）截取套件；（2）改善Urllib2的缺点，让使用者以最简单的方式获取网络资源；（3）可以使用REST操作（POST，PUT，GET，DELETE）存取网络资源

[python] view
plain copy

import requests

newsurl='http://news.sina.com.cn/china/'

res=requests.get(newsurl)

print(res.text)

8.DOM Tree

[html] view
plain copy

<html>

<body>

<h1 id="title">Hello World</h1>

<a href="#" class="link">This is link1</a>

<a href="# link2" class="link">This is link2</a>

</body>

</html>

9.BeautifulSoup范例

将网页读进BeautifulSoup中：

[html] view
plain copy

from bs4 import BeautifulSoup

html_sample=' \

<html> \

<body> \

<h1 id="title">Hello Word</h1> \

<a href="#"class="link">This is link1</a> \

<a href="#link2" class="link">This is link2</a> \

</body> \

</html>'

soup=BeautifulSoup(html_sample,'html.parser')

print(soup.text)

10.找出所有含特定标签的HTML元素

使用select找出含有h1标签的元素

[python] view
plain copy

soup = BeautifulSoup(html_sample,'html.parser')

hesder=soup.select('h1')

print(header)

print(header[0])

print(header[0].text)

使用select找出含有a标签的元素

[python] view
plain copy

soup = BeautifulSoup(html_sample,'html.parser')

alink = soup.select('a')

print(alink)

for link in alink:

 #print(link)

 print(link.text)

11.取得含有特定CSS属性的元素

使用select找出所有id为title的元素（id前面需加#）

[python] view
plain copy

alink = soup.select('#title')

print(alink)

使用select找出所有class为link的元素（class前面需加.）

[python] view
plain copy

soup = BeautifulSoup(html_sample)



for link in soup.select('.link')



 print(link)

取得所有a标签内的链接

使用select找出所有a tag的href连接

[python] view
plain copy

alinks = soup.select('a')



for link in alinks:



 print(link['href'])

a='<a href="#" qoo=123 abc=456> i am a link</a>'

soup2 = BeautifulSoup(a,'html.parser')

print(soup2.select(‘a’)[0][href])

[html] view
plain copy

12.连接到新浪新闻的页面

13.寻找CSS的定位

chrome开发人员工具；Firefox开发人员工具；InfoLite； https://chrome.google.com/webstore/detail/infolite/ipjbadabbpedegielkhgpiekdlmfpgal
14.观察元素抓取位置

15.根据不同HTML标签取得对应内容

[python] view
plain copy

import requests

from bs4 import BeautifulSoup

res = requests.get('http://news.sina.com.cn/china/')

res.encoding = 'utf-8'

soup = BeautifulSoup(res.text,'html.parser')



for news in soup.select('.news-item'):

     if len(news.select('h2'))>0:

         h2 = news.select('h2')[0].text

         time = news.select('.time')[0].text

         a=news.select('a')[0]['href']

         print(time,h2,a)

16.抓取内文资料

[python] view
plain copy

import requests

from bs4 import BeautifulSoup

res = requests.get('http://news.sina.com.cn/c/nd/2016-08-20/doc-ifxvctcc8121090.shtml')

res.encoding = 'utf-8'

print(res.text)

soup = BeautifulSoup(res.text,'htnl.parser')

17.抓取标题

[python] view
plain copy

soup.select('#artibodyTitle')[0].text

18.抓取时间与来源

[python] view
plain copy

soup.select('#artibodyTitle')[0].text



timesource = soup.select('.time-source')[0].contents[0].strip()

timesource

//时间字符串转换

from datatime import datatime

//字串转时间-strptime

dt=datatime.strptime(timesource,'%Y年%m月%d日%H：%M')

dt

//时间转字串--strftime

dt.strftime('%Y-%m-%d')

19.取得内文

将每一个段落加到list中：

[python] view
plain copy

article = []

for p in soup.select('#artibody p')[:-1]:

 article.append(p.text.strip())

#print(article)

''.join(article)

#'@'.join(article)

简短的写法：

[python] view
plain copy

''.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])

20.取得编辑名称

[html] view
plain copy



editor = soup.select('.article-editor')[0].text.strip('责任编辑：') editor

[html] view
plain copy

<pre></pre>

#21.取得评论数



<pre name="code" class="html"><pre name="code" class="python">soup.select('#commentCount1')</pre>

<pre></pre>

<pre name="code" class="html">找寻评论出处：</pre>



<pre name="code" class="html"><pre name="code" class="python">import json

comments = requests.aet('http://comment5.news.sina.com.cn/page/info?version=1&format=is&channel=an&newsid=comos-fxvctcc8121090&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20')</pre>

<pre></pre>

#22.剖析新闻标识符

如何取得新闻编号：



<pre name="code" class="python">newsurl = 'http://news.sina.com.cn/c/nd/2016-08-20/doc-ifxvctcc8121090.shtml'

newsid = newsurl.split('/')[-1].strip('.shtml').lstrip('doc-i')

newsid</pre>



如何取得新闻编号(使用正规表达式)：



<pre name="code" class="python">import re

m = re.search('doc-i(.*).shtml',newsurl)

print(m.group(1))</pre>



#23.将抓取评论数方法整理成一函数





<pre name="code" class="python">commentURL =

'http:comment5.news.sina.......'



def getCommentCount(newsurl):

 m=re.search('doc-i(.*).shtml',newsurl)

 newsid = m.group(1)

 comments = requests.get(commentURL.format(newsid))

 jd = json,loads(comments.text.strip('var data='))

 return jd['result']['count']['total']</pre>



#24.将抓取内文信息方法整理成一函数



<pre name="code" class="python">import requests

from bs4 import BeautifulSoup



def getNewsDetail(newsurl):

 result = {};

 res = requests.get(newsurl)

 res.encoding = 'utf-8'

 soup = BeautifulSoup(res.text,'html.parser')

 result['title'] = soup.select('#artibodyTitle')[0].text

 result['newssource']=soup.select('.time-source apan a')[0].text

 timesource = soup.select('.time-source')[0].contents[0].strip()

 result['dt'] = datatime.strptime(timesource,%Y年%m月%d日%H：%M')

 result['article'] = ''.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])

 result['comments'] = getCommentCount(newsurl)

 return result</pre>



<pre></pre>

<pre></pre>

<pre></pre>

 





</pre></pre>

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航