您的位置:首页 > 编程语言 > Python开发

Python爬虫教程——实战一之爬取糗事百科段子

2015-12-15 00:32 751 查看
大家好,前面入门已经说了那么多基础知识了,下面我们做几个实战项目来挑战一下吧。那么这次为大家带来,Python爬取糗事百科的小段子的例子。
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来。

友情提示

糗事百科在前一段时间进行了改版,导致之前的代码没法用了,会导致无法输出和CPU占用过高的情况,是因为正则表达式没有匹配到的缘故。
现在,博主已经对程序进行了重新修改,代码亲测可用,包括截图和说明,之前一直在忙所以没有及时更新,望大家海涵!
更新时间:2015/8/2

本篇目标

1.抓取糗事百科热门段子

2.过滤带有图片的段子

3.实现每按一次回车显示一个段子的发布时间,发布人,段子内容,点赞数。

糗事百科是不需要登录的,所以也没必要用到Cookie,另外糗事百科有的段子是附图的,我们把图抓下来图片不便于显示,那么我们就尝试过滤掉有图的段子吧。
好,现在我们尝试抓取一下糗事百科的热门段子吧,每按下一次回车我们显示一个段子。

1.确定URL并抓取页面代码

首先我们确定好页面的URL是 http://www.qiushibaike.com/hot/page/1,其中最后一个数字1代表页数,我们可以传入不同的值来获得某一页的段子内容。
我们初步构建如下的代码来打印页面代码内容试试看,先构造最基本的页面抓取方式,看看会不会成功
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">\<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># -*- coding:utf-8 -*-</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> urllib
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> urllib2

page = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
url = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'http://www.qiushibaike.com/hot/page/'</span> + str(page)
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">try</span>:
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> response.read()
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">except</span> urllib2.URLError, e:
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"code"</span>):
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> e.code
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"reason"</span>):
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> e.reason</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li></ul>
运行程序,哦不,它竟然报错了,真是时运不济,命途多舛啊
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">373</span>, <span class="hljs-operator" style="box-sizing: border-box;">in</span> <span class="hljs-title" style="box-sizing: border-box;">_read</span>_status
 raise BadStatusLine(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span>)
httplib.BadStatusLine: <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">''</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
好吧,应该是headers验证的问题,我们加上一个headers验证试试看吧,将代码修改如下
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">\<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># -*- coding:utf-8 -*-</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> urllib
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> urllib2

page = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
url = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'http://www.qiushibaike.com/hot/page/'</span> + str(page)
user_agent = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'</span>
headers = { <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'User-Agent'</span> : user_agent }
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">try</span>:
    request = urllib2.Request(url,headers = headers)
    response = urllib2.urlopen(request)
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> response.read()
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">except</span> urllib2.URLError, e:
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"code"</span>):
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> e.code
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"reason"</span>):
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> e.reason</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li></ul>

嘿嘿,这次运行终于正常了,打印出了第一页的HTML代码,大家可以运行下代码试试看。在这里运行结果太长就不贴了。

2.提取某一页的所有段子

好,获取了HTML代码之后,我们开始分析怎样获取某一页的所有段子。
首先我们审查元素看一下,按浏览器的F12,截图如下



我们可以看到,每一个段子都是
<code class="hljs handlebars has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="xml" style="box-sizing: border-box;">content = response.read().decode('utf-8')
pattern = re.compile('<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">div.*?author"</span>></span>.*?<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">a.*?</span><<span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">img.</span>*?></span>(.*?)<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"></<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">a</span>></span>.*?<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">div.*?'+
</span>                         '<span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">content</span>"></span>(.*?)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"><!--(.*?)--></span>.*?<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"></<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">div</span>></span>(.*?)<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">div</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">class</span>=<span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">"stats.*?class="</span><span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">number"</span>></span>(.*?)<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"></<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">i</span>></span>',re.S)
items = re.findall(pattern,content)
for item in items:
    print item[0],item[1],item[2],item[3],item[4]</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>

现在正则表达式在这里稍作说明

1. ? 是一个固定的搭配,.和代表可以匹配任意无限多个字符,加上?表示使用非贪婪模式进行匹配,也就是我们会尽可能短地做匹配,以后我们还会大量用到 .*? 的搭配。

2. (.?)代表一个分组,在这个正则表达式中我们匹配了五个分组,在后面的遍历item中,item[0]就代表第一个(.?)所指代的内容,item[1]就代表第二个(.*?)所指代的内容,以此类推。

3. re.S 标志代表在匹配时为点任意匹配模式,点 . 也可以代表换行符。
这样我们就获取了发布人,发布时间,发布内容,附加图片以及点赞数。
在这里注意一下,我们要获取的内容如果是带有图片,直接输出出来比较繁琐,所以这里我们只获取不带图片的段子就好了。
所以,在这里我们就需要对带图片的段子进行过滤。
我们可以发现,带有图片的段子会带有类似下面的代码,而不带图片的则没有,所以,我们的正则表达式的item[3]就是获取了下面的内容,如果不带图片,item[3]获取的内容便是空。
<code class="hljs xml has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">div</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">class</span>=<span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">"thumb"</span>></span>

<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">a</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">href</span>=<span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">"/article/112061287?list=hot&s=4794990"</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">target</span>=<span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">"_blank"</span>></span>
<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">img</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">src</span>=<span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">"http://pic.qiushibaike.com/system/pictures/11206/112061287/medium/app112061287.jpg"</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">alt</span>=<span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">"但他们依然乐观"</span>></span>
<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"></<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">a</span>></span>

<span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"></<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">div</span>></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>

所以我们只需要判断item[3]中是否含有img标签就可以了。
好,我们再把上述代码中的for循环改为下面的样子
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span> <span class="hljs-operator" style="box-sizing: border-box;">in</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">items</span>:
        haveImg = re.search(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"img"</span>,<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>])
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-operator" style="box-sizing: border-box;">not</span> haveImg:
            print <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>],<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>],<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>],<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>

现在,整体的代码如下
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">\<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># -*- coding:utf-8 -*-</span>
import urllib
import urllib2
import re

page = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
url = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'http://www.qiushibaike.com/hot/page/'</span> + str(page)
user_agent = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'</span>
headers = { <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'User-Agent'</span> : user_agent }
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">try</span>:
    request = urllib2.Request(url,headers = headers)
    response = urllib2.urlopen(request)
    content = response.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">read</span>().decode(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'utf-8'</span>)
    pattern = re.compile(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'</span>+
                         <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>'</span>,re.S)
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">items</span> = re.findall(pattern,content)
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span> <span class="hljs-operator" style="box-sizing: border-box;">in</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">items</span>:
        haveImg = re.search(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"img"</span>,<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>])
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-operator" style="box-sizing: border-box;">not</span> haveImg:
            print <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>],<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>],<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>],<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">item</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>]
except urllib2.URLError, e:
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"code"</span>):
        print e.code
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"reason"</span>):
        print e.reason</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li></ul>

运行一下看下效果



恩,带有图片的段子已经被剔除啦。是不是很开森?

3.完善交互,设计面向对象模式

好啦,现在最核心的部分我们已经完成啦,剩下的就是修一下边边角角的东西,我们想达到的目的是:
按下回车,读取一个段子,显示出段子的发布人,发布日期,内容以及点赞个数。
另外我们需要设计面向对象模式,引入类和方法,将代码做一下优化和封装,最后,我们的代码如下所示

(from yockie:我改写了此段代码)
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#coding=utf-8</span>

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> urllib, urllib2, re
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> datetime <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> datetime

<span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">QSBK</span>:</span>
    <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">__init__</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self)</span>:</span>
        self.enable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">False</span>
        self.page = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
        self.user_agent = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">r"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"</span>
        self.headers = {
            <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"User-Agent"</span> : self.user_agent
        }
        self.catch_url = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">r'http://www.qiushibaike.com/hot/page/'</span>
        self.pattern = re.compile(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'</span>+
                         <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>.*?'</span> +
                         <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'class="number">(.*?)</i>.*?</div>'</span>,re.S)

    <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">get_content_with_page</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self, page)</span>:</span>
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">try</span>:
            request = urllib2.Request(self.catch_url+str(page), headers=self.headers)
            response = urllib2.urlopen(request)
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> response.read()
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">except</span> urllib2.URLError, e:
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"code"</span>):
                print(e.code)
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> hasattr(e, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"reason"</span>):
                print(e.reason)
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">""</span>

    <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">get_stories_from_content</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self, content)</span>:</span>
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> re.findall(self.pattern, content)

    <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">start</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self)</span>:</span>
        self.enable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">True</span>

        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> self.enable:
            content = self.get_content_with_page(self.page)
            stories = self.get_stories_from_content(content)

            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> len(stories) > <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>:
                <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> item <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> stories:
                    hasPicture = re.search(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'img'</span>, item[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>])
                    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">not</span> hasPicture:
                        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'author:'</span> + item[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].strip()
                        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'content:'</span> + item[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>].strip()
                        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'time:'</span> + str(datetime.fromtimestamp(float(item[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>])) )
                        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'praise:'</span> + item[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>]
                        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'comment:'</span> +item[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>]
                        print(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'按Enter继续...'</span>)
                        input = raw_input()

            self.page += <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>

spider = QSBK()
spider.start()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li></ul>

好啦,大家来测试一下吧,点一下回车会输出一个段子,包括发布人,发布时间,段子内容以及点赞数,是不是感觉爽爆了!
我们第一个爬虫实战项目介绍到这里,欢迎大家继续关注,小伙伴们加油!
【转载:静觅 » Python爬虫实战一之爬取糗事百科段子
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: