您的位置：首页 > 编程语言 > Python开发

python的小程序----用urllib2从百度贴吧获取发言

2013-10-20 20:22 351 查看

python的小程序----用urllib2从百度贴吧获取发言

本文参考http://www.oschina.net/code/snippet_1156122_21491

小程序的目标：输入网址（但是输入的网址有限制，此程序是针对这种形式（http://tieba.baidu.com/p/2164260230?pn=？(?代表1,2,3...)）的地址进行处理的）。

程序实现思想：根据给定的网址，爬取该网页的内容。对内容进行分析，提取出想要的东西。如此帖总共的页数，每个人的发言信息等。

下面是程序的具体实现：

'''
Created on Oct 20, 2013

@author: lsy
'''

import urllib2
import re

class BaiduSpider:
    #def __init__(self):
    #    self.pages = []
        
    def getPageNum(self, url):
        content = urllib2.urlopen(url + str(1)).read();
        unicodePage = content.decode('gbk')
        num = re.findall('<span class="red">(.*?)</span>', unicodePage, re.S)
        return int(num[0]) 
    #get the all the pages
    def getPage(self, url):
        num = self.getPageNum(url)
        print num
        #title = self.getTitle(url)
        #the problem of encode
        #print title
        outfile = open('abc.txt', 'w+')
        for i in range(1, num+1):
            #print i
            new_url = url + str(i)
            #print new_url
            response = urllib2.urlopen(new_url)
            content = response.read().decode('gbk')
            phrase = re.findall('id="post_content.*?>(.*?)</div>', content, re.S)
            #print phrase.encode('gbk')
            for item in phrase:
                #self.pages.append(item)
                print item.encode('utf-8')
                outfile.write(item.encode('utf-8'))
        outfile.close()
        #self.saveData()   
        
url = str(raw_input('Please input the first page address:'))
bs = BaiduSpider()
bs.getPage(url)

程序遇到的问题:

字符串的编码问题：python用的是unicode的编码。而网页可能用的是gbk,utf-8等。这时候就需要编码的转换。

decode:把某个编码的字符串转换为unicode的

encode:把unicode的字符串转换为其他编码。

因为百度贴吧的网页编码是gbk的，所以先转换为unicode才能被程序处理。在写入文件时，因为文件的编码为utf-8,所以需要转换为utf-8的。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航