您的位置：首页 > 其它

使用scrapy框架爬取自己的博文（2）

2014-05-05 15:14 253 查看

　　之前写了一篇用scrapy框架爬取自己博文的博客，后来发现对于中文的处理一直有问题- -

　　显示的时候 [u'python\u4e0b\u722c\u67d0\u4e2a\u7f51\u9875\u7684\u56fe\u7247 - huhuuu - \u535a\u5ba2\u56ed'] 而不是 python下爬某个网页的图片 - huhuuu - 博客园。这显然不是我们需要的结果。

　　现在如何把列表中的字符串转到字符串，显然不能直接用str! 那就遍历列表，把信息提取出来。

def change_word(s): #把表中的字符串转化到中文显示
print s
sum = 0
for i in s[0]:
sum += 1
ss2 = ''

count = 0
for i in range(0,sum):
ss2 += s[0][i]

s = ss2
print s

　　运行一下，似乎是可以的，但是发现有些字符还是没有转化到中文字符，查了下编译器的提示：

　　

　　\u2014这个字符好像支持的不好，那就把这个字符除掉

　　一开始没搞明白字符的单位是什么，判断条件写成了，自然就没起到任何作用

if (s[0][i] == '\\') and (s[0][i+1] == 'u'):
if (s[0][i+2] == '2') and (s[0][i+3] == '0') and (s[0][i+4] == '1') and (s[0][i+5] == '4'):

　　原来在python中对中文字符与对英文字符都看做一个单位，所以：

if (s[0][i] == u'\u2014'):
continue

　　最后，可以正确的显示所以中文字符了。

　　完整的spider代码：

　　

#!/usr/bin/env python
#coding=utf-8
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
from scrapy.selector import HtmlXPathSelector
import sys
import string
sys.stdout=open('output.txt','w') #将打印信息输出在相应的位置下

add = 0
def change_word(s): #把表中的字符串转化到中文显示
print s
sum = 0
for i in s[0]:
sum += 1
ss2 = ''

count = 0
for i in range(0,sum):
#对 /u2014处理
if (s[0][i] == u'\u2014'):
continue
ss2 += s[0][i]

s = ss2
print s

class DmozSpider(CrawlSpider):

name = "huhu"
allowed_domains = ["cnblogs.com"]
start_urls = [
"http://www.cnblogs.com/huhuuu",
]

rules = (
# 提取匹配 huhuuu/default.html\?page\=([\w]+) 的链接并跟进链接(没有callback意味着follow默认为True)
Rule(SgmlLinkExtractor(allow=('huhuuu/default.html\?page\=([\w]+)', ),)),

# 提取匹配 'huhuuu/p/' 的链接并使用spider的parse_item方法进行分析
Rule(SgmlLinkExtractor(allow=('huhuuu/p/', )), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('huhuuu/archive/', )), callback='parse_item'), #以前的一些博客是archive形式的所以
)

def parse_item(self, response):
global add #用于统计数量
print  add
add+=1

sel = HtmlXPathSelector(response)
items = []

item = Website()

temp = sel.xpath('/html/head/title/text()').extract()

item['headTitle'] = temp#观察网页对应得html源码
item['url'] = response

#print temp

print item['url']
change_word(temp)

items.append(item)
return items

爬取的结果：

近四百篇博文

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航