您的位置：首页 > 编程语言 > Python开发

calibre recipe抓取中没有注意的空格

2016-04-06 09:58 639 查看

编写了几次recipe抓取网页上的内容，这次发现抓取中总是出现某些网页下载失败，因为calibre的出错信息比较简单，无法进行进一步的判断。出错信息如图：

如果程序有错，那么实际上其它的章节都下载正常。经过反复的查看，发现每次都是每一，五，九章节下载出错，重复性很一致。问题如果不是程序，那么应该是页面的代码有什么不一样，可每次用浏览器打开相应章节页面均正常，没有出错，而且网页内容也可以在源代码中看到，那么也不是该相应章节有什么与众不同的防范措施。问题究竟出在哪哪儿呢？

既然是一、五、九章节，这章节的url会不会有什么问题，造成只能用浏览器打开，不能用程序下载呢？于是反复分析目录页，终于发现，这几个页面的url在.html后多了一个空格，成了href="xxxxxxxxxxxxxxxxxxxxx.html "。这样浏览器会正确打开网页，而calibre的下载程序不会自动去除空格，打开网页就会出错，造成下载失败。发现了问题还是比较容易解决的，在获取url的字符串上调用strip()方法，出险url中的空格，然后生成feeds，下载网页成功，再也不见下载失败的提示，生成的电子书也不再缺少章节了。

由此，得出经验，分析家目录页时，不能只用“开发者工具”查看一下就行，有时还是得仔细看一下源代码，注意每个章节的标题和url有没有特别的地方，尤其是是否多了空格或者其它程序不能识别的东西。一些网站为了防止别人使用程序或者下载工具批量下载，使用了一些特别的方法，虽然不一定有什么特别的地方，但一点小小的变化对防止批量下载，还是有不错的作用的。

全部recipe如下：

# coding=gbk

from calibre.web.feeds.recipes import BasicNewsRecipe

class ddwx(BasicNewsRecipe):
title = u"邪器"
description = u"竟然有人把法器之魂吞下去了？"
recursions = 0
max_articles_per_feed = 1500
oldest_article = 5000
remove_javascript = True
timeout = 300

cover_url = "http://www.lexinren.com/files/article/image/0/815/815s.jpg"
url_prefix = "http://www.lexinren.com/files/article/html/0/815/"
no_stylesheets = True
keep_only_tags =[dict(name='div', attrs={'class':'novel_head'}),
dict(name='div', attrs={'class':'novel_content'})] #要保留的标签
remove_tags = [dict(name='div', attrs={'class':'top'}),
dict(name='div', attrs={'class':'logo'}),
dict(name='div', attrs={'id':'pagetop'}),
]

def get_title(self,link):
return link.contents[0].strip()

def parse_index(self):
ans0 = []
soup = self.index_to_soup('http://www.lexinren.com/files/article/html/0/815/index.html')
aa = soup.find('div',attrs={'class':'novel_volume'})
tv = []
for nt_div in aa.findAll('h2'): #先取每卷的标题，生成tv列表，标题在<div class="novel_title">标签的<h2>中，一次性都读出到tv中。
til_vol = nt_div.string
tv.append(til_vol)
i = 0
nc = 0
articles = []
for nl_div in aa.findAll('div',attrs={'class':'novel_list'}): #章节的标题和url在标签<div class="novel_list">中，注意每一卷分在了三个<div>中，增加了nc记数。
for link in nl_div.findAll('a'):
title = self.get_title(link)
url = self.url_prefix + link['href'].strip()
a = {'title':title,'url':url}
articles.append(a)
#print
nc = nc +1
if nc == 3: #新章，取标题并添加到ans0中。
ans = (tv[i],articles)
ans0.append(ans)
articles = []
i = i + 1
nc = 0
return ans0

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： calibre recipe python

相关文章推荐

新的分享

章节导航