您的位置：首页 > 其它

利用bs模块抓取数据

2015-11-28 13:52 316 查看

刚刚用bs模块抓取了一些数据，发现其中真的有好多问题，觉得对自己很有用，有的甚至是困扰自己很长时间的。接下来就说说吧抓取的是豆瓣电影的网站http://movie.douban.com/top250，其实用bs模块是非常简单的，就一个函数就能搞定了可是这中间我就出现了一些问题，例如，分割的不明显，然后自己就一直绕，最后看bs文档 http://beautifulsoup.readthedocs.org/zh_CN/latest/里面关于get_text()的使用，就觉得自己实在是太愚蠢了，一直都活在自己造的语法中，跟其正确使用虽然差不多，但是存在很多的问题，所以，这就告诉我们，什么东西都不能想当然，一定要细心，记好每个语法的用处以及用法，要不然跟我似的就惨了，东拆西减的最后也不对，比如有一处，分割一直不明显，我用的是get_text(strip=True)然后分割线划到了

file.write(item.encode('UTF-8')+'|')，所以就一直分割不出来，在写别的脚本的时候，这样也行的通来着，可是，

在这就不行了，所以，又看了看文档，然后改了正确的，这才分割出来了(我是绝对不会说在这个过程我也把标签改了的，并且还特沮丧

)

还有就是encode（）用法。比如：

SNOWMAN字符在UTF-8编码中可以正常显示(看上去像是☃)

print(tag.encode("utf-8"))
# <b>☃</b>

print tag.encode("latin-1")
# <b><span class="c" style="font-family: Consolas, 'Andale Mono WT', 'Andale Mono', 'Lucida Console', 'Lucida Sans Typewriter', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Liberation Mono', 'Nimbus Mono L', Monaco, 'Courier New', Courier, monospace; line-height: 1.5; font-size: 10px; box-sizing: border-box; color: rgb(153, 153, 136); font-style: italic;">☃</span><span style="font-size: 10px; font-family: Lato, proxima-nova, 'Helvetica Neue', Arial, sans-serif;"></b></span>

print tag.encode("ascii")
# <b><span class="c" style="font-family: Consolas, 'Andale Mono WT', 'Andale Mono', 'Lucida Console', 'Lucida Sans Typewriter', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Liberation Mono', 'Nimbus Mono L', Monaco, 'Courier New', Courier, monospace; line-height: 1.5; font-size: 10px; box-sizing: border-box; color: rgb(153, 153, 136); font-style: italic;">☃</span><span style="font-size: 10px; font-family: Lato, proxima-nova, 'Helvetica Neue', Arial, sans-serif;"></b></span>

所以，encode()也是一个很神奇的语法（其实也就是一个编码的问题）[/code]

好了，那就上我的代码吧，没用多长时间写，有缺漏的地方还请指教啊

我只爬了2页，以下的几页就是改下range()就好了

#coding=utf-8
from bs4 import BeautifulSoup

import urllib2
import time
class DB():
def __init__(self):
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.24 (KHTML, like '}
def gethtml(self,page):
try:
full_url="http://movie.douban.com/top250?start="+str(page)#后面的有没有都一样，索性就删了吧
req = urllib2.Request(full_url,None,self.headers)
response = urllib2.urlopen(req)
html = response.read()
return html
except urllib2.URLError,e:
if hasattr(e,'reason'):
print u"连接失败",e.reason
return None
def getItem(self):
for m in range(0,30,25):#在这改页数就好了
html = self.gethtml(m)
soup=BeautifulSoup(html,"html.parser")
Trlist = soup.find_all('ol')
file=open("DB.txt","a")
for item in Trlist:
if item not in ['\n','\t',' ']:
item = item.get_text('|',strip=True)
file.write('\n')
file.write(item.encode('utf-8'))

time.sleep(5)
file.close()

if __name__ == '__main__':
DB().getItem()

[/code]

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航