您的位置：首页 > 编程语言 > Python开发

python关于用BeautifulSoup爬取网易动态评论

2016-12-19 19:28 555 查看

1关于python爬取网易json格式的动态评论，因为这个使用json格式来编写的，所以就是要先是要找到这个json格式的文件的链接，先是在用F12出现

这个页面

2网易跟帖上这个网站就是这个hotlist，最热跟帖，和newslist最新跟贴，现在我们是要爬这个最热跟帖

3然后就是要打开这个链接，然后机会出现下面这个页面。

4然后就可以利用这个ison的特性来取出来自己想要的信息了。

# coding:utf-8
import urllib
import re
import json #必须先要引入json
def getpage():
for z in range(1,3):#我是爬的最新跟帖，有好几页，要先找到页数的规律来，如果点击下一页，会再出现一个文件newslist
i = 0
url='http://comment.news.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/C6BUSTPO000187VI/
comments/newList?offset='+str(z)+'&limit=30&showLevelThreshold=72&headLimit=1&tailLimit=2&callback=getData&ibc=news
pc&_=1479812321476'
z+=30
page=urllib.urlopen(url)
html=page.read()
return html
def getItems(html):
reg = re.compile("getData\(")，#先是要去掉这个头和尾，才会有一个字典的格式，会有key和value
data = reg.sub(' ', html)
reg3 = re.compile('\);')
data = reg3.sub('', data)
data = json.loads(data)
for i in data['commentIds']:#然后我是用这个for循环来提取出这个data里面的key，然后去掉里面十位数的数字
pp=re.compile('\d{10}')
zz=re.findall(pp,i)#然后就是用这个数字来当做key来找出value
for n in zz:#再用for循环提取出来，赋值给n
try:
w.write(data['comments']
['user']['nickname'].encode('utf-8')+'|')#这个就是转一下码
w.write(data['comments']
['content'].encode('utf-8')+'|')
w.write(data['comments']
['user']['location'].encode('utf-8')+'|')
w.write(data['comments']
['createTime'].encode('utf-8')+'|'+'\n')
except:
w.write("null")
w=open('wypinglun.text','w')
html=getpage()
getItems(html)
w.close()

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： BeautifulSoup python html 正则表达式 json

相关文章推荐

新的分享

章节导航