您的位置:首页 > 移动开发 > 微信开发

selenium;time;requests;json,微信自己的公众号接口,爬取微信公众号文章,简单案例,后期可以自己添加公众号或者构造公众号名称列表来寻找文章

2018-11-22 15:01 405 查看
版权声明:如有侵权,请联系作者删除该文件! https://blog.csdn.net/Programmer_huangtao/article/details/84339304

      这样做法可以爬虫每个公众号大概能爬60篇,就会起限制,所以要爬取全部的文章的话,要启用下篇文章中的代理池爬取,就是在请求的时候加上代理,然后睡眠时间需要你自己的尝试了。

      然后其中的查询关键字‘query’,可以换成是手输入的;或者加个找好的公众号列表,在循环遍历,但是呢,这样做的话,最好在函数中定义,然后循环函数,会比这个更加清晰。

[code]# -*- coding: utf-8 -*-
# @date: 2018\11\20 00200:39
# @Author  : huangtao!!
# @FileName: get_cook.py
# @Software: PyCharm
# @Blog    :https://blog.csdn.net/Programmer_huangtao
from selenium import webdriver
import time
from pprint import pprint
from fake_useragent import UserAgent
import random
import  requests
import re
import time
import json
cookie = {}
driver = webdriver.Chrome()
driver.get('https://mp.weixin.qq.com')
time.sleep(2)
driver.find_element_by_xpath('./*//input[@name="account"]').clear()
driver.find_element_by_xpath('./*//input[@name="account"]').send_keys('你的公众号账号')
driver.find_element_by_xpath('./*//input[@name="password"]').clear()
time.sleep(5)
driver.find_element_by_xpath('./*//input[@name="password"]').send_keys('密码')
driver.find_element_by_xpath('//label[@class="frm_checkbox_label"]').click()
driver.find_element_by_xpath('//a[@class="btn_login"]').click()
time.sleep(15)
cookies = driver.get_cookies()
for item in cookies:
cookie[item.get('name')] = item.get('value')
pprint(cookie)
with open('cookie.txt','w',encoding='utf-8')as f:
f.write(json.dumps(cookie))
headers = {'User-Agent':UserAgent().random}
with open('cookie.txt','r',encoding='utf-8')as f:
cookie = f.read()
cookie = json.loads(cookie)
url = 'https://mp.weixin.qq.com'
response = requests.get(url,headers=headers,cookies=cookie)
print(response.url)
# print(response.text)
search_url = 'https://mp.weixin.qq.com/cgi-bin/searchbiz?'
token = re.findall(r'token=(\d*)',response.url)[0]
print(token)
search_data = {
'action': 'search_biz',
'token': token ,
'lang': 'zh_CN',
'f': 'json',
'ajax': '1',
'random': random.random(),
'query': 'jikexueyuan00',
'begin': '0',
'count': '5'
}
# print(response.url)
search_response = requests.get(search_url,cookies=cookie,params=search_data)
# print(search_response.text)
result = search_response.json().get('list')[0]
fakeid = result.get('fakeid')
appmsg_data = {
'token': token,
'lang': 'zh_CN',
'f': 'json',
'ajax': '1',
'random': random.random(),
'action': 'list_ex',
'begin': '0',
'count': '5',
'query': '',
'fakeid': fakeid,
'type': '9'
}
appmsg_url = 'https://mp.weixin.qq.com/cgi-bin/appmsg?'
appmsg_response = requests.get(appmsg_url,cookies=cookie,params=appmsg_data)
# print(appmsg_response.text)
page_num = int(int(appmsg_response.json().get('app_msg_cnt')) / 5)
begin = 0
while page_num +1 >0:
appmsg_data = {
'token': token,
'lang': 'zh_CN',
'f': 'json',
'ajax': '1',
'random': random.random(),
'action': 'list_ex',
'begin': '{}'.format(str(begin)),
'count': '5',
'query': '',
'fakeid': fakeid,
'type': '9'}
print('翻页',begin)
appmsg_response = requests.get(appmsg_url, cookies=cookie, params=appmsg_data)
appmsg_response_list = appmsg_response.json().get('app_msg_list')
for item in appmsg_response_list:
print('标题:',item.get('title'))
print('链接',item.get('link'))
pass
page_num -= 1
begin = int(begin)
begin +=5
time.sleep(2)

 

 

 

 

阅读更多
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐