您的位置：首页 > 编程语言 > Python开发

Python爬虫scrapy框架发送POST请求以及自定义中间件（使用Cookie池，获取Cookie）——登录，爬取微博

2018-07-10 21:53 1061 查看

微博网址：https://weibo.cn/打开微博

附：scrapy基础知识--发送POST请求：点击打开

查看网页的内容如下：

根据上面的可以发现搜索是POST请求，参数和表单数据！

POST请求

注意：scrapy主要是GET请求，因此需要重写start_requests()请求方法！

主要使用的模块：from urllib.parse import quote（对参数进行加密）

from scrapy.http import FormRequest（scrapy框架自带的POST请求方法）

源码初始url请求：

POST请求FormRequest模块的源码：

示例代码：

wb.py

# -*- coding: utf-8 -*-
import scrapy
from urllib.parse import quote
from scrapy.http import FormRequest

class WbSpider(scrapy.Spider):
name = 'wb'
allowed_domains = ['weibo.cn']
start_url = 'https://weibo.cn/search/mblog'
# 最大页码
max_page = 100
#********************************发送POST请求*********************************
# 默认情况下，scrapy都是采用GET请求。重写的目的：初始URL的请求修改用POST请求。
# 需要重写start_requests()方法。
def start_requests(self):
# https://weibo.cn/search/mblog?keyword=%E5%91%A8%E6%9D%B0%E4%BC%A6
key_word = '周杰伦'
url = '{url}?keyword={kw}'.format(url=self.start_url, kw=quote(key_word))

for page_num in range(1, 2):
form_data = {
'mp': str(self.max_page),
'page': str(page_num)
}
# FormRequest()就是用来构造POST请求的类。
request = FormRequest(url, formdata=form_data, callback=self.parse_list_page)
yield request

def parse_list_page(self, response):
"""
解析列表页的url, : 转发微博的详情url，原创微博的详情url
:param response:
:return:
"""
# xpath组合查询：同时符合两个条件
weibo_div = response.xpath('//div[@class="c" and contains(@id, "M_")]')
for weibo in weibo_div:
# 需要区分原创微博和转发微博
has_cmt = weibo.xpath('.//span[@class="cmt"]').extract_first('')
if has_cmt:
# 如果能找到span[@class="cmt"]，说明是转发微博
# @class, @id
# . 表示文本内容
detail_url = weibo.xpath('.//a[contains(., "原文评论[")]/@href').extract_first('')
else:
# 没找到，说明是原创微博
detail_url = weibo.xpath('.//a[contains(., "评论[")]/@href').extract_first('')

# 构造详情页的请求
yield scrapy.Request(detail_url, callback=self.parse_detail_page)

def parse_detail_page(self, response):
"""
解析详情页的数据
:param response:
:return:
"""
print(response.url, response.status)

微博Cookie池：自定义中间件

注意：请求头携带的Cookie必须是一个字典，不能直接设置成字符串

cmd命令行打开Cookie池的运行代码！

使用logging打印信息代码如下：

def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

import requests
import logging
import json
# 自定义微博请求的中间件
class WeiBoMiddleWare(object):
def __init__(self, cookies_pool_url):
self.logging = logging.getLogger("WeiBoMiddleWare")
self.cookies_pool_url = cookies_pool_url

def get_random_cookies(self):
try:
response = requests.get(self.cookies_pool_url)
except Exception as e:
self.logging.info('Get Cookies failed: {}'.format(e))
else:
# 在中间件中，设置请求头携带的Cookies值，必须是一个字典，不能直接设置字符串。
cookies = json.loads(response.text)
self.logging.info('Get Cookies success: {}'.format(response.text))
return cookies

@classmethod
def from_settings(cls, settings):
obj = cls(
cookies_pool_url=settings['WEIBO_COOKIES_URL']
)
return obj
  
    # process_request()该方法会被多次调用，每一个request请求都会经过这个方法交给downloader
    def process_request(self, request, spider):
        request.cookies = self.get_random_cookies()
        return None

     def process_response(self, request, response, spider):
        """
        对此次请求的响应进行处理。
        :param request:
        :param response:
        :param spider:
        :return:
        """
        # 携带cookie进行页面请求时，可能会出现cookies失效的情况。访问失败会出现两种情况：1. 重定向302到登录页面；2. 也能会出现验证的情况；

        # 想拦截重定向请求，需要在settings中配置。
        if response.status in [302, 301]:
            # 如果出现了重定向，获取重定向的地址
            redirect_url = response.headers['location']
            if 'passport' in redirect_url:
                # 重定向到了登录页面，Cookie失效。
                self.logging.info('Cookies Invaild!')
            if '验证页面' in redirect_url:
                # Cookies还能继续使用，针对账号进行的反爬虫。
                self.logging.info('当前Cookie无法使用，需要认证。')

            # 如果出现重定向，说明此次请求失败，继续获取一个新的Cookie，重新对此次请求request进行访问。
            request.cookies = self.get_random_cookies()
            # 返回值request: 停止后续的response中间件，而是将request重新放入调度器的队列中重新请求。
            return request

        # 如果没有出现重定向，直接将response向下传递后续的中间件。
        return response

注意settings.py 中的配置：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航