Scrapy框架爬取有验证码的登录网站
2017-10-25 21:12
781 查看
使用Scrapy爬取91pron网站
**声明:本项目旨在学习Scrapy爬虫框架和MongoDB数据库,不可使用于商业和个人其他意图。若使用不当,均由个人承担。**
首先,我们需要将scrapy框架所需的各种包,安装好,我们就开始了!
打开将要放项目的文件夹,在cmd中创建scrapy项目!
scrapy startproject yelloweb
scrapy文件夹中的各个文件就不介绍作用了。不懂的请先百度一下。
打开yelloweb文件夹下的items.ty
import scrapy class YellowebItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() # 视频标题 link = scrapy.Field() # 视频链接 img = scrapy.Field() # 封面图片链接
然后打开 yelloweb文件夹下的spiders文件夹,创建yellowebSpider.py
我们打开它:
先粘代码:
import scrapy class yellowebSpider(scrapy.Spider): name = "webdata" # 爬虫的识别名,它必须是唯一的 allowed_domains = ["91.91p17.space"] start_urls = [ # 爬虫开始爬的一个URL列表 "http://91.91p17.space/index.php" ] def parse(self, response): pass
开始喽!
我们首先要解决的问题就是最难克服的问题,就是如何登陆!
首先进入该网站的登录页面!
def start_requests(self): return [Request("http://91.91p17.space/login.php", callback=self.login, meta={"cookiejar":1})] def login(self, response): pass # 这里一会敲如何登陆
用request去跳转页面! callback=self.login 的意思是将用login函数处理。meta={“cookiejar”:1} 这里也是有些不太懂,就不说了,有知道的请在下面评论,谢了!
接下来我们处理登录
上代码:
headers={ "GET /index.php HTTP/1.1" "Host": "91.91p17.space", "Connection": "keep-alive", "Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Referer": "http://91.91p17.space/login.php", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.8" } def login(self, response): print("准备开始模拟登录!") captcha_image = response.xpath('//*[@id="safecode"]/@src').extract() print(urljoin("http://91.91p17.space", captcha_image[0])) if ( len(captcha_image) > 0): # 拟定文件名与保存路径 localpath = "D:\SoftWare\Soft\WorkSpace\Python\scrapy\code\captcha.png" opener=urllib.request.build_opener() opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')] urllib.request.install_opener(opener) urllib.request.urlretrieve(urljoin("http://91.91p17.space", captcha_image[0]), localpath) print("此次登录有验证码,请查看本地captcha图片输入验证码:") captcha_value = input() data = { "username": "这里填用户名", "password": "这里填密码", "fingerprint": "1838373130", "fingerprint2": "1a694ef42547498d2142328d89e38c22", "captcha_input": captcha_value, "action_login": "Log In", "x": "54", "y": "21" } else: print("登录时没有验证码!代码又写错了!") # print(data) print("验证码对了!!!!") return [FormRequest.from_response(response, # 设置cookie信息 meta={'cookiejar': response.meta['cookiejar']}, # 设置headers信息模拟浏览器 headers=self.headers, formdata=data, callback=self.next )] def next(self, response): pass # 这里写处理网站的代码
代码有些长,慢慢解释:
因为接下来我们需要headers,要伪装成浏览器,不然,网站设置的反爬处理。
headers中的东西,使用Chrome中的F12中的Network查的:
接下来,处理登录,烦了我好几天,验证码真讨厌!
通过chrome中的F12,找到验证码的链接,并且复制它的Xpath,
captcha_image = response.xpath('//*[@id="safecode"]/@src').extract()
又因为这个链接是相对地址,我们需要处理一下,就得到它的绝对地址了:
urljoin("http://91.91p17.space", captcha_image[0])
因为没有学习机器语言,只能手动输入喽!
opener= e03d urllib.request.build_opener() opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')] urllib.request.install_opener(opener)
这3行代码,是最关键的,因为下载验证码的图片竟然也设置了反爬处理。真是*了狗了。
好在找到了方法。因为urlretrieve远程下载资源时不能添加headers,所以需要上面的代码。在下载的时候也伪装成浏览器。
bingo!!
验证码就下载好了!真爽!
data字典中的数据就是我们所需要向网站提交的东西!
我们用FormRequest.from_response()向网站提交信息!
我们这就轻轻松松的进入到了网站里了。哈!哈!
既然进入到网站,那肯定不能就“取”一点点东西吧!
发现,咦!有一个更多视频,果断跳转到那个界面:
def next(self, response): href = response.xpath('//*[@id="tab-featured"]/div/a/@href').extract() url=urljoin("http://91.91p17.space", href[0]) # print("\n\n\n\n\n\n"+url+"\n\n\n\n\n\n") yield scrapy.http.Request(url, meta={'cookiejar': response.meta['cookiejar']}, # 设置headers信息模拟浏览器 headers=response.headers, callback=self.parse) def parse(self, response): pass # 解析网页
这个同理,要把相对地址,转化成绝对地址。
然后我们就开始爬取吧!
def parse(self, response): sel = Selector(response) print("进入更多精彩视频了") web_list = sel.css('.listchannel') for web in web_list: item = YellowebItem() try: item['link'] = web.xpath('a/@href').extract()[0] url = response.urljoin(item['link']) yield scrapy.Request(url, meta={'cookiejar': response.meta['cookiejar']}, callback=self.parse_content, dont_filter=True) except: print("完蛋了。。。。") # 跳转下一个页面 href = response.xpath('//*[@id="paging"]/div/form/a[6]/@href').extract() nextPage = urljoin("http://91.91p17.space/video.php", href[0]) print(nextPage) if nextPage: yield scrapy.http.Request(nextPage, meta={'cookiejar': response.meta['cookiejar']}, # 设置headers信息模拟浏览器 headers=response.headers, callback=self.parse) def parse_content(self, response): try: name = response.xpath('//*[@id="head"]/h3/a[1]/text()').extract()[0] item = YellowebItem() item['link'] = response.xpath('///*[@id="vid"]//@src').extract()[0] item['title'] = response.xpath('//*[@id="viewvideo-title"]/text()').extract()[0].strip() item['img'] = response.xpath('//*[@id="vid"]/@poster').extract()[0] yield item except: print("完蛋了。。。爬不下来了。。。")
这是通过css选择器加上xpath选择器提取的!
一个91网就爬下来了。
看运行结果:
D:\SoftWare\Soft\WorkSpace\Python\scrapy\yelloweb>scrapy crawl webdata 2017-10-25 21:03:44 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: yelloweb) 2017-10-25 21:03:44 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['yelloweb.spiders'], 'NEWSPIDER_MODULE': 'yelloweb.spiders', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'BOT_NAME': 'yelloweb'} 2017-10-25 21:03:44 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-10-25 21:03:44 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-25 21:03:45 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-25 21:03:45 [scrapy.middleware] INFO: Enabled item pipelines: ['yelloweb.pipelines.YellowebPipeline'] 2017-10-25 21:03:45 [scrapy.core.engine] INFO: Spider opened 2017-10-25 21:03:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-25 21:03:45 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-25 21:03:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/login.php> (referer: None) 准备开始模拟登录! http://91.91p17.space/captcha.php 此次登录有验证码,请查看本地captcha图片输入验证码:
打开你保存验证码的地址,查看验证码:
此次登录有验证码,请查看本地captcha图片输入验证码: 4541 验证码对了!!!! 2017-10-25 21:05:11 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min) 2017-10-25 21:05:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://91.91p17.space/index.php> from <POST http://91.91p17.space/login.php> 2017-10-25 21:05:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/index.php> (referer: http://91.91p17.space/login.php) 2017-10-25 21:05:45 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 1 pages/min), scraped 0 items (at 0 items/min) 2017-10-25 21:06:45 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
然后就变成这样了,
好像卡了一般。但是等等就会继续运行。
可能是网不好??请大神解答!!!
完了就出结果了:
2017-10-25 21:08:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://91.91p17.space/view_video.php?viewkey=e231628214a5c5ea54ba&page=1&viewtype=basic&category=rf> {'img': 'http://img2.t6k.co/thumb/240427.jpg', 'link': 'http://192.240.120.100//mp43/240427.mp4?st=iQXkdUjR5J_1H2KjVY8WgQ&e=1509009304', 'title': 'woman on top Guanyin sit lotus, [help to apply for highlight, ' 'thanks 91PORN platform, management audit fortunately]'} 2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=247433dbac92ae91f6ff&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf) 2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=5ff48ed3ecc37745251b&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf) 2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=d5d24ee2936c086eb342&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf) 2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=358683d42298681fabe0&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf) 2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=835bf1fac457e9a8e9f6&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf)[/code]
接下来我们保存到数据库中:
打开pipeline.ty
代码如下:import pymysql as db class YellowebPipeline(object): def __init__(self): self.con = db.connect(user="root", passwd="root", host="localhost", db="python", charset="utf8") self.cur = self.con.cursor() self.cur.execute('drop table 91pron_content') self.cur.execute("create table 91pron_content(id int auto_increment primary key, title varchar(200), img varchar(244), link varchar(244))") def process_item(self, item, spider): self.cur.execute("insert into 91pron_content(id,title,img,link) values(NULL,%s,%s,%s)", (item['title'], item['img'], item['link'])) self.con.commit() return item
同时在setting.ty中设置:DOWNLOADER_MIDDLEWARES = { 'yelloweb.middlewares.MyCustomDownloaderMiddleware': None, }
一个简简单单的爬虫就完了!!
相关文章推荐
- python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(1)
- python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)
- python爬虫scrapy框架——人工识别知乎登录知乎倒立文字验证码和数字英文验证码
- python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)
- 第三百三十五节,web爬虫讲解2—Scrapy框架爬虫—豆瓣登录与利用打码接口实现自动识别验证码
- Python爬虫模拟登录带验证码网站
- 第三百三十三节,web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies
- PHP使用CURL实现对带有验证码的网站进行模拟登录的方法
- PHP使用CURL实现对带有验证码的网站进行模拟登录的方法
- 使用C#登录带验证码的网站
- [PHP自动化-进阶]002.CURL模拟登录带有验证码的网站
- python爬虫实战(四)--------豆瓣网的模拟登录(模拟登录和验证码的处理----scrapy)
- 网站登录密码忘记后,通过向手机发送验证码实现找回密码的实现方法
- C# 利用 HttpWebRequest 和 HttpWebResponse 模拟登录有验证码的网站
- 带简易验证码网站自动注册及登录(勿转)
- Django自定义插件实现网站登录验证码功能
- Python爬虫模拟登录带验证码网站
- NodeJS学习笔记(一)——搭建开发框架Express,实现Web网站登录验证
- PHP curl模拟登录带验证码的网站
- 使用C#登录带验证码的网站