[scrapy]实例:爬取jobbole页面
2017-12-14 16:25
253 查看
工程概览:
创建工程
创建spider
items.py
piplines.py
创建工程
scrapy startproject ArticleSpider
创建spider
cd /ArticleSpider/spiders/ 新建jobbole.py # -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from urllib import parse import re from ArticleSpider.items import ArticlespiderItem class JpbboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] #先下载http://blog.jobbole.com/all-posts/这个页面,然后传给parse解析 def parse(self, response): #1.start_urls下载页面http://blog.jobbole.com/all-posts/,然后交给parse解析,parse里的post_urls获取这个页面的每个文章的url,Request下载每个文章的页面,然后callback=parse_detail,交给parse_detao解析 #2.等post_urls这个循环执行完,说明这一个的每个文章都已经解析完了, 就执行next_url,next_url获取下一页的url,然后Request下载,callback=self.parse解析,parse从头开始,先post_urls获取第二页的每个文章的url,然后循环每个文章的url,交给parse_detail解析 #获取http://blog.jobbole.com/all-posts/中所有的文章url,并交给Request去下载,然后callback=parse_detail,交给parse_detail解析 post_urls = response.css("#archive .floated-thumb .post-thumb a::attr(href)").extract() for post_url in post_urls: yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail) #获取下一页的url地址,交给Request下载,然后交给parse解析 next_url = response.css(".next.page-numbers::attr(href)").extract_first("") if next_url: yield Request(url=next_url,callback=self.parse) def parse_detail(self,response): title=response.css('.entry-header h1::text').extract()[0] create_date= response.css("p.entry-meta-hide-on-mobile::text").extract()[0] praise_unms = response.css(".vote-post-up h10::text").extract()[0] fav_nums = response.css(".bookmark-btn::text").extract()[0] match_re = re.match(".*?(\d+).*",fav_nums) if match_re: fav_nums = int(match_re.group(1)) else: fav_nums = 0 comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0] match_re = re.match(".*?(\d+).*",comment_nums) if match_re: comment_nums = int(match_re.group(1)) else: comment_nums = 0 item = ArticlespiderItem() #实例化item item['name'] = title #item里的name字段的内容就是这里的title yield item #执行item print(title,create_date,praise_unms,fav_nums,comment_nums)
items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class ArticlespiderItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field()
piplines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html # class ArticlespiderPipeline(object): # def process_item(self, item, spider): # return item class ArticlespiderPipeline(object): def process_item(self, item, spider): with open("my_meiju.txt", 'a') as fp: fp.write(item['name'] + '\n')
相关文章推荐
- php生成静态html页面缓存技术原理+实例
- Vue2.0实现将页面中表格数据导出excel的实例
- 11. jsp与servlet之间页面跳转及参数传递实例
- asp.net页面SqlCacheDependency缓存实例
- 文本域文字内容指定选中实例页面(只选中文本中的一些字段)
- js中常用framesetiframe页面跳转传参方法实例大全
- webpack+react+antd 单页面应用实例
- MooTools下的返回顶部功能的实现实例页面
- Silverlight 第一步 快速的掌握页面布局,做一个博客的布局实例
- dedecms页面如何获取会员状态的实例代码
- jquery利用json实现页面之间传值的实例解析
- PHP生成HTML静态页面实例代码
- Scrapy爬虫实例——校花网
- Scrapy爬虫实例讲解_校花网
- 微信小程序input表单页面实例,redio和下拉列表获取数据
- scrapy 爬取网站并存入数据库实例
- Javascript实现倒计时(防页面刷新)实例
- angular实现spa单页面应用实例
- Android深入浅出系列之实例应用—手机页面之间的跳转
- Symfony2创建页面实例详解