十.scrapy项目 爬取主页http://cuiqingcai.com/获取所有url与title
2017-07-10 10:50
399 查看
一.分析采用crawlspider,利用rule规则提取url,并且follow=True追踪下去
二.spider为
三.pipelines为
四.item为
rules = ( Rule(LinkExtractor(allow=('\d+\.html$',)), callback='parse_all', follow=True), # Rule(LinkExtractor(allow=('\d+\.html$',)), callback='parse_pachong', follow=True), )
二.spider为
#coding:utf-8
from scrapy.spiders import CrawlSpider, Rule, Request
from scrapy.linkextractors import LinkExtractor
from ..items import CuiqingcaiItem
class myspider(CrawlSpider):
name = 'cqc'
allowed_domains = ['cuiqingcai.com']
count_all = 0
url_all = []
start_urls = ['http://cuiqingcai.com']
label_tags = [u'爬虫', 'scrapy', 'selenium']
rules = ( Rule(LinkExtractor(allow=('\d+\.html$',)), callback='parse_all', follow=True), # Rule(LinkExtractor(allow=('\d+\.html$',)), callback='parse_pachong', follow=True), )
'''
# 将爬虫相关的数据存入数据库
def parse_pachong(self, response):
print_tag = False
title_name = u""
for tag in self.label_tags:
title_name = response.xpath('//header/h1[1][@class="article-title"]/a/text()').extract()[0]
if tag in title_name.lower().encode("utf-8"):
print_tag = True
if print_tag == True:
self.count_all = self.count_all + 1
self.url_all.append(response.url)
item = CuiqingcaiItem()
item['url'] = response.url
item['title'] = title_name.encode("utf-8")
return item
'''
# 将全站数据存入json文件
def parse_all(self, response):
title_name = None
if response.xpath('//header/h1[1][@class="article-title"]/a/text()').extract()[0]:
title_name = response.xpath('//header/h1[1][@class="article-title"]/a/text()').extract()[0]
item = CuiqingcaiItem()
item['url'] = response.url
item['title'] = title_name
return item
三.pipelines为
import json from pymongo import MongoClient import settings from items import CuiqingcaiItem class CuiqingcaiPipeline(object): def __init__(self): cn=MongoClient('127.0.0.1',27017) db=cn[settings.Mongodb_DBNAME] self.table=db[settings.Mongodb_DBTable] def process_item(self, item, spider): if isinstance(item,CuiqingcaiItem): try: self.table.insert(dict(item)) except Exception, e: pass return item
四.item为
import scrapy class CuiqingcaiItem(scrapy.Item): title = scrapy.Field() # 标题 url = scrapy.Field() # 页面的地址
相关文章推荐
- http://cuiqingcai.com/993.html
- [置顶] django快速获取项目所有的URL
- django快速获取项目所有的URL
- 获取SSM项目中所有的URL
- SpringMVC项目中获取所有URL到Controller Method的映射
- django快速获取项目所有的URL
- 项目启动时获取项目中的所有url
- j2ee的web项目,在浏览器中发起一个该项目中html页面的绝对地址,也是发起的一个http url请求,请求的响应报文的结果就是该html页面的所有html代码
- django快速获取项目所有的URL
- django获取项目所有的URL
- SpringMVC项目中获取所有URL到Controller Method的映射
- js解析url参数如http://www.taobao.com/index.php?key0=21&key1=你哈&(获取key0和key1的值)
- django快速获取项目所有的URL
- Android开发 通过httpURL获取图片
- HttpUrlConnection获取网络json串
- Python 网络爬虫 009 (编程) 通过正则表达式来获取一个网页中的所有的URL链接,并下载这些URL链接的源代码
- 2017_12_15 js获取项目路径,js调用问题,jsp获取js传递url中参数
- SDK Failed to fetch URL http://dl.google.com/android/repository
- HTTP 获取网页内容 HttpURLConnection与HttpClient
- 一条JavaScript语句获取当前网页所有图片的url