Scrapy爬虫爬取天气数据存储为txt和json等多种格式
2017-09-02 15:16
489 查看
一、创建Scrrapy项目
scrapy startproject weather
二、 创建爬虫文件
scrapy genspider wuhanSpider wuhan.tianqi.com
三、SCrapy项目各个文件
(1) items.py
import scrapy
class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
cityDate = scrapy.Field()
week = scrapy.Field()
img = scrapy.Field()
temperature = scrapy.Field()
weather = scrapy.Field()
wind = scrapy.Field()
(2)wuhanSpider.py
# -*- coding: utf-8 -*-
import scrapy
from weather.items import WeatherItem
class WuhanspiderSpider(scrapy.Spider):
name = "wuHanSpider"
allowed_domains = ["tianqi.com"]
citys = ['wuhan','shanghai']
start_urls = []
for city in citys:
start_urls.append('http://' + city + '.tianqi.com/')
def parse(self, response):
subSelector = response.xpath('//div[@class="tqshow1"]')
items = []
for sub in subSelector:
item = WeatherItem()
cityDates = ''
for cityDate in sub.xpath('./h3//text()').extract():
cityDates += cityDate
item['cityDate'] = cityDates
item['week'] = sub.xpath('./p//text()').extract()[0]
item['img'] = sub.xpath('./ul/li[1]/img/@src').extract()[0]
temps = ''
for temp in sub.xpath('./ul/li[2]//text()').extract():
temps += temp
item['temperature'] = temps
item['weather'] = sub.xpath('./ul/li[3]//text()').extract()[0]
item['wind'] = sub.xpath('./ul/li[4]//text()').extract()[0]
items.append(item)
return items
(3)pipelines.py,处理spider的结果
import time
import os.path
import urllib2
#将获得的数据存储到txt文件
class WeatherPipeline(object):
def process_item(self, item, spider):
today = time.strftime('%Y%m%d', time.localtime())
fileName = today + '.txt'
with open(fileName,'a') as fp:
fp.write(item['cityDate'].encode('utf8') + '\t')
fp.write(item['week'].encode('utf8') + '\t')
imgName = os.path.basename(item['img'])
fp.write(imgName + '\t')
if os.path.exists(imgName):
pass
else:
with open(imgName, 'wb') as fp:
response = urllib2.urlopen(item['img'])
fp.write(response.read())
fp.write(item['temperature'].encode('utf8') + '\t')
fp.write(item['weather'].encode('utf8') + '\t')
fp.write(item['wind'].encode('utf8') + '\n\n')
time.sleep(1)
return item
import time
import json
import codecs
#将获得的数据存储到json文件
class WeatherPipeline(object):
def process_item(self, item, spider):
today = time.strftime('%Y%m%d', time.localtime())
fileName = today + '.json'
with codecs.open(fileName, 'a', encoding='utf8') as fp:
line = json.dumps(dict(item), ensure_ascii=False) + '\n'
fp.write(line)
return item
(4)settings.py,决定 由哪个文件来处理获取的数据
BOT_NAME = 'weather'
SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'
#### user add
ITEM_PIPELINES = {
'weather.pipelines.WeatherPipeline':1,
'weather.pipelines2json.WeatherPipeline':2,
'weather.pipelines2mysql.WeatherPipeline':3
}
(5)爬取命令
scrapy crawl wuHanSpider
(6)结果显示
1.txt数据
2.json数据
3. 存储到mysql数据库
scrapy startproject weather
二、 创建爬虫文件
scrapy genspider wuhanSpider wuhan.tianqi.com
三、SCrapy项目各个文件
(1) items.py
import scrapy
class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
cityDate = scrapy.Field()
week = scrapy.Field()
img = scrapy.Field()
temperature = scrapy.Field()
weather = scrapy.Field()
wind = scrapy.Field()
(2)wuhanSpider.py
# -*- coding: utf-8 -*-
import scrapy
from weather.items import WeatherItem
class WuhanspiderSpider(scrapy.Spider):
name = "wuHanSpider"
allowed_domains = ["tianqi.com"]
citys = ['wuhan','shanghai']
start_urls = []
for city in citys:
start_urls.append('http://' + city + '.tianqi.com/')
def parse(self, response):
subSelector = response.xpath('//div[@class="tqshow1"]')
items = []
for sub in subSelector:
item = WeatherItem()
cityDates = ''
for cityDate in sub.xpath('./h3//text()').extract():
cityDates += cityDate
item['cityDate'] = cityDates
item['week'] = sub.xpath('./p//text()').extract()[0]
item['img'] = sub.xpath('./ul/li[1]/img/@src').extract()[0]
temps = ''
for temp in sub.xpath('./ul/li[2]//text()').extract():
temps += temp
item['temperature'] = temps
item['weather'] = sub.xpath('./ul/li[3]//text()').extract()[0]
item['wind'] = sub.xpath('./ul/li[4]//text()').extract()[0]
items.append(item)
return items
(3)pipelines.py,处理spider的结果
import time
import os.path
import urllib2
#将获得的数据存储到txt文件
class WeatherPipeline(object):
def process_item(self, item, spider):
today = time.strftime('%Y%m%d', time.localtime())
fileName = today + '.txt'
with open(fileName,'a') as fp:
fp.write(item['cityDate'].encode('utf8') + '\t')
fp.write(item['week'].encode('utf8') + '\t')
imgName = os.path.basename(item['img'])
fp.write(imgName + '\t')
if os.path.exists(imgName):
pass
else:
with open(imgName, 'wb') as fp:
response = urllib2.urlopen(item['img'])
fp.write(response.read())
fp.write(item['temperature'].encode('utf8') + '\t')
fp.write(item['weather'].encode('utf8') + '\t')
fp.write(item['wind'].encode('utf8') + '\n\n')
time.sleep(1)
return item
import time
import json
import codecs
#将获得的数据存储到json文件
class WeatherPipeline(object):
def process_item(self, item, spider):
today = time.strftime('%Y%m%d', time.localtime())
fileName = today + '.json'
with codecs.open(fileName, 'a', encoding='utf8') as fp:
line = json.dumps(dict(item), ensure_ascii=False) + '\n'
fp.write(line)
return item
import MySQLdb import os.path #将获得的数据存储到mysql数据库 class WeatherPipeline(object): def process_item(self, item, spider): cityDate = item['cityDate'].encode('utf8') week = item['week'].encode('utf8') img = os.path.basename(item['img']) temperature = item['temperature'].encode('utf8') weather = item['weather'].encode('utf8') wind = item['wind'].encode('utf8') conn = MySQLdb.connect( host='localhost', port=3306, user='crawlUSER', passwd='crawl123', db='scrapyDB', charset = 'utf8') cur = conn.cursor() cur.execute("INSERT INTO weather(cityDate,week,img,temperature,weather,wind) values(%s,%s,%s,%s,%s,%s)", (cityDate,week,img,temperature,weather,wind)) cur.close() conn.commit() conn.close() return item
(4)settings.py,决定 由哪个文件来处理获取的数据
BOT_NAME = 'weather'
SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'
#### user add
ITEM_PIPELINES = {
'weather.pipelines.WeatherPipeline':1,
'weather.pipelines2json.WeatherPipeline':2,
'weather.pipelines2mysql.WeatherPipeline':3
}
(5)爬取命令
scrapy crawl wuHanSpider
(6)结果显示
1.txt数据
2.json数据
3. 存储到mysql数据库
相关文章推荐
- Scrapy爬虫框架学习之自定义Pipelines将文件以Json格式存储
- jqgrid实现客户端导出Excel、txt、word、json等数据格式的文件
- Android读取服务端TXT(JSON格式数据)返回汉字乱码的问题
- mysql中无法存储JSON 格式的数据的解决方法
- Json格式获取里面某个字段的数据获取天气接口
- Scrapy:抓取返回数据格式为JSON的网站内容
- 读取含有json格式数据的txt文件 并且把数据保存到数据库
- python多种格式数据加载、处理与存储
- Ajax与用户交互的JSON数据存储格式
- php里少用到的session_module_name,以及session的key值限制,简单将session存储为json格式数据的方法
- 学习笔记——天气数据解析2(JSON格式数据)
- 天气API整理,返回的数据格式为json对象
- 数据可视化 三步走(一):数据采集与存储,利用python爬虫框架scrapy爬取网络数据并存储
- Scrapy:抓取返回数据格式为JSON的网站内容
- Nodejs实现简单爬虫,将爬到的数据以json数据格式保存到MySQL数据库中
- python解析json格式的天气数据
- Android如何把json格式的数据存储到xml中
- spring mvc 在同一个controller 中同时返回多种格式的数据 (xml json atom)
- [置顶] Android网络与数据存储03-在线请求天气API,并解析其中的json数据予以显示
- python,scrapy爬虫sql之爬取数据存储到mysql的piplelines.py配置