您的位置:首页 > 其它

Scrapy实例————爬取链家新房数据

weixin_37891983 2020-04-05 12:16 87 查看 https://blog.csdn.net/weixin_3

目标

通过

Scrapy
爬取链家新房数据,地址 https://bj.fang.lianjia.com/loupan ,并将爬取到的数据存储到
json
文件中。

环境

博主是在

Windows
平台使用
PyCharm
基于
Python 3.7
Scrapy 2.0.1
编写爬虫,不赘述环境配置了。

建立项目

右键

Scrapy
文件夹,选择在终端中打开
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VOuvpS1i-1585657632817)(https://boyinthesun.oss-cn-beijing.aliyuncs.com/img/Python-scrapy(1)].png?x-oss-process=style/watermark)
在终端中输入
scrapy startproject lianjia
,其中
lianjia
为项目名

新建begin.py

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZPxXH3rb-1585657632818)(https://boyinthesun.oss-cn-beijing.aliyuncs.com/img/Python-scrapy(2)].png?x-oss-process=style/watermark)
在项目文件夹中新建

begin.py
,内容为:

from scrapy import cmdline
cmdline.execute("scrapy crawl lianjia".split())

其中

lianjia
为爬虫名(无须与项目名相同)。目的是为了方便运行爬虫。否则,需要在终端手动输入
scrapy crawl lianjia
来运行

更改
items.py

import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
resblock_type = scrapy.Field()
sale_status = scrapy.Field()
location0 = scrapy.Field()
location1 = scrapy.Field()
location2 = scrapy.Field()
num_room = scrapy.Field()
area = scrapy.Field()
price_pre_spm = scrapy.Field()
price_pre_suite = scrapy.Field()

分析网页

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L8op9CIW-1585657632818)(https://boyinthesun.oss-cn-beijing.aliyuncs.com/img/Python-scrapy(3)].png?x-oss-process=style/watermark)
右键爬取内容,

检查
,以下以
name
为例讲解。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-17Q0BiEa-1585657632819)(https://boyinthesun.oss-cn-beijing.aliyuncs.com/img/Python-scrapy(4)].png?x-oss-process=style/watermark)
复制三个
name
的xpath路径,不难找到规律。

/html/body/div[4]/ul[2]/li[1]/div/div[1]/a
/html/body/div[4]/ul[2]/li[2]/div/div[1]/a
/html/body/div[4]/ul[2]/li[3]/div/div[1]/a

新建并更改
spider.py

spider
文件夹中新建
spider.py
,并键入:

#为方便更改,设置全局变量,爬取页数
page = 5
import scrapy
from lianjia.items import MyItem
class mySpider(scrapy.spiders.Spider):
#爬虫名,须与begin.py中相同。无须和项目名相同,这里只是为了方便
name = "lianjia"
#允许域名
allowed_domains = ["bj.fang.lianjia.com"]
#新建爬取链接为空列表
start_urls = []
if page >= 1:
for i in range(1 , page+1):
#依次将第1页到第page页放入开始连接队列
start_urls.append("https://bj.fang.lianjia.com/loupan/nhs1pg{}".format(i))
else:
print("page must >= 1")
def parse(self, response):
item = MyItem ()
#迭代爬取每个li区块
for each in response.xpath("/html/body/div[4]/ul[2]/*"):
item['name'] = each.xpath("div/div[1]/a/text()").extract()
item['resblock_type'] = each.xpath('div/div[1]/span[@class="resblock-type"]/text()').extract()
item['sale_status'] = each.xpath('div/div[1]/span[@class="sale-status"]/text()').extract()
item['location0'] = each.xpath('div/div[2]/span[1]/text()').extract()
item['location1'] = each.xpath('div/div[2]/span[2]/text()').extract()
item['location2'] = each.xpath('div/div[2]/a/text()').extract()
item['num_room'] = []
for room in each.xpath('div/a/*'):
item['num_room'] += room.xpath('text()').extract()
#去除掉列表中为'/'的元素
if item['num_room'][-1] == '/':
del item['num_room'][-1]
item['area'] = each.xpath('div/div[3]/span/text()').extract()
item['price_pre_spm'] = ['均价{}元/平方米'.format(each.xpath('div/div[6]/div[1]/span[1]/text()').extract()[0])]
item['price_pre_suite'] = each.xpath('div/div[6]/div[2]/text()').extract()
#传递数据
yield item

更改
setting.py

#不遵守机器人协议
ROBOTSTXT_OBEY = False
BOT_NAME = 'lianjia'
SPIDER_MODULES = ['lianjia.spiders']
NEWSPIDER_MODULE = 'lianjia.spiders'
#开启管道
ITEM_PIPELINES = {'lianjia.pipelines.MyPipeline': 300,}
#客户端伪装
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36'

更改
pipelines.py

import json
class MyPipeline (object):
#打开文件
def open_spider (self,spider):
try:
self.file = open('MyData.json ', "w", encoding="utf-8")
except Exception as err:
print(err)
#写入
def process_item (self, item, spider):
dict_item = dict (item)
# 生成 json 串
json_str = json.dumps(dict_item , ensure_ascii=False) + "\n"
self.file.write(json_str)
return item
#关闭文件
def close_spider (self,spider):
self.file.close()

运行爬虫

运行

begin.py
,成功爬取到数据。这里只列举前三条。

{"name": ["金地旭辉·江山风华"], "resblock_type": ["住宅"], "sale_status": ["在售"], "location0": ["大兴"], "location1": ["黄村中"], "location2": ["地铁4号线清源路站西侧800米"], "num_room": ["3室", "4室"], "area": ["建面 89-136㎡"], "price_pre_spm": ["均价55800元/平方米"], "price_pre_suite": ["总价480万/套"]}
{"name": ["中海寰宇时代"], "resblock_type": ["住宅"], "sale_status": ["在售"], "location0": ["大兴"], "location1": ["瀛海"], "location2": ["黄亦路与京福路西南口交叉口"], "num_room": ["2室", "3室", "4室"], "area": ["建面 48-112㎡"], "price_pre_spm": ["均价52449元/平方米"], "price_pre_suite": ["总价350万/套"]}
{"name": ["合景天汇广场"], "resblock_type": ["住宅"], "sale_status": ["在售"], "location0": ["顺义"], "location1": ["马坡"], "location2": ["昌金路与通顺路交汇处天汇广场售楼处"], "num_room": ["3室", "4室"], "area": ["建面 89-117㎡"], "price_pre_spm": ["均价38000元/平方米"], "price_pre_suite": ["总价330万/套"]}

提示

请不要把page改得过大,否则可能会封禁ip。

交流讨论等具体内容请访问我的博客

原文地址:https://boyinthesun.cn/post/python-scrapy/

  • 点赞
  • 收藏
  • 分享
  • 文章举报
BoyInTheSun 发布了16 篇原创文章 · 获赞 0 · 访问量 1325 私信 关注
标签: