您的位置：首页 > 其它

爬虫笔记（10/4）-------scrapy项目管理

2017-10-04 16:08 344 查看

1.爬虫项目

1）创建爬虫项目

scrapy startproject 项目名

scrapy startproject myfirstpjt

2）进入项目

cd 爬虫项目所在目录

..................>cd myfirstpjt

3）scrapy参数

scrapy startproject -h

4）--logfile=FILE用来指定日志文件

日志等级常见值

等级名	含义
CAITICAL	发生最严重的错误
ERROR	发生了必须立即处理的问题
WARNING	出现一些警告信息，存在潜在的错误
INFO	输出一些提示信息
DEBUG	输出一些调试信息，常用于开发阶段

5）全局命令

...............................>scrapy -h
Scrapy 1.4.0 - project: myfirstspjt

Usage:
scrapy <command> [options] [args]

Available commands:
bench         Run quick benchmark test
check         Check spider contracts
crawl         Run a spider
edit          Edit spider
fetch         Fetch a URL using the Scrapy downloader
genspider     Generate new spider using pre-defined templates
list          List available spiders
parse         Parse URL (using its spider) and print the results
runspider     Run a self-contained spider (without creating a project)
settings      Get settings values
shell         Interactive scraping console
startproject  Create new project
version       Print Scrapy version
view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

··fetch命令···显示爬虫爬取过程

.......................>scrapy fetch --headers --nolog http://news.sina.com.cn/ > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: nginx
< Date: Wed, 04 Oct 2017 04:14:24 GMT
< Content-Type: text/html
< Last-Modified: Wed, 04 Oct 2017 04:12:07 GMT
< Vary: Accept-Encoding
< Expires: Wed, 04 Oct 2017 04:14:21 GMT
< Cache-Control: max-age=60
< X-Powered-By: shci_v1.03
< Age: 32
< Via: http/1.1 ctc.ningbo.ha2ts4.81 (ApacheTrafficServer/4.2.1.1 [cHs f ]), http/1.1 ctc.ningbo.ha2ts4.106 (ApacheTrafficServer/4.2.1.1 [cRs f ])
< X-Cache: HIT.81
< X-Cache: HIT.106
< X-Via-Cdn: f=edge,s=ctc.ningbo.ha2ts4.107.nb.sinaedge.com,c=61.164.56.98;f=Edge,s=ctc.ningbo.ha2ts4.106,c=61.164.56.98;f=edge,s=ctc.ningbo.ha2ts4.73.nb.sinaedge.com,c=115.238.190.106;f=Edge,s=ctc.ningbo.ha2ts4.81,c=106.38.241.153
< X-Via-Edge: jgwjigaqtn

······runspider命令·····直接运行一个爬虫文件不依托scrapy爬虫项目

............................>scrapy runspider --loglevel=INFO runspider.py

·····settings命令····查看scrapy对应的配置信息

...................>scrapy settings --get BOT_NAME
scrapybot

·········shell命令·····可以通过shell命令开启scrapy的交互终端

...................................>scrapy shell http://www.baidu.com --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000023F01A34630>
[s]   item       {}
[s]   request    <GET http://www.baidu.com> [s]   response   <200 http://www.baidu.com> [s]   settings   <scrapy.settings.Settings object at 0x0000023F02EFB940>
[s]   spider     <DefaultSpider 'default' at 0x23f0318e978>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: ti=sel.xpath("/html/head/title")

In [2]: print(ti)
[<Selector xpath='/html/head/title' data='<title>百度一下，你就知道</title>'>]

In [3]: exit()

·······version命令····显示scrapy的版本

...............>scrapy version
Scrapy 1.4.0

..................>scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.7.3.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.9.0
Python    : 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL : 17.0.0 (OpenSSL 1.0.2l  25 May 2017)
Platform  : Windows-10-10.0.15063-SP0

··········view命令······下载某个网页并用浏览器查看的功能

..........>scrapy view http://news.163.com/
6）项目命令

scrapy -h查看项目中可以使用的命令

.................>scrapy -h
Scrapy 1.4.0 - project: myfirstspjt

Usage:
scrapy <command> [options] [args]

Available commands:
bench         Run quick benchmark test
check         Check spider contracts
crawl         Run a spider
edit          Edit spider
fetch         Fetch a URL using the Scrapy downloader
genspider     Generate new spider using pre-defined templates
list          List available spiders
parse         Parse URL (using its spider) and print the results
runspider     Run a self-contained spider (without creating a project)
settings      Get settings values
shell         Interactive scraping console
startproject  Create new project
version       Print Scrapy version
view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

···bench命令····测试本地硬件的性能

····genspider命令····创建scrapy爬虫文件

scrapy genspider -1······查看当前可以使用的爬虫模板

available templates:basic,crawl,csvfeed,xmlfeed.

使用basic模板生成一个爬虫文件:scrapy genspider -t basic weisuen iqianyue.com(模板新爬虫名新爬虫爬取的域名)

查看csvfeed爬虫模板中内容:scrapy genspider -d csvfeed

··········check命令·····进行合同（contract）检查

scrapy check 爬虫名

······crawl命令···启动某个爬虫

scrapy crawl 爬虫名

·····list命令····列出当前可使用的爬虫文件

scrapy list

·····edit命令····打开编辑器对爬虫文件进行编辑（Windows下有问题，一般在Linux下OK）

·····parse命令····获取指定的URL网址，并使用对应的爬虫文件进行处理和分析

parse命令对应的参数表

参数	含义
--spider=SPIDER	强行指定某个爬虫文件spider进行处理
-a NAME=VALUE	设置spider的参数，可能会复制
--pipelines	通过pipelines来处理items
--nolinks	不展示提取到的链接信息
--nocolour	输出结果颜色不高亮
--rules,-r	使用crawlspider规则去处理回调函数
--callback=CALLBACK,-c CALLBACK	指定spider中用于处理返回的响应的回调函数
--noitems	不展示得到的items
--depth=DEPTH,-d DEPTH	设置爬行深度，默认深度为1
--verbose,-v	显示每层的详细信息

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航