您的位置:首页 > 编程语言 > Python开发

python 爬虫scrapy框架练习

2013-09-21 21:25 513 查看
最近发现python中的scrapy库用来做爬虫是相当的好用,只要你有想法,就能实现一些抓取,本人的测试环境在ubuntu12.04下进行

第一步安装scrapy:

pip install Scrapy

easy_install scrapy

这两条命令完成了scrapy的安装,

下来我们用这些东西实现一个简单的抓取:

命令:scrapy

这个我们能看到下面出现好多的 命令这些命令是连起来用的

wangyu@ubuntu:~$ scrapy
Scrapy 0.12.0.2546 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  fetch         Fetch a URL using the Scrapy downloader
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

这个命令可以让你看到生成一个文件夹在当前目录下:

scarpy startproject wo


wangyu@ubuntu:~/wo$ tree
.
├── scrapy.cfg
└── wo
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

2 directories, 6 files


我们可以通过这棵树看到我们当前的wo文件夹下有什么

我们先用vim来打开items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html 
from scrapy.item import Item, Field

class WoItem(Item):
    # define the fields for your item here like:
    # name = Field()
    pass
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~

这个是当前的空文件,我们要进行一下编辑:

1 # Define here the models for your scraped items
  2 #-*-coding:utf-8-*-
  3 # See documentation in:
  4 # http://doc.scrapy.org/topics/items.html   5 
  6 from scrapy.item import Item, Field
  7 
  8 class WoItem(Item):
  9     # define the fields for your item here like:
 10      title = Field()
 11      link  = Field()
 12 #定义两个一个title,一个link
 13                                                                                       
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~

我添加了两句,一句是中文注释,在就是下面的类里面的东西

要建立一个Spider,你必须为scrapy.spider.BaseSpider创建一个子类,并确定三个主要的、强制的属性

name必须唯一

start_urls 爬虫开始爬的地方

parse():爬虫的方法,调用时候传入从每一个URL传回的Response对象作为参数,response将会是parse方法的唯一的一个参数,

1 from scrapy.spider import BaseSpider   
  2 
  3 class wo(BaseSpider):
  4     name = "wo"
  5     allowed_wo =["jandan.net"]
  6     start_urls=[
  7             "http://jandan.net/new",
  8             "http://jandan.net/fml"
  9             ]
 10     def parse(self,response):
 11         filename=response.url.split("/"    )[-2]
 12         open(filename,'wb').write(respo    nse.body)
 13 
~                                          
~                                          
~                                          
~                                          
~

这个中我们以简单为例。

这时我们运行:scrapy crawl wo

wangyu@ubuntu:~/wo$ scrapy crawl wo
2013-09-21 20:53:40+0800 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: wo)
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled item pipelines: 
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-21 20:53:40+0800 [wo] INFO: Spider opened
2013-09-21 20:53:40+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/fml> (referer: None)
2013-09-21 20:53:40+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/new> (referer: None)
2013-09-21 20:53:40+0800 [wo] INFO: Closing spider (finished)
2013-09-21 20:53:40+0800 [wo] INFO: Spider closed (finished)

这个就是我们的运行结果,我们能看到这些东西,我们返回wo这个文件夹,会发现一个被名民为jandan的文件,但是打开发现全是代码。不能达到真的直接吧一个网页下载下来,我们在改改这些代码:

1 from scrapy.spider import BaseSpider
  2 
  3 class wo(BaseSpider):
  4     name = "wo"
  5     allowed_wo =["jiandan.net"]
  6     start_urls=[
  7              "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
  8                      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
  9                     "http://jandan.net/ooxx"                                       
 10 
 11             ]
 12     def parse(self,response):
 13         filename=response.url.split("/")[-2]
 14         open(filename,'wb').write(response.body)
 15     #这个地方是给你的文件命名,在url最后的倒数地一个“/”之后的词语命名为你的文件名,下面调用写入文件
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~


再次运行:

wangyu@ubuntu:~/wo$ scrapy crawl wo

2013-09-21 21:21:16+0800 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: wo)
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled item pipelines: 
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-21 21:21:16+0800 [wo] INFO: Spider opened
2013-09-21 21:21:16+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/ooxx> (referer: None)
2013-09-21 21:21:18+0800 [wo] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2013-09-21 21:21:18+0800 [wo] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2013-09-21 21:21:18+0800 [wo] INFO: Closing spider (finished)
2013-09-21 21:21:18+0800 [wo] INFO: Spider closed (finished)


这个时候我能能看见在你这个项目的根文件夹下有三个文件叫做books和Resources和jiandan.net这个时候我们就完成了我们的下爬虫了

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: