Scrapy网络爬虫----初识
2017-06-26 19:20
169 查看
刚刚接触,也是通过官方文档一步一步学习的,根据自己的理解在此做个记录。文档来源自官方:https://docs.scrapy.org/en/latest/intro/tutorial.html
基本上使用Scrapy时用的它建好的框架,我们更多的做为使用来填空内容的。具体使用如下:
首先建立一个工程,工程中已经包含了固定格式的各种文件原型。
scrapy startproject tutorial
将在目录下生成如下工程细节
tutorial/ scrapy.cfg # 配置文件 tutorial/ # 工程名 __init__.py items.py # 爬下来的内容名称 pipelines.py #爬下来保存数据用的 settings.py # 设置 spiders/ # 真正要真加所爬内容的目录,使用时要自己新建个文件,并进行内容填写 __init__.py quotes_spider.py #新建名的文件
quotes_spider.py中的内容可以写为如下:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
name在后用命令执行时使用的,做为标识
start_request 是开始从起始urls的发送请求,这里一定要给出开始爬的网页网址,成功后会返回响应(Response)
parse解析Response中的内容。
之后执行
scrapy crawl quotes
执行代码如下:
... (omitted for brevity) 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-12-16 21 ccb4 :24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) ...
我们可以看出,执行过程顺序,先时将域名转化成ip来监听,然后获取我们所写网址,最后通过parse将解析后的东西保存。
下面我们将学习Scrapy是如何提取数据的
执行scrapy shell ‘http://quotes.toscrape.com/page/1/’
产生如下:
... Scrapy log here ... ] 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10> [s] spider <DefaultSpider 'default' at 0x7fa91c8af990> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser
在运行脚本后,我们可以在脚本环境下查询各种css属性:
>>> response.css('title') [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
提取相应的内容
>>> response.css('title').extract() ['<title>Quotes to Scrape</title>'] >>> response.css('title::text')[0].extract() //注意::text的作用 'Quotes to Scrape'
extract() 和 extract_first()只是后是在爬的时候如果没有爬到会返回一些信息,re()也有同样的功能。
>>> response.css('title::text').re(r'Quotes.*') ['Quotes to Scrape']
除了CSS外,还可以使用经xpath。相关可见http://zvon.org/comp/m/xpath.html
>>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape'
下面将看,是如何对html源码用上述两种方法进行解析的
<div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>
对以上源码进行解析如下:
$ scrapy shell 'http://quotes.toscrape.com' >>> quote = response.css("div.quote")[0] //这里返回的是一个数组,这里的·「0」表示我们只使用第一个而已 >>> title = quote.css("span.text::text").extract_first() >>> title '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' >>> author = quote.css("small.author::text").extract_first() >>> author 'Albert Einstein' /********对于源码tags中的字符串可以如下处理*********/ >>> tags = quote.css("div.tags a.tag::text").extract() >>> tags ['change', 'deep-thoughts', 'thinking', 'world'] /****最终最整篇的解析可用循环实现*******/ >>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").extract_first() ... author = quote.css("small.author::text").extract_first() ... tags = quote.css("div.tags a.tag::text").extract() ... print(dict(text=text, author=author, tags=tags)) {'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'} {'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'} ... a few more of these, omitted for brevity >>>
第二部分数据的存储
对小量数据存储可以用以下指令:
scrapy crawl quotes -o quotes.jl //和gcc有点像
大部分就需要用到大数据量的存储,就要使用pipeline.
这也是爬虫的意义所在。
对网页而言就是子页关系。如下面:
<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul>
可以用以下指令解析:
>>> response.css('li.next a::attr(href)').extract_first() '/page/2/'
在处理完起始页面后,便可以跳到第二个界面了
将最开始的源码修改为如下,就可以不断的一层层爬虫
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('span small::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)
这里涉及到路径的问题,为了避免这个问题,可以将后三句代码改为如下:
for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse) //或者 for a in response.css('li.next a'): yield response.follow(a, callback=self.parse)
这是一个最Scrapy浅显的认识。从监听网站到,到回应解析到下载,主要讲主体部分的粗浅实现,后面将涉及框架中每一部分的语言设计!
相关文章推荐
- Python爬虫开发与项目实战 3: 初识爬虫
- com.panie 项目开发随笔_爬虫初识(2017.2.7)
- python网络爬虫之初识网络爬虫
- Python爬虫初识
- JAVA爬虫初识之模拟登录
- 初识爬虫
- JAVA爬虫初识之HTTP通信机制
- JAVA爬虫初识之httpclient与jsoup
- Python爬虫初识
- python爬虫从入门到放弃(一)之初识爬虫
- Python网络爬虫——1、初识网络爬虫
- 初识爬虫
- python 爬虫系列(0) --- 初识网络爬虫
- 初识python爬虫
- Python爬虫初识
- Python爬虫(入门+进阶)学习笔记 1-2 初识Python爬虫
- 网络爬虫之初识网络爬虫
- 初识爬虫之一:urllib2与urllib实现
- 初识scrapy及scrapy 小爬虫程序实练
- python爬虫从入门到放弃(一)之初识爬虫