您的位置：首页 > 编程语言 > Python开发

python3异步网页抓取

2016-03-19 12:51 507 查看

对于python的新学者来说，学习python3+的概率要大好多，毕竟喜欢新的东西。python2 就是不支持中文这一点太不好了！当初我就是因为这个坚定的选择python3的，一开始也是发现好多很好的库没有python3的版本，但现在回头看，感觉python2好旧了。

所以这一篇当然是关于python3的新东西。也就是用python3进行异步网页抓取，因为那什么urllib是阻塞的，然后很多已有的异步框架是基于python2的，所以网上这方面的技术还是比较少的，相关的可以找到廖雪峰的博客里有些。还有就是这篇文章了 python:利用asyncio进行快速抓取。然后你会发现上面的例子不完整，还或许有错。然后你会对这个较新的领域有些疑问，你会被标准库的asyncio之底层雷到。比如数据库的异步怎么做，比如文件读写的异步有木有必要，，主要是python文件处理在各类系统不一，window有些函数还不支持……然后，你就可以继续往下看了。

俺也是自己看了一天的文档于是自己搭建了自己的异步抓取小框架。现在，用这个框架只需要填写一些请求地址、请求参数啊什么的就能异步抓了，快速高效。当然抓到的网页怎么处理那个是要自己写的，框架里用的是pyquery。数据库异步框架被我好困难的找到了 ——aiomysql，在人家博客里被非常低调的一笔带过了……然后我看到有文章分析出python的文件读写没必要异步，一方面那太复杂，找不到简单的库，你要接触底层的python函数库那随你。其次，python的速度与磁盘读写的速度能差别多大呢？真的有必要吗，而且非常大的数据人家又有了数据库的方案。

框架是在github上的，代码很少，在github上有详细的介绍。下面贴一点主要代码，看看我的design：

import aiohttp
import asyncio
import config
import processData as pd

async def fetchData(url, callback = pd.processData, params=None):
#set request url and parameters here or you can pass from outside.

conn = aiohttp.TCPConnector(limit=config.REQ_AMOUNTS)
s = aiohttp.ClientSession(headers = config.HEADERS, connector=conn)
#use s.** request a webside will keep-alive the connection automaticaly,
#so you can set multi request here without close the connection
#while in the same domain.
#i.e.
#await s.post('***/page1')
#await s.get('***/page2')
########################################################################
async with s.get(url, params = params) as r:  #this format will auto close connection after preccessing.
#here the conection closed automaticly.
data = await r.text()  #here data will be just string with html format.
return await callback(data)

if __name__ == '__main__':
loop = asyncio.get_event_loop()

calendar_url= 'http://ec.cn.forexprostools.com/?ecoDayFontColor=%23c5c5c5&ecoDayBackground=%23ffffff&innerBorderColor=%23edeaea&borderColor=%23edeaea&columns=exc_flags,exc_currency,exc_importance,exc_actual,exc_forecast,exc_previous&category=_employment,_economicActivity,_inflation,_credit,_centralBanks,_confidenceIndex,_balance,_Bonds&importance=1,2,3&features=datepicker,timezone,timeselector,filters&countries=29,25,54,145,34,163,32,70,6,27,37,122,15,113,107,55,24,121,59,89,72,71,22,17,51,39,93,106,14,48,33,23,10,35,92,57,94,97,68,96,103,111,42,109,188,7,105,172,21,43,20,60,87,44,193,125,45,53,38,170,100,56,80,52,36,90,112,110,11,26,162,9,12,46,85,41,202,63,123,61,143,4,5,138,178,84,75&calType=week&timeZone=28&lang=1'

#coroutine in tasks will run  automaticly
tasks = [   #add every web url you want to crawl  and never feel limitation. just remember pass the callback function to process data.
fetchData('http://www.xm.com', pd.Xm),
fetchData('http://www.aetoscg.com/cn/long-and-short.html', pd.Asto),
fetchData(calendar_url, pd.Calendar)
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

中间那段长得不像话的链接原谅我，我不是为了充字数的，我是想爬取外汇的财经日历。代码很简洁，就是设计很干净。config——网站请求配置文件在config.py， 数据处理文件，也就是异步请求返回后的callback在processData.py 文件， 我巧妙的将yield/await这种看着同步其实异步的写法与 callback这个关键字结合，只想让你秒懂。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航