pykoala - A simple, small and fast web crawler - Google Project Hosting
2012-08-22 15:35
751 查看
http://code.google.com/p/pykoala/
pykoala - A simple, small and fast web crawler - Google Project Hosting
pykoala可以轻松地嵌入到你需要使用爬虫的地方。下面展示一些基本用法:
, r'\.jpe?g ] >>> from pykoala import Koala >>> koalaBaby = Koala.Koala('http://www.cnbeta.com/', entryFilter, yieldFilter) >>> for url in koalaBaby.go(): ... print url # 只允许抓取不以mailto:开头的URL >>> yFilter = dict() >>> yFilter['Type'] = 'deny' >>> yFilter['List'] = [r'^mailto:', ] >>> from pykoala import Koala >>> koalaBaby = Koala.Koala('http://www.cnbeta.com/', yieldFilter = yFilter) >>> for url in koalaBaby.go(): ... print url
更多用法请参见代码中的文档。
Email/Gtalk: mail@zhang-chun.org
QQ: 123721771
pykoala - A simple, small and fast web crawler - Google Project Hosting
[介绍]
pykoala是一个简单、小巧、快速的“网络爬虫模块”。虽然真实世界中的“考拉”是一种行动缓慢的可爱生物,但这个pykoala速度很快,同时非常易于使用。pykoala可以轻松地嵌入到你需要使用爬虫的地方。下面展示一些基本用法:
# 最简单的用法 >>> from pykoala import Koala >>> koalaBaby = Koala.Koala('http://www.cnbeta.com/') >>> for url in koalaBaby.go(): ... print url # 设置爬虫深度,默认为10 >>> from pykoala import Koala >>> koalaBaby = Koala.Koala('http://www.cnbeta.com/') >>> for url in koalaBaby.go(maxDepth = 5): ... print url # 只允许进入www.cnbeta.com/articles/这样的URL中,并只抓取URL以.htm和.jp(e)g结尾的URL >>> entryFilter = dict() >>> entryFilter['Type'] = 'allow' >>> entryFilter['List'] = [r'www\.cnbeta\.com/articles/', ] >>> yieldFilter = dict() >>> yieldFilter['Type'] = 'allow' >>> yieldFilter['List'] = [r'\.htm
更多用法请参见代码中的文档。
使用问题,报bug,共同开发,技术交流……
请联系我:Email/Gtalk: mail@zhang-chun.org
QQ: 123721771
相关文章推荐
- crawler4j - Open Source Web Crawler for Java - Google Project Hosting
- duibrowser - a cross-platform and light-weight webkit kernel web browser - Google Project Hosting
- gperftools - Fast, multi-threaded malloc() and nifty performance analysis tools - Google Project Hosting
- aranduka - A simple e-book manager and reader - Google Project Hosting
- mysqlviz - MySQL and SQLite Database Visualisation Tool - Google Project Hosting
- snova - A client–server model web proxy application build on PaaS platforms. - Google Project Hosting
- chromiumembedded - A simple framework for embedding chromium browser windows in other applications. - Google Project Hosting
- shellinabox - Web based AJAX terminal emulator - Google Project Hosting
- imageclipper - A tool to crop images manually fast - Google Project Hosting
- psutil - A cross-platform process and system utilities module for Python - Google Project Hosting
- pacparser - A library to make your web software pac (proxy auto-config) files intelligent. Comes with much useful pactester tool now. - Google Project Hosting
- web-shell - WebShell is a web-based ssh shell for the iPhone - Google Project Hosting
- plda - A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation - Google Project Hosting
- cranebrowser - A headless web browser in .NET - Google Project Hosting
- Issue 6 - phantomjs - Debugging with Web Inspector - headless WebKit with JavaScript API - Google Project Hosting
- Issue 6 - phantomjs - Debugging with Web Inspector - headless WebKit with JavaScript API - Google Project Hosting
- webtty - WebTTY is a web based(javascript & python) terminal emulator application - Google Project Hosting
- cute-log - A lightweight, flxiable, high configurable, thread-safe and cute logging library - Google Project Hosting
- libibase - 实时增量全文检索搜索引擎系统(Instant and Incremental Full-Text Search Engine) - Google Project Hosting
- webscraping - Python library for web scraping - Google Project Hosting