Avoid URLs Matching Any of a Set of Patterns(chilkat/python学习四)过滤url
2008-08-23 22:05
429 查看
大家都知道在页面的超链接中可以连接很多东西,有用的,非常有用的,或是无聊的,无用的,甚至还有错误的,空的,还有莫名其妙的;做个爬虫是很幸苦的,老是被href欺骗感情,遇到这么些个东西你该怎么办呢,过滤掉他,一脚把他踢得远远的,爬虫我的感情是很丰富,但是绝对不喜欢滥交的;
代码:
spider = chilkat.CkSpider()
# The spider object crawls a single web site at a time. As you'll see
# in later examples, you can collect outbound links and use them to
# crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com")
# Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/")
# Avoid URLs matching these patterns:
spider.AddAvoidPattern("*java*")
spider.AddAvoidPattern("*python*")
spider.AddAvoidPattern("*perl*")
# Begin crawling the site by calling CrawlNext repeatedly.
for i in range(0,10):
success = spider.CrawlNext()
if (success == True):
# Show the URL of the page just spidered.
print spider.lastUrl()
# The HTML is available in the LastHtml property
else:
# Did we get an error or are there no more URLs to crawl?
if (spider.get_NumUnspidered() == 0):
print "No more URLs to spider"
else:
print spider.lastErrorText()
# Sleep 1 second before spidering the next URL.
spider.SleepMs(1000)
在这里可以从代码中看到他过滤掉了"java、python、perl",但是在实际中我们应该过滤掉的应该是
"dtd,xsd,javascript,(,zip,rar"等等,看实际情况需要了;
注:在chilkat中有很多对于url限制功能的函数,具体可以看http://www.example-code.com/python/pythonspider.asp
代码:
spider = chilkat.CkSpider()
# The spider object crawls a single web site at a time. As you'll see
# in later examples, you can collect outbound links and use them to
# crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com")
# Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/")
# Avoid URLs matching these patterns:
spider.AddAvoidPattern("*java*")
spider.AddAvoidPattern("*python*")
spider.AddAvoidPattern("*perl*")
# Begin crawling the site by calling CrawlNext repeatedly.
for i in range(0,10):
success = spider.CrawlNext()
if (success == True):
# Show the URL of the page just spidered.
print spider.lastUrl()
# The HTML is available in the LastHtml property
else:
# Did we get an error or are there no more URLs to crawl?
if (spider.get_NumUnspidered() == 0):
print "No more URLs to spider"
else:
print spider.lastErrorText()
# Sleep 1 second before spidering the next URL.
spider.SleepMs(1000)
在这里可以从代码中看到他过滤掉了"java、python、perl",但是在实际中我们应该过滤掉的应该是
"dtd,xsd,javascript,(,zip,rar"等等,看实际情况需要了;
注:在chilkat中有很多对于url限制功能的函数,具体可以看http://www.example-code.com/python/pythonspider.asp
相关文章推荐
- Update your urlpatterns to be a list of django.conf.urls.url() instances instead. Django 1.10. Updat
- 【byte-of-python 学习笔记】——DOS运行python
- Principle of Computing (Python)学习笔记(4) Combination + Yahtzee
- Nutch学习笔记7---url的正则过滤机制研究
- Python学习之集合(set)
- 仿照django的urls风格和模块化结构的flask项目(Django-Style URL Patterns for Flask)
- Python-dict和set(学习笔记2)
- python3学习之set
- Python学习-map&set
- Python学习 Day 3 字符串 编码 list tuple 循环 dict set
- python学习之urlparse()
- Python学习——使用dict和set
- 集合set、python给程序传参数的学习、列表生成式
- python学习-set
- Python基础,基本类型(整型,浮点数等)数据结构(List,dic(Map),Set,Tuple),控制语句(if,for,while,continue or break):来自学习资料
- python爬虫:传递URL参数学习笔记
- 3.python的set集合的操作示例——《深入python》学习
- python的dict、list、set学习运用
- Dive Into Python 学习记录3-getattr 介绍/过滤列表/and or/lambda 函数
- 【Python学习日记】dic和set 以及什么是 不可变对象