您的位置:首页 > 编程语言 > Python开发

Avoid URLs Matching Any of a Set of Patterns(chilkat/python学习四)过滤url

2008-08-23 22:05 429 查看
大家都知道在页面的超链接中可以连接很多东西,有用的,非常有用的,或是无聊的,无用的,甚至还有错误的,空的,还有莫名其妙的;做个爬虫是很幸苦的,老是被href欺骗感情,遇到这么些个东西你该怎么办呢,过滤掉他,一脚把他踢得远远的,爬虫我的感情是很丰富,但是绝对不喜欢滥交的;

代码:
spider = chilkat.CkSpider()

# The spider object crawls a single web site at a time. As you'll see
# in later examples, you can collect outbound links and use them to
# crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com")

# Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/")

# Avoid URLs matching these patterns:
spider.AddAvoidPattern("*java*")
spider.AddAvoidPattern("*python*")
spider.AddAvoidPattern("*perl*")

# Begin crawling the site by calling CrawlNext repeatedly.

for i in range(0,10):

success = spider.CrawlNext()
if (success == True):
# Show the URL of the page just spidered.
print spider.lastUrl()
# The HTML is available in the LastHtml property
else:
# Did we get an error or are there no more URLs to crawl?
if (spider.get_NumUnspidered() == 0):
print "No more URLs to spider"
else:
print spider.lastErrorText()

# Sleep 1 second before spidering the next URL.
spider.SleepMs(1000)
在这里可以从代码中看到他过滤掉了"java、python、perl",但是在实际中我们应该过滤掉的应该是
"dtd,xsd,javascript,(,zip,rar"等等,看实际情况需要了;

注:在chilkat中有很多对于url限制功能的函数,具体可以看http://www.example-code.com/python/pythonspider.asp
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: