您的位置：首页 > 编程语言 > Python开发

Avoid URLs Matching Any of a Set of Patterns（chilkat/python学习四）过滤url

2008-08-23 22:05 429 查看

大家都知道在页面的超链接中可以连接很多东西，有用的，非常有用的，或是无聊的，无用的，甚至还有错误的，空的，还有莫名其妙的；做个爬虫是很幸苦的，老是被href欺骗感情，遇到这么些个东西你该怎么办呢，过滤掉他，一脚把他踢得远远的，爬虫我的感情是很丰富，但是绝对不喜欢滥交的；

代码：
spider = chilkat.CkSpider()

# The spider object crawls a single web site at a time. As you'll see
# in later examples, you can collect outbound links and use them to
# crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com")

# Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/")

# Avoid URLs matching these patterns:
spider.AddAvoidPattern("*java*")
spider.AddAvoidPattern("*python*")
spider.AddAvoidPattern("*perl*")

# Begin crawling the site by calling CrawlNext repeatedly.

for i in range(0,10):

success = spider.CrawlNext()
if (success == True):
# Show the URL of the page just spidered.
print spider.lastUrl()
# The HTML is available in the LastHtml property
else:
# Did we get an error or are there no more URLs to crawl?
if (spider.get_NumUnspidered() == 0):
print "No more URLs to spider"
else:
print spider.lastErrorText()

# Sleep 1 second before spidering the next URL.
spider.SleepMs(1000)
在这里可以从代码中看到他过滤掉了"java、python、perl",但是在实际中我们应该过滤掉的应该是
"dtd,xsd,javascript,(,zip,rar"等等，看实际情况需要了；

注：在chilkat中有很多对于url限制功能的函数，具体可以看http://www.example-code.com/python/pythonspider.asp

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航