您的位置：首页 > 其它

Lucene Nutch 入门简明教程总结(转载收藏)

2013-01-21 00:42 756 查看

原文地址：http://cid-47027e68f36cbaf5.spaces.live.com/blog/cns!47027E68F36CBAF5!443.entry

先声明，是个菜鸟的总结。总结的是一件很菜鸟的事情。

事情的起因是1月2号，在实验室，黄黄的书架上看到一本书《Lucene in action》。觉得有意思，拿过来一看。又见Nutch是基于Lucene的一个开源搜索引擎，很受欢迎，决心试一试。

先列出主要参考文献

Nutch入门教程，某北邮人写的。下载地址

download.csdn.net/source/619615

当然，想要从CSDN上下东东，先注册个账号再说。

然后基本就是套着上去就好了。

(另：Nutch_tutorial8.pdf 这个文件是 tutorial, ms也很好的样子，但是英文，我有严重的阅读障碍，所以，未涉足)

中间涉及的一些问题

cygwin的安装，就是下载那个setup.exe文件，然后先下载到local directory，下完后再从local directory装。我装在D:\cygwin目录下。

Java我是以前就已经装好了的。所以设置一下就好了。path,classpath,java_home等一定要设置好。。。否则问题会很烦人，啥classLoader的。。。验证的方法基本是在命令行输入 java 或 javac，看看有没有错误提示。有一个问题我也不明白是它会提示找不tools.jar，但实际上是在jre\lib里面，将其拷到jdk\lib里面就好了。

接下来跑nutch,第一步是抓页面。照着教程一步一步弄就好了

crawl-urlfilter.txt 注意黑体部分。

# The url filter file used by the crawl command.

# Better for intranet crawling.

# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file

# determines whether a URL is included or ignored. If no pattern

# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops

-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME

+^%2A/.)*163.com/]http://([a-z0-9]*\.)*163.com/

# skip everything else

-.

nutch下面的nutch-site.xml就不用多说了，要注意<value>一定要有值

然后抓取过程是trival的，盯着crawl.log看看都抓了些啥东东就好了。

出现的一个问题是No URLs to fetch。然后就啥也不做退出来了，令我很不爽。后来不知道怎么又能work了。ms设置能以下后才OK的。

# accept hosts in MY.DOMAIN.NAME

+^%2A/.)*163.com/]http://([a-z0-9]*\.)*163.com/

要在Tomcat上跑。我是如黑体部分设置Tomcat下面的nutch-site.xml的，另外nutch-default.xml 里面的searcher.dir好像ms也要改的说，偶改成一样了。

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>

<name>searcher.dir</name>

<value>D:/cygwin/nutch/crawldemo</value>

</property>

</configuration>

注意：我犯了致命的一个错误是在D:/cygwin/nutch/crawldemo 里面带了空格，所以死活搜不到东东，折腾死了。希望不再有人同样死去活来。

最后，安慰一下自己，就是这么一东东，基本上整了偶三天。太眼了。

一些链接（对菜鸟来说，都很不错哟）：

Linux下安装Lucene（详细）

http://blog.c1gstudio.com/archives/142

Windows下Nutch的安装过程

http://read.newbooks.com.cn/info/196850.html

在Eclipse下编译运行nutch

http://zhangxiang390.iteye.com/blog/257373

Nutch-0.9源代码：Crawl类整体分析

http://hi.baidu.com/shirdrn/blog/item/b7de0813a865a8d6f7039e18.html

Nutch跑起来一些细细的事项

http://blog.csdn.net/fancyhf/archive/2007/08/29/1763629.aspx

Nutch中添加中文分词的方法

http://www.chinawiss.com/docs/docs/14/1194.html

Nutch项目配置1（内部网搜索）
http://wind-bell.iteye.com/blog/80135

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

Lucene Nutch 入门简明教程总结(转载 收藏)

原文地址：http://cid-47027e68f36cbaf5.spaces.live.com/blog/cns!47027E68F36CBAF5!443.entry

Lucene Nutch 入门简明教程总结(转载收藏)