您的位置：首页 > 理论基础 > 计算机网络

nutch1.6安装使用中错误解决方法

2017-08-22 09:36 423 查看

本文为小编在使用nutch 1.6中遇到“Nutch Fetcher: No agents listed in ‘http.agent.name’ property” 的第一个，该问题解决方法：原文网址：http://blog.csdn.net/chaishen10000/article/details/7183382

网络上大多解释是：在{nutch}/conf下找到nutch-default.xml

如果一开始的属性设置为：

<property>
<name> http.agent.name</name>
<value> </value>
</property>

则可能会抛出Fetcher: No agents listed in ‘http.agent.name’ property的错误提示。原因在于<value></value>中的值为空，自己加上一些东西（我想应该是随意的），改成如下所示：

<property>
<name> http.agent.name</name>
<value> ZB nutch agent</value>
</property>

这种方法在nutch1.6中无效，仔细分析后发现，1.6中存在“/runtime/local”目录，所有运行都是在该目录下。找到该目录下的conf/nutch-default.xml，按上述办法即可解决。

第二个遇到的问题：

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

解决方法：nutch-default.xml中的<name>http.robots.agents</name>中加入spider,* 。官方并不建议这么做，最好将下面的代码复制到nutch-site.xml，默认会覆盖nutch-default.xml中的配置（推荐）。

<property>
<name>http.agent.name</name>
<value>spider</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.robots.agents</name>
<value>spider,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>

（三个Input path doesn't exist问题：

删除在 data/segments 报错的文件夹就可以了。

1、git 来作为版本控制工具，github作为server。bitbucket.org提供免费的私有库。

2、Nutch的提高在于研读nutch-default.xml文件中的每一个配置项的实际含义（需要结合源代码理解）。

3、ant 根据build.xml配置文件进行执行，里面指定了如何对nutch进行编译，进行打包的，定制开发Nutch入门的方法时研读build.xml文件。

第一次获取数据失败了，失败内容通过cat nohup.out 日志文件中进行查看，发现是hostname的问题。

在进行 nohup bin/nutch crawl urls -dir data -threads 20 -depth 1 & 开始对数据进行抓取，最后成功的抓取到了网页的数据，放到data目录下的文件中。

PS ：学习过程中的一些知识分享。

crawlDb ：一个全局的抓取过超大的URL 文件夹

logs/hadoop.log 里面是数据抓去的详细信息。

solr ：是一个独立的企业级搜索应用服务器，对外提供类似于Web-Service的API接口。

slf4j：java 日志

1、通过nutch诞生了hadoop，tika，gora。

2、Nutch 通过ivy（1.2之后）来进行依赖管理的。

3、Nutch 是使用SVN进行源代码管理的。

4、Lucene，Nutch，Hadoop在搜索界相当有名。

5、Nutch和Hadoop是通过什么连接起来的？

通过Nutch 脚本，通过Hadoop命令把apache-nutch-1.6.job提交给Hadoop的Job Tracker。

6、Nutch入门重点在于分析nutch脚本文件。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： nutch 网络

相关文章推荐

新的分享

章节导航