您的位置:首页 > 运维架构 > Apache

Apache Nutch 1.3 学习笔记十(插件扩展)

2011-10-24 00:10 363 查看

1. 自己扩展一个简单的插件

这里扩展一个Nutch的URLFilter插件,叫MyURLFilter

1.1 生成一个Package

首先生成一个与urlfilter-regex类似的包结构
如org.apache.nutch.urlfilter.my

1.2 在这个包中生成相应的扩展文件

再生成一个MyURLFilter.java文件,内容如下:

package org.apache.nutch.urlfilter.my;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import org.apache.hadoop.conf.Configuration;

import org.apache.nutch.net.URLFilter;

import org.apache.nutch.urlfilter.prefix.PrefixURLFilter;

public class MyURLFilter implements URLFilter{ // 这里的继承自Nutch的URLFilter扩展

private Configuration conf;

public MyURLFilter()

{}

@Override

public String filter(String urlString) { // 对url字符串进行过滤

// TODO Auto-generated method stub

return "My Filter:"+ urlString;

}

@Override

public Configuration getConf() {

// TODO Auto-generated method stub

return this.conf;

}

@Override

public void setConf(Configuration conf) {

// TODO Auto-generated method stub

this.conf = conf;

}

public static void main(String[] args) throws IOException

{

MyURLFilter filter = new MyURLFilter();

BufferedReader in=new BufferedReader(new InputStreamReader(System.in));

String line;

while((line=in.readLine())!=null) {

String out=filter.filter(line);

if(out!=null) {

System.out.println(out);

}

}

}

}

1.3 打包成jar包并生成相应的plugin.xml文件

打包可以用ivy或者是eclipse来打,每一个plugin都有一个描述文件plugin.xml,内容如下:

<plugin

id="urlfilter-my"

name="My URL Filter"

version="1.0.0"

provider-name="nutch.org">

<runtime>

<library name="urlfilter-my.jar">

<export name="*"/>

</library>

<!-- 如果这里你的插件有依赖第三方库的话,可以这样写

<library name="fontbox-1.4.0.jar"/>

<library name="geronimo-stax-api_1.0_spec-1.jar"/>

-->

</runtime>

<requires>

<import plugin="nutch-extensionpoints"/>

</requires>

<extension id="org.apache.nutch.net.urlfilter.my"

name="Nutch My URL Filter"

point="org.apache.nutch.net.URLFilter">

<implementation id="MyURLFilter"

class="org.apache.nutch.urlfilter.prefix.MyURLFilter"/>

<!-- by default, attribute "file" is undefined, to keep classic behavior.

<implementation id="PrefixURLFilter"

class="org.apache.nutch.net.PrefixURLFilter">

<parameter name="file" value="urlfilter-prefix.txt"/>

</implementation>

-->

</extension>

lt;/plugin>

1.4 把需要的包与配置文件放入plugins目录中

最后把打好的jar包与plugin.xml放到一个urlfilter-my文件夹中,再把这个文件夹到到nutch的plugins目录下

2. 使用bin/nutch plugin来进行测试

在运行bin/nutch plugin命令之前你要修改一下nutch-site.xml这个配置文件,在下面加入我们写的插件,如下

<property>

<name>plugin.includes</name>

<value>protocol-http|urlfilter-(regex|prefix|my)|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

<description>Regular expression naming plugin directory names to

include. Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpoints plugin. By

default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPS please enable

protocol-httpclient, but be aware of possible intermittent problems with the

underlying commons-httpclient library.

</description>

</property>

在本机测试结果如下:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch plugin urlfilter-my org.apache.nutch.urlfilter.my.MyURLFilter

urlString1

My Filter:urlString1

urlString2

My Filter:urlString2

3. 总结
这里只是写了一个简单的插件,当然你可以根据你的需求写出更加复杂的插件.

4. 参考

http://wiki.apache.org/nutch/WritingPluginExample#The_Example

作者:http://blog.csdn.net/amuseme_lu

相关文章阅读及免费下载:

Apache Nutch 1.3 学习笔记目录

Apache Nutch 1.3 学习笔记一

Apache Nutch 1.3 学习笔记二

[b]《[/b]Apache Nutch 1.3 学习笔记三(Inject)[b]》[/b]

Apache Nutch 1.3 学习笔记三(Inject CrawlDB Reader)

[b]《[/b]Apache Nutch 1.3 学习笔记四(Generate)[b]》[/b]

Apache Nutch 1.3 学习笔记四(SegmentReader分析)

[b]《[/b]Apache Nutch 1.3 学习笔记五(FetchThread)[b]》[/b]

Apache Nutch 1.3 学习笔记五(Fetcher流程)

[b]《[/b]Apache Nutch 1.3 学习笔记六(ParseSegment)[b]》[/b]

Apache Nutch 1.3 学习笔记七(CrawlDb - updatedb)

[b]《[/b]Apache Nutch 1.3 学习笔记八(LinkDb)[b]》[/b]

Apache Nutch 1.3 学习笔记九(SolrIndexer)

[b]《[/b]Apache Nutch 1.3 学习笔记十(Ntuch 插件机制简单介绍)[b]》[/b]

Apache Nutch 1.3 学习笔记十(插件扩展)

[b]《[/b]Apache Nutch 1.3 学习笔记十(插件机制分析)[b]》[/b]

[b]《[/b]Apache Nutch 1.3 学习笔记十一(页面评分机制 OPIC)[b]》[/b]

[b]《[/b]Apache Nutch 1.3 学习笔记十一(页面评分机制 LinkRank 介绍)[b]》[/b]

[b]《[/b]Apache Nutch 1.3 学习笔记十二(Nutch 2.0 的主要变化)[b]》[/b]

更多《Apache Nutch文档》,尽在开卷有益360 http://www.docin.com/book_360
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: