您的位置:首页 > 其它

编译安装nutch2.3和hbase0.98.8集成

2015-12-04 14:20 267 查看
nutch2.3版本官网默认支持的hbase是0.94.14版本,如需升级hbase到更新版本,则需要修改gora的版本到0.6或以上。由于nutch2.3版本还是比较新的,网上安装nutch2.3的教程并不是很多,接下来就是根据网上教程安装nutch2.3版本的一个整合描述:

1、修改ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-core" rev="0.6" conf="*->default"/>
<!--取消该注释-->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />
<dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-compiler" rev="0.6" conf="*->default"/>
<!--将hadoop1.2相关的去掉,然后添加-->
<dependency org="org.apache.hadoop" name="hadoop-client" rev="2.5.2" conf="*->default"/>


2、修改ivysetting.xml

编译时部分jar包可能不能下载,需要修改如下配置:

<property name="repository.apache.org" value="http://maven.restlet.org/" override="false"/>


3、修改nutch-site.xml

<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>plugin.folders</name>
<value>plugins</value>
</property>
</configuration>


4、修改gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

5、修改jdk版本
如果用的是1.7版本,可以修改default.properties
javac.version= 1.7

6、编译
ant runtime

编译通过之后,就可以使用命令逐步抓取:
1、injector job将url注入抓取队列中进行初始化
cd runtime/local
mkdir urls
echo "http://nutch.apache.org/" > urls/seed.txt
bin/nutch inject urls -crawlId test
以上测试都没有问题,在hbase中新建了一个表test_webpage,有相应的数据写入
2、generate
bin/nutch generate -crawlId test
执行以上命令报错:

2015-12-03 16:19:43,423 WARN  mapred.LocalJobRunner - job_local246507986_0001
java.lang.Exception: java.io.EOFException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.EOFException
at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)
at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:423)
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at org.apache.hadoop.io.serializer.avro.AvroSerialization$AvroDeserializer.deserialize(AvroSerialization.java:127)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:146)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2015-12-03 16:19:43,493 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1449130778-1365697545, jobid=job_local246507986_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:213)
at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:241)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:308)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:316)
定位问题期间,有怀疑过avro的版本不兼容问题,因为hadoop2.5.2中的avro是1.7.4版本的,而nutch2.3中的avro是1.7.6版本的,但是版本保持同步之后,还是碰到同样的错误。在另外一篇文章中也提到了这个问题,说是avro的bug,在apache的jira上也有该问题的记录。这个问题是2011年就提的,到现在2015年还没有解决,想想也是不应该的。仔细想想,可能只是表面现象一样而已,继续定位问题。根据日志的描述,应该是跟序列化这块有关系的,后来找到一篇文章,在nutch-site.xml中要加入以下配置:

<property>
<name>io.serializations</name>
<value>org.apache.hadoop.io.serializer.WritableSerialization</value>
<description>A list of serialization classes that can be used for obtaining serializers and deserializers.</description>
</property>


重新ant runtime之后,报错问题解决了。由于解决该问题还是花了好几天时间的,觉得有必要记录一下,希望对碰到同样问题的朋友有所帮助。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: