nutch1.2爬虫在eclipse下运行遇到的问题
2016-02-23 19:09
225 查看
最近在研究nutch,将爬虫的源码导入eclipse。参照apache的一个wiki进行了配置。
http://wiki.apache.org/nutch/RunNutchInEclipse1.0
可是运行起单元测试起来会报出异常:
2011-05-27 11:15:46,747 WARN regex.RegexURLNormalizer (RegexURLNormalizer.java:setConf(113)) - Can't load the default config file! regex-normalize.xml
2011-05-27 11:15:46,760 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - prefix-urlfilter.txt not found
2011-05-27 11:15:46,773 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - suffix-urlfilter.txt not found
2011-05-27 11:15:46,775 WARN suffix.SuffixURLFilter (SuffixURLFilter.java:readConfigurationFile(175)) - Missing urlfilter.suffix.file, all URLs will be rejected!
2011-05-27 11:15:46,785 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - regex-urlfilter.txt not found
2011-05-27 11:15:46,786 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: regex-urlfilter.txt
2011-05-27 11:15:46,794 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - automaton-urlfilter.txt not found
2011-05-27 11:15:46,795 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: automaton-urlfilter.txt
2011-05-27 11:15:46,800 WARN domain.DomainURLFilter (DomainURLFilter.java:setConf(135)) - Attribute "file" is not defined in plugin.xml for plugin urlfilter-domain
2011-05-27 11:15:46,801 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(968)) - found resource domain-urlfilter.txt at file:/boot/wx-zone/nutch_all/bin/domain-urlfilter.txt
2011-05-27 11:15:46,868 WARN domain.DomainSuffixes (DomainSuffixes.java:<init>(47)) - java.net.MalformedURLException
at java.net.URL.<init>(URL.java:601)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at org.apache.nutch.util.domain.DomainSuffixesReader.read(DomainSuffixesReader.java:54)
at org.apache.nutch.util.domain.DomainSuffixes.<init>(DomainSuffixes.java:44)
显示的是一些配置文件txt没有装载,可是在命令行模式下是可以运行的。
我最后的解决方法是将爬虫根目录下的所有配置文件复制到 src/test package下一份,解决了。看来nutch的测试对于test来说是依赖很大。 比较混乱。
http://wiki.apache.org/nutch/RunNutchInEclipse1.0
可是运行起单元测试起来会报出异常:
2011-05-27 11:15:46,747 WARN regex.RegexURLNormalizer (RegexURLNormalizer.java:setConf(113)) - Can't load the default config file! regex-normalize.xml
2011-05-27 11:15:46,760 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - prefix-urlfilter.txt not found
2011-05-27 11:15:46,773 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - suffix-urlfilter.txt not found
2011-05-27 11:15:46,775 WARN suffix.SuffixURLFilter (SuffixURLFilter.java:readConfigurationFile(175)) - Missing urlfilter.suffix.file, all URLs will be rejected!
2011-05-27 11:15:46,785 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - regex-urlfilter.txt not found
2011-05-27 11:15:46,786 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: regex-urlfilter.txt
2011-05-27 11:15:46,794 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - automaton-urlfilter.txt not found
2011-05-27 11:15:46,795 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: automaton-urlfilter.txt
2011-05-27 11:15:46,800 WARN domain.DomainURLFilter (DomainURLFilter.java:setConf(135)) - Attribute "file" is not defined in plugin.xml for plugin urlfilter-domain
2011-05-27 11:15:46,801 INFO conf.Configuration (Configuration.java:getConfResourceAsReader(968)) - found resource domain-urlfilter.txt at file:/boot/wx-zone/nutch_all/bin/domain-urlfilter.txt
2011-05-27 11:15:46,868 WARN domain.DomainSuffixes (DomainSuffixes.java:<init>(47)) - java.net.MalformedURLException
at java.net.URL.<init>(URL.java:601)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at org.apache.nutch.util.domain.DomainSuffixesReader.read(DomainSuffixesReader.java:54)
at org.apache.nutch.util.domain.DomainSuffixes.<init>(DomainSuffixes.java:44)
显示的是一些配置文件txt没有装载,可是在命令行模式下是可以运行的。
我最后的解决方法是将爬虫根目录下的所有配置文件复制到 src/test package下一份,解决了。看来nutch的测试对于test来说是依赖很大。 比较混乱。
相关文章推荐
- 彻底解决Spring MVC 中文乱码 问题
- eclipse中删除Android Private libraries后没有自动生成
- Java 非递归方式深度优先遍历二叉树
- java.lang.ClassNotFoundException: com.ibm.websphere.ssl.protocol.SSLSocketFactory的解决办法
- Java 递归形式深度优先遍历二叉树
- HDU 1001Sum Problem(入门题,C,Java两个版本)
- Java 二叉树广度优先遍历
- Java 数组构建二叉树
- RxJava操作符
- 执行插入操作后,如何返回自动增长的ID(Java)
- Spring的事务处理机制及JAVA异常
- Java 调用Dll
- Spring的AOP配置
- java中== equals 和comparTo的区别
- 关于context:component-scan扫描spring注解标记的用法
- 查看java进程的所有信息
- JAVA简易WEB服务器(二)
- Vin验证工具类
- maven工程出现java.lang.ClassNotFoundException: org.springframework.web.context.ContextLoaderListener
- JAVA运行时异常及常见的5中RuntimeExecption