您的位置:首页 > 编程语言 > Java开发

nutch1.2爬虫在eclipse下运行遇到的问题

2016-02-23 19:09 225 查看
      最近在研究nutch,将爬虫的源码导入eclipse。参照apache的一个wiki进行了配置。

 
http://wiki.apache.org/nutch/RunNutchInEclipse1.0
 

  可是运行起单元测试起来会报出异常:

 

 

2011-05-27 11:15:46,747 WARN  regex.RegexURLNormalizer (RegexURLNormalizer.java:setConf(113)) - Can't load the default config file! regex-normalize.xml

2011-05-27 11:15:46,760 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - prefix-urlfilter.txt not found

2011-05-27 11:15:46,773 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - suffix-urlfilter.txt not found

2011-05-27 11:15:46,775 WARN  suffix.SuffixURLFilter (SuffixURLFilter.java:readConfigurationFile(175)) - Missing urlfilter.suffix.file, all URLs will be rejected!

2011-05-27 11:15:46,785 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - regex-urlfilter.txt not found

2011-05-27 11:15:46,786 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: regex-urlfilter.txt

2011-05-27 11:15:46,794 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - automaton-urlfilter.txt not found

2011-05-27 11:15:46,795 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: automaton-urlfilter.txt

2011-05-27 11:15:46,800 WARN  domain.DomainURLFilter (DomainURLFilter.java:setConf(135)) - Attribute "file" is not defined in plugin.xml for plugin urlfilter-domain

2011-05-27 11:15:46,801 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(968)) - found resource domain-urlfilter.txt at file:/boot/wx-zone/nutch_all/bin/domain-urlfilter.txt

2011-05-27 11:15:46,868 WARN  domain.DomainSuffixes (DomainSuffixes.java:<init>(47)) - java.net.MalformedURLException

    at java.net.URL.<init>(URL.java:601)

    at java.net.URL.<init>(URL.java:464)

    at java.net.URL.<init>(URL.java:413)

    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)

    at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)

    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)

    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)

    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)

    at org.apache.nutch.util.domain.DomainSuffixesReader.read(DomainSuffixesReader.java:54)

    at org.apache.nutch.util.domain.DomainSuffixes.<init>(DomainSuffixes.java:44)

 

显示的是一些配置文件txt没有装载,可是在命令行模式下是可以运行的。

 

我最后的解决方法是将爬虫根目录下的所有配置文件复制到  src/test     package下一份,解决了。看来nutch的测试对于test来说是依赖很大。 比较混乱。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: