您的位置:首页 > 业界新闻

我的架构演化笔记 12:Nutch1.7 构建互联网爬虫

2014-06-24 00:00 323 查看
Nutch是一个比较流行的互联网爬虫。

现在的需求是:用Nutch爬虫构造网络爬虫,并且对网页内容进一步分析提出需要的字段。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[code=plain]需要对页面进行进一步解析,那么解析的类是什么呢?
查看一下Crawl.java代码:
Path tmpDir = job.getLocalPath("crawl"+Path.SEPARATOR+getDate());
Injector injector = new Injector(getConf());
Generator generator = new Generator(getConf());
Fetcher fetcher = new Fetcher(getConf());
ParseSegment parseSegment = new ParseSegment(getConf());
CrawlDb crawlDbTool = new CrawlDb(getConf());
LinkDb linkDbTool = new LinkDb(getConf());

也就是说解析类是:
org.apache.nutch.parse.ParseSegment


来看看这个类有什么!

ParseSegment.java里的函数

public void map(WritableComparable<?> key, Content content,
OutputCollector<Text, ParseImpl> output, Reporter reporter)

有这么一段代码:

[code=plain]ParseResult parseResult = null;
try {
parseResult = new ParseUtil(getConf()).parse(content);
} catch (Exception e) {
LOG.warn("Error parsing: " + key + ": " + StringUtils.stringifyException(e));
return;
}

那我们就来看看ParseUtil类有什么秘密

这个 public ParseResult parse(Content content) throws ParseException {函数里有这么一段:

[code=plain]try {
parsers = this.parserFactory.getParsers(content.getContentType(),
content.getUrl() != null ? content.getUrl():"");
} catch (ParserNotFound e) {
if (LOG.isWarnEnabled()) {
LOG.warn("No suitable parser found when trying to parse content " + content.getUrl() +
" of type " + content.getContentType());
}
throw new ParseException(e.getMessage());
}

ParseResult parseResult = null;

也就是说,是根据Content-Type来获得相应的html解析类。

那么解析类是什么类呢?这才是我们需要的!

~~~~~~~~~~~~~

为了测试,我在ParseUtil.java的第92行添加了一行代码:

LOG.info("Parsing [" + content.getUrl() + "] with [" + parsers[i] + "]");

重新ant下然后执行。

查看日志:

[code=plain]2014-06-24 15:15:21,751 INFO  parse.ParseUtil - Parsing [http://my.oschina.net/hanzhankang] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]
2014-06-24 15:15:21,758 INFO  parse.ParseUtil - Parsing [http://my.oschina.net/leejun2005] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]
2014-06-24 15:15:21,771 INFO  parse.ParseUtil - Parsing [http://my.oschina.net/u/1259678] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]
2014-06-24 15:15:21,780 INFO  parse.ParseUtil - Parsing [http://my.oschina.net/u/937625] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]
2014-06-24 15:15:21,793 INFO  parse.ParseUtil - Parsing [http://www.oschina.net/news/53024/jboss-tools-4-2-beta2] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]
2014-06-24 15:15:21,803 INFO  parse.ParseUtil - Parsing [http://www.oschina.net/p/jfinal] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]
2014-06-24 15:15:21,812 INFO  parse.ParseUtil - Parsing [http://www.oschina.net/question/ask] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]
2014-06-24 15:15:21,818 INFO  parse.ParseUtil - Parsing [http://www.oschina.net/translate/tag/docker] with [org.apache.nutch.parse.html.HtmlParser@3be131e5]

也就是说,默认的HTML解析器是:org.apache.nutch.parse.html.HtmlParser

其实如果查看conf/parse-plugins.xml文件就知道

html的类型和parse类数组是一一对应的。

~~~~~~~~~~~~那我们就来看这个类。

这个类实际上是以插件的形式存在的,插件为

[code=plain]plugins/parse-html/parse-html.jar

执行解析类的函数名为getParse

[code=plain]public ParseResult getParse(Content content)
{
HTMLMetaTags metaTags = new HTMLMetaTags();
URL base;
try
{
base = new URL(content.getBaseUrl());
}
catch (MalformedURLException e)
{
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
}
String text = "";
String title = "";
Outlink[] outlinks = new Outlink[0];
Metadata metadata = new Metadata();
DocumentFragment root;
try
{
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));

EncodingDetector detector = new EncodingDetector(this.conf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content, this.defaultCharEncoding);

metadata.set("OriginalCharEncoding", encoding);
metadata.set("CharEncodingForConversion", encoding);

input.setEncoding(encoding);
if (LOG.isTraceEnabled()) {
LOG.trace("Parsing...");
}
root = parse(input);
}
catch (IOException e)
{
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
}
catch (DOMException e)
{
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
}
catch (SAXException e)
{
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
}
catch (Exception e)
{
LOG.error("Error: ", e);
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
}
HTMLMetaProcessor.getMetaTags(metaTags, root, base);
if (LOG.isTraceEnabled()) {
LOG.trace("Meta tags for " + base + ": " + metaTags.toString());
}
if (!metaTags.getNoIndex())
{
StringBuffer sb = new StringBuffer();
if (LOG.isTraceEnabled()) {
LOG.trace("Getting text...");
}
this.utils.getText(sb, root);
text = sb.toString();
sb.setLength(0);
if (LOG.isTraceEnabled()) {
LOG.trace("Getting title...");
}
this.utils.getTitle(sb, root);
title = sb.toString().trim();
}
if (!metaTags.getNoFollow())
{
ArrayList<Outlink> l = new ArrayList();
URL baseTag = this.utils.getBase(root);
if (LOG.isTraceEnabled()) {
LOG.trace("Getting links...");
}
this.utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
outlinks = (Outlink[])l.toArray(new Outlink[l.size()]);
if (LOG.isTraceEnabled()) {
LOG.trace("found " + outlinks.length + " outlinks in " + content.getUrl());
}
}
ParseStatus status = new ParseStatus(1);
if (metaTags.getRefresh())
{
status.setMinorCode((short)100);
status.setArgs(new String[] { metaTags.getRefreshHref().toString(), Integer.toString(metaTags.getRefreshTime()) });
}
ParseData parseData = new ParseData(status, title, outlinks, content.getMetadata(), metadata);

ParseResult parseResult = ParseResult.createParseResult(content.getUrl(), new ParseImpl(text, parseData));

ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult, metaTags, root);
if (metaTags.getNoCache()) {
for (Map.Entry<Text, Parse> entry : filteredParse) {
((Parse)entry.getValue()).getData().getParseMeta().set("caching.forbidden", this.cachingPolicy);
}
}
return filteredParse;
}

~~~~~~~~~~~~关于插件的说明

打开conf/nutch-default.xml,可以看到

[code=plain]<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>

这里是总开关,要起作用的插件必须在这里同时也定义,否则无效。

再进一步分析发现解析时的过程是这样的。

1 从配置中获取类型对应的html解析器,依次解析,其中一个解析成功就返回。

[code=plain]ParseResult parseResult = null;
for (int i=0; i<parsers.length; i++) {

if (LOG.isDebugEnabled()) {
LOG.debug("Parsing [" + content.getUrl() + "] with [" + parsers[i] + "]");
}
LOG.info("Parsing [" + content.getUrl() + "] with [" + parsers[i] + "]");
if (maxParseTime!=-1)
parseResult = runParser(parsers[i], content);
else
parseResult = parsers[i].getParse(content);

if (parseResult != null && !parseResult.isEmpty())
return parseResult;
}


2 返回之前还要通过htmlparsefilter的过滤,这个是全部过一遍,除非某个过滤器造成失败会终止。

[code=plain]/** Run all defined filters. */
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {

// loop on each filter
for (int i = 0 ; i < this.htmlParseFilters.length; i++) {
// call filter interface
parseResult =
htmlParseFilters[i].filter(content, parseResult, metaTags, doc);

// any failure on parse obj, return
if (!parseResult.isSuccess()) {
// TODO: What happens when parseResult.isEmpty() ?
// Maybe clone parseResult and use parseResult as backup...

// remove failed parse before return
parseResult.filter();
return parseResult;
}
}

return parseResult;
}


以上就是解析时的流程及原理。

现在的最重要的问题是:在什么地方添加自己的self-defined字段?

解析时?索引时?

既然我们的目标是添加索引字段,那么我们还是自己写索引插件吧。

所以,现在的任务是针对每个网站写一个索引插件。

[code=plain]IndexingFilter -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~下面开始介绍如何添加代码。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`代码开始

1 进入到文件夹 $NUTCH_HOME/src/plugin

mkdir indexingfilter-youku

2 按照下面的结构建立文件

indexingfilter-youku/
plugin.xml
build.xml
ivy.xml
src/
java/
org/
apache/
nutch/
indexer/
IndexingFilterYouKu.java

3 修改plugin.xml

<?xml version="1.0" encoding="UTF-8"?>

<plugin id="indexingfilter-youku" name="Add YouKu Field to Index"

version="1.0.0" provider-name="nutch.org">

<runtime>

<library name="indexingfilter-youku.jar">

<export name="*"/>

</library>

</runtime>

<requires>

</requires>

<extension id="org.apache.nutch.indexer.youku"

name="Add YouKu Field to Index"

point="org.apache.nutch.indexer.IndexingFilter">

<implementation id="IndexingFilterYouKu"

class="org.apache.nutch.indexer.IndexingFilterYouKu"/>

</extension>

</plugin>

4 关于ivy.xml

从plugin/urlmeta下面复制对应的ivy.xml,不需要任何改变即可。

5 关于build.xml

修改成以下内容

<?xml version="1.0" encoding="UTF-8"?>

<project name="indexingfilter-youku" default="jar">

<import file="../build-plugin.xml"/>

</project>

6修改IndexingFilterYouKu.java文件如下

package org.apache.nutch.indexer;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.Text;

import org.apache.nutch.crawl.CrawlDatum;

import org.apache.nutch.crawl.Inlinks;

import org.apache.nutch.indexer.IndexingFilter;

import org.apache.nutch.indexer.NutchDocument;

import org.apache.nutch.parse.Parse;

public class IndexingFilterYouKu implements IndexingFilter {

private static final Log LOG = LogFactory.getLog(IndexingFilterYouKu.class);

private Configuration conf;

//implements the filter-method which gives you access to important Objects like NutchDocument

public NutchDocument filter(NutchDocument doc, Parse parse, Text url,

CrawlDatum datum, Inlinks inlinks) {

String content = parse.getText();

//adds the new field to the document

doc.add("pageLength", content.length());

LOG.info("oh, the pagelength is "+content.length());

System.out.println("oh, the pagelength is "+content.length());

return doc;

}

//Boilerplate

public Configuration getConf() {

return conf;

}

//Boilerplate

public void setConf(Configuration conf) {

this.conf = conf;

}

}

7 修改src/plugin/build.xml

找到

<!-- ====================================================== -->

<!-- Build & deploy all the plugin jars. -->

<!-- ====================================================== -->

在下面添加一行

<ant dir="indexingfilter-youku" target="deploy"/>

8 修改nutch-site.xml

我的运行在local模式下,则修改配置文件local/conf/nutch-site.xml如下

从nutch-default.xml中复制plugin.includes的配置块到nutch-site.xml中

然后修改复制后的内容

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|indexingfilter-youku</value>

就可以了。

9 最后一步-修改$NUTCH_HOME/conf/schema.xml

在<fields>...</fields>段内添加

<field name="pageLength"type="long"stored="true"indexed="true"/>

10 重新ant , 大功告成。

但是,如果需要提取的字段比较复杂,比如说只能从原始页面提取,那就得用到HtmlParseFilter插件。

下面将来看看如何编写自己的HtmlParseFilter插件。

但是在此之前,需要认真研究下HtmlParser类的解析流程,具体请看下一篇文章。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  Nutch 架构 Mysql