您的位置:首页 > 其它

Nutch Crawl执行过程解析

2013-12-11 18:47 176 查看
为了研究怎么解决recrawl的问题,今天仔细观察了一下nutch crawl的每一步具体发生了什么。

==============准备工作======================

(Windows下需要cygwin)

从SVN check out代码;

cd到crawler目录;

==============inject==========================

$ bin/nutch inject crawl/crawldb urls

Injector: starting

Injector: crawlDb: crawl/crawldb

Injector: urlDir: urls

Injector: Converting injected urls to crawl db entries.

Injector: Merging injected urls into crawl db.

Injector: done

crawldb目录在这时生成。

查看里面的内容:

$ bin/nutch readdb crawl/crawldb -stats

CrawlDb statistics start: crawl/crawldb

Statistics for CrawlDb: crawl/crawldb

TOTAL urls: 1

retry 0: 1

min score: 1.0

avg score: 1.0

max score: 1.0

status 1 (db_unfetched): 1

CrawlDb statistics: done

===============generate=========================

$bin/nutch generate crawl/crawldb crawl/segments

$s1=`ls -d crawl/segments/2* | tail -1`

Generator: Selecting best-scoring urls due for fetch.

Generator: starting

Generator: segment: crawl/segments/20080112224520

Generator: filtering: true

Generator: jobtracker is 'local', generating exactly one partition.

Generator: Partitioning selected urls by host, for politeness.

Generator: done.

segments目录在这时生成。但里面只有一个crawl_generate目录:

$ bin/nutch readseg -list $1

NAME GENERATED FETCHER START FETCHER END

FETCHED PARSED

20080112224520 1 ? ? ? ?

crawldb的内容此时没变化,仍是1个unfetched url。

=================fetch==============================

$bin/nutch fetch $s1

Fetcher: starting

Fetcher: segment: crawl/segments/20080112224520

Fetcher: threads: 10

fetching http://www.complaints.com/directory/directory.htm
Fetcher: done

segments多了些其他子目录。

$ bin/nutch readseg -list $s1

NAME GENERATED FETCHER START FETCHER END

FETCHED PARSED

20080112224520 1 2008-01-12T22:52:00 2008-01-12T22:52:00

1 1

crawldb的内容此时没变化,仍是1个unfetched url。

================updatedb=============================

$ bin/nutch updatedb crawl/crawldb $s1

CrawlDb update: starting

CrawlDb update: db: crawl/crawldb

CrawlDb update: segments: [crawl/segments/20080112224520]

CrawlDb update: additions allowed: true

CrawlDb update: URL normalizing: false

CrawlDb update: URL filtering: false

CrawlDb update: Merging segment data into db.

CrawlDb update: done

这时候crawldb内容就变化了:

$ bin/nutch readdb crawl/crawldb -stats

CrawlDb statistics start: crawl/crawldb

Statistics for CrawlDb: crawl/crawldb

TOTAL urls: 97

retry 0: 97

min score: 0.01

avg score: 0.02

max score: 1.0

status 1 (db_unfetched): 96

status 2 (db_fetched): 1

CrawlDb statistics: done

==============invertlinks ==============================

$ bin/nutch invertlinks crawl/linkdb crawl/segments/*

LinkDb: starting

LinkDb: linkdb: crawl/linkdb

LinkDb: URL normalize: true

LinkDb: URL filter: true

LinkDb: adding segment: crawl/segments/20080112224520

LinkDb: done

linkdb目录在这时生成。

===============index====================================

$ bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

Indexer: starting

Indexer: linkdb: crawl/linkdb

Indexer: adding segment: crawl/segments/20080112224520

Indexing [http://www.complaints.com/directory/directory.htm] with analyzer

org

apache.nutch.analysis.NutchDocumentAnalyzer@ba4211 (null)

Optimizing index.

merging segments _ram_0 (1 docs) into _0 (1 docs)

Indexer: done

indexes目录在这时生成。

================测试crawl的结果==========================

$ bin/nutch org.apache.nutch.searcher.NutchBean complaints

Total hits: 1

0 20080112224520/http://www.complaints.com/directory/directory.htm

Complaints.com - Sitemap by date ?Complaints ...

参考资料:

【1】Nutch version 0.8.x tutorial
http://lucene.apache.org/nutch/tutorial8.html
【2】 Introduction to Nutch, Part 1: Crawling
http://today.java.net/lpt/a/255
出处:http://xruby.iteye.com/blog/258128
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: