您的位置:首页 > 其它

nutch爬取内容分析和爬取流程命令实现

2015-12-13 22:38 417 查看
1、 nutch的存储文件夹data下面各个文件夹和文件里面的内容究竟是什么?

crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)

readdb read / dump crawl db

mergedb merge crawldb-s, with optional filtering

readlinkdb read / dump link db

inject inject new urls into the database

generate generate new segments to fetch from crawl db

freegen generate new segments to fetch from text files

fetch fetch a segment's pages

parse parse a segment's pages

readseg read / dump segment data

mergesegs merge several segments, with optional filtering and slicing

updatedb update crawl db from segments after fetching

invertlinks create a linkdb from parsed segments

mergelinkdb merge linkdb-s, with optional filtering

solrindex run the solr indexer on parsed segments and linkdb

solrdedup remove duplicates from solr

solrclean remove HTTP 301 and 404 documents from solr

parsechecker check the parser for a given url

indexchecker check the indexing filters for a given url

domainstats calculate domain statistics from crawldb

webgraph generate a web graph from existing segments

linkrank run a link analysis program on the generated web graph

scoreupdater updates the crawldb with linkrank scores

nodedumper dumps the web graph's node scores

plugin load a plugin and run one of its classes main()

junit runs the given JUnit test

or

CLASSNAME run the class named CLASSNAME

这次我们将要使用的命令是readdb readseg readlinkdb来查看目录下的相关内容信息

crawldb

bin/nutch | grep read

bin/nutch readdb data/crawldb -stats

bin/nutch readdb data/crawldb -dump data/crawldb/crawldb_dump

bin/nutch readdb data/crawldb -url http://4008209999.tianyaclub.com/

bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN

bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN_m 1

segments

crawl_generate:

bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nofetch -noparse -noparsedata –noparsetext

crawl_fetch:

bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nogenerate -noparse -noparsedata –noparsetext

content:

bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -noparse -noparsedata –noparsetext

crawl_parse:

bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent –noparsedata –noparsetext

parse_data:

bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse –noparsetext

parse_text:

bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse -noparsedata

全部:

bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump

segments

bin/nutch readseg -list -dir data/segments

bin/nutch readseg -list data/segments/20130325043023

bin/nutch readseg -get data/segments/20130325042858 http://blog.tianya.cn/
linkdb

bin/nutch readlinkdb data/linkdb -url http://4008209999.tianyaclub.com/
bin/nutch readlinkdb data/linkdb -dump data/linkdb_dump

2.nutch爬取流程的命令实现

第一步 引入

bin/nutch inject

Usage: Injector <crawldb> <url_dir>

第一个是crawldb的生成目录 第二个为初始的目标url的目录

第二步 generate 生成抓取列表

第三步 fetch 抓取

第四步 parse 解析抓取结果

第五步 updatedb 更新抓取列表

如果想进行多轮抓取 执行2-5步即可

最后抓取结束 执行 invertlinks 生成linkdb
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: