nutch爬取内容分析和爬取流程命令实现
2015-12-13 22:38
417 查看
1、 nutch的存储文件夹data下面各个文件夹和文件里面的内容究竟是什么?
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
这次我们将要使用的命令是readdb readseg readlinkdb来查看目录下的相关内容信息
crawldb
bin/nutch | grep read
bin/nutch readdb data/crawldb -stats
bin/nutch readdb data/crawldb -dump data/crawldb/crawldb_dump
bin/nutch readdb data/crawldb -url http://4008209999.tianyaclub.com/
bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN
bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN_m 1
segments
crawl_generate:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nofetch -noparse -noparsedata –noparsetext
crawl_fetch:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nogenerate -noparse -noparsedata –noparsetext
content:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -noparse -noparsedata –noparsetext
crawl_parse:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent –noparsedata –noparsetext
parse_data:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse –noparsetext
parse_text:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse -noparsedata
全部:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump
segments
bin/nutch readseg -list -dir data/segments
bin/nutch readseg -list data/segments/20130325043023
bin/nutch readseg -get data/segments/20130325042858 http://blog.tianya.cn/
linkdb
bin/nutch readlinkdb data/linkdb -url http://4008209999.tianyaclub.com/
bin/nutch readlinkdb data/linkdb -dump data/linkdb_dump
2.nutch爬取流程的命令实现
第一步 引入
bin/nutch inject
Usage: Injector <crawldb> <url_dir>
第一个是crawldb的生成目录 第二个为初始的目标url的目录
第二步 generate 生成抓取列表
第三步 fetch 抓取
第四步 parse 解析抓取结果
第五步 updatedb 更新抓取列表
如果想进行多轮抓取 执行2-5步即可
最后抓取结束 执行 invertlinks 生成linkdb
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
这次我们将要使用的命令是readdb readseg readlinkdb来查看目录下的相关内容信息
crawldb
bin/nutch | grep read
bin/nutch readdb data/crawldb -stats
bin/nutch readdb data/crawldb -dump data/crawldb/crawldb_dump
bin/nutch readdb data/crawldb -url http://4008209999.tianyaclub.com/
bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN
bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN_m 1
segments
crawl_generate:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nofetch -noparse -noparsedata –noparsetext
crawl_fetch:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nogenerate -noparse -noparsedata –noparsetext
content:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -noparse -noparsedata –noparsetext
crawl_parse:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent –noparsedata –noparsetext
parse_data:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse –noparsetext
parse_text:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse -noparsedata
全部:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump
segments
bin/nutch readseg -list -dir data/segments
bin/nutch readseg -list data/segments/20130325043023
bin/nutch readseg -get data/segments/20130325042858 http://blog.tianya.cn/
linkdb
bin/nutch readlinkdb data/linkdb -url http://4008209999.tianyaclub.com/
bin/nutch readlinkdb data/linkdb -dump data/linkdb_dump
2.nutch爬取流程的命令实现
第一步 引入
bin/nutch inject
Usage: Injector <crawldb> <url_dir>
第一个是crawldb的生成目录 第二个为初始的目标url的目录
第二步 generate 生成抓取列表
第三步 fetch 抓取
第四步 parse 解析抓取结果
第五步 updatedb 更新抓取列表
如果想进行多轮抓取 执行2-5步即可
最后抓取结束 执行 invertlinks 生成linkdb
相关文章推荐
- 第三周
- 第二阶段总结
- 利用filter()过滤器进行访问权限控制
- 【SSH进阶之路】一步步重构容器实现Spring框架——从一个简单的容器开始(八)
- 利用xtrabackup创建主从复制,以及replication slave和replication client权限
- 【官方方法】xcode7免证书真机调试
- 【SSH进阶之路】一步步重构容器实现Spring框架——解决容器对组件的“侵入式”管理的两种方案--主动查找和控制反转(九)
- 如何做好目标激励
- linux安装Scala
- Jquery Ajax方法传递json到action
- 【SSH进阶之路】一步步重构容器实现Spring框架——配置文件+反射实现IoC容器(十)
- 多线程基础:两种常见的创建线程的方式
- 概述到物理层
- Shader学习笔记-0
- 带你玩转Visual Studio——带你了解VC++各种类型的工程
- Http协议的常见参数
- TCP/IP详解卷1 读书笔记:HTTP协议
- Git 设置SSH Key
- 【SSH进阶之路】一步步重构容器实现Spring框架——彻底封装,实现简单灵活的Spring框架(十一)
- python-ldap 报gcc错误