您的位置:首页 > 理论基础 > 计算机网络


2013-12-12 20:26 381 查看



[hadoop@hadoop bin]$ ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject		inject new urls into the database
hostinject     creates or updates an existing host table from a text file
generate 	    generate new batches to fetch from crawl db
fetch 		fetch URLs marked during generate
parse 		parse URLs marked during fetch
updatedb 	     update web table after parsing
updatehostdb   update host table after parsing
readdb 	      read/dump records from page database
readhostdb     display entries from the hostDB
elasticindex    run the elasticsearch indexer
solrindex 	run the solr indexer on parsed batches
solrdedup 	remove duplicates from solr
parsechecker   check the parser for a given url
indexchecker   check the indexing filters for a given url
plugin 	load a plugin and run one of its classes main()
nutchserver    run a (local) Nutch server on a user defined port
junit         	runs the given JUnit test
or      CLASSNAME 	run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


[hadoop@hadoop bin]$ ./nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]

<crawlID> <solrURL> <numberOfRounds>完成爬取流程,而不必像Nutch-2.1版本中那样,必须一步一步地执行inject、generate、fetch、parse等命令。对于初学者的我来说,决定不执行傻瓜命令(crawl命令),主要想看看每执行一步,HBase中数据的变化,所以就认真研读了crawl脚本,发现了一下几段代码:

$bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID
$bin/nutch generate $commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId $CRAWL_ID -batchId $batchId
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId $CRAWL_ID -threads 50
$bin/nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId $CRAWL_ID
$bin/nutch updatedb $commonOptions -crawlId $CRAWL_ID




[hadoop@hadoop local]$ bin/nutch inject urls -crawlId bbs
InjectorJob: starting at 2013-12-12 10:51:28
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2013-12-12 10:51:34, elapsed: 00:00:05


hbase(main):007:0> scan 'bbs_webpage'
ROW                                         COLUMN+CELL
cn.tianya.bbs:http/                        column=f:fi, timestamp=1386817647216, value=\x00'\x8D\x00
cn.tianya.bbs:http/                        column=f:ts, timestamp=1386817647216, value=\x00\x00\x01B\xE4\xC5\xE1\x84
cn.tianya.bbs:http/                        column=mk:_injmrk_, timestamp=1386817647216, value=y
cn.tianya.bbs:http/                        column=mk:dist, timestamp=1386817647216, value=0
cn.tianya.bbs:http/                        column=mtdt:_csh_, timestamp=1386817647216, value=?\x80\x00\x00
cn.tianya.bbs:http/                        column=s:s, timestamp=1386817647216, value=?\x80\x00\x00
1 row(s) in 0.0460 seconds

此后分别执行./nutchgenerate -topN 5 -crawlId bbs、$ ./nutch fetch1386818590-1938811668 -crawlId bbs -threads 50、./nutch parse1386818590-1938811668 -crawlId bbs、./nutch updatedb-crawlId bbs,每执行一步上面所列出的命令,都在HBaseshell下运行scan'bbs_webpage'命令查看表的内容是否发生了变化,大家会发现每执行一次命令,表中存放的数据都发生了变化。表中的数据以及内容的变化,说明Nutch爬取数据存放到HBase中时正确的。


<table name="webpage">
<family name="p" maxVersions="1"/>
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

<!-- fetch fields                                       -->
<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>
<field name="reprUrl" family="f" qualifier="rpr"/>
<field name="content" family="f" qualifier="cnt"/>
<field name="contentType" family="f" qualifier="typ"/>
<field name="protocolStatus" family="f" qualifier="prot"/>
<field name="modifiedTime" family="f" qualifier="mod"/>
<field name="prevModifiedTime" family="f" qualifier="pmod"/>
<field name="batchId" family="f" qualifier="bid"/>

<!-- parse fields                                       -->
<field name="title" family="p" qualifier="t"/>
<field name="text" family="p" qualifier="c"/>
<field name="parseStatus" family="p" qualifier="st"/>
<field name="signature" family="p" qualifier="sig"/>
<field name="prevSignature" family="p" qualifier="psig"/>

<!-- score fields                                       -->
<field name="score" family="s" qualifier="s"/>
<field name="headers" family="h"/>
<field name="inlinks" family="il"/>
<field name="outlinks" family="ol"/>
<field name="metadata" family="mtdt"/>
<field name="markers" family="mk"/>

因为对HBase还不是特别熟悉,后面会研究一下HBase然后再继续分析所爬取到的内容,并且看看有没有方法可以在HBase shell下显示中文。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息