您的位置：首页 > 理论基础 > 计算机网络

Nutch-2.2.1学习之九Nutch过滤URL实践

2014-01-04 20:14 253 查看

通过分析Nutch的配置文件Nutch-default.xml和阅读了部分源代码后，了解了Nutch的插件机制以及如何通过修改conf中的文件实现过滤抓取数据。默认情况下，实现URL过滤的类为RegexURLFilter，对应的过滤文件为regex-urlfilter.txt，在不修改该文件的情况下，Nutch可以过滤后缀以gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS结尾的文件，过滤包含?*!@=字符的URL，过滤/SameSomething/重复出现三次的URL，而接受其他一切URL。现在以http://hadoop.apache.com为抓取的URL为例，分为默认抓取和只抓取包含hadoop的URL两种情况。

先看第一种情况，即对rgex-urlfilter.txt不做任何修改，代码及结果如下所示。从结果可以看到，总共抓取了38条记录。

[hadoop@hadoop deploy]$ bin/crawl urls hadoop http://localhost:8983/solr/ 1

hbase(main):012:0> scan 'hadoop_webpage', {COLUMNS=>'f:ts'}
ROW                                      COLUMN+CELL
com.apachecon.eu.www:http/c/aceu2009/   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1D1
com.apachecon.us:http/c/acus2008/       column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Ft
com.cafepress.www:http/hadoop/          column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fw
com.yahoo.developer:http/blogs/hadoop/2 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV\x1D2
008/07/apache_hadoop_wins_terabyte_sort
_benchmark.html
org.apache.avro:http/                   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fw
org.apache.cassandra:http/              column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fx
org.apache.forrest:http/                column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV\x1F\xAC
org.apache.hadoop:http/                 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01C\xE1\xD1\xB2\xFC
org.apache.hadoop:http/bylaws.html      column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fy
org.apache.hadoop:http/docs/current/    column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x06
org.apache.hadoop:http/docs/r0.23.10/   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fy
org.apache.hadoop:http/docs/r1.2.1/     column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fz
org.apache.hadoop:http/docs/r2.1.1-beta column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x07
/
org.apache.hadoop:http/docs/r2.2.0/     column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F{
org.apache.hadoop:http/docs/stable/     column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x07
org.apache.hadoop:http/index.html       column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F|
org.apache.hadoop:http/index.pdf        column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08
org.apache.hadoop:http/issue_tracking.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08
tml
org.apache.hadoop:http/mailing_lists.ht column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08
ml
org.apache.hadoop:http/privacy_policy.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09
tml
org.apache.hadoop:http/releases.html    column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F|
org.apache.hadoop:http/who.html         column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F}
org.apache.hbase:http/                  column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09
org.apache.hive:http/                   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F~
org.apache.incubator:http/ambari/       column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x7F
org.apache.incubator:http/chukwa/       column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09
org.apache.incubator:http/hama/         column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A
org.apache.mahout:http/                 column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x7F
org.apache.pig:http/                    column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x82
org.apache.wiki:http/hadoop             column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A
org.apache.wiki:http/hadoop/PoweredBy   column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A
org.apache.www:http/                    column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x12
org.apache.www:http/foundation/sponsors column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x83
hip.html
org.apache.www:http/foundation/thanks.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x12
tml
org.apache.www:http/licenses/           column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x83
org.apache.zookeeper:http/              column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x85
org.sortbenchmark:http/                 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x13
uk.co.guardian.www:http/technology/2011 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x14
/mar/25/media-guardian-innovation-award
s-apache-hadoop
38 row(s) in 0.2590 seconds

第二种情况是修改rgex-urlfilter.txt文件，修改最后一行为+^http://.*hadoop.*，即只抓取包含hadoop的URL。抓取的结果如下所示，只包含20行，并且rowkey仅仅包含hadoop的RUL。

[hadoop@hadoop deploy]$ bin/crawl urls hadoopWithFilter http://localhost:8983/solr/ 1
hbase(main):016:0> scan 'hadoopWithFilter_webpage',{COLUMNS=>'f:ts'}
ROW                                      COLUMN+CELL
com.cafepress.www:http/hadoop/          column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtGl
com.yahoo.developer:http/blogs/hadoop/2 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtG\x88
008/07/apache_hadoop_wins_terabyte_sort
_benchmark.html
org.apache.hadoop:http/                 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01C\xE1\xF0\xEB\xCB
org.apache.hadoop:http/bylaws.html      column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHv
org.apache.hadoop:http/docs/current/    column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA7
org.apache.hadoop:http/docs/r0.23.10/   column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHw
org.apache.hadoop:http/docs/r1.2.1/     column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHw
org.apache.hadoop:http/docs/r2.1.1-beta column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8
/
org.apache.hadoop:http/docs/r2.2.0/     column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHx
org.apache.hadoop:http/docs/stable/     column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8
org.apache.hadoop:http/index.html       column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHx
org.apache.hadoop:http/index.pdf        column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8
org.apache.hadoop:http/issue_tracking.h column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9
tml
org.apache.hadoop:http/mailing_lists.ht column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9
ml
org.apache.hadoop:http/privacy_policy.h column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9
tml
org.apache.hadoop:http/releases.html    column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHy
org.apache.hadoop:http/who.html         column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHz
org.apache.wiki:http/hadoop             column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAA
org.apache.wiki:http/hadoop/PoweredBy   column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAA
uk.co.guardian.www:http/technology/2011 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAB
/mar/25/media-guardian-innovation-award
s-apache-hadoop
20 row(s) in 0.3090 seconds

通过上面的结果可以发现，通过修改rgex-urlfilter.txt文件中的正则表达式，可以实现定制抓取URL，仅仅抓取自己感兴趣的内容。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： nutch filter 网络抓取爬虫

相关文章推荐

新的分享

章节导航