Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎
2017-12-04 22:41
399 查看
Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎
前言:
配置环境:
系统环境:Ubuntu 14.04
JDK版本:jdk1.8.0_45
通过wget获取下载安装包:
gannyee@ubuntu:~/download$ wget https://www.reucon.com/cdn/java/jdk-8u45-linux-x64.tar.gz
tar zxvf jdk-8u45-linux-x64.tar.gz
解压后得到jdk1.8.0_45这个文件夹,先查看/usr/lib/路径下有没有jvm这个文件夹,若没有,则新建一个jvm文件夹:
gannyee@ubuntu:~/download$ mkdir /usr/lib/jvm
将当前解压得到的jdk1.8.0_45复制到/usr/lib/jvm中:
gannyee@ubuntu:~/download$mv jdk1.8.0_45 /usr/lib/jvm
打开profile设置环境变量:
gannyee@ubuntu:~/download$vim /etc/profile
在profile的末尾加入以下内容:
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_45
export CLASSPATH=.:JAVAHOME/jre/lib/rt.jar:JAVA_HOME/lib/dt.jar:JAVAHOME/lib/tools.jarexportPATH=PATH:$JAVA_HOME/bin
然后使用以下命令使得环境变量生效:
gannyee@ubuntu:~/download$source /etc/profile
到此为止,JDK就安装完成了。查看JDK的版本:
gannyee@ubuntu:~/download$java –version
java version “1.8.0_45”
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
若以上命令没有成功显示版本信息,那有可能是之前的操作出现问题,请仔细检查之前的操作。
Ant版本:1.9.4
通过wget下载安装包:
https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz
gannyee@ubuntu:~/download$ wget https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz
解压后可得到apache-ant-1.9.6这个文件夹,将其移动到/usr/local/ant文件夹中:
gannyee@ubuntu:~/downloadsudotar−zvxfapache−ant−1.9.4−bin.tar.gzgannyee@ubuntu: /downloadsudo mkdir /usr/local/ant
gannyee@ubuntu:~/download$mv apache-ant-1.9.4 /usr/local/ant
打开profile设置环境变量:
gannyee@ubuntu:~/download$vim /etc/profile
在profile文件末尾加入以下内容:
export ANT_HOME=/usr/local/ant/apache-ant-1.9.4
export PATH=PATH:ANT_HOME/bin
使用以下命令使得环境变量生效:
gannyee@ubuntu:~/download$source /etc/profile
查看Ant版本:
gannyee@ubuntu:~/download$ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014
至此,配置引擎所需的环境预先配置完成!
引擎数据流如图示:
图片来源博客:http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
这里写图片描述
Mongodb下载、安装、启动
开源文档数据库,Nosql数据典型代表之一。
版本:MongoDB-2.6.11gannyee@ubuntu:~/download$ wget https://fastdl.mongodb.org/src/mongodb-src-r2.6.11.tar.gz
gannyee@ubuntu:~/downloadsudotar−zxvfmongodb−src−r2.6.11.tar.gzgannyee@ubuntu: /download mv mongodb-src-r2.6.11/ ../mongodb/
gannyee@ubuntu:~cdmongodb/gannyee@ubuntu: /mongodb
sudo mkdir log/ conf/ data/
从2.6版开始,mongodb使用YAML-based配置文件格式。参考下面的配置可以在这里找到。
创建se.yml
gannyee@ubuntu:~/mongodb$ vim conf/se.yml
net:
port: 27017
bindIp: 127.0.0.1systemLog:
destination: file
path: “/opt/mongodb/log/mongodb.log”
logAppend: true
processManagement:
fork: true
pidFilePath: “/opt/mongodb/log/mongodb.pid”
storage:
dbPath: “/opt/mongodb/data”
directoryPerDB: true
smallFiles: true
启动Mongodb
gannyee@ubuntu:~/mongodb$ ./bin/mongod -f conf/se.yml
进入Mongodb以检查Mongodb是否启动成功
gannyee@ubuntu:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11connecting to: test
show dbs
admin (empty)
local 0.031GB
exit
bye
关闭Mongodb:
use admin
db.shutdownServer()
gannyee@ubuntu:~/mongodb$ sudo wget http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb
gannyee@ubuntu:~/mongodb$sudo dpkg -i robomongo-0.8.5-x86_64.deb
然后在浏览器中输入:http://localhost:27017,如果出现以下内容,说明外网可以访问:
It looks like you are trying to access MongoDB over HTTP on the native driver port.
ElasticSearch下载、安装
从Apache Lucene提取高性能的分布式搜索引擎。
版本:ElastricSearch-1.4.4
gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz
gannyee@ubuntu:~/downloadtar−zxvfelasticsearch−1.4.4.tar.gzgannyee@ubuntu: /download mv elasticsearch-1.4.4 ../elasticsearch
gannyee@ubuntu:~$cd /elasticsearch
修改config下文件elasticsearch.yml
gannyee@ubuntu:~/elasticsearch$ vim config/elasticsearch.yml
……
cluster.name: gannyee
node.name: “gannyee”
node.master: true
node.data: true
path.conf: /home/gannyee/elasticsearch/config
path.data: /home/gannyee/elasticsearch/data
http.enabled: true
network.bind_host: 127.0.0.1network.publish_host: 127.0.0.1network.host: 127.0.0.1…….
后台启动ElasticSearch
gannyee@ubuntu:~/elasticsearch$ ./bin/elasticsearch -d
终止ElasticSearch进程
关闭单一节点
gannyee@ubuntu:~/elasticsearch$curl -XPOST
http://localhost:9200/_cluster/nodes/_shutdown
关闭节点BlrmMvBdSKiCeYGsiHijdg
gannyee@ubuntu:~/elasticsearch$curl –XPOST
http://localhost:9200/_cluster/nodes/BlrmMvBdSKiCeYGsiHijdg/_shutdown
检测是否成功运行ElasticSearch
gannyee@ubuntu:~/elasticsearch$ curl -XGET ‘http://localhost:9200’
{
“status” : 200,
“name” : “gannyee”,
“cluster_name” : “gannyee”,
“version” : {
“number” : “1.4.4”,
“build_hash” : “c88f77ffc81301dfa9dfd81ca2232f09588bd512”,
“build_timestamp” : “2015-02-19T13:05:36Z”,
“build_snapshot” : false,
“lucene_version” : “4.10.3”
},
“tagline” : “You Know, for Search”
}
elasticsearch-head是一个elasticsearch的集群管理工具,它是完全由html5编写的独立网页程序,你可以通过插件把它集成到es
安装 elasticsearch-head插件
gannyee@ubuntu:~/elasticsearchcdelasticsearchgannyee@ubuntu: /elasticsearch ./bin/plugin -install mobz/elasticsearch-head
运行重启elasticsearch
在浏览器输入:http://localhost:9200/_plugin/head/
界面的右边有些按钮,如:node stats, cluster nodes,这些是直接请求es的相关状态的api,返回结果为json,如下图:
这里写图片描述
Kibana下载、安装
基于分析和搜索Elasticsearch仪表板的开源浏览器
版本:kibana-4.0.1gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.1-linux-x64.tar.gz
gannyee@ubuntu:~/downloadtar−zxvf/downloadkibana−4.0.1−linux−x64.tar.gzgannyee@ubuntu: /downloadmv kibana-4.0.1-linux-x64/ ../kibana/
gannyee@ubuntu:~/downloadcd../kibana/gannyee@ubuntu: /kibana ./bin/kibana
下面你就可以通过http://127.0.0.1:5601端口访问了,界面如图所示:
这里写图片描述
Apache Nutch 安装、编译、配置:
在Lucene发展来的开源网络爬虫,本次配置只能使用nutch2.x系列,1.x系列不支持MongoDB等其他如Mysql,Habase数据库。
版本:apache-nutch-2.3.1Nutch2.3下载、编译、配置
gannyee@ubuntu:~/download$ wget
http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
gannyee@ubuntu:~/downloadtar−zxvfapache−nutch−2.3.1−src.tar.gzgannyee@ubuntu: /download mv apache-nutch-2.3.1 ../nutch
gannyee@ubuntu:~/downloadcd../nutchgannyee@ubuntu: /nutch export NUTCH_HOME=$(pwd)
修改/conf/nutch-site.xml使Mongodb作为GORA的存储单元
gannyee@ubuntu:~/nutch/conf$ vim nutch-site.conf
storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data
从/ivy/ivy.xml文件中取消下面部分的注释
gannyee@ubuntu:~/nutch/confvimNUTCH_HOME/ivy/ivy.xml
default” />
…
确保MongoStore设置为默认数据存储
gannyee@ubuntu:~/nutch$ vim conf/gora.properties
/#######################
/# MongoDBStore properties #
/#######################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch
开始编译nutch
gannyee@ubuntu:~/nutch$ant runtime
如果编译过程中有如下错误
Trying to override old definition of task javac
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
Trying to override old definition of task javac
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
是因为缺少lib包,解决办法如下(其实可以无视):
下载 sonar-ant-task-2.1.jar,拷贝到 $NUTCH_HOME/lib 目录下面
修改 $NUTCH_HOME/build.xml,引入上面添加
编译后的文件将被放在新生成的文件夹/nutch/runtime中
最后确认nutch已经正确地编译和运行,输出如下:
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
定制你的爬取特性
gannyee@ubuntu:~$ sudo vim /nutch/runtime/local/conf/nutch-site.xml
< ?xml version=”1.0”?>
< ?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data
http.agent.name
Hist Crawler
plugin.includes
protocol-(httphttpclient)urlfilter-regexindex-(basicmore)query-(basicsiteurllang)indexer-elasticnutch-extensionpointsparse-(texthtmlmsexcelmswordmspowerpointpdf)summary-basicscoring-opicurlnormalizer-(passregexbasic)parse-(htmltikametatags)index-(basicanchormoremetadata)
elastic.host
localhost
elastic.cluster
hist
elastic.index
nutch
parser.character.encoding.default
utf-8
http.content.limit
6553600
爬取自己第一个网页
创建一个URL种子列表
gannyee@ubuntu:~mkdir−p/nutch/runtime/local/urlsgannyee@ubuntu: echo ‘http://www.aossama.com/’ >/nutch/runtime/local/urls/seed.txt
编辑conf/regex-urlfilter.txt文件,并且替换以下内容
/# accept anything else
+.
使用正则表达式匹配你想要爬取的域名
+^]http://([a-z0-9]*.)*aossama.com/
初始化crawldb
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch inject urls/
从 crawldb生成urls
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch generate -topN 80
获取生成的所有urls
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch fetch -all
解析获取的urls
gannyee@ubuntu:~/nutch/runtime/local$./ bin/nutch parse -all
更新database数据库
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch updatedb -all
索引解析的urls
gannyee@ubuntu:~/nutch/runtime/local$ bin/nutch index -all
爬取完给定网页,mongoDB会生成一个新的数据库:nutch_1gannyee@ubuntu:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11connecting to: test
show dbs
admin (empty)
local 0.031GB
nutch_1 0.031GB
test (empty)
use nutch_1switched to db nutch_1show tables
system.indexes
webpage
具体数据可以在terminal下用指令或在图形界面下直接点击查看
前言:
文章讲述如何通过Nutch、MongoDB、ElasticSearch、Kibana搭建网络爬虫,其中Nutch用于网页数据爬取,MongoDB用于存储爬虫而来的数据,ElasticSearch用来作Index索引,Kibana用来形象化查看索引结果。具体步骤如下:
配置环境:
系统环境:Ubuntu 14.04
JDK版本:jdk1.8.0_45
通过wget获取下载安装包:
gannyee@ubuntu:~/download$ wget https://www.reucon.com/cdn/java/jdk-8u45-linux-x64.tar.gz
tar zxvf jdk-8u45-linux-x64.tar.gz
解压后得到jdk1.8.0_45这个文件夹,先查看/usr/lib/路径下有没有jvm这个文件夹,若没有,则新建一个jvm文件夹:
gannyee@ubuntu:~/download$ mkdir /usr/lib/jvm
1
将当前解压得到的jdk1.8.0_45复制到/usr/lib/jvm中:
gannyee@ubuntu:~/download$mv jdk1.8.0_45 /usr/lib/jvm
1
打开profile设置环境变量:
gannyee@ubuntu:~/download$vim /etc/profile
在profile的末尾加入以下内容:
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_45
export CLASSPATH=.:JAVAHOME/jre/lib/rt.jar:JAVA_HOME/lib/dt.jar:JAVAHOME/lib/tools.jarexportPATH=PATH:$JAVA_HOME/bin
然后使用以下命令使得环境变量生效:
gannyee@ubuntu:~/download$source /etc/profile
1
到此为止,JDK就安装完成了。查看JDK的版本:
gannyee@ubuntu:~/download$java –version
java version “1.8.0_45”
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
若以上命令没有成功显示版本信息,那有可能是之前的操作出现问题,请仔细检查之前的操作。
Ant版本:1.9.4
通过wget下载安装包:
https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz
gannyee@ubuntu:~/download$ wget https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz
1
解压后可得到apache-ant-1.9.6这个文件夹,将其移动到/usr/local/ant文件夹中:
gannyee@ubuntu:~/downloadsudotar−zvxfapache−ant−1.9.4−bin.tar.gzgannyee@ubuntu: /downloadsudo mkdir /usr/local/ant
gannyee@ubuntu:~/download$mv apache-ant-1.9.4 /usr/local/ant
打开profile设置环境变量:
gannyee@ubuntu:~/download$vim /etc/profile
1
在profile文件末尾加入以下内容:
export ANT_HOME=/usr/local/ant/apache-ant-1.9.4
export PATH=PATH:ANT_HOME/bin
12
使用以下命令使得环境变量生效:
gannyee@ubuntu:~/download$source /etc/profile
1
查看Ant版本:
gannyee@ubuntu:~/download$ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014
12
至此,配置引擎所需的环境预先配置完成!
引擎数据流如图示:
图片来源博客:http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
这里写图片描述
Mongodb下载、安装、启动
开源文档数据库,Nosql数据典型代表之一。
版本:MongoDB-2.6.11gannyee@ubuntu:~/download$ wget https://fastdl.mongodb.org/src/mongodb-src-r2.6.11.tar.gz
gannyee@ubuntu:~/downloadsudotar−zxvfmongodb−src−r2.6.11.tar.gzgannyee@ubuntu: /download mv mongodb-src-r2.6.11/ ../mongodb/
gannyee@ubuntu:~cdmongodb/gannyee@ubuntu: /mongodb
sudo mkdir log/ conf/ data/
从2.6版开始,mongodb使用YAML-based配置文件格式。参考下面的配置可以在这里找到。
创建se.yml
gannyee@ubuntu:~/mongodb$ vim conf/se.yml
net:
port: 27017
bindIp: 127.0.0.1systemLog:
destination: file
path: “/opt/mongodb/log/mongodb.log”
logAppend: true
processManagement:
fork: true
pidFilePath: “/opt/mongodb/log/mongodb.pid”
storage:
dbPath: “/opt/mongodb/data”
directoryPerDB: true
smallFiles: true
启动Mongodb
gannyee@ubuntu:~/mongodb$ ./bin/mongod -f conf/se.yml
1
进入Mongodb以检查Mongodb是否启动成功
gannyee@ubuntu:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11connecting to: test
show dbs
admin (empty)
local 0.031GB
exit
bye
关闭Mongodb:
use admin
db.shutdownServer()
如Ubuntu使用Mongodb的图形化界面管理工具,推荐使用robomongo 下载地址: http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb 使用robomongo链接数据库 下载、安装robomongo
gannyee@ubuntu:~/mongodb$ sudo wget http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb
gannyee@ubuntu:~/mongodb$sudo dpkg -i robomongo-0.8.5-x86_64.deb
gannyee@ubuntu:~$robomongo就可以打开客户端。 建立新连接,只需要添加host和port即可。 note:我第一次安装成功后链接也成功,但是看不到任何数据。 解决办法:重新使用root权限安装即可。 软件界面如图所示: 这里写图片描述 如果需要外网访问的话,需要配置文件中的bindIp: 127.0.0.1改为bindIp: 0.0.0.0
然后在浏览器中输入:http://localhost:27017,如果出现以下内容,说明外网可以访问:
It looks like you are trying to access MongoDB over HTTP on the native driver port.
如果出现无法执行./mongod的错误 大部分是因为mongodb 服务在不正常关闭的情况下,mongod 被锁,想想可能是上次无故死机造成的. 如何解决这种问题: 删除 mongod.lock 文件和日志文件 mongodb.log.2016-1-26T06-55-20 ,如果有必要把 log日志全部删除 mongod –repair –dbpath /home/gannyee/mongodb/data/db / –repairpath /home/gannyee/mongodb
ElasticSearch下载、安装
从Apache Lucene提取高性能的分布式搜索引擎。
版本:ElastricSearch-1.4.4
gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz
gannyee@ubuntu:~/downloadtar−zxvfelasticsearch−1.4.4.tar.gzgannyee@ubuntu: /download mv elasticsearch-1.4.4 ../elasticsearch
gannyee@ubuntu:~$cd /elasticsearch
修改config下文件elasticsearch.yml
gannyee@ubuntu:~/elasticsearch$ vim config/elasticsearch.yml
……
cluster.name: gannyee
node.name: “gannyee”
node.master: true
node.data: true
path.conf: /home/gannyee/elasticsearch/config
path.data: /home/gannyee/elasticsearch/data
http.enabled: true
network.bind_host: 127.0.0.1network.publish_host: 127.0.0.1network.host: 127.0.0.1…….
后台启动ElasticSearch
gannyee@ubuntu:~/elasticsearch$ ./bin/elasticsearch -d
终止ElasticSearch进程
关闭单一节点
gannyee@ubuntu:~/elasticsearch$curl -XPOST
http://localhost:9200/_cluster/nodes/_shutdown
关闭节点BlrmMvBdSKiCeYGsiHijdg
gannyee@ubuntu:~/elasticsearch$curl –XPOST
http://localhost:9200/_cluster/nodes/BlrmMvBdSKiCeYGsiHijdg/_shutdown
检测是否成功运行ElasticSearch
gannyee@ubuntu:~/elasticsearch$ curl -XGET ‘http://localhost:9200’
{
“status” : 200,
“name” : “gannyee”,
“cluster_name” : “gannyee”,
“version” : {
“number” : “1.4.4”,
“build_hash” : “c88f77ffc81301dfa9dfd81ca2232f09588bd512”,
“build_timestamp” : “2015-02-19T13:05:36Z”,
“build_snapshot” : false,
“lucene_version” : “4.10.3”
},
“tagline” : “You Know, for Search”
}
elasticsearch-head是一个elasticsearch的集群管理工具,它是完全由html5编写的独立网页程序,你可以通过插件把它集成到es
安装 elasticsearch-head插件
gannyee@ubuntu:~/elasticsearchcdelasticsearchgannyee@ubuntu: /elasticsearch ./bin/plugin -install mobz/elasticsearch-head
运行重启elasticsearch
在浏览器输入:http://localhost:9200/_plugin/head/
界面的右边有些按钮,如:node stats, cluster nodes,这些是直接请求es的相关状态的api,返回结果为json,如下图:
这里写图片描述
Kibana下载、安装
基于分析和搜索Elasticsearch仪表板的开源浏览器
版本:kibana-4.0.1gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.1-linux-x64.tar.gz
gannyee@ubuntu:~/downloadtar−zxvf/downloadkibana−4.0.1−linux−x64.tar.gzgannyee@ubuntu: /downloadmv kibana-4.0.1-linux-x64/ ../kibana/
gannyee@ubuntu:~/downloadcd../kibana/gannyee@ubuntu: /kibana ./bin/kibana
下面你就可以通过http://127.0.0.1:5601端口访问了,界面如图所示:
这里写图片描述
Apache Nutch 安装、编译、配置:
在Lucene发展来的开源网络爬虫,本次配置只能使用nutch2.x系列,1.x系列不支持MongoDB等其他如Mysql,Habase数据库。
版本:apache-nutch-2.3.1Nutch2.3下载、编译、配置
gannyee@ubuntu:~/download$ wget
http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
gannyee@ubuntu:~/downloadtar−zxvfapache−nutch−2.3.1−src.tar.gzgannyee@ubuntu: /download mv apache-nutch-2.3.1 ../nutch
gannyee@ubuntu:~/downloadcd../nutchgannyee@ubuntu: /nutch export NUTCH_HOME=$(pwd)
修改/conf/nutch-site.xml使Mongodb作为GORA的存储单元
gannyee@ubuntu:~/nutch/conf$ vim nutch-site.conf
storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data
从/ivy/ivy.xml文件中取消下面部分的注释
gannyee@ubuntu:~/nutch/confvimNUTCH_HOME/ivy/ivy.xml
default” />
…
确保MongoStore设置为默认数据存储
gannyee@ubuntu:~/nutch$ vim conf/gora.properties
/#######################
/# MongoDBStore properties #
/#######################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch
开始编译nutch
gannyee@ubuntu:~/nutch$ant runtime
1
如果编译过程中有如下错误
Trying to override old definition of task javac
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
Trying to override old definition of task javac
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
是因为缺少lib包,解决办法如下(其实可以无视):
下载 sonar-ant-task-2.1.jar,拷贝到 $NUTCH_HOME/lib 目录下面
修改 $NUTCH_HOME/build.xml,引入上面添加
12
3
4
5
6
编译后的文件将被放在新生成的文件夹/nutch/runtime中
最后确认nutch已经正确地编译和运行,输出如下:
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
12
3
4
5
6
7
8
9
10
1112
13
14
15
16
17
18
19
20
2122
23
24
25
26
27
定制你的爬取特性
gannyee@ubuntu:~$ sudo vim /nutch/runtime/local/conf/nutch-site.xml
< ?xml version=”1.0”?>
< ?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data
http.agent.name
Hist Crawler
plugin.includes
protocol-(httphttpclient)urlfilter-regexindex-(basicmore)query-(basicsiteurllang)indexer-elasticnutch-extensionpointsparse-(texthtmlmsexcelmswordmspowerpointpdf)summary-basicscoring-opicurlnormalizer-(passregexbasic)parse-(htmltikametatags)index-(basicanchormoremetadata)
elastic.host
localhost
elastic.cluster
hist
elastic.index
nutch
parser.character.encoding.default
utf-8
http.content.limit
6553600
爬取自己第一个网页
创建一个URL种子列表
gannyee@ubuntu:~mkdir−p/nutch/runtime/local/urlsgannyee@ubuntu: echo ‘http://www.aossama.com/’ >/nutch/runtime/local/urls/seed.txt
12
编辑conf/regex-urlfilter.txt文件,并且替换以下内容
/# accept anything else
+.
12
使用正则表达式匹配你想要爬取的域名
+^]http://([a-z0-9]*.)*aossama.com/
1
初始化crawldb
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch inject urls/
1
从 crawldb生成urls
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch generate -topN 80
1
获取生成的所有urls
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch fetch -all
1
解析获取的urls
gannyee@ubuntu:~/nutch/runtime/local$./ bin/nutch parse -all
1
更新database数据库
gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch updatedb -all
1
索引解析的urls
gannyee@ubuntu:~/nutch/runtime/local$ bin/nutch index -all
1
爬取完给定网页,mongoDB会生成一个新的数据库:nutch_1gannyee@ubuntu:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11connecting to: test
show dbs
admin (empty)
local 0.031GB
nutch_1 0.031GB
test (empty)
use nutch_1switched to db nutch_1show tables
system.indexes
webpage
具体数据可以在terminal下用指令或在图形界面下直接点击查看
相关文章推荐
- Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎
- Logstash+Elasticsearch+kibana搭建日志平台文档(xjh20160527重庆公司亲测可用)
- 使用ElasticSearch+LogStash+Kibana+Redis搭建日志管理服务
- elasticsearch + logstash + kibana 搭建实时日志收集系统【原创】
- 一个大数据方案:基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎
- 基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎
- 使用logstash+elasticsearch+kibana快速搭建日志平台
- 搭建ELK(ElasticSearch+Logstash+Kibana)日志分析系统(十五) logstash将配置写在多个文件
- 使用ELK(Elasticsearch + Logstash + Kibana) 搭建日志集中分析平台实践
- 使用logstash+elasticsearch+kibana快速搭建日志平台
- ELK日志系统:Elasticsearch + Logstash + Kibana 搭建教程
- 安装logstash+kibana+elasticsearch+redis搭建集中式日志分析平台
- 从头搭建Logstash+ElasticSearch+Kibana
- ELK(elasticsearch+logstash+kibana)+firebeat搭建教程
- ELK学习3_使用redis+logstash+elasticsearch+kibana快速搭建日志平台
- 使用logstash+elasticsearch+kibana快速搭建日志平台
- Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南
- 使用ELK(Elasticsearch + Logstash + Kibana) 搭建日志集中分析平台实践--转载
- 基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎
- ELK环境搭建(ElasticSearch+Logstash+kibana)