您的位置：首页 > 数据库 > Mongodb

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎

2017-12-04 22:41 399 查看

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎

前言：

文章讲述如何通过Nutch、MongoDB、ElasticSearch、Kibana搭建网络爬虫，其中Nutch用于网页数据爬取，MongoDB用于存储爬虫而来的数据，ElasticSearch用来作Index索引，Kibana用来形象化查看索引结果。具体步骤如下：

配置环境：

系统环境：Ubuntu 14.04

JDK版本：jdk1.8.0_45

通过wget获取下载安装包:

gannyee@ubuntu:~/download$ wget https://www.reucon.com/cdn/java/jdk-8u45-linux-x64.tar.gz

tar zxvf jdk-8u45-linux-x64.tar.gz

解压后得到jdk1.8.0_45这个文件夹，先查看/usr/lib/路径下有没有jvm这个文件夹，若没有，则新建一个jvm文件夹：

gannyee@ubuntu:~/download$ mkdir /usr/lib/jvm

将当前解压得到的jdk1.8.0_45复制到/usr/lib/jvm中：

gannyee@ubuntu:~/download$mv jdk1.8.0_45 /usr/lib/jvm

打开profile设置环境变量：

gannyee@ubuntu:~/download$vim /etc/profile

在profile的末尾加入以下内容：

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_45

export CLASSPATH=.:JAVAHOME/jre/lib/rt.jar:JAVA_HOME/lib/dt.jar:JAVAHOME/lib/tools.jarexportPATH=PATH:$JAVA_HOME/bin

然后使用以下命令使得环境变量生效：

gannyee@ubuntu:~/download$source /etc/profile

到此为止，JDK就安装完成了。查看JDK的版本：

gannyee@ubuntu:~/download$java –version

java version “1.8.0_45”

Java(TM) SE Runtime Environment (build 1.8.0_45-b14)

Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

若以上命令没有成功显示版本信息，那有可能是之前的操作出现问题，请仔细检查之前的操作。

Ant版本：1.9.4

通过wget下载安装包:

https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz

gannyee@ubuntu:~/download$ wget https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz

解压后可得到apache-ant-1.9.6这个文件夹，将其移动到/usr/local/ant文件夹中：

gannyee@ubuntu:~/downloadsudotar−zvxfapache−ant−1.9.4−bin.tar.gzgannyee@ubuntu: /downloadsudo mkdir /usr/local/ant

gannyee@ubuntu:~/download$mv apache-ant-1.9.4 /usr/local/ant

打开profile设置环境变量：

gannyee@ubuntu:~/download$vim /etc/profile

在profile文件末尾加入以下内容：

export ANT_HOME=/usr/local/ant/apache-ant-1.9.4

export PATH=PATH:ANT_HOME/bin

使用以下命令使得环境变量生效：

gannyee@ubuntu:~/download$source /etc/profile

查看Ant版本：

gannyee@ubuntu:~/download$ant -version

Apache Ant(TM) version 1.9.4 compiled on April 29 2014

至此，配置引擎所需的环境预先配置完成！

引擎数据流如图示：

图片来源博客：http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

这里写图片描述

Mongodb下载、安装、启动

开源文档数据库，Nosql数据典型代表之一。

版本：MongoDB-2.6.11gannyee@ubuntu:~/download$ wget https://fastdl.mongodb.org/src/mongodb-src-r2.6.11.tar.gz

gannyee@ubuntu:~/downloadsudotar−zxvfmongodb−src−r2.6.11.tar.gzgannyee@ubuntu: /download mv mongodb-src-r2.6.11/ ../mongodb/

gannyee@ubuntu:~cdmongodb/gannyee@ubuntu: /mongodb

sudo mkdir log/ conf/ data/

从2.6版开始,mongodb使用YAML-based配置文件格式。参考下面的配置可以在这里找到。

创建se.yml

gannyee@ubuntu:~/mongodb$ vim conf/se.yml

net:

port: 27017

bindIp: 127.0.0.1systemLog:

destination: file

path: “/opt/mongodb/log/mongodb.log”

logAppend: true

processManagement:

fork: true

pidFilePath: “/opt/mongodb/log/mongodb.pid”

storage:

dbPath: “/opt/mongodb/data”

directoryPerDB: true

smallFiles: true

启动Mongodb

gannyee@ubuntu:~/mongodb$ ./bin/mongod -f conf/se.yml

进入Mongodb以检查Mongodb是否启动成功

gannyee@ubuntu:~/mongodb$ ./bin/mongo

MongoDB shell version: 2.6.11connecting to: test

show dbs

admin (empty)

local 0.031GB

exit

bye

关闭Mongodb：

use admin

db.shutdownServer()

如Ubuntu使用Mongodb的图形化界面管理工具，推荐使用robomongo
下载地址： http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb 使用robomongo链接数据库
下载、安装robomongo

gannyee@ubuntu:~/mongodb$ sudo wget http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb

gannyee@ubuntu:~/mongodb$sudo dpkg -i robomongo-0.8.5-x86_64.deb

gannyee@ubuntu:~$robomongo就可以打开客户端。
建立新连接，只需要添加host和port即可。
note：我第一次安装成功后链接也成功，但是看不到任何数据。
解决办法：重新使用root权限安装即可。
软件界面如图所示：
这里写图片描述
如果需要外网访问的话，需要配置文件中的bindIp: 127.0.0.1改为bindIp: 0.0.0.0

然后在浏览器中输入：http://localhost:27017,如果出现以下内容，说明外网可以访问：

It looks like you are trying to access MongoDB over HTTP on the native driver port.

如果出现无法执行./mongod的错误
大部分是因为mongodb 服务在不正常关闭的情况下,mongod 被锁,想想可能是上次无故死机造成的.
如何解决这种问题:

删除 mongod.lock 文件和日志文件 mongodb.log.2016-1-26T06-55-20 ,如果有必要把 log日志全部删除
mongod –repair –dbpath /home/gannyee/mongodb/data/db / –repairpath /home/gannyee/mongodb

ElasticSearch下载、安装

从Apache Lucene提取高性能的分布式搜索引擎。

版本：ElastricSearch-1.4.4

gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz

gannyee@ubuntu:~/downloadtar−zxvfelasticsearch−1.4.4.tar.gzgannyee@ubuntu: /download mv elasticsearch-1.4.4 ../elasticsearch

gannyee@ubuntu:~$cd /elasticsearch

修改config下文件elasticsearch.yml

gannyee@ubuntu:~/elasticsearch$ vim config/elasticsearch.yml

……

cluster.name: gannyee

node.name: “gannyee”

node.master: true

node.data: true

path.conf: /home/gannyee/elasticsearch/config

path.data: /home/gannyee/elasticsearch/data

http.enabled: true

network.bind_host: 127.0.0.1network.publish_host: 127.0.0.1network.host: 127.0.0.1…….

后台启动ElasticSearch

gannyee@ubuntu:~/elasticsearch$ ./bin/elasticsearch -d

终止ElasticSearch进程

关闭单一节点

gannyee@ubuntu:~/elasticsearch$curl -XPOST

http://localhost:9200/_cluster/nodes/_shutdown

关闭节点BlrmMvBdSKiCeYGsiHijdg

gannyee@ubuntu:~/elasticsearch$curl –XPOST

http://localhost:9200/_cluster/nodes/BlrmMvBdSKiCeYGsiHijdg/_shutdown

检测是否成功运行ElasticSearch

gannyee@ubuntu:~/elasticsearch$ curl -XGET ‘http://localhost:9200’

{

“status” : 200,

“name” : “gannyee”,

“cluster_name” : “gannyee”,

“version” : {

“number” : “1.4.4”,

“build_hash” : “c88f77ffc81301dfa9dfd81ca2232f09588bd512”,

“build_timestamp” : “2015-02-19T13:05:36Z”,

“build_snapshot” : false,

“lucene_version” : “4.10.3”

},

“tagline” : “You Know, for Search”

}

elasticsearch-head是一个elasticsearch的集群管理工具，它是完全由html5编写的独立网页程序，你可以通过插件把它集成到es

安装 elasticsearch-head插件

gannyee@ubuntu:~/elasticsearchcdelasticsearchgannyee@ubuntu: /elasticsearch ./bin/plugin -install mobz/elasticsearch-head

运行重启elasticsearch

在浏览器输入:http://localhost:9200/_plugin/head/

界面的右边有些按钮，如：node stats， cluster nodes，这些是直接请求es的相关状态的api，返回结果为json，如下图：

这里写图片描述

Kibana下载、安装

基于分析和搜索Elasticsearch仪表板的开源浏览器

版本：kibana-4.0.1gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.1-linux-x64.tar.gz

gannyee@ubuntu:~/downloadtar−zxvf/downloadkibana−4.0.1−linux−x64.tar.gzgannyee@ubuntu: /downloadmv kibana-4.0.1-linux-x64/ ../kibana/

gannyee@ubuntu:~/downloadcd../kibana/gannyee@ubuntu: /kibana ./bin/kibana

下面你就可以通过http://127.0.0.1:5601端口访问了,界面如图所示：

这里写图片描述

Apache Nutch 安装、编译、配置：

在Lucene发展来的开源网络爬虫，本次配置只能使用nutch2.x系列，1.x系列不支持MongoDB等其他如Mysql,Habase数据库。

版本：apache-nutch-2.3.1Nutch2.3下载、编译、配置

gannyee@ubuntu:~/download$ wget

http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz

gannyee@ubuntu:~/downloadtar−zxvfapache−nutch−2.3.1−src.tar.gzgannyee@ubuntu: /download mv apache-nutch-2.3.1 ../nutch

gannyee@ubuntu:~/downloadcd../nutchgannyee@ubuntu: /nutch export NUTCH_HOME=$(pwd)

修改/conf/nutch-site.xml使Mongodb作为GORA的存储单元

gannyee@ubuntu:~/nutch/conf$ vim nutch-site.conf

storage.data.store.class

org.apache.gora.mongodb.store.MongoStore

Default class for storing data

从/ivy/ivy.xml文件中取消下面部分的注释

gannyee@ubuntu:~/nutch/confvimNUTCH_HOME/ivy/ivy.xml

default” />

…

确保MongoStore设置为默认数据存储

gannyee@ubuntu:~/nutch$ vim conf/gora.properties

/#######################

/# MongoDBStore properties #

/#######################

gora.datastore.default=org.apache.gora.mongodb.store.MongoStore

gora.mongodb.override_hadoop_configuration=false

gora.mongodb.mapping.file=/gora-mongodb-mapping.xml

gora.mongodb.servers=localhost:27017

gora.mongodb.db=nutch

开始编译nutch

gannyee@ubuntu:~/nutch$ant runtime

如果编译过程中有如下错误

Trying to override old definition of task javac

[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:

[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

Trying to override old definition of task javac

[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:

[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

是因为缺少lib包，解决办法如下（其实可以无视）：

下载 sonar-ant-task-2.1.jar，拷贝到 $NUTCH_HOME/lib 目录下面

修改 $NUTCH_HOME/build.xml，引入上面添加

编译后的文件将被放在新生成的文件夹/nutch/runtime中

最后确认nutch已经正确地编译和运行,输出如下：

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch

Usage: nutch COMMAND

where COMMAND is one of:

inject inject new urls into the database

hostinject creates or updates an existing host table from a text file

generate generate new batches to fetch from crawl db

fetch fetch URLs marked during generate

parse parse URLs marked during fetch

updatedb update web table after parsing

updatehostdb update host table after parsing

readdb read/dump records from page database

readhostdb display entries from the hostDB

index run the plugin-based indexer on parsed batches

elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead

solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead

solrdedup remove duplicates from solr

solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead

clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins

parsechecker check the parser for a given url

indexchecker check the indexing filters for a given url

plugin load a plugin and run one of its classes main()

nutchserver run a (local) Nutch server on a user defined port

webapp run a local Nutch web application

junit runs the given JUnit test

or

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

定制你的爬取特性

gannyee@ubuntu:~$ sudo vim /nutch/runtime/local/conf/nutch-site.xml

< ?xml version=”1.0”?>

< ?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

storage.data.store.class

org.apache.gora.mongodb.store.MongoStore

Default class for storing data

http.agent.name

Hist Crawler

plugin.includes

protocol-(httphttpclient)urlfilter-regexindex-(basicmore)query-(basicsiteurllang)indexer-elasticnutch-extensionpointsparse-(texthtmlmsexcelmswordmspowerpointpdf)summary-basicscoring-opicurlnormalizer-(passregexbasic)parse-(htmltikametatags)index-(basicanchormoremetadata)

elastic.host

localhost

elastic.cluster

hist

elastic.index

nutch

parser.character.encoding.default

utf-8

http.content.limit

6553600

爬取自己第一个网页

创建一个URL种子列表

gannyee@ubuntu:~mkdir−p/nutch/runtime/local/urlsgannyee@ubuntu: echo ‘http://www.aossama.com/’ >/nutch/runtime/local/urls/seed.txt

编辑conf/regex-urlfilter.txt文件，并且替换以下内容

/# accept anything else

+.

使用正则表达式匹配你想要爬取的域名

+^]http://([a-z0-9]*.)*aossama.com/

初始化crawldb

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch inject urls/

从 crawldb生成urls

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch generate -topN 80

获取生成的所有urls

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch fetch -all

解析获取的urls

gannyee@ubuntu:~/nutch/runtime/local$./ bin/nutch parse -all

更新database数据库

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch updatedb -all

索引解析的urls

gannyee@ubuntu:~/nutch/runtime/local$ bin/nutch index -all

爬取完给定网页，mongoDB会生成一个新的数据库：nutch_1gannyee@ubuntu:~/mongodb$ ./bin/mongo

MongoDB shell version: 2.6.11connecting to: test

show dbs

admin (empty)

local 0.031GB

nutch_1 0.031GB

test (empty)

use nutch_1switched to db nutch_1show tables

system.indexes

webpage

具体数据可以在terminal下用指令或在图形界面下直接点击查看

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航