您的位置：首页 > 其它

CDH HBase Indexer的基本使用

2016-05-03 15:12 281 查看

1. 简介

CDH上的Key-Value Store Indexer服务使用的是Lily HBase Indexer。Lily HBase Indexer是一款灵活的、可扩展的、高容错的，并且近实时的处理hbase列索引数据的软件。它是NGDATA公司开发的Lily系统的一部分，已开放源代码，源代码托管在github上。Lily HBase Indexer依赖于hbase的replication功能，在hbase进行写入、更新或者删除操作的时候，HBase Indexer将监听到这些操作，以此将数据的增删改同步到Solr里面。Hbase Indexer使用SolrCloud来存储hbase的索引数据。HBase Indexer支持用户自定义的抽取，转换规则来索引hbase列数据。Solr搜索结果会包含用户自定义的columnfamily:qualifier字段结果，这样应用程序就通过solr直接检索hbase的列数据。而且HBase Indexer索引和搜索不会影响hbase运行的稳定性和hbase数据写入的吞吐量，因为索引和搜索过程是完全分开并且异步的。

2. 使用

CDH5.4中已经整合了Lily HBase Indexer服务。在Cloudera Manager管理界面上安装Key-Value Store Indexer服务之后，开始测试使用hbase-indexer相关功能。

CDH5.4.2中的Key-Value Store Indexer使用的是Lily HBase Indexer服务。Lily HBase Indexer在CDH5中运行必须依赖HBase、SolrCloud和Zookeeper服务。

2.1 在 hbase 表列族上启用复制

对于已经存在的hbase表，修改表中需要索引的列族的REPLICATION_SCOPE为1，如下所示：

$ hbase shell
hbase shell> disable 'record'
hbase shell> alter 'record', {NAME => 'data', REPLICATION_SCOPE => 1}
hbase shell> enable 'record'

对于每个新表，创建时指定需要索引的列族的REPLICATION_SCOPE为1，如下所示：

$ hbase shell
hbase shell> create 'record', {NAME => 'data', REPLICATION_SCOPE => 1}

2.2 创建相应的 SolrCloud 集合

创建的SolrCloud 集合字段要包括所有需要索引的hbase列。通过如下命令实例化SolrCloud配置信息并创建SolrCloud：

$ solrctl instancedir --generate $HOME/hbase-collection1
$ edit $HOME/hbase-collection1/conf/schema.xml
$ solrctl instancedir --create hbase-collection1 $HOME/hbase-collection1
$ solrctl collection --create hbase-collection1

【说明】

（1）每个需要索引的hbase列对应于schema中的一个

<field>

（2）在schema.xml中 uniqueKey 必须为 hbase 表的 rowkey ,而 rowkey 默认使用 id 字段表示，所以

<field>

配置中必须要有 id 字段。

2.3 创建 Lily HBase Indexer 配置

$ cat $HOME/morphline-hbase-mapper.xml

<?xml version="1.0"?>
<indexer table="record" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

<!—如果使用CM来管理，则使用相对路径 "morphlines.conf" -->
<param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/>

<!-- The optional morphlineId identifies a morphline if there are multiple morphlines in morphlines.conf，value对应morphlines.conf的id属性 -->
<!-- <param name="morphlineId" value="morphline1"/> -->

</indexer>

【说明】 其中table表示需要索引的hbase表，如上面的配置指定为record表；mapper表示用来实现和读取指定的Morphline配置文件类，固定为

MorphlineResultToSolrMapper

。morphlineFile参数用来指定当前配置为morphlineFile文件所在的路径。如果是使用Cloudera Manager来管理morphlines.conf就直接写入值“morphlines.conf”。否则使用绝对路径来指定具体的morphlines.conf文件。morphlineId参数指定对应

morphlines.conf的id属性。

morphline-hbase-mapper.xml在

<indexer>

节点里面可以通过unique-key-field指定hbase rowkey将被映射的solr字段名，默认值为id字段，如果需要指定映射为其他字段名，通过配置unique-key-field来实现，如下所示：

<indexer table="record" unique-key-field="rowkey" ...>
...
</indexer>

【注意】 unique-key-field的值应该与SolrCloud schema.xml里面的uniqueKey字段名相对应。

2.4 创建 Morphline 配置文件

Morphlines是一款开源的，用来减少构建hadoop ETL数据流程时间的应用程序。它可以替代传统的通过MapReduce来抽取、转换、加载数据的过程，提供了一系列的命令工具。对于HBase Indexer，其提供了extractHBaseCells命令来读取HBase的列数据。我们采用Cloudera Manager来管理morphlines.conf文件。

使用CM来管理morphlines.conf文件除了上面提到的好处之外，还有一个好处就是当我们需要增加索引列的时候，如果采用本地路径方式将需要重新注册Lily HBase Indexer的配置文件，而采用CM管理的话只需要修改morphlines.conf文件后重启Key-Value HBase Indexer服务即可。

具体操作为：进入Key-Value Store Indexer面板 -> 配置 -> 服务范围 -> Morphlines -> Morphlines文件。在该选项加入如下配置：

morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : "data:id"
outputField : "id"
type : string
source : value
}
{ logTrace { format : "output record: {}", args : ["@{}"] } }
]
}
]

【说明】

id : 与

morphline-hbase-mapper.xml

里面配置的 morphlineId 参数对应。

importCommands : 需要引入的命令包地址。

extractHBaseCells：该命令用来读取HBase列数据并写入到SolrInputDocument对象中，该命令必须包含零个或者多个mappings命令对象。

mappings : 用来指定hbase列字段与solr之间的映射。

inputColumn : 需要写入到solr中的hbase列字段。值包含列族和列限定符，并用

分开。其中列限定符也可以使用通配符

来表示，譬如可以使用

data:*

表示索引列族为data的所有列；也可以通过

data:my*

来表示索引列族为data中以my开头的字段。

outputField : 指定

inputColumn

与 solr 的 schema.xml 文件的哪个字段名 (

<field>

) 进行映射，否则写入不正确。

type : 指定hbase列值的映射数据类型，我们知道hbase中的数据都是以byte[]的形式保存，但是所有的内容在Solr中索引为text 形式，所以需要一个方法来把byte[]类型转换为实际的数据类型。type参数的值就是用来做这件事情的。现在支持的数据类型有：byte[] (原封不动的拷贝hbase中的byte[]数据),int,long,string,boolean,float,double,short和 bigdecimal。当然你也可以指定自定义的数据类型，只需要实现

com.ngdata.hbaseindexer.parse.ByteArrayValueMapper

接口即可。

source : 用来指定hbase的KeyValue的哪一部分作为索引输入数据，可选的有

value

和

qualifier

, 当取值为value的时候表示使用hbase的列值作为索引输入，当取值为qualifier的时候表示使用hbase的列限定符作为索引输入。

2.5 注册 Lily HBase Indexer配置

当前面的所有步骤完成之后，我们需要把Lily HBase Indexer的配置文件注册到Zookeeper中，使用如下命令：

hbase-indexer add-indexer -n myIndexer \
-c $HOME/morphline-hbase-mapper.xml \
-cp solr.zk=Node03:2181,Node04:2181,Node05:2181/solr \
-cp solr.collection=coll1 \
-z Node03:2181,Node04:2181,Node05:2181

-n : –name

-c : –indexer-conf

-cp : –connection-param

-z : –zookeeper

更多介绍可以通过如下命令查看：

hbase-indexer add-indexer --help

注册后，可以验证是否注册成功：

$ hbase-indexer list-indexers

2.6 验证索引是否正常工作

往hbase写入数据

$ hbase shell
hbase(main):001:0> put 'record', 'row1', 'data:id', '1'
hbase(main):002:0> put 'record', 'row2', 'data:id', '2'

打开solr web ui查看数据同步情况

3. 参考

Using the Lily HBase NRT Indexer Service

Using the Lily HBase Batch Indexer for Indexing

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航