您的位置：首页 > 其它

elasticsearch 自定义hash function （routing）

2016-05-31 16:28 495 查看

本次研究基于elastic search verison 2.1.1

为何想要自定义hash function？

本意是想提高elasticsearch的indexing速度。

具体思路就是：

1.将自己的数据在生成的时候就按照预先设定的routing逻辑分片好。

2.这样真正执行大批量bulk导入的时候，每个bulk里面都是一个shard的数据，就可以直接写到对应的shard，而不需要再分发到不同的shard。

3.减少了大量的网络t通讯开销。

可惜的是，经过实际研究发现，es已经不建议自定义设置hash function：

原文在这里：https://www.elastic.co/guide/en/elasticsearch/reference/2.3/breaking_20_crud_and_routing_changes.html#_routing_hash_function

关键部分如下：

In addition, the following routing-related node settings have been deprecated:

cluster.routing.operation.hash.type

This was an undocumented setting that allowed to configure which hash function to use for routing.

murmur3

is
now enforced on new indices.

cluster.routing.operation.use_type

This was an undocumented setting that allowed to take the

_type

of
the document into account when computing its shard (default:

false

false

is
now enforced on new indices.

虽然结果是不能自定义，但是研究过程中某些发现，还是值得列出来：

1.elastic search 默认hsah function 为Murmur3HashFunction

The default hash function that is used for routing has been changed from

djb2

murmur3

.
This change should be transparent unless you relied on very specific properties of

djb2

.
This will help ensure a better balance of the document counts between shards.

2.elastic search 的源码里面还有两个hash function：

a.simple hash function :就是最简单的string hash，使用的java的默认实现

b.djb2 hash function ：应该是es2.0版本以前，一直使用的hash function

2.es 为啥不建议在去修改默认的hsah function呢？

只能猜测，就以我的目的来讲，如果我真的自定义了hash function，那么我每个bulk都持续写入一个shard的情况下，必然导致这个shard的load非常高，然后产生大量的segment，进而引起merge 瓶颈，最终效率说不定还不如把数据分发给所有shard来的快。

当然这还需要实际测试和验证。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： elasticsearch routing hash

相关文章推荐

新的分享

章节导航