ES doc_values介绍——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间(列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩)
2017-02-15 15:00
1281 查看
doc_values
Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _sourcebut in a column-oriented fashion that is way more efficient for sorting and aggregations.(本质!!!) Doc values are supported on almost all field types, with the notable exception of
analyzedstring fields.
All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:
PUT my_index { "mappings": { "my_type": { "properties": { "status_code": { "type": "keyword" }, "session_id": { "type": "keyword", "doc_values": false } } } } }
The status_codefield has doc_valuesenabled by default. | |
The session_idhas doc_valuesdisabled, but can still be queried. |
Column-store compression
edit
At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.
To see how it can help compression, take this set of doc values for a numeric field:
Doc Terms ----------------------------------------------------------------- Doc_1 | 100 Doc_2 | 1000 Doc_3 | 1500 Doc_4 | 1200 Doc_5 | 300 Doc_6 | 1900 Doc_7 | 4200 -----------------------------------------------------------------
The column-stride layout means we have a contiguous block of numbers:
[100,1000,1500,1200,300,1900,4200].
xxx
Doc values use several tricks like this. In order, the following compression schemes are checked:
If all values are identical (or missing), set a flag and record the value
If there are fewer than 256 values, a simple table encoding is used
If there are > 256 values, check to see if there is a common divisor
If there is no common divisor, encode everything as an offset from the smallest value
You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.
You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.
相关文章推荐
- 列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩
- ES doc_values介绍2——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间
- 时间序列数据库——索引用ES、聚合分析时加载数据用什么?docvalues的列存储貌似更优优势一些。那分布式计算呢?ES做
- 时间序列数据库——索引用ES、聚合分析时加载数据用什么?docvalues的列存储貌似更优优势一些
- WindowsXP系统所占空间内存为什么总是很大?空间占用大的原因分析与解决方法介绍
- sphinx 源码阅读之分词,压缩索引,倒排——单词对应的文档ID列表本质和lucene无异 也是外部排序再压缩 解压的时候需要全部扫描doc_ids列表偏移量相加获得最终的文档ID
- Swift-取数字或者字符串的最大或最小值
- JAVA之JDK在64位系统默认开启压缩指针分析
- 腾讯Hermes设计概要——数据分析用的是列存储,词典文件前缀压缩,倒排文件递增id、变长压缩、依然是跳表-本质是lucene啊
- 建立一个存储和处理字符串的类DelCharStr。构造函数:动态申请存储字符串所需内存空间,并且即能用指定的字符串也能用默认的值0为所声明的对象进行初始化。
- C# list存储的数据格式以及默认初始化空间,内存回收分析
- mongodb底层存储和索引原理——本质是文档数据库,无表设计,同时wiredTiger存储引擎支持文档级别的锁,MMAPv1引擎基于mmap,二级索引(二级是文档的存储位置信息『文件id + 文件内offset 』)
- ZOJ 1952( Dijkstra )要求卡车的最大载货量,即是求dist[]的最小值这里关键是把字符串转化为数字存储在邻接矩阵cost[][]中.开始看了党姐的代码不懂,又看了一遍,大悟!
- sphinx 源码阅读之分词,压缩索引,倒排——单词对应的文档ID列表本质和lucene无异 也是外部排序再压缩 解压的时候需要全部扫描doc_ids列表偏移量相加获得最终的文档ID
- python统计ES存储空间占用的代码
- lucene LZ4 会将doc存储在一个chunk里进行Lz4压缩 ES的_source便如此
- 揪出占用磁盘空间的真凶!介绍一个好用的磁盘空间分析清理工具
- sphinx索引分析——文件格式和字典是double array trie 检索树,索引存储 – 多路归并排序,文档id压缩 – Variable Byte Coding
- C# list存储的数据格式以及默认初始化空间,内存回收分析
- ES索引瘦身 压缩——_source _all 均disable filed store为no,引入第三方DB存储原始数据,去掉pos倒排和doc_values,强制定期merge segments,将所有fileds合并为一个field big string