hive 随机抽样 实用,有助于快速分析数据分布情况和可能的数据倾斜
Sampling Syntax 抽样语法
Sampling Bucketized Table 分桶表抽样
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])
The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are ‘bucketed’ on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.
抽样表(TABLESAMPLE)可以实现对数据样本的查询而不是对整个表的查询。抽样表可以用于任何表格的FROM子句。桶从1开始编号。列名代表要抽样元素的所在列。列名可以是一个非分区列,也可以是rand()函数,使用rand()表示对整个行抽样而不是某一列。表的每一行都按照按照选取的列随机分桶到编号为到y的桶中。返回值是其中属于桶x的行。
In the following example the 3rd bucket out of the 32 buckets of the table source. ‘s’ is the table alias.
下面表示:数据随机分配到32个桶中,返回第3个桶的数据
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.
输入裁剪:通常抽样表会扫描整表并抓取样本。但是这样并不高效。取代的方法是,表格可以通过CLUSTER BY语句生成,指定列的集合并以此对表格进行hash分区或分桶。如果抽样表中指定的列和整表CLUSTERED BY语句指定的列一致,抽样表只会扫描整表中所需的hash分区。
Example:
So in the above example, if table ‘source’ was created with ‘CLUSTERED BY id INTO 32 BUCKETS’
假设源表创建时已经按照 **CLUSTERED BY id INTO 32 BUCKETS ** 分桶
TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)
would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
抽样表的每一个桶由源表的两个桶组成,会返回第3个和第19个桶。
On the other hand the tablesample clause
TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)
would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.
会返回第3个桶的一半
For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.
Block Sampling 块抽样
Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn’t necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.
提取n%的数据作为输入
只支持CombineHiveInputFormat,不能处理某些特定的压缩格式
抽样失败会返回整个数据
在HDFS块一级抽样,抽样块的大小以Block Size为粒度
In the following example the input size 0.1% or more will be used for the query.
抽取0.1%的数据用于查询
SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s;
Sometimes you want to sample the same data with different blocks, you can change this seed number:
如果想要从不同块抽样,修改hive.sample.seednumber
set hive.sample.seednumber=;
Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
想要指定读取的总长度,不过受限于抽样比例的限制
block_sample: TABLESAMPLE (ByteLengthLiteral)
ByteLengthLiteral : (Digit)+ (‘b’ | ‘B’ | ‘k’ | ‘K’ | ‘m’ | ‘M’ | ‘g’ | ‘G’)
字节长度写法:数字+单位
In the following example the input size 100M or more will be used for the query.
提取100M的数据用于查询
SELECT *
FROM source TABLESAMPLE(100M) s;
Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
Hive也支持行计数限制输入大小。
不需要CombineHiveInputFormat ,可用于非本地表格
计数作用于每一个切片
总数受输入切片数的影响
block_sample: TABLESAMPLE (n ROWS)
从每一个切片中提取前10行数据
For example, the following query will take the first 10 rows from each input split.
SELECT * FROM source TABLESAMPLE(10 ROWS);
explain语句,注意rand()的hash值是如何处理的
hive> explain select fl_date FROM 19q1_tbl TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s group by fl_date; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: s Statistics: Num rows: 2851705 Data size: 285170702 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (((hash(rand()) & 2147483647) % 32) = 2) (type: boolean) Statistics: Num rows: 1425852 Data size: 142585300 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: fl_date (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1425852 Data size: 142585300 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 1425852 Data size: 142585300 Basic stats: COMPLETE Column stats: NONE Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 712926 Data size: 71292650 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 712926 Data size: 71292650 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 5.459 seconds, Fetched: 44 row(s)
- hive表关联查询,什么情况下会发生数据倾斜,应该如何解决?
- JVM运行时数据分析(内存中堆、栈的分布情况)
- hive 数据倾斜分析-=推荐
- HIVE数仓数据血缘分析工具-SQL解析
- 基于数据仓库星形模式的广东省高速公路一张网资金结算情况分析系统
- 网站分析快速作业:十三个网站数据分析的问题及答案
- [python之数据分析] 基础篇1- Numpy,Scipy,Matplotlib 快速入门攻略
- 非常实用的开源的JavaScript库,你可以将一些数据分析处理、可视化的工作交给它们
- hive大数据倾斜总结
- [python之数据分析] 基础篇1- Numpy,Scipy,Matplotlib 快速入门攻略
- Hive 数据倾斜 (Data Skew) 总结
- 数据分析:Hive、Pig和Impala
- Hive数据倾斜
- 大数据技术丛书《数据挖掘:实用案例分析》迷你书
- Hive mapjoin使用(数据倾斜优化)
- 大数据Spark “蘑菇云”行动第96课:基于Hive对电商数据案例分析
- Z-STACK中XDATA数据占用情况分析
- Hive数据倾斜与调优总结
- 数据分析利器之hive优化十大原则
- Spark调优-数据倾斜解决方案 原理及现象分析