您的位置:首页 > 其它

hive 随机抽样 实用,有助于快速分析数据分布情况和可能的数据倾斜

2019-07-06 10:13 1306 查看

Sampling Syntax 抽样语法

Sampling Bucketized Table 分桶表抽样

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])
The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are ‘bucketed’ on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.
抽样表(TABLESAMPLE)可以实现对数据样本的查询而不是对整个表的查询。抽样表可以用于任何表格的FROM子句。桶从1开始编号。列名代表要抽样元素的所在列。列名可以是一个非分区列,也可以是rand()函数,使用rand()表示对整个行抽样而不是某一列。表的每一行都按照按照选取的列随机分桶到编号为到y的桶中。返回值是其中属于桶x的行。

In the following example the 3rd bucket out of the 32 buckets of the table source. ‘s’ is the table alias.
下面表示:数据随机分配到32个桶中,返回第3个桶的数据

SELECT *
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.
输入裁剪:通常抽样表会扫描整表并抓取样本。但是这样并不高效。取代的方法是,表格可以通过CLUSTER BY语句生成,指定列的集合并以此对表格进行hash分区或分桶。如果抽样表中指定的列和整表CLUSTERED BY语句指定的列一致,抽样表只会扫描整表中所需的hash分区。

Example:
So in the above example, if table ‘source’ was created with ‘CLUSTERED BY id INTO 32 BUCKETS’
假设源表创建时已经按照 **CLUSTERED BY id INTO 32 BUCKETS ** 分桶
TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)
would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
抽样表的每一个桶由源表的两个桶组成,会返回第3个和第19个桶。

On the other hand the tablesample clause
TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)
would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.
会返回第3个桶的一半

For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.

Block Sampling 块抽样

Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn’t necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.
提取n%的数据作为输入
只支持CombineHiveInputFormat,不能处理某些特定的压缩格式
抽样失败会返回整个数据
在HDFS块一级抽样,抽样块的大小以Block Size为粒度

In the following example the input size 0.1% or more will be used for the query.
抽取0.1%的数据用于查询

SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;

Sometimes you want to sample the same data with different blocks, you can change this seed number:
如果想要从不同块抽样,修改hive.sample.seednumber
set hive.sample.seednumber=;
Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
想要指定读取的总长度,不过受限于抽样比例的限制
block_sample: TABLESAMPLE (ByteLengthLiteral)
ByteLengthLiteral : (Digit)+ (‘b’ | ‘B’ | ‘k’ | ‘K’ | ‘m’ | ‘M’ | ‘g’ | ‘G’)
字节长度写法:数字+单位
In the following example the input size 100M or more will be used for the query.
提取100M的数据用于查询
SELECT *
FROM source TABLESAMPLE(100M) s;

Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
Hive也支持行计数限制输入大小。
不需要CombineHiveInputFormat ,可用于非本地表格
计数作用于每一个切片
总数受输入切片数的影响

block_sample: TABLESAMPLE (n ROWS)
从每一个切片中提取前10行数据
For example, the following query will take the first 10 rows from each input split.

SELECT * FROM source TABLESAMPLE(10 ROWS);

explain语句,注意rand()的hash值是如何处理的

hive> explain select fl_date FROM 19q1_tbl TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s group by fl_date;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: s
Statistics: Num rows: 2851705 Data size: 285170702 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (((hash(rand()) & 2147483647) % 32) = 2) (type: boolean)
Statistics: Num rows: 1425852 Data size: 142585300 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: fl_date (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1425852 Data size: 142585300 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1425852 Data size: 142585300 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 712926 Data size: 71292650 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 712926 Data size: 71292650 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

Time taken: 5.459 seconds, Fetched: 44 row(s)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: