您的位置：首页 > 其它

Hive分析函数之grouping sets、cube、rollup学习

2017-11-15 19:31 543 查看

源数据建表语句：

hive> show create table bi_all_access_log;
OK
CREATE TABLE `bi_all_access_log`(
`appsource` string,
`appkey` string,
`identifier` string,
`uid` string)
PARTITIONED BY (
`pt_month` string,
`pt_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'line.delim'='\n',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-cluster/user/hive/warehouse/bi_all_access_log'
TBLPROPERTIES (
'transient_lastDdlTime'='1481864860')
Time taken: 0.025 seconds, Fetched: 22 row(s)

1、GROUPING SETS
GROUPING SETS作为GROUP BY的子句，允许开发人员在GROUP BY语句后面指定多个统计选项，可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来。

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day))
;

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day,appkey))
;

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day),(pt_day,appkey))
;

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((),(pt_day),(pt_day,appkey))
;

2、CUBE
cube简称数据魔方，可以实现hive多个任意维度的查询，cube(a,b,c)则首先会对(a,b,c)进行group by，然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by，他会统计所选列中值的所有组合的聚合。

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with cube
;

3、ROLL UP
rollup可以实现从右到做递减多级的统计，显示统计某一层次结构的聚合。

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with rollup
;

4、Grouping_ID
用以区别数据里的NULL与cube、rollup及grouping sets所产生的NULL。

select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with rollup
;

select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with cube
;

select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((),(pt_day),(pt_day,appkey))
;

5、总结
cube的分组组合最全，是各个维度值的笛卡尔（包含null）组合;

rollup的各维度组合应满足，前一维度为null后一位维度必须为null，前一维度取非null时，下一维度随意;

grouping sets则为自定义维度，根据需要分组即可。

ps:通过grouping sets的使用可以简化SQL，比group by单维度进行union性能更好。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航