您的位置:首页 > 数据库

hive sql里count(distinct)的详细优化介绍

2019-04-30 10:44 666 查看

1.count(distinct)

select count(distinct column_name)  from table_name  where  ...

对某些字段的去重统计,例:统计用户数量(统计去重的用户ID) count(distinct userId)
优化原因: 因为引入了DISTINCT,无法在map阶段利用combine对输出结果去重,导致shuffle任务量增大
错误解决办法:显式地增大Reduce Task个数来提高Reduce阶段的并发 set mapred.reduce.tasks=n
发现并不能增加Reduce Task个数,原因是Hive在处理count"全聚合(full aggredates)"时,会忽略用指定的Reduce Task个数,强制使用1.
正确解决办法:利用嵌套,增加MapReduce个数
不足: 只能适合单一字段的去重处理

select count(*) from (select distinct column_name from table_name where ...) t

具体的验证如下,当数据量较小时(大概百万条以下吧),或数据的指定粒度比较精细,直接使用count(distinct)的效率是最高的:

select count(tduserid) from(select distinct tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') t;    1m1
select count(*) from(select distinct tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') t;    1m3
select count(1) from(select distinct tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') t;   1m13         //上述对比证明主要是验证了count(加指定 字段的效率最高)  优化方法一

select count(tduserid) from(select tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27' group by tduserid) t;   1m 优化方法二

select count(distinct tduserid) from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27';   36s    未优化

当数据量比较大的时候,或数据的指定粒度比较粗的时候,显然优化之后的效率更高:

select count(tduserid) from(select distinct tduserid from tdanalytics.stg_td_launch_ex) t;   1m29 优化方法一
select count(tduserid) from(select tduserid from tdanalytics.stg_td_launch_ex group by tduserid) t;    1m49 优化方法二
select count(distinct tduserid) from tdanalytics.stg_td_launch_ex;      2m40s  未优化

显然优化方法一的效率最高,但是其弊端方法一只能对单一字段的去重使用,当有多个字段的去重时只能使用优化二了,具体代码如下:

select count(tduserid),count(sessionid)
from (select sessionid,null tduserid from tdanalytics.stg_td_launch_ex
group by sessionid
union all
select null sessionid,tduserid from tdanalytics.stg_td_launch_ex
group by tduserid) tl;      2m38  优化后

select count(distinct tduserid),count(distinct sessionid) from tdanalytics.stg_td_launch_ex;    5m没出来结果  未优化

2.每一个字段名都指定表名,就是表唯一的字段也要指定,(不要偷懒,好比告诉人家总比人家去猜时谁的更简单)

select
'2019-04-27'as the_date,"天" as type,'3006062',count(distinct tne.tduserid),count(distinct tl.tduserid),count(distinct sessionid),
cast(sum(session_duration)/count(distinct sessionid)/60000 as decimal(10,2))
from (select session_duration,sessionid,tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') tl
left join (select tduserid from tdanalytics.stg_td_newuser_ex where productid='3006062' and l_date = '2019-04-27') tne on  tl.tduserid =tne.tduserid;  1m40s
select
'2019-04-27'as the_date,"天" as type,'3006062',count(distinct tne.tduserid),count(distinct tl.tduserid),count(distinct tl.sessionid),
cast(sum(session_duration)/count(distinct tl.sessionid)/60000 as decimal(10,2))
from (select session_duration,sessionid,tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') tl
left join (select tduserid from tdanalytics.stg_td_newuser_ex where productid='3006062' and l_date = '2019-04-27') tne on  tl.tduserid =tne.tduserid;  55s

3.Map Join
小表在前(因为要加载到内存里,这不就是map join吗),大表在后
Hive中的join可分为Common Join(在reduce阶段完成join)和Map Join(在Map阶段完成join)。
Map Join把小表加载到缓存,在map阶段join, 可省去shffule 和reduce过程。
Hive0.7之前,需要使用hint提示* /+ mapjoin(table) */才会执行MapJoin,否则执行Common Join,但在0.7版本之后,默认自动会转换Map Join,由数 **hive.auto.convert.join来控制,默认为true.

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: