hive sql里count(distinct)的详细优化介绍
1.count(distinct)
select count(distinct column_name) from table_name where ...
对某些字段的去重统计,例:统计用户数量(统计去重的用户ID) count(distinct userId)
优化原因: 因为引入了DISTINCT,无法在map阶段利用combine对输出结果去重,导致shuffle任务量增大
错误解决办法:显式地增大Reduce Task个数来提高Reduce阶段的并发 set mapred.reduce.tasks=n
发现并不能增加Reduce Task个数,原因是Hive在处理count"全聚合(full aggredates)"时,会忽略用指定的Reduce Task个数,强制使用1.
正确解决办法:利用嵌套,增加MapReduce个数
不足: 只能适合单一字段的去重处理
select count(*) from (select distinct column_name from table_name where ...) t
具体的验证如下,当数据量较小时(大概百万条以下吧),或数据的指定粒度比较精细,直接使用count(distinct)的效率是最高的:
select count(tduserid) from(select distinct tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') t; 1m1 select count(*) from(select distinct tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') t; 1m3 select count(1) from(select distinct tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') t; 1m13 //上述对比证明主要是验证了count(加指定 字段的效率最高) 优化方法一 select count(tduserid) from(select tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27' group by tduserid) t; 1m 优化方法二 select count(distinct tduserid) from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27'; 36s 未优化
当数据量比较大的时候,或数据的指定粒度比较粗的时候,显然优化之后的效率更高:
select count(tduserid) from(select distinct tduserid from tdanalytics.stg_td_launch_ex) t; 1m29 优化方法一 select count(tduserid) from(select tduserid from tdanalytics.stg_td_launch_ex group by tduserid) t; 1m49 优化方法二 select count(distinct tduserid) from tdanalytics.stg_td_launch_ex; 2m40s 未优化
显然优化方法一的效率最高,但是其弊端方法一只能对单一字段的去重使用,当有多个字段的去重时只能使用优化二了,具体代码如下:
select count(tduserid),count(sessionid) from (select sessionid,null tduserid from tdanalytics.stg_td_launch_ex group by sessionid union all select null sessionid,tduserid from tdanalytics.stg_td_launch_ex group by tduserid) tl; 2m38 优化后 select count(distinct tduserid),count(distinct sessionid) from tdanalytics.stg_td_launch_ex; 5m没出来结果 未优化
2.每一个字段名都指定表名,就是表唯一的字段也要指定,(不要偷懒,好比告诉人家总比人家去猜时谁的更简单)
select '2019-04-27'as the_date,"天" as type,'3006062',count(distinct tne.tduserid),count(distinct tl.tduserid),count(distinct sessionid), cast(sum(session_duration)/count(distinct sessionid)/60000 as decimal(10,2)) from (select session_duration,sessionid,tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') tl left join (select tduserid from tdanalytics.stg_td_newuser_ex where productid='3006062' and l_date = '2019-04-27') tne on tl.tduserid =tne.tduserid; 1m40s select '2019-04-27'as the_date,"天" as type,'3006062',count(distinct tne.tduserid),count(distinct tl.tduserid),count(distinct tl.sessionid), cast(sum(session_duration)/count(distinct tl.sessionid)/60000 as decimal(10,2)) from (select session_duration,sessionid,tduserid from tdanalytics.stg_td_launch_ex where productid='3006062' and l_date = '2019-04-27') tl left join (select tduserid from tdanalytics.stg_td_newuser_ex where productid='3006062' and l_date = '2019-04-27') tne on tl.tduserid =tne.tduserid; 55s
3.Map Join
小表在前(因为要加载到内存里,这不就是map join吗),大表在后
Hive中的join可分为Common Join(在reduce阶段完成join)和Map Join(在Map阶段完成join)。
Map Join把小表加载到缓存,在map阶段join, 可省去shffule 和reduce过程。
Hive0.7之前,需要使用hint提示* /+ mapjoin(table) */才会执行MapJoin,否则执行Common Join,但在0.7版本之后,默认自动会转换Map Join,由数 **hive.auto.convert.join来控制,默认为true.
- Hive SQL优化之 Count Distinct
- Hive SQL优化之 Count Distinct
- SQL优化(二) 快速计算Distinct Count
- sql优化详细介绍学习笔记
- sql优化之count distinct vs. count group by
- 【SQL优化】使用子查询可提升 COUNT DISTINCT
- 使用use index优化sql查询的详细介绍
- Hive sql 优化介绍
- 使用use index优化sql查询的详细介绍
- hive SQL count时的'\N'
- hive SQL优化之distribute by和sort by
- Mysql中的count()与sum()区别详细介绍
- 深入浅出Hive企业级架构优化、Hive Sql优化、压缩和分布式缓存(企业Hadoop应用核心产品)
- SQL_DISTINCT 语句详细用法
- 详细介绍SQL交叉表的实例
- android listview优化几种写法详细介绍
- 详细介绍ORACLE sqlplus命令
- sql语句性能优化介绍
- MySQL5.6 如何优化慢查询的SQL语句 -- 慢日志介绍
- 使用嵌套子查询优化hive的SQL