您的位置:首页 > 其它

Hive进阶-深入解析Hive底层实现 - Distinct 的底层实现

2018-01-10 16:40 1846 查看
转自Hive – Distinct 的实现 并稍作更改

http://ju.outofmemory.cn/entry/784

Hive版本为1.1.0。有空的话其实可以分析它在hive on spark 的底层实现是怎么样的

分析语句

准备数据

计算过程

Operator

Explain

分析语句

SELECT count, COUNT(DISTINCT uid) FROM logs GROUP BY count;


根据count分组,计算独立用户数。

业务解释:分析有多少个人有x个物品

准备数据

create table logs(uid string,name string,count string);


insert into table logs values('a','apple','3'),('a','orange','3'),('a','banana','1'),('b','banana','3');


select * from logs;
OK
a       apple   3
a       orange  3
a       banana  1
b       banana  3


计算过程



第一步先在mapper计算部分值,会以count和uid作为key,如果是distinct并且之前已经出现过,则忽略这条计算。第一步是以组合为key,第二步是以count为key.

ReduceSink是在mapper.close()时才执行的,在GroupByOperator.close()时,把结果输出。注意这里虽然key是count和uid,但是在reduce时分区是按count来的!

第一步的distinct计算的值没用,要留到reduce计算的才准确。这里只是减少了key组合相同的行。不过如果是普通的count,后面是会合并起来的。

distinct通过比较lastInvoke判断要不要+1(因为在reduce是排序过了的,所以判断distict的字段变了没有,如果没变,则不+1)

Operator



Explain

explain select count, count(distinct uid) from logs group by count;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: logs
Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: count (type: string), uid (type: string)
outputColumnNames: count, uid
Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT uid)
keys: count (type: string), uid (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息