您的位置:首页 > 其它

hive 底层模块实现-distinct

2017-01-18 11:29 351 查看

准备数据

语句

SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
hive> SELECT * FROM logs;
OK
a   苹果  3
a   橙子  3
a   烧鸡  1
b   烧鸡  3

hive> SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;


根据count分组,计算独立用户数。

计算过程



默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.

Operator



Explain

hive> explain SELECT uid, sum(count) FROM logs group by uid;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
logs
TableScan // 扫描表
alias: logs
Select Operator //选择字段
expressions:
expr: uid
type: string
expr: count
type: int
outputColumnNames: uid, count
Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据
aggregations:
expr: sum(count) //聚集函数
bucketGroup: false
keys: //键
expr: uid
type: string
mode: hash //hash方式,processHashAggr()
outputColumnNames: _col0, _col1
Reduce Output Operator //输出key,value给reducer
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator

aggregations:
expr: sum(VALUE._col0)
//聚合
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial //合并值
outputColumnNames: _col0, _col1
Select Operator //选择字段
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator //输出到文件
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0
Fetch Operator
limit: -1


转载:http://ju.outofmemory.cn/entry/784
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  hive 数据