您的位置：首页 > 其它
hive：(group by, having;order by)的使用；group by+多个字段，以及wiki说的group by两种使用限制验证

2014-03-24 16:12 399 查看
hive> select * from app_data_stats_historical where os='1' group by dt limit 100;
出现结果如下：
2014-01-01
2014-01-06
......
2014-02-07
2014-02-10
2014-02-14
2014-02-17
2014-02-24
（只返回了一列日期。说明* 不起作用。不过这样，可以查看总共哪些日期，有效，存在数据）
加上having命令，having只作用于group by中的字段，非group的字段不行：
select * from app_data_stats_historical where os='1' group by dt having dt>' limit 1002014-02-01' limit 100;
2014-02-03
2014-02-04
2014-02-07
2014-02-10
2014-02-14
2014-02-17
2014-02-24
（可以用日期进行比较。）

hive> select * from app_data_stats_historical where os='1' order by dt desc limit 100;
字段是全部返回的。降序。

group by 多个字段：

2周内分联盟（5是多盟）分平台(分平台指的是分ios和android)，分时段的曝光及点击。
select substr(createtime,12,2)hour,logtype,os_id,count(distinct logtype)
from wizad_mdm_raw_hdfs
where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and adn=5
group by substr(createtime,12,2), logtype, os_id;
注意，时段hour提取函数substr和substring 是通用的！

或者分步做：将group的字段按取值加到where中去（os_id取值1是android，2是iOS）：
select substr(createtime,12,2)hour,logtype,count(*)
from wizad_mdm_raw_hdfs
where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and os_id='2' and adn=5
group by substr(createtime,12,2), logtype;
以及：
select substr(createtime,12,2),logtype,count(*)
from wizad_mdm_raw_hdfs
where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and os_id='1' and adn=5
group by substr(createtime,12,2), logtype;

substr(createtime,12,2)提取小时字段，起hour别名，group by 不支持。
group 里面不能起别名hour，直接用或引用都不行。
（1）Group by定义别名hour，报错，解析错误
selectsubstr(createtime,12,2),logtype,count(*)
from wizad_mdm_raw_hdfs
where day >= '2014-12-01'
group by substr(createtime,12,2) hour,logtype;

（2）引用定义的hour别名也不行： Line 4:9 Invalid table alias or column reference 'hour'
select substr(createtime,12,2)hour,logtype,count(*)
from wizad_mdm_raw_hdfs
where day >= '2014-12-01'
group by hour, logtype;

hive wiki上说group by有两种使用情况限制
（1）group by只有一列，则distinct只能作用一列（可以多次作用同一列）
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

像这样 SELECTpv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCTpv_users.userid)
只distinct一列，但出现多次是可以的。
下面的查询错误，不允许DISTINCT多个列
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCTpv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
但我测试发现是可以的。有大神可以给我解释么？

如
group by统计去重distinct个数
select substr(createtime,12,2)hour,logtype,os_id,count(distinct logtype), count(distinct os_id)
from wizad_mdm_raw_hdfs
where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and adn=5
group by substr(createtime,12,2), logtype, os_id;


或者一些测试脚本都说明了，我们
select logtype, count(distinct os_id),count(distinctip),count(distinct id)
from wizad_mdm_raw_hdfs
where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'
group by logtype;
结果：
1      2       8493314 77579300
2      2       267685  211119

select substr(createtime,12,2) hour, count(distinctid),count(distinct ip)
from wizad_mdm_raw_hdfs
where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and adn=5
group by substr(createtime,12,2);

部分结果如下：
00     1598136 154053
04     989745  51201
……
18     1711493 201436
21     3644241 374243

（2）另一个wiki限制说明：group by后，除了作用列和聚合函数统计项，多余列不能存在。
这个我验证过确实是不行的。
SELECT a,sum(b)
FROM t1
GROUP BY a;
是正确的

下面是错误的。
SELECT a,b
FROM t1
GROUP BY a;
因为有多余列b，其不在group by的字段属性，(且不是聚合函数).
查询结果会是这样
a   b
------
100 1
100 2
100 3
gourp by a后，b不能成为集合{1,2,3}返回，你可以count，但不能直接返回b。b是多值的。hive摒弃了这种猜测无效的SQL（HQL，要准确）：有一列在select子句中，却不包含在GROUPBY子句中。
pig是可以构成集合返回的。
内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理
标签：
相关文章推荐
新的分享
章节导航