Hive的collect_set使用详解
2017-08-09 17:51
218 查看
有这么一需求,在Hive中求出一个数据表中在某天内首次登陆的人;可以借助collect_set来处理sql:
[html] view
plain copy
print?
select count(a.id)
from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) as a where size(a.t)=1 and a.t[0]='20150906';
上面中的
[html] view
plain copy
print?
select id,collect_set(time) as t from t_action_login where time<='20150906' group by id
会按照id分组,因为一个id可能对应一天也可能对应多天,对应多天表示有多天都有登陆,所以一个id会对应多个日期time,通过collect_set会把每个id所对应的日期构建成一个以逗号分隔的数组返回。上述SQL返回:
[html] view
plain copy
print?
123@163.com | ["20150620","20150619"] |
| abc@163.com | ["20150816"] |
| cde@qq.com | ["20150606","20150608","20150607","20150609","20150613","20150610","20150616","20150615"] |
| 789@sohu.com | ["20150827","20150623","20150627","20150820","20150823","20150612","20150717"] |
| 987@163.com | ["20150701","20150829","20150626","20150625","20150726","20150722","20150629","20150824","20150716","20150 |
| ddsf@163.com | ["20150804","20150803","20150801","20150809","20150807","20150806","20150905","20150904","20150730","20150 |
| 182@163.com |["20150803","20150801","20150809","20150808","20150805","20150806","20150906","20150904","20150730","20150 |
| 22225@163.com | ["20150604","20150609","20150622","20150827","20150625","20150620","20150613","20150610","20150614","20150 |
| 18697@qq.com | ["20150902"] |
| 1905@qq.com | ["20150709"]
所以我们就可以按照这个返回的数组做文章,即为
[html] view
plain copy
print?
where size(a.t)=1 and a.t[0]='20150906';
表示某id所对应的数组长度为1 并且第一个时间为20150906的id表示为该天首次登陆。
总结:
Hive不允许直接访问非group by字段;
对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
使用数字下标,可以直接访问数组中的元素;
[html] view
plain copy
print?
select count(a.id)
from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) as a where size(a.t)=1 and a.t[0]='20150906';
上面中的
[html] view
plain copy
print?
select id,collect_set(time) as t from t_action_login where time<='20150906' group by id
会按照id分组,因为一个id可能对应一天也可能对应多天,对应多天表示有多天都有登陆,所以一个id会对应多个日期time,通过collect_set会把每个id所对应的日期构建成一个以逗号分隔的数组返回。上述SQL返回:
[html] view
plain copy
print?
123@163.com | ["20150620","20150619"] |
| abc@163.com | ["20150816"] |
| cde@qq.com | ["20150606","20150608","20150607","20150609","20150613","20150610","20150616","20150615"] |
| 789@sohu.com | ["20150827","20150623","20150627","20150820","20150823","20150612","20150717"] |
| 987@163.com | ["20150701","20150829","20150626","20150625","20150726","20150722","20150629","20150824","20150716","20150 |
| ddsf@163.com | ["20150804","20150803","20150801","20150809","20150807","20150806","20150905","20150904","20150730","20150 |
| 182@163.com |["20150803","20150801","20150809","20150808","20150805","20150806","20150906","20150904","20150730","20150 |
| 22225@163.com | ["20150604","20150609","20150622","20150827","20150625","20150620","20150613","20150610","20150614","20150 |
| 18697@qq.com | ["20150902"] |
| 1905@qq.com | ["20150709"]
所以我们就可以按照这个返回的数组做文章,即为
[html] view
plain copy
print?
where size(a.t)=1 and a.t[0]='20150906';
表示某id所对应的数组长度为1 并且第一个时间为20150906的id表示为该天首次登陆。
总结:
Hive不允许直接访问非group by字段;
对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
使用数字下标,可以直接访问数组中的元素;
相关文章推荐
- hive中一些常用函数介绍weekofyear、LATERAL VIEW explode() 、collect_set
- Hive 的collect_set使用详解
- HIVE中关于collect_set与explode函数妙用
- Hive 的collect_set使用详解
- Hive 的collect_set使用详解
- Hive中列转行函数collect_set详解
- Hive 的collect_set使用详解
- Hive 的collect_set使用详解
- Hive中的explode()函数和collect_set()函数
- Hive--行转列(Lateral View explode())和列转行(collect_set() 去重)
- hive:数据库“行专列”操作---使用collect_set/collect_list/collect_all & row_number()over(partition by 分组字段 [order by 排序字段])
- hive中的concat,concat_ws,collect_set用法
- hive列转行 (collect_set())
- hive列转行 (collect_all()/collect_list() 不去重)
- hive使用collect与explode
- HIVE: collect_set(输出未包含在groupby的字段);
- 用spark实现hive中的collect_set函数的功能
- 一次hive reduce oom 处理:Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTas
- 开源BI工具Pentaho 连接hive进行大数据分析
- HADOOP在处理HIVE时权限错误的解决办法