您的位置:首页 > 其它

Spark分析窗口函数

2016-11-30 19:18 288 查看
Spark1.4发布,支持了窗口分析函数(window functions)。

在离线平台中,90%以上的离线分析任务都是使用Hive实现,其中必然会使用很多窗口分析函数,如果SparkSQL支持窗口分析函数,

那么对于后面Hive向SparkSQL中的迁移的工作量会大大降低,使用方式如下:

1、初始化数据

创建表

[sql] view
plain copy

create table window_test2 (url string, rate int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

准备测试数据

[plain] view
plain copy

url1,12

url2,11

url1,23

url2,25

url1,58

url3,11

url2,25

url3,58

url2,11

加载数据:

[sql] view
plain copy

load data local inpath '/opt/bin/short_opt/windows2.data' overwrite into table window_test2 ;

2、窗口函数测试

查询所有数据

[sql] view
plain copy

select * from window_test2;

+-------+-------+

| url | rate |

+-------+-------+

| url1 | 12 |

| url2 | 11 |

| url1 | 23 |

| url2 | 25 |

| url1 | 58 |

| url3 | 11 |

| url2 | 25 |

| url3 | 58 |

| url2 | 11 |

+-------+-------+

分组排序:

[sql] view
plain copy

select url,rate,row_number() over(partition by url order by rate desc) as r from window_test2;

+-------+-------+----+

| url | rate | r |

+-------+-------+----+

| url1 | 58 | 1 |

| url1 | 23 | 2 |

| url1 | 12 | 3 |

| url2 | 25 | 1 |

| url2 | 25 | 2 |

| url2 | 11 | 3 |

| url2 | 11 | 4 |

| url3 | 58 | 1 |

| url3 | 11 | 2 |

+-------+-------+----+

分组统计sum

[sql] view
plain copy

select url,rate,sum(rate) over(partition by url ) as r from window_test2;

+-------+-------+-----+

| url | rate | r |

+-------+-------+-----+

| url1 | 12 | 93 |

| url1 | 23 | 93 |

| url1 | 58 | 93 |

| url2 | 11 | 72 |

| url2 | 25 | 72 |

| url2 | 25 | 72 |

| url2 | 11 | 72 |

| url3 | 11 | 69 |

| url3 | 58 | 69 |

+-------+-------+-----+

分组统计avg

[sql] view
plain copy

select url,rate,avg(rate) over(partition by url ) as r from window_test2;

+-------+-------+-------+

| url | rate | r |

+-------+-------+-------+

| url1 | 12 | 31.0 |

| url1 | 23 | 31.0 |

| url1 | 58 | 31.0 |

| url2 | 25 | 18.0 |

| url2 | 11 | 18.0 |

| url2 | 11 | 18.0 |

| url2 | 25 | 18.0 |

| url3 | 11 | 34.5 |

| url3 | 58 | 34.5 |

+-------+-------+-------+

分组统计count

[sql] view
plain copy

select url,rate,count(rate) over(partition by url ) as r from window_test2;

+-------+-------+----+

| url | rate | r |

+-------+-------+----+

| url1 | 12 | 3 |

| url1 | 23 | 3 |

| url1 | 58 | 3 |

| url2 | 11 | 4 |

| url2 | 25 | 4 |

| url2 | 25 | 4 |

| url2 | 11 | 4 |

| url3 | 11 | 2 |

| url3 | 58 | 2 |

+-------+-------+----+

分组lag

[sql] view
plain copy

select url,rate,lag(rate) over(partition by url ) as r from window_test2;

+-------+-------+-------+

| url | rate | r |

+-------+-------+-------+

| url1 | 12 | NULL |

| url1 | 23 | 12 |

| url1 | 58 | 23 |

| url2 | 25 | NULL |

| url2 | 11 | 25 |

| url2 | 11 | 11 |

| url2 | 25 | 11 |

| url3 | 11 | NULL |

| url3 | 58 | 11 |

+-------+-------+-------+

3、spark-1.4以后,支持所有的窗口函数了,有利用于hive作业向spark-sql来转换。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: