您的位置:首页 > 产品设计 > UI/UE

Hive分析窗体函数之LAG,LEAD,FIRST_VALUE和LAST_VALUE

2017-08-13 20:28 543 查看
环境信息:
Hive版本号为apache-hive-0.14.0-bin
Hadoop版本号为hadoop-2.6.0
Tez版本号为tez-0.7.0


创建表:

create table windows_ss

(

polno string,

eff_date string,

userno string

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

stored as textfile;

数据:

P066666666666,2016-04-02 09:00:02,user01

P066666666666,2016-04-02 09:00:00,user02

P066666666666,2016-04-02 09:03:04,user11

P066666666666,2016-04-02 09:50:05,user03

P066666666666,2016-04-02 10:00:00,user51

P066666666666,2016-04-02 09:10:00,user09

P066666666666,2016-04-02 09:50:01,user32

P088888888888,2016-04-02 09:00:02,user41

P088888888888,2016-04-02 09:00:00,user55

P088888888888,2016-04-02 09:03:04,user23

P088888888888,2016-04-02 09:50:05,user80

P088888888888,2016-04-02 10:00:00,user08

P088888888888,2016-04-02 09:10:00,user22

P088888888888,2016-04-02 09:50:01,user31

将数据导入Hive表中:

LOAD DATA LOCAL INPATH '/home/hadoop/testhivedata/windows_ss.txt' OVERWRITE INTO TABLE windows_ss;



1. LAG

LAG(col,n,DEFAULT) 用于统计窗体内往上第n行值
第一个參数为列名。第二个參数为往上第n行(可选,默觉得1),第三个參数为默认值(当往上第n行为NULL时候。取默认值,如不指定,则为NULL)

SELECT

polno,

eff_date,

userno,

ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,

LAG(eff_date,1,'1970-01-01 00:00:00') OVER(PARTITION BY polno ORDER BY eff_date) AS last_1_time,

LAG(eff_date,2) OVER(PARTITION BY polno ORDER BY eff_date) AS last_2_time

FROM windows_ss;

结果:

polno eff_date userno rn last_1_time last_2_time

P066666666666 2016-04-02 09:00:00 user02 1 1970-01-01 00:00:00 NULL

P066666666666 2016-04-02 09:00:02 user01 2 2016-04-02 09:00:00 NULL

P066666666666 2016-04-02 09:03:04 user11 3 2016-04-02 09:00:02 2016-04-02 09:00:00

P066666666666 2016-04-02 09:10:00 user09 4 2016-04-02 09:03:04 2016-04-02 09:00:02

P066666666666 2016-04-02 09:50:01 user32 5 2016-04-02 09:10:00 2016-04-02 09:03:04

P066666666666 2016-04-02 09:50:05 user03 6 2016-04-02 09:50:01 2016-04-02 09:10:00

P066666666666 2016-04-02 10:00:00 user51 7 2016-04-02 09:50:05 2016-04-02 09:50:01

P088888888888 2016-04-02 09:00:00 user55 1 1970-01-01 00:00:00 NULL

P088888888888 2016-04-02 09:00:02 user41 2 2016-04-02 09:00:00 NULL

P088888888888 2016-04-02 09:03:04 user23 3 2016-04-02 09:00:02 2016-04-02 09:00:00

P088888888888 2016-04-02 09:10:00 user22 4 2016-04-02 09:03:04 2016-04-02 09:00:02

P088888888888 2016-04-02 09:50:01 user31 5 2016-04-02 09:10:00 2016-04-02 09:03:04

P088888888888 2016-04-02 09:50:05 user80 6 2016-04-02 09:50:01 2016-04-02 09:10:00

P088888888888 2016-04-02 10:00:00 user08 7 2016-04-02 09:50:05 2016-04-02 09:50:01

分析:

last_1_time: 指定了往上第1行的值。default为'1970-01-01 00:00:00'

P066666666666第一行。往上1行为NULL,因此取默认值 1970-01-01 00:00:00

P066666666666第三行,往上1行值为第二行值。2016-04-02 09:00:02

P066666666666第六行,往上1行值为第五行值。2016-04-02 09:50:01

last_2_time: 指定了往上第2行的值,为指定默认值

P088888888888第一行,往上2行为NULL

P088888888888第二行,往上2行为NULL

P088888888888第四行。往上2行为第二行值。2016-04-02 09:00:02

P088888888888第七行,往上2行为第五行值,2016-04-02 09:50:01

2. LEAD

与LAG相反

LEAD(col,n,DEFAULT) 用于统计窗体内往下第n行值

第一个參数为列名。第二个參数为往下第n行(可选,默觉得1),第三个參数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT

polno,

eff_date,

userno,

ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,

LEAD(eff_date,1,'1970-01-01 00:00:00') OVER(PARTITION BY polno ORDER BY eff_date) AS next_1_time,

LEAD(eff_date,2) OVER(PARTITION BY polno ORDER BY eff_date) AS next_2_time

FROM windows_ss;

结果:

polno eff_date userno rn next_1_time next_2_time

P066666666666 2016-04-02 09:00:00 user02 1 2016-04-02 09:00:02 2016-04-02 09:03:04

P066666666666 2016-04-02 09:00:02 user01 2 2016-04-02 09:03:04 2016-04-02 09:10:00

P066666666666 2016-04-02 09:03:04 user11 3 2016-04-02 09:10:00 2016-04-02 09:50:01

P066666666666 2016-04-02 09:10:00 user09 4 2016-04-02 09:50:01 2016-04-02 09:50:05

P066666666666 2016-04-02 09:50:01 user32 5 2016-04-02 09:50:05 2016-04-02 10:00:00

P066666666666 2016-04-02 09:50:05 user03 6 2016-04-02 10:00:00 NULL

P066666666666 2016-04-02 10:00:00 user51 7 1970-01-01 00:00:00 NULL

P088888888888 2016-04-02 09:00:00 user55 1 2016-04-02 09:00:02 2016-04-02 09:03:04

P088888888888 2016-04-02 09:00:02 user41 2 2016-04-02 09:03:04 2016-04-02 09:10:00

P088888888888 2016-04-02 09:03:04 user23 3 2016-04-02 09:10:00 2016-04-02 09:50:01

P088888888888 2016-04-02 09:10:00 user22 4 2016-04-02 09:50:01 2016-04-02 09:50:05

P088888888888 2016-04-02 09:50:01 user31 5 2016-04-02 09:50:05 2016-04-02 10:00:00

P088888888888 2016-04-02 09:50:05 user80 6 2016-04-02 10:00:00 NULL

P088888888888 2016-04-02 10:00:00 user08 7 1970-01-01 00:00:00 NULL

分析:

--逻辑与LAG一样,仅仅只是LAG是往上。LEAD是往下

3. FIRST_VALUE

取分组内排序后,截止到当前行。第一个值

SELECT

polno,

eff_date,

userno,

ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,

FIRST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date) AS first1

FROM windows_ss;

polno eff_date userno rn first1

P066666666666 2016-04-02 09:00:00 user02 1 user02

P066666666666 2016-04-02 09:00:02 user01 2 user02

P066666666666 2016-04-02 09:03:04 user11 3 user02

P066666666666 2016-04-02 09:10:00 user09 4 user02

P066666666666 2016-04-02 09:50:01 user32 5 user02

P066666666666 2016-04-02 09:50:05 user03 6 user02

P066666666666 2016-04-02 10:00:00 user51 7 user02

P088888888888 2016-04-02 09:00:00 user55 1 user55

P088888888888 2016-04-02 09:00:02 user41 2 user55

P088888888888 2016-04-02 09:03:04 user23 3 user55

P088888888888 2016-04-02 09:10:00 user22 4 user55

P088888888888 2016-04-02 09:50:01 user31 5 user55

P088888888888 2016-04-02 09:50:05 user80 6 user55

P088888888888 2016-04-02 10:00:00 user08 7 user55

4. LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

SELECT

polno,

eff_date,

userno,

ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,

LAST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date) AS last1

FROM windows_ss;

结果:

polno eff_date userno rn last1

P066666666666 2016-04-02 09:00:00 user02 1 user02

P066666666666 2016-04-02 09:00:02 user01 2 user01

P066666666666 2016-04-02 09:03:04 user11 3 user11

P066666666666 2016-04-02 09:10:00 user09 4 user09

P066666666666 2016-04-02 09:50:01 user32 5 user32

P066666666666 2016-04-02 09:50:05 user03 6 user03

P066666666666 2016-04-02 10:00:00 user51 7 user51

P088888888888 2016-04-02 09:00:00 user55 1 user55

P088888888888 2016-04-02 09:00:02 user41 2 user41

P088888888888 2016-04-02 09:03:04 user23 3 user23

P088888888888 2016-04-02 09:10:00 user22 4 user22

P088888888888 2016-04-02 09:50:01 user31 5 user31

P088888888888 2016-04-02 09:50:05 user80 6 user80

P088888888888 2016-04-02 10:00:00 user08 7 user08

假设不指定ORDER BY,则默认依照记录在文件里的偏移量进行排序,会出现错误的结果

FIRST_VALUE没有排序:

SELECT

polno,

eff_date,

userno,

FIRST_VALUE(userno) OVER(PARTITION BY polno) AS first2

FROM windows_ss;

polno eff_date userno first2

P066666666666 2016-04-02 09:00:02 user01 user01

P066666666666 2016-04-02 09:00:00 user02 user01

P066666666666 2016-04-02 09:03:04 user11 user01

P066666666666 2016-04-02 09:50:05 user03 user01

P066666666666 2016-04-02 10:00:00 user51 user01

P066666666666 2016-04-02 09:10:00 user09 user01

P066666666666 2016-04-02 09:50:01 user32 user01

P088888888888 2016-04-02 09:00:02 user41 user41

P088888888888 2016-04-02 09:00:00 user55 user41

P088888888888 2016-04-02 09:03:04 user23 user41

P088888888888 2016-04-02 09:50:05 user80 user41

P088888888888 2016-04-02 10:00:00 user08 user41

P088888888888 2016-04-02 09:10:00 user22 user41

P088888888888 2016-04-02 09:50:01 user31 user41

LAST_VALUE没有排序:

SELECT

polno,

eff_date,

userno,

LAST_VALUE(userno) OVER(PARTITION BY polno) AS last2

FROM windows_ss;

结果:

polno eff_date userno last2

P066666666666 2016-04-02 09:00:02 user01 user32

P066666666666 2016-04-02 09:00:00 user02 user32

P066666666666 2016-04-02 09:03:04 user11 user32

P066666666666 2016-04-02 09:50:05 user03 user32

P066666666666 2016-04-02 10:00:00 user51 user32

P066666666666 2016-04-02 09:10:00 user09 user32

P066666666666 2016-04-02 09:50:01 user32 user32

P088888888888 2016-04-02 09:00:02 user41 user31

P088888888888 2016-04-02 09:00:00 user55 user31

P088888888888 2016-04-02 09:03:04 user23 user31

P088888888888 2016-04-02 09:50:05 user80 user31

P088888888888 2016-04-02 10:00:00 user08 user31

P088888888888 2016-04-02 09:10:00 user22 user31

P088888888888 2016-04-02 09:50:01 user31 user31

假设想要取分组内排序后最后一个值,则须要变通一下:

SELECT

polno,

eff_date,

userno,

ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,

LAST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date) AS last1,

FIRST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date DESC) AS last2

FROM windows_ss ORDER BY polno,eff_date;

polno eff_date userno rn last1 last2

P066666666666 2016-04-02 09:00:00 user02 1 user02 user51

P066666666666 2016-04-02 09:00:02 user01 2 user01 user51

P066666666666 2016-04-02 09:03:04 user11 3 user11 user51

P066666666666 2016-04-02 09:10:00 user09 4 user09 user51

P066666666666 2016-04-02 09:50:01 user32 5 user32 user51

P066666666666 2016-04-02 09:50:05 user03 6 user03 user51

P066666666666 2016-04-02 10:00:00 user51 7 user51 user51

P088888888888 2016-04-02 09:00:00 user55 1 user55 user08

P088888888888 2016-04-02 09:00:02 user41 2 user41 user08

P088888888888 2016-04-02 09:03:04 user23 3 user23 user08

P088888888888 2016-04-02 09:10:00 user22 4 user22 user08

P088888888888 2016-04-02 09:50:01 user31 5 user31 user08

P088888888888 2016-04-02 09:50:05 user80 6 user80 user08

P088888888888 2016-04-02 10:00:00 user08 7 user08 user08

注意:

在使用分析函数的过程中,要特别注意ORDERBY子句,用的不恰当。统计出的结果就不是你所期望的
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: