增量数据和合并问题验证
2015-12-31 00:04
369 查看
1.建立基表和增量测试数据
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/base
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/delta
[root@node1 delta_merge]# hdfs dfs -put base.txt /user/merge_delta/base
[root@node1 delta_merge]# hdfs dfs -put delta.txt /user/merge_delta/delta
2.建立测试表
3.测试:
a. full outer join语法:
结果如下:
我们最终想要的答案应该是:
1001 gongshaocheng --代表保持不变的记录
1002 lidachao --代表修改后的最新记录
1003 chenjianzhong --代表新增记录
b.coalesce函数:
结果:
上面验证了对于主键列,我们可以采用coalesce函数,使得结果集中主键列总是有值的
c.if函数
结果:
上面验证了对于普通列,如果是未修改的数据(delta.id is NULL),则直接用基表里的值,否则直接用增量表的数据
最后综合起来,得到我们想要的HQL语句:
结果如下:
注意:上面所有的HQL都只需要有一个MR作业。这就是本解决方案的精髓所在!
最后对HQL进行进一步优化:之前为了保持逻辑上的清晰,增加了WHERE子句,对FULL OUTER JOIN的三种情况进行分布讨论,但实际上两个OR合并后就是全集,其实WHERE子句是多余的。最终的HQL为:
[root@node1 delta_merge]# pwd /root/delta_merge [root@node1 delta_merge]# cat base.txt 1001,gongshaocheng 1002,LIDACHAO [root@node1 delta_merge]# cat delta.txt 1002,lidachao 1003,chenjianzhong
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/base
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/delta
[root@node1 delta_merge]# hdfs dfs -put base.txt /user/merge_delta/base
[root@node1 delta_merge]# hdfs dfs -put delta.txt /user/merge_delta/delta
2.建立测试表
hive> create external table base(id string,name string) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' > location "/user/merge_delta/base/"; OK Time taken: 0.304 seconds hive> select * from base; OK 1001 gongshaocheng 1002 LIDACHAO Time taken: 0.875 seconds, Fetched: 2 row(s) hive> create external table delta(id string,name string) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' > location "/user/merge_delta/delta/"; OK Time taken: 0.134 seconds hive> select * from delta; OK 1002 lidachao 1003 chenjianzhong Time taken: 0.321 seconds, Fetched: 2 row(s)
3.测试:
a. full outer join语法:
hive> select base.*,delta.* from base full outer join delta on base.id = delta.id;
结果如下:
1001 gongshaocheng NULL NULL 1002 LIDACHAO 1002 lidachao NULL NULL 1003 chenjianzhong
我们最终想要的答案应该是:
1001 gongshaocheng --代表保持不变的记录
1002 lidachao --代表修改后的最新记录
1003 chenjianzhong --代表新增记录
b.coalesce函数:
select coalesce(base.id, delta.id) from base full outer join delta on base.id = delta.id where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);
结果:
1001 1002 1003
上面验证了对于主键列,我们可以采用coalesce函数,使得结果集中主键列总是有值的
c.if函数
select if(delta.id is NULL, base.name,delta.name) from base full outer join delta on base.id = delta.id where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);
结果:
gongshaocheng lidachao chenjianzhong
上面验证了对于普通列,如果是未修改的数据(delta.id is NULL),则直接用基表里的值,否则直接用增量表的数据
最后综合起来,得到我们想要的HQL语句:
select coalesce(base.id, delta.id), if(delta.id is NULL, base.name,delta.name) from base full outer join delta on base.id = delta.id where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);
结果如下:
hive> select > coalesce(base.id, delta.id), > if(delta.id is NULL, base.name,delta.name) > from base full outer join delta on base.id = delta.id > where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll); Query ID = root_20151230235050_befa6322-f78f-4166-8bbd-4fde04a1a9b1 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1451024710809_0005, Tracking URL = http://node1.clouderachina.com:8088/proxy/application_1451024710809_0005/ Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job -kill job_1451024710809_0005 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2015-12-30 23:51:04,904 Stage-1 map = 0%, reduce = 0% 2015-12-30 23:51:13,245 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.69 sec 2015-12-30 23:51:22,685 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.16 sec MapReduce Total cumulative CPU time: 5 seconds 160 msec Ended Job = job_1451024710809_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.16 sec HDFS Read: 12293 HDFS Write: 52 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 160 msec OK 1001 gongshaocheng 1002 lidachao 1003 chenjianzhong Time taken: 31.376 seconds, Fetched: 3 row(s)
注意:上面所有的HQL都只需要有一个MR作业。这就是本解决方案的精髓所在!
最后对HQL进行进一步优化:之前为了保持逻辑上的清晰,增加了WHERE子句,对FULL OUTER JOIN的三种情况进行分布讨论,但实际上两个OR合并后就是全集,其实WHERE子句是多余的。最终的HQL为:
select coalesce(base.id, delta.id), if(delta.id is NULL, base.name,delta.name) from base full outer join delta on base.id = delta.id;
相关文章推荐
- 突击部队拼多多
- 认识HTTPS
- 深度好文:2018,世界在拼多多发生折叠
- java并发系列——基本线程同步(一)
- 深度好文:2018,世界在拼多多发生折叠
- jsp中变量及方法的声明与使用
- JSP使用JDBC连接MYSQL数据库的方法
- 开启PHP的伪静态模式
- php创建桌面快捷方式实现方法
- php实现微信公众号主动推送消息
- WordPress中限制非管理员用户在文章后只能评论一次
- php+jQuery+Ajax实现点赞效果的方法(附源码下载)
- WordPress用户登录框密码的隐藏与部分显示技巧
- 详解Window7 下开发php扩展
- WordPress特定文章对搜索引擎隐藏或只允许搜索引擎查看
- 【HDOJ】3660 Alice and Bob's Trip
- WordPress中获取所使用的模板的页面ID的简单方法
- 简单了解将WordPress中的工具栏移到底部的小技巧
- WordPress中给媒体文件添加分类和标签的PHP功能实现
- Swoole-1.7.22 版本已发布,修复PHP7相关问题