05-Hive动态分区
2016-06-02 20:57
435 查看
大家好!砸门又见面了。我先作一个自我介绍吧。我是一个打算学习大数据一万小时的谢老师,目前学习了800多小时了,现在正在努力。
今天来玩的实验是:Hive分区表的动态分区
分区是在处理大型事实表时常用的方法。分区的好处在于缩小查询扫描范围,从而提高速度。分区分为两种:静态分区static partition和动态分区dynamic partition。静态分区和动态分区的区别在于导入数据时,是手动输入分区名称,还是通过数据来判断数据分区。对于大数据批量导入来说,显然采用动态分区更为简单方便。
动态分区不需要为不同的分区添加不同的插入语句。那什么时候使用动态分区呢?其实,当你不知道要怎么分区,这个分区要通过查询语句里得出来,通过查出来的数据建立分区。换句话说就是分区数据不确定的时候,就适合使用动态分区。
有几个参数,要认识一下:
1 接下来我们来实战一下怎么动态分区:
注意到什么?partitioned by后面没有指定具体的值了。
没有数据对不对?我们来插入一些数据,并且让他动态分区。
查看一下数据加载了没?
为什么会有三个partition呢?其实addr as value就说清楚了,addr的值就作为分区值,分别是46,78,89(两个89分到一起了).不信你看:
我们再看web端的情况:
搞啥子哟,图片居然上传不上来!!
(此处本该有图)
不生气不生气,待会补上咯!
2 接下来我们要创建两个分区:
好,停一下。是不是第二次见到
接着干嘛?插入数据:
各位看官,这里两个分区哦,value 分区下有子分区dt.两个分区都是根据数据的值来分的。
addr as value, name as dt。你知道d_part2表下面的数据是啥吗?select ‘test’ as name, 都是test。
来验证一下:
学了这两个实验基本就入门了。砸门还是应该多练习一下,对吧?熟能生巧,巧能生能。
插入数据:
好了,一遍练不会练两边。
这里给大家布置一个作业:什么情况下设置动态分区会产生灾难?欢迎留言呀!
有点累了,休息一下。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:五十年后
今天来玩的实验是:Hive分区表的动态分区
分区是在处理大型事实表时常用的方法。分区的好处在于缩小查询扫描范围,从而提高速度。分区分为两种:静态分区static partition和动态分区dynamic partition。静态分区和动态分区的区别在于导入数据时,是手动输入分区名称,还是通过数据来判断数据分区。对于大数据批量导入来说,显然采用动态分区更为简单方便。
动态分区不需要为不同的分区添加不同的插入语句。那什么时候使用动态分区呢?其实,当你不知道要怎么分区,这个分区要通过查询语句里得出来,通过查出来的数据建立分区。换句话说就是分区数据不确定的时候,就适合使用动态分区。
有几个参数,要认识一下:
set hive.exec.dynamic.partition=true// 使用动态分区 set hive.exec.dynamic.partition.mode=nonstrick//无限制模式 如果模式是stric,则必须有一个静态分区,且放在最前面 set hive.exec.max.dynamic.partitions.pernode=10000;//每个节点生成动态分区的最大个数 set hive.exec.max.created.files=150000;//一个任务最多可以创建的文件数目 set dfs.datanode.max.xcievers=8192;//限定一次最多打开的文件数
1 接下来我们来实战一下怎么动态分区:
hive> create table d_part( > name string > ) > partitioned by(value string) > row format delimited fields terminated by '\t' > lines terminated by '\n' > stored as textfile; OK Time taken: 0.243 seconds hive>
注意到什么?partitioned by后面没有指定具体的值了。
hive> show partitions d_part; OK Time taken: 0.304 seconds hive> select * from d_part; OK Time taken: 0.096 seconds hive>
没有数据对不对?我们来插入一些数据,并且让他动态分区。
hive> set hive.exec.dynamic.partition=true; hive> set hive.exec.dynamic.partition.mode=nonstrick; hive> insert overwrite table d_part partition(value) > select name, > addr as value > from testtext; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 1; number of reducers: 0 2016-06-02 05:16:16,951 null map = 0%, reduce = 0% 2016-06-02 05:16:32,117 null map = 100%, reduce = 0%, Cumulative CPU 1.07 sec 2016-06-02 05:16:33,226 null map = 100%, reduce = 0%, Cumulative CPU 1.07 sec 2016-06-02 05:16:34,277 null map = 100%, reduce = 0%, Cumulative CPU 1.07 sec MapReduce Total cumulative CPU time: 1 seconds 70 msec Ended Job = job_1464828076391_0016 Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Loading data to table default.d_part partition (value=null) Loading partition {value=78} Loading partition {value=46} Loading partition {value=89} Partition default.d_part{value=46} stats: [num_files: 1, num_rows: 0, total_size: 4, raw_data_size: 0] Partition default.d_part{value=78} stats: [num_files: 1, num_rows: 0, total_size: 6, raw_data_size: 0] Partition default.d_part{value=89} stats: [num_files: 1, num_rows: 0, total_size: 7, raw_data_size: 0] Table default.d_part stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 17, raw_data_size: 0] OK Time taken: 41.974 seconds hive>
查看一下数据加载了没?
hive> select * from d_part; OK wer 46 weree 78 wer 89 rr 89 Time taken: 0.732 seconds hive> show partitions d_part; OK value=46 value=78 value=89 Time taken: 0.142 seconds hive>
为什么会有三个partition呢?其实addr as value就说清楚了,addr的值就作为分区值,分别是46,78,89(两个89分到一起了).不信你看:
hive> desc testtext; OK name string name value addr string addr value Time taken: 0.155 seconds hive> select * from testtext; OK wer 46 wer 89 weree 78 rr 89 Time taken: 0.162 seconds
我们再看web端的情况:
搞啥子哟,图片居然上传不上来!!
(此处本该有图)
不生气不生气,待会补上咯!
2 接下来我们要创建两个分区:
hive> create table d_part2( > name string > ) > partitioned by(value string,dt string) > row format delimited fields terminated by '\t' > lines terminated by '\n' > stored as textfile; OK Time taken: 0.092 seconds hive> set hive.exec.dynamic.partition=true; hive>
好,停一下。是不是第二次见到
set hive.exec.dynamic.partition=true;了?dynamic是动态的意思,明白了吧?
接着干嘛?插入数据:
hive> insert overwrite table d_part2 partition(value,dt) > select 'test' as name, > addr as value, > name as dt > from testtext; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 1; number of reducers: 0 2016-06-02 05:45:36,828 null map = 0%, reduce = 0% 2016-06-02 05:45:52,339 null map = 100%, reduce = 0%, Cumulative CPU 1.3 sec 2016-06-02 05:45:53,443 null map = 100%, reduce = 0%, Cumulative CPU 1.3 sec 2016-06-02 05:45:54,493 null map = 100%, reduce = 0%, Cumulative CPU 1.3 sec MapReduce Total cumulative CPU time: 1 seconds 300 msec Ended Job = job_1464828076391_0017 Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Loading data to table default.d_part2 partition (value=null, dt=null) Loading partition {value=89, dt=wer} Loading partition {value=89, dt=rr} Loading partition {value=46, dt=wer} Loading partition {value=78, dt=weree} Partition default.d_part2{value=46, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Partition default.d_part2{value=78, dt=weree} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Partition default.d_part2{value=89, dt=rr} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Partition default.d_part2{value=89, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Table default.d_part2 stats: [num_partitions: 4, num_files: 4, num_rows: 0, total_size: 20, raw_data_size: 0] OK Time taken: 40.971 seconds hive>
各位看官,这里两个分区哦,value 分区下有子分区dt.两个分区都是根据数据的值来分的。
addr as value, name as dt。你知道d_part2表下面的数据是啥吗?select ‘test’ as name, 都是test。
来验证一下:
hive> show partitions d_part2; OK value=46/dt=wer value=78/dt=weree value=89/dt=rr value=89/dt=wer Time taken: 0.293 seconds hive> select * from d_part2; OK test 46 wer test 78 weree test 89 rr test 89 wer Time taken: 0.435 seconds hive>
学了这两个实验基本就入门了。砸门还是应该多练习一下,对吧?熟能生巧,巧能生能。
hive> create table d_part3( > name string > ) > partitioned by(value string,dt string) > row format delimited fields terminated by '\t' > lines terminated by '\n' > stored as textfile; OK Time taken: 0.121 seconds hive> set hive.exec.dynamic.partition=true;
插入数据:
hive> insert overwrite table d_part3 partition(value,dt) > select 'test' as name, > addr as value, > name as dt > from testtext; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 1; number of reducers: 0 2016-06-02 05:54:27,509 null map = 0%, reduce = 0% 2016-06-02 05:54:37,411 null map = 100%, reduce = 0%, Cumulative CPU 0.99 sec 2016-06-02 05:54:38,501 null map = 100%, reduce = 0%, Cumulative CPU 0.99 sec 2016-06-02 05:54:39,555 null map = 100%, reduce = 0%, Cumulative CPU 0.99 sec MapReduce Total cumulative CPU time: 990 msec Ended Job = job_1464828076391_0018 Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Loading data to table default.d_part3 partition (value=null, dt=null) Loading partition {value=78, dt=weree} Loading partition {value=46, dt=wer} Loading partition {value=89, dt=wer} Loading partition {value=89, dt=rr} Partition default.d_part3{value=46, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Partition default.d_part3{value=78, dt=weree} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Partition default.d_part3{value=89, dt=rr} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Partition default.d_part3{value=89, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0] Table default.d_part3 stats: [num_partitions: 4, num_files: 4, num_rows: 0, total_size: 20, raw_data_size: 0] OK Time taken: 36.737 seconds hive> show partitions d_part3; OK value=46/dt=wer value=78/dt=weree value=89/dt=rr value=89/dt=wer Time taken: 0.264 seconds hive>
好了,一遍练不会练两边。
这里给大家布置一个作业:什么情况下设置动态分区会产生灾难?欢迎留言呀!
有点累了,休息一下。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:五十年后
相关文章推荐
- socket系列函数
- LRU高速缓存算法统计高速缓存请求失败的缺页次数
- CodeForce 597ADivisibility
- AlphaGo是怎样教棋手重新做人的
- AlphaGo是怎样教棋手重新做人的
- Mysql性能优化一
- Hybrid----优秀开源代码解读之JS与iOS Native Code互调的优雅实现方案-备
- Linux编程中的坑——C++中exit和return的区别
- AJAX运行环境
- Android-Async-Http(1)—上传文件
- 判断2个数是否能被除尽java(可以小数)
- 进程控制块PCB结构 task_struct 描述
- 第二阶段团队项目冲刺第四天
- leetcode-java-169. Majority Element
- hdu_1848_Fibonacci again and again(博弈sg函数)
- AJAX基础
- hdu_1848_Fibonacci again and again(博弈sg函数)
- Android Studio下NDK开发之"No such file or direct"和“no rule to make target”
- 菜鸟成长进阶之——fiddler使用总结
- HI3518E和CVBS和4G的网络摄像机