您的位置：首页 > 其它

Hive分桶的概念--------入门到入土（十）

2020-07-29 18:52 197 查看

Hive分桶的概念

10.1 分桶的概述

10.1.1 为什么要分桶

- 数据分区可能导致有些分区数据过多，有些分区数据极少。分桶是将数据集分解为若干部分(数据文件)的另一种技术。
- 分区和分桶其实都是对数据更细粒度的管理。当单个分区或者表中的数据越来越大，分区不能细粒度的划分数据时，我们就采用分桶技术将数据更细粒度的划分和管理
- [CLUSTERED BY (col_name, col_name, ...)

10.1.2 分桶的原理

与MapReduce中的HashPartitioner的原理一模一样

- MapReduce：使用key的hash值对reduce的数量进行取模(取余)
- hive：使用分桶字段的hash值对分桶的数量进行取模(取余)。针对某一列进行分桶存储。每一条记录都是通过分桶字段的值的hash对分桶个数取余，然后确定放入哪个桶。

10.1.3 分桶的意义

1. 为了保存分桶查询的分桶结构（数据已经按照分桶字段进行了hash散列）
2. 分桶表适合进行数据抽样
抽样更高效。处理大数据时，如果能在数据集的一部分上运行查询进行测试会带来很多方便
3. join操作时可以提高MR的查询效率
连接查询两个在相同列上划分了桶的表，可以在map端进行高效的连接操作。 比如join操作。对于两个表都有一个相同的列，如果对两个表都进行桶操作，那么hive底层会对相同列值的桶进行join操作。效率很高

10.2 分桶表创建

10.2.1 案例

第一步：建表

drop table student;
create table student(
sno int,
name string,
sex string,
age int,
academy string
)
clustered by (sno) sorted by (age desc) into 4 buckets
row format delimited
fields terminated by ','
;

#分桶字段和排序字段可以不一致

第二步：准备数据(创建临时表)

create table temp_student(
sno int,
name string,
sex string,
age int,
academy string
)
clustered by (sno) sorted by (age desc) into 4 buckets
row format delimited
fields terminated by ','
;

load data local inpath './data/students.txt' into table temp_student;

第三步：从临时表中查询并导入数据

insert into table student
select * from temp_student
distribute by(sno)
sort by (age desc)
;
或者
insert overwrite table student
select * from temp_student
distribute by(sno)
sort by (age desc)
;

注意加载数据时，绝对不能使用load或者是上传方式，没有分桶效果。

10.2.2 注意事项

2.1.1版本设置了强制分桶操作，因此人为的修改reduce的个数不会影响最终文件的个数(文件个数由桶数决定)

如果是低版本，比如1.2.1版本可以修改下面的属性

1. 需要设置reduce数量和分桶数量相同：
set mapreduce.job.reduces=4;
2.如果数据量比较大，我们可以使用MR的本地模式：
set hive.exec.mode.local.auto=true;
3.强行分桶设置：set hive.enforce.bucketing=true; 默认是false
4.强行排序设置：set hive.enforce.sorting=true;

10.3 分桶表查询

10.3.1 语法：

语法:tablesample(bucket x out of y on sno)
x:代表从第几桶开始查询，x不能大于y

2.1.1版本的y:代表查询的总的桶数 y值可以自定义。
低版本，比如1.2.1的y必须是表的桶数的因子或者是倍数。

10.3.2 查询全部

select * from student;
select * from student tablesample(bucket 1 out of 1);

10.3.3 指定桶查询

查询第一桶
select * from student tablesample(bucket 1 out of 4 on sno);
查询第一桶和第三桶
select * from student tablesample(bucket 1 out of 2 on sno);
查询第二桶和第四桶的数据
select * from student tablesample(bucket 2 out of 2 on sno);
查询对8取余的第一桶的数据：
select * from student tablesample(bucket 1 out of 8 on sno);

10.3.4 其他查询

查询三行数据
select * from student limit 3;
select * from student tablesample(3 rows);
查询百分比的数据
select * from student tablesample(13 percent);大小的百分比所占的那一行。

查询固定大小的数据
select * from student tablesample(68b); 单位（K,KB,MB,GB...）
固定大小所占的那一行。
随机抽三行数据
select * from student order by rand() limit 3;

10.4 小总结：

10.4.1 定义

clustered by (id);         ---指定表内的字段进行分桶。
sorted by (id asc|desc)   ---指定数据的排序规则，表示咱们预期的数据是以这种规则进行的排序

10.4.2 导入数据

cluster by (id)
--指定getPartition以哪个字段来进行hash，并且排序字段也是指定的字段，排序是以asc排列
--相当于distribute by (id) sort by (id)

distribute by (id)    -- 指定getPartition以哪个字段来进行hash
sort by (name asc | desc) --指定排序字段

-- 区别：distribute by 这种方式可以分别指定getPartition和sort的字段

导数据时：
insert overwrite table buc3
select id,name,age from temp_buc1
distribute by (id) sort by (id asc)
;
和下面的语句效果一样
insert overwrite table buc4
select id,name,age from temp_buc1
cluster by (id)
;

10.4.3 注意事项

分区使用的是表外字段，分桶使用的是表内字段
分桶更加细粒度的管理数据，更多的是使用来做抽样、join

(id asc)
;
和下面的语句效果一样
insert overwrite table buc4
select id,name,age from temp_buc1
cluster by (id)
;

分区使用的是表外字段，分桶使用的是表内字段
分桶更加细粒度的管理数据，更多的是使用来做抽样、join

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航