您的位置：首页 > 其它

Hive分区表的相关内容------入门到入土（九）

2020-07-29 18:49 120 查看

Hive分区表的相关内容

9.1 分区简介

9.1.1 为什么分区

Hive的Select查询时，一般会扫描整个表内容。随着系统运行的时间越来越长，表的数据量越来越大，而hive查询做全表扫描，会消耗很多时间，降低效率。而有时候，我们需求的数据只需要扫描表中的一部分数据即可。这样，hive在建表时引入了partition概念。即在建表时，将整个表存储在不同的子目录中，每一个子目录对应一个分区。在查询时，我们就可以指定分区查询，避免了hive做全表扫描，从而提高查询效率。

9.1.2 如何分区

根据业务需求而定，不过通常以年、月、日、小时、地区等进行分区。

9.1.3 分区的语法

create table tableName(
.......
.......
)
partitioned by (colName colType [comment '...'],...)

9.1.4 分区的注意事项

- hive的分区名不区分大小写，不支持中文
- hive的分区字段是一个伪字段，但是可以用来进行操作
- 一张表可以有一个或者多个分区，并且分区下面也可以有一个或者多个分区。
- 分区是以字段的形式在表结构中存在，通过describe table命令可以查看到字段存在，但是该字段不存放实际的数据内容，仅仅是分区的表示。

9.1.5 分区的意义

让用户在做数据统计的时候缩小数据扫描的范围，在进行select操作时可以指定要统计哪个分区

9.1.6 分区的本质

在表的目录或者是分区的目录下在创建目录，分区的目录名为指定字段=值

9.2 分区案例

9.2.1 一级分区的使用

1）建表语句

create table if not exists part1(
id int,
name string,
age int
)
partitioned by (dt string)
row format delimited
fields terminated by '\t'
lines terminated by '\n';

2）加载数据

load data local inpath './data/user1.txt' into table part1 partition(dt='2020-05-05');
load data local inpath './data/user2.txt' into table part1 partition(dt='2020-05-06');

9.2.2 二级分区的使用

1）建表语句

create table if not exists part2(
id int,
name string,
age int
)
partitioned by (year string,month string)
row format delimited fields terminated by '\t';

2）加载数据

load data local inpath './data/user1.txt' into table part2 partition(year='2020',month='03');
load data local inpath './data/user2.txt' into table part2 partition(year='2020',month=04);
load data local inpath './data/user2.txt' into table part2 partition(year='2020',month="05");

9.2.3 三级分区的使用

1）建表语句

create table if not exists part3(
id int,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited
fields terminated by '\t';

2）加载数据

load data local inpath './data/user1.txt' into table part3 partition(year='2020',month='05',day='01');

load data local inpath './data/user2.txt' into table part3 partition(year='2019',month='12',day='31');

9.2.4 测试是否区分大小写

在hive中，分区字段名是不区分大小写的，不过字段值是区分大小写的。我们可以来测试一下

1）建表语句

create table if not exists part4(
id int,
name string
)
partitioned by (year string,month string,DAY string)
row format delimited fields terminated by ','
;

--测试字段名的大小写，结果不区分。

2）加载数据

load data local inpath './data/user1.txt' into table part4 partition(year='2018',month='03',DAy='21');

load data local inpath './data/user2.txt' into table part4 partition(year='2018',month='03',day='AA');

--测试字段值的大小写，结果是区分的。

9.2.5 查看分区:

语法：
show partitions tableName
eg:
show partitions part4;

9.2.6 修改分区：

修改分区（注意：location后接的hdfs路径需要写成完全路径）

alter table part3 partition(year='2019',month='10',day='23') set location '/user/hive/warehouse/mydb1.db/part1/dt=2018-03-21';    --错误使用

#:修改分区，指的是修改分区字段值对应的映射位置。

alter table part3 partition(year='2020',month='05',day='01') set location 'hdfs://qianfeng01:8020/user/hive/warehouse/mydb.db/part1/dt=2020-05-05';

9.2.7 增加分区

1）新增分区（空）

alter table part3 add partition(year='2020',month='05',day='02');

alter table part3 add partition(year='2020',month='05',day='03') partition(year='2020',month='05',day='04');

2）新增分区 (带数据)

alter table part3 add partition(year='2020',month='05',day='05') location '/user/hive/warehouse/mydb.db/part1/dt=2020-05-06';

3）新增多分区

alter table part3 add
partition(year='2020',month='05',day='06') location '/user/hive/warehouse/mydb.db/part1/dt=2020-05-05'
partition(year='2020',month='05',day='07') location '/user/hive/warehouse/mydb.db/part1/dt=2020-05-06';

9.2.8 删除分区

1）删除单个分区

alter table part3 drop partition(year='2020',month='05',day='07');

2）删除多个分区

alter table part3 drop partition(year='2020',month='05',day='06'),partition(year='2020',month='05',day='06');

测试分区表的分区都被删除的特点

create table if not exists part10(
id int,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited
fields terminated by '\t';

load data local inpath './data/user1.txt' overwrite into table part10
partition(year='2020',month='05',day='06');
load data local inpath './data/user2.txt' overwrite into table part10
partition(year='2020',month='05',day='07');

删除分区:
alter table part10 drop
partition(year='2020',month='05',day='06'),
partition(year='2020',month='05',day='07');

注意:  默认创建分区表时，删除所有分区时，表目录不会被删除。

测试2： 使用location关键字去指定分区对应的位置
alter table part10 add partition(year='2020',month='05',day='08') location '/test/a/b';
alter table part10 add partition(year='2020',month='05',day='09') location '/test/a/c';

alter table part10 drop
partition(year='2020',month='05',day='08'),
partition(year='2020',month='05',day='09');
结论：在删除操作时，对应的目录（最里层）会被删除，上级目录如果没有文件存在，也会被删除，如果有文件存在，则不会被删除。

9.3 分区类型详解

9.3.1 分区的种类

1. 静态分区：直接加载数据文件到指定的分区，即静态分区表。
2. 动态分区：数据未知，根据分区的值来确定需要创建的分区(分区目录不是指定的，而是根据数据的值自动分配的)
3. 混合分区：静态和动态都有。

9.3.2 分区属性设置

hive.exec.dynamic.partition=true，是否支持动态分区操作
hive.exec.dynamic.partition.mode=strict/nonstrict:  严格模式/非严格模式
hive.exec.max.dynamic.partitions=1000: 总共允许创建的动态分区的最大数量
hive.exec.max.dynamic.partitions.pernode=100:in each mapper/reducer node

9.3.3 创建动态分区的案例

1）创建动态分区表

create table dy_part1(
sid int,
name string,
gender string,
age int,
academy string
)
partitioned by (dt string)
row format delimited fields terminated by ','
;

2）动态分区加载数据

下面方式不要用，因为不是动态加载数据

load data local inpath '/hivedata/user.txt' into table dy_part1 partition(dt='2020-05-06');

正确方式：要从别的表中加载数据

**第一步：**先创建临时表：

create table temp_part1(
sid int,
name string,
gender string,
age int,
academy string,
dt string
)
row format delimited
fields terminated by ','
;

注意：创建临时表时，必须要有动态分区表中的分区字段。

**第二步：**导入数据到临时表：

95001,李勇,男,20,CS,2017-8-31
95002,刘晨,女,19,IS,2017-8-31
95003,王敏,女,22,MA,2017-8-31
95004,张立,男,19,IS,2017-8-31
95005,刘刚,男,18,MA,2018-8-31
95006,孙庆,男,23,CS,2018-8-31
95007,易思玲,女,19,MA,2018-8-31
95008,李娜,女,18,CS,2018-8-31
95009,梦圆圆,女,18,MA,2018-8-31
95010,孔小涛,男,19,CS,2017-8-31
95011,包小柏,男,18,MA,2019-8-31
95012,孙花,女,20,CS,2017-8-31
95013,冯伟,男,21,CS,2019-8-31
95014,王小丽,女,19,CS,2017-8-31
95015,王君,男,18,MA,2019-8-31
95016,钱国,男,21,MA,2019-8-31
95017,王风娟,女,18,IS,2019-8-31
95018,王一,女,19,IS,2019-8-31
95019,邢小丽,女,19,IS,2018-8-31
95020,赵钱,男,21,IS,2019-8-31
95021,周二,男,17,MA,2018-8-31
95022,郑明,男,20,MA,2018-8-31

load data local inpath './data/student2.txt' into table temp_part1;

**第三步：**动态加载到表

insert into dy_part1 partition(dt) select sid,name,gender,age,academy,dt from temp_part1;

注意：严格模式下，给动态分区表导入数据时，分区字段至少要有一个分区字段是静态值
非严格模式下,导入数据时，可以不指定静态值。

9.3.4 混合分区示例

1）创建一个分区表：

create table dy_part2(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ','
;

2）创建临时表

create table temp_part2(
id int,
name string,
year string,
month string,
day string
)
row format delimited fields terminated by ','
;

数据如下：
1,廉德枫,2019,06,25
2,刘浩(小),2019,06,25
3,王鑫,2019,06,25
5,张三,2019,06,26
6,张小三,2019,06,26
7,王小四,2019,06,27
8,夏流,2019,06,27

load data local inpath './data/temp_part2.txt' into table temp_part2;

3）导入数据到分区表

- 错误用法：
insert into dy_part2 partition (year='2019',month,day)
select * from temp_part2;

- 正确用法：
insert into dy_part2 partition (year='2020',month,day)
select id,name,month,day from temp_part2;

4）分区表注意事项

1. hive的分区使用的是表外字段，分区字段是一个伪列，但是分区字段是可以做查询过滤。
2. 分区字段不建议使用中文
3. 一般不建议使用动态分区，因为动态分区会使用mapreduce来进行查询数据，如果分区数据过多，导致namenode和resourcemanager的性能瓶颈。所以建议在使用动态分区前尽可能预知分区数量。
4. 分区属性的修改都可以修改元数据和hdfs数据内容。

5） Hive分区和Mysql分区的区别

mysql分区字段用的是表内字段；而hive分区字段采用表外字段。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航