您的位置:首页 > 数据库

Hive常用的基础sql语法(DDL)

2018-01-04 22:09 561 查看
对于Hive的学习,官网可以说是最详细不过的了;不仅仅是Hive,对于大部分大数据的组件能够理解官网所介绍的那你估计就是大牛级别的了!下面我们就对官网上给我们列出的sql语法进行进一步了解。

首先了解下Hive的数据存储结构,如下图:



1. Database:Hive中包含了多个数据库,默认的数据库为default,对应于HDFS目录是
/user/hadoop/hive/warehouse
,可以通过hive.metastore.warehouse.dir参数进行配置(hive-site.xml中配置)

2. Table: Hive 中的表又分为内部表和外部表 ,Hive 中的每张表对应于HDFS上的一个目录,HDFS目录为:/user/hadoop/hive/warehouse/[databasename.db]/table 。

3. Partition:分区,每张表中可以加入一个分区或者多个,方便查询,提高效率;并且HDFS上会有对应的分区目录:

/user/hadoop/hive/warehouse/[databasename.db]/table
(下面会详细介绍)

4. Bucket:….待续(未完成)

DDL操作(Data Definition Language)

1 查询数据库(Show Databases)

下面是官网上为我们列出的语法:

SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];


我们解读其中的符号和关键词所代表的意思:

“ | ”:可以选择其中一种;

“[ ]”:可选项。

LIKE ‘identifier_with_wildcards’:模糊查询数据库

hive> show databases;
OK
default
word
wordcount


hive> show databases like 'word';
OK
word
Time taken: 0.16 seconds, Fetched: 1 row(s)

hive> show databases like '*word*';
OK
word
wordcount


2 创建数据库(Create Database)

下面是官网上为我们列出的语法:

Create Database
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];


IF NOT EXISTS:加上这句话代表判断数据库是否存在,不存在就会创建,存在就不会创建。

COMMENT:数据库的描述

LOCATION:创建数据库的地址,不加默认在
/user/hive/warehouse/
路径下

WITH DBPROPERTIES:数据库的属性

hive> CREATE DATABASE hive1;
OK
Time taken: 0.791 seconds

hive> show databases;
OK
default
hive1
word
wordcount


hive> CREATE DATABASE hive3 LOCATION "/db_hive3";
OK


hive>
> CREATE DATABASE IF NOT EXISTS hive2
> COMMENT "it is my database"
> WITH DBPROPERTIES ("creator"="zhangsan", "date"="2018-08-08")
> ;
OK


3 查询数据库信息(Describe Database)

下面是官网上为我们列出的语法:

DESCRIBE DATABASE [EXTENDED] db_name;


DESCRIBE DATABASE db_name:查看数据库的描述信息和文件目录位置路径信息;

EXTENDED:加上数据库键值对的属性信息。

hive> describe database hive2;
OK
hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        hadoop      USER
Time taken: 0.119 seconds, Fetched: 1 row(s)

hive> describe database hive3;
OK
hive3           hdfs://192.168.137.200:9000/db_hive3    hadoop  USER
Time taken: 0.165 seconds, Fetched: 1 row(s)

hive> describe database extended hive2;
OK
hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        hadoop      USER    {date=2018-08-08, creator=zhangsan}
Time taken: 0.135 seconds, Fetched: 1 row(s)


补充:hive的元数据默认是存储在derby中的,但是一般我们不会使用默认的而修改成mysql,那么我们就在mysql中能不能查询出数据库的信息呢?

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| basic01            |
| mysql              |
| performance_schema |
| test               |
+--------------------+


其中basic01 是在我们启动Hive时创建的(在hive-site.xml中进行配置的)

mysql> use basic01;
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_basic01         |
+---------------------------+
| bucketing_cols            |
| cds                       |
| columns_v2                |
| database_params           |
| dbs                       |
| func_ru                   |
| funcs                     |
| global_privs              |
| part_col_stats            |
| partition_key_vals        |
| partition_keys            |
| partition_params          |
| partitions                |
| roles                     |
| sd_params                 |
| sds                       |
| sequence_table            |
| serde_params              |
| serdes                    |
| skewed_col_names          |
| skewed_col_value_loc_map  |
| skewed_string_list        |
| skewed_string_list_values |
| skewed_values             |
| sort_cols                 |
| tab_col_stats             |
| table_params              |
| tbls                      |
| version                   |
+---------------------------+
29 rows in set (0.07 sec)


查询dbs表:

mysql> select * from dbs \G;
*************************** 1. row ***************************
DB_ID: 1
DESC: Default Hive database
DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse
NAME: default
OWNER_NAME: public
OWNER_TYPE: ROLE
*************************** 2. row ***************************
DB_ID: 6
DESC: NULL
DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse/word.db
NAME: word
OWNER_NAME: hadoop
OWNER_TYPE: USER
*************************** 3. row ***************************
DB_ID: 7
DESC: NULL
DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse/wordcount.db
NAME: wordcount
OWNER_NAME: hadoop
OWNER_TYPE: USER
*************************** 4. row ***************************
DB_ID: 16
DESC: NULL
DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse/hive1.db
NAME: hive1
OWNER_NAME: hadoop
OWNER_TYPE: USER
*************************** 5. row ***************************
DB_ID: 21
DESC: it is my database
DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db
NAME: hive2
OWNER_NAME: hadoop
OWNER_TYPE: USER
*************************** 6. row ***************************
DB_ID: 26
DESC: NULL
DB_LOCATION_URI: hdfs://192.168.137.200:9000/db_hive3
NAME: hive3
OWNER_NAME: hadoop
OWNER_TYPE: USER
6 rows in set (0.00 sec)


mysql> select * from database_params \G;
*************************** 1. row ***************************
DB_ID: 21
PARAM_KEY: creator
PARAM_VALUE: zhangsan
*************************** 2. row ***************************
DB_ID: 21
PARAM_KEY: date
PARAM_VALUE: 2018-08-08
2 rows in set (0.00 sec)


当我们在hive中创建数据库的时候相应的信息都可以在mysql中查询出来

4 删除数据库(Drop Database)

下面是官网上为我们列出的语法:

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];


RESTRICT:默认是restrict,如果该数据库还有表存在则报错;

CASCADE:级联删除数据库(当数据库还有表时,级联删除表后在删除数据库)。

hive> drop database hive3;
OK


5 修改数据库信息(Alter Database)

下面是官网上为我们列出的语法:

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
-- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;
-- (Note: Hive 2.2.1, 2.4.0 and later)


(Note:表示对于版本进行的修改)

修改数据库属性的键值对

hive> alter database hive2 set dbproperties ("update"="lisi");
OK


设置所属用户

hive> alter database hive2 set owner user zy;
OK


修改前

hive> describe database extended hive2;
OK
hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        hadoop      USER    {date=2018-08-08, creator=zhangsan}
Time taken: 0.135 seconds, Fetched: 1 row(s)


修改后

hive> describe database extended hive2;
OK
hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        zy USER     {date=2018-08-08, creator=zhangsan, update=lisi}
Time taken: 0.235 seconds, Fetched: 1 row(s)


6 切换数据库(Use Database)

下面是官网上为我们列出的语法:

USE database_name;


hive> use default;
OK
Time taken: 0.11 seconds


7 创建表(Create Table)

下面是官网上为我们列出的语法:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
-- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)                  -- (Note: Avai
19954
lable in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
-- (Note: Available in Hive 0.6.0 and later)
]

[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
-- (Note: Available in Hive 0.6.0 and later)
[AS select_statement];
-- (Note: Available in Hive 0.5.0 and later; not supported for external tables)


官网给我们列出了一大堆内容,看着确实很可怕,下面我们逐步进行分析:

7.1 TEMPORARY(临时表)

Hive从0.14.0开始提供创建临时表的功能,表只对当前session有效,session退出后,表自动删除。

语法:

CREATE TEMPORARY TABLE ...


注意点:

1. 如果创建的临时表表名已存在,那么当前session引用到该表名时实际用的是临时表,只有drop或rename临时表名才能使用原始表;

2. 临时表限制:不支持分区字段和创建索引。

7.2 EXTERNAL(外部表)

Hive上有两种类型的表,一种是Managed Table(默认的),另一种是External Table(加上EXTERNAL关键字)。它俩的主要区别在于:当我们drop表时,Managed Table会同时删去data(存储在HDFS上)和meta data(存储在MySQL),而External Table只会删meta data。

hive> create table external_table(
> id int,
> name string
> );


加External

hive> create external table external_table(
> id int,
>  name string
> );


查询HDFS上的数据

[hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 7 items
drwxr-xr-x   - hadoop supergroup          0 2017-12-23 00:21 /user/hive/warehouse/external_table
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:42 /user/hive/warehouse/helloword
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:28 /user/hive/warehouse/hive1.db
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:40 /user/hive/warehouse/hive2.db
drwxr-xr-x   - hadoop supergroup          0 2017-12-23 00:06 /user/hive/warehouse/managed_table
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:58 /user/hive/warehouse/word.db
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db

external_table和managed_table存在。


查询MySQL上的数据

mysql> select * from tbls \G;

*************************** 4. row ***************************
TBL_ID: 11
CREATE_TIME: 1513958794
DB_ID: 1
LAST_ACCESS_TIME: 0
OWNER: hadoop
RETENTION: 0
SD_ID: 11
TBL_NAME: managed_table
TBL_TYPE: MANAGED_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
*************************** 5. row ***************************
TBL_ID: 13
CREATE_TIME: 1513959668
DB_ID: 1
LAST_ACCESS_TIME: 0
OWNER: hadoop
RETENTION: 0
SD_ID: 13
TBL_NAME: external_table
TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL

两张表都存在,并且类型不同。


删除表

hive> drop table  managed_table;
OK
Time taken: 0.807 seconds
hive> drop table external_table;
OK


删除后查询HDFS上的数据

[hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 6 items
drwxr-xr-x   - hadoop supergroup          0 2017-12-23 00:21 /user/hive/warehouse/external_table
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:42 /user/hive/warehouse/helloword
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:28 /user/hive/warehouse/hive1.db
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:40 /user/hive/warehouse/hive2.db
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:58 /user/hive/warehouse/word.db
drwxr-xr-x   - hadoop supergroup          0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db


managed_table已经不存在和而external_table还存在。。。

删除后查询MySQL上的信息

mysql> select * from tbls \G;
发现两张标的信息已经不存在了。


7.3 [(col_name data_type [COMMENT col_comment], … [constraint_specification])] [COMMENT table_comment]

col_name:字段名;

data_type:字段类型;

COMMENT col_comment:字段的注释;

[COMMENT table_comment]:表的注释。

下面我们创建一张表并查询注解:

hive> create table student(
> id int comment '学号',
> name string comment '姓'
> )
> comment 'this is student information table';
OK


7.4 PARTITIONED BY(分区表)

产生背景:如果一个表中数据很多,我们查询时就很慢,耗费大量时间,如果要查询其中部分数据该怎么办呢,这是我们引入分区的概念。

Hive 中的分区表分为两种:静态分区和动态分区

(1) 静态分区:

可以根据
PARTITIONED BY
创建分区表,一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下;

分区是以字段的形式在表结构中存在,通过describe table命令可以查看到字段存在,但是该字段不存放实际的数据内容,仅仅是分区的表示。

分区建表分为2种,一种是单分区,也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区,表文件夹下出现多文件夹嵌套模式。

单分区:
hive> CREATE TABLE order_partition (
> order_number string,
> event_time string
> )
> PARTITIONED BY (event_month string);
OK


将order.txt 文件中的数据加载到order_partition表中(加载数据的用法可参照该地址

hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_partition partition (event_month='2014-05');

hive> select * from order_partition;

10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05


注:使用hadoop shell 加载数据也能加载数据,下面进行演示:

- 创建分区,也就是说在HDFS文件夹目录下会有一个分区目录,那么我们是不是直接可以在HDFS上创建一个目录,再把数据加载进去呢?

[hadoop@zydatahadoop001 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-06
[hadoop@zydatahadoop001 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-06

上传完成后查看表order_partition
hive> select * from order_partition
> ;
OK
10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05
Time taken: 2.034 seconds, Fetched: 5 row(s)

可以看到并没有看到我们刚刚通过hdfs上传后的数据,原因是我们将文件上传到了hdfs,hdfs是有了数据,但hive中的元数据中还没有,执行如下命令更新

msck repair table order_partition;

再次查看数据
hive> select * from order_partition;
OK
10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05
10703007267488  2014-05-01 06:01:12.334+01      2014-06
10101043505096  2014-05-01 07:28:12.342+01      2014-06
10103043509747  2014-05-01 07:50:12.33+01       2014-06
10103043501575  2014-05-01 09:27:12.33+01       2014-06
10104043514061  2014-05-01 09:03:12.324+01      2014-06


多分区:
hive>  CREATE TABLE order_partition2 (
> order_number string,
> event_time string
> )
>  PARTITIONED BY (event_month string, step string);
OK

加载数据:
hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_multi_partition partition (event_month='2014-05',step=1);

查询:
hive> select * from order_multi_partition;
OK
10703007267488  2014-05-01 06:01:12.334+01      2014-05 1
10101043505096  2014-05-01 07:28:12.342+01      2014-05 1
10103043509747  2014-05-01 07:50:12.33+01       2014-05 1
10103043501575  2014-05-01 09:27:12.33+01       2014-05 1
10104043514061  2014-05-01 09:03:12.324+01      2014-05 1
Time taken: 0.228 seconds, Fetched: 5 row(s)


在HDFS上查询文件结构:

[hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse/order_multi_partition/event_month=2014-05
18/01/09
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-09 22:52 /user/hive/warehouse/order_multi_partition/event_month=2014-05/step=1


单级分区在HDFS上文件目录为单级;多分区在HDFS上文件目录为多级

(2)动态分区

官方地址

先看看官方为我们解释的什么是动态分区:

Static Partition (SP) columns:静态分区;

Dynamic Partition (DP) columns 动态分区。

DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns.
In INSERT ... SELECT ... queries, the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.


一看到English可能让大家头疼,哈哈,下面为大家解释这段话的意思:

DP列的指定方式与SP列相同 - 在分区子句中( Partition关键字后面),唯一的区别是,DP列没有值,而SP列有值( Partition关键字后面只有key没有value);

在INSERT … SELECT …查询中,必须在SELECT语句中的列中最后指定动态分区列,并按PARTITION()子句中出现的顺序进行排列;

所有DP列 - 只允许在非严格模式下使用。 在严格模式下,我们应该抛出一个错误。

如果动态分区和静态分区一起使用,必须是动态分区的字段在前,静态分区的字段在后。

下面举几个例子进行演示:

**演示前先进行设置:**hive 中默认是静态分区,想要使用动态分区,需要设置如下参数,可以使用临时设置,你也可以写在配置文件(hive-site.xml)里,永久生效。临时配置如下

开启动态分区(默认为false,不开启)

set hive.exec.dynamic.partition=true;  (开启动态分区)
set hive.exec.dynamic.partition.mode=nonstrict;
(指定动态分区模式,默认为strict,即必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。)


创建员工的动态分区表,分区字段为deptno;

CREATE TABLE emp_dynamic_partition (
empno int,
ename string,
job string,
mgr int,
hiredate string,
salary double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";


如果使用静态分区,根据deptno要写多条语句如下:

CREATE TABLE emp_partition (
empno int,
ename string,
job string,
mgr int,
hiredate string,
salary double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";

insert into table emp_partition partition(deptno=10)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;

insert into table emp_partition partition(deptno=20)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;

insert into table emp_partition partition(deptno=30)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;

查询结果:
hive> select * from emp_partition;
OK
7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981/11/17      5000.0  NULL    10
7934    MILLER  CLERK   7782    1982/1/23       1300.0  NULL    10
7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1987/4/19       3000.0  NULL    20
7876    ADAMS   CLERK   7788    1987/5/23       1100.0  NULL    20
7902    FORD    ANALYST 7566    1981/12/3       3000.0  NULL    20
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981/5/1        2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981/9/8        1500.0  0.0     30
7900    JAMES   CLERK   7698    1981/12/3       950.0   NULL    30


我们再来看看使用动态分区的效果并注意他的基本语法与静态分区的区别:

insert into table emp_dynamic_partition partition(deptno)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;
一条语句完成
hive> select * from emp_dynamic_partition;
OK
7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981/11/17      5000.0  NULL    10
7934    MILLER  CLERK   7782    1982/1/23       1300.0  NULL    10
7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1987/4/19       3000.0  NULL    20
7876    ADAMS   CLERK   7788    1987/5/23       1100.0  NULL    20
7902    FORD    ANALYST 7566    1981/12/3       3000.0  NULL    20
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981/5/1        2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981/9/8        1500.0  0.0     30
7900    JAMES   CLERK   7698    1981/12/3       950.0   NULL    30


查看HDFS上文件目录结构

[hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 5 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-09 20:30 /user/hive/warehouse/emp
drwxr-xr-x   - hadoop supergroup          0 2018-01-10 00:38 /user/hive/warehouse/emp_dynamic_partition
drwxr-xr-x   - hadoop supergroup          0 2018-01-10 00:34 /user/hive/warehouse/emp_partition
drwxr-xr-x   - hadoop supergroup          0 2018-01-09 22:52 /user/hive/warehouse/order_multi_partition
drwxr-xr-x   - hadoop supergroup          0 2018-01-09 22:42 /user/hive/warehouse/order_partition


mixed SP & DP columns(混合使用动态分区和静态分区)

create table student(
id int,
name string,
tel string,
age int
)
row format delimited fields terminated by '\t';

insert into student values(1,'zhangsan','18310982765',20),(2,'lisi','18282823434',30),(3,'wangwu','1575757668',40);


创建分区表stu_mixed_partition

create table stu_age_partition(
id int,
name string,
tel string
)
partitioned by (ds string,age int)
row format delimited fields terminated by '\t';

insert into stu_age_partition partition(ds='2010-03-03',age)
select id,name,tel,age from student;

结果:
hive> select * from stu_age_partition;
OK
1       zhangsan        18310982765     2010-03-03      20
2       lisi    18282823434     2010-03-03      30
3       wangwu  1575757668      2010-03-03      40
Time taken: 0.149 seconds, Fetched: 3 row(s)

查看HDFS上的目录结构:
[hadoop@zydatahadoop001 data]$ hdfs dfs -ls /user/hive/warehouse/stu_age_partition/ds=2010-03-03

drwxr-xr-x   - hadoop supergroup          0 2018-01-10 01:04 /user/hive/warehouse/stu_age_partition/ds=2010-03-03/age=20
drwxr-xr-x   - hadoop supergroup          0 2018-01-10 01:04 /user/hive/warehouse/stu_age_partition/ds=2010-03-03/age=30
drwxr-xr-x   - hadoop supergroup          0 2018-01-10 01:04 /user/hive/warehouse/stu_age_partition/ds=2010-03-03/age=40


7.5 ROW FORMAT

先看看官网对于ROW FORMAT是怎样描述的呢?

: DELIMITED
[FIELDS TERMINATED BY char [ESCAPED BY char]]       [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
[NULL DEFINED AS char]
-- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]


先看看官网给我们的解释:用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将会使用自带的 SerDe。在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的 SerDe,Hive 通过 SerDe 确定表的具体的列的数据。

(这里我们可能又有疑惑了SerDe到底是什么,官网给出的解释,有兴趣的同学可以看看。)

那么问题又来了上面这句话又是什么意思呢?

让我们来一起看看到底是神马东东:

DELIMITED:分隔符(可以自定义分隔符);

FIELDS TERMINATED BY char:每个字段之间使用的分割;

例:
-FIELDS TERMINATED BY '\n'
字段之间的分隔符为\n;

COLLECTION ITEMS TERMINATED BY char:集合中元素与元素(array)之间使用的分隔符(collection单例集合的跟接口);

MAP KEYS TERMINATED BY char:字段是K-V形式指定的分隔符;

LINES TERMINATED BY char:每条数据之间由换行符分割(默认[ \n ])。

一般情况下
LINES TERMINATED BY char
我们就使用默认的换行符\n,只需要指定
FIELDS TERMINATED BY char


创建demo1表,字段与字段之间使用\t分开,换行符使用默认\n:
hive> create table demo1(
> id int,
> name string
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK

创建demo2表,并指定其他字段:
hive> create table demo2 (
> id int,
> name string,
> hobbies ARRAY <string>,
> address MAP <string, string>
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY '-'
> MAP KEYS TERMINATED BY ':';
OK


7.6 STORED AS(存储格式)

7.7 Create Table As Select (CTAS)

创建表(拷贝表结构及数据,并且会运行MapReduce作业)

1 .复制整张表:

hive> create table emp2 as select * from emp;
运行的mr:
Query ID = hadoop_20171223093737_43cbeae2-654d-4832-ad5f-13a53732af34
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator

select * from emp2;
OK
7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981/5/1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987/4/19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981/11/17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981/9/8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987/5/23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981/12/3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981/12/3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982/1/23       1300.0  NULL    10
Time taken: 0.229 seconds, Fetched: 14 row(s)


2.复制表中的一些字段:

hive> create table emp3 as select empno,ename from emp;

hive> select * from emp3;
OK
7369    SMITH
7499    ALLEN
7521    WARD
7566    JONES
7654    MARTIN
7698    BLAKE
7782    CLARK
7788    SCOTT
7839    KING
7844    TURNER
7876    ADAMS
7900    JAMES
7902    FORD
7934    MILLER
Time taken: 0.295 seconds, Fetched: 14 row(s)


7.8 LIKE

使用like创建表时,只会复制表的结构,不会复制表的数据。

hive> CREATE TABLE emp (
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> salary double,
> comm double,
> deptno int
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\n";
OK
Time taken: 0.54 seconds

hive> create table emp1 like emp;
OK

hive> select * from emp1;
OK
查询结果可以看出没有数据


7.9 Skewed Tables

7.10 desc formatted table_name

查询表的详细信息

hive> desc formatted emp;
OK
# col_name              data_type               comment

empno                   int
ename                   string
job                     string
mgr                     int
hiredate                string
salary                  double
comm                    double
deptno                  int

# Detailed Table Information
Database:               default
Owner:                  hadoop
CreateTime:             Sat Dec 23 09:39:57 CST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://192.168.137.200:9000/user/hive/warehouse/emp
Table Type:             MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   true
numFiles                1
numRows                 0
rawDataSize             0
totalSize               671
transient_lastDdlTime   1513993210

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
field.delim             \t
serialization.format    \t
Time taken: 0.364 seconds, Fetched: 39 row(s)


通过查询可以列出创建表时的所有信息,并且我们可以在mysql中查询出这些信息(元数据)
select * from table_params;


7.11 查询数据库下的所有表

hive> show tables;
OK
emmp2
emp
emp1
emp3
helloword
Time taken: 0.221 seconds, Fetched: 5 row(s)


7.12 查询创建表的语法

hive> show create table emp;
OK
CREATE TABLE `emp`(
`empno` int,
`ename` string,
`job` string,
`mgr` int,
`hiredate` string,
`salary` double,
`comm` double,
`deptno` int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://192.168.137.200:9000/user/hive/warehouse/emp'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='671',
'transient_lastDdlTime'='1513993210')
Time taken: 0.393 seconds, Fetched: 24 row(s)


8 删除表(Drop Table)

官网给出的结构

DROP TABLE [IF EXISTS] table_name [PURGE];     -- (Note: PURGE available in Hive 0.14.0 and later)


指定PURGE后,数据不会放到回收箱,会直接删除。

DROP TABLE删除此表的元数据和数据。如果配置了垃圾箱(并且未指定PURGE),则实际将数据移至.Trash / Current目录。元数据完全丢失。

删除EXTERNAL表时,表中的数据不会从文件系统中删除。

hive> show tables;
OK
demo1
demo2
demo3
order_partition
order_partition2
student
Time taken: 0.529 seconds, Fetched: 6 row(s)

删除demo3:
hive> drop table demo3;
OK
Time taken: 2.081 seconds

hive> show tables;
OK
demo1
demo2
order_partition
order_partition2
student


9 修改表(Alter Table)

9.1 Rename Table(重命名表明)

ALTER TABLE table_name RENAME TO new_table_name;


例:

hive> alter table demo2 rename to new_demo2;
OK


10 修改分区(Alter Partition)

10.1 Add Partitions(添加分区)

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];

partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)


用户可以用 ALTER TABLE ADD PARTITION 来向一个表中增加分区。当分区名是字符串时加引号。

注:添加分区时可能出现
FAILED: SemanticException table is not partitioned but partition spec exists
错误。

原因是,你在创建表时并没有添加分区,需要在创建表时创建分区,再添加分区。

创建dept表

hive>  create table dept(
>  deptno int,
> dname string,
> loc string
> )
> PARTITIONED BY (dt string)
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
OK
Time taken: 0.953 seconds


加载数据

hive> load data local inpath '/home/hadoop/dept.txt'into table dept partition (dt='2018-08-08');
Loading data to table default.dept partition (dt=2018-08-08)
Partition default.dept{dt=2018-08-08} stats: [numFiles=1, numRows=0, totalSize=84, rawDataSize=0]
OK
Time taken: 5.147 seconds

查询结果
hive> select * from dept;
OK
10      ACCOUNTING      NEW YORK        2018-08-08
20      RESEARCH        DALLAS  2018-08-08
30      SALES   CHICAGO 2018-08-08
40      OPERATIONS      BOSTON  2018-08-08
Time taken: 0.481 seconds, Fetched: 4 row(s)


添加分区

hive> ALTER TABLE dept ADD PARTITION (dt='2018-09-09');
OK


10.2 删除分区(Drop Partitions)

下面是官方语法:

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]


hive> ALTER TABLE dept DROP PARTITION (dt='2018-09-09');


10.3 基于分区的查询的语句

hive> select * from dept where dt='2018-08-08';
OK
10      ACCOUNTING      NEW YORK        2018-08-08
20      RESEARCH        DALLAS  2018-08-08
30      SALES   CHICAGO 2018-08-08
40      OPERATIONS      BOSTON  2018-08-08
Time taken: 2.323 seconds, Fetched: 4 row(s)


10.4 查看分区语句

hive> show partitions dept;
OK
dt=2018-08-08
dt=2018-09-09
Time taken: 0.385 seconds, Fetched: 2 row(s)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  大数据 sql hive ddl 分区