您的位置：首页 > 其它

hive导入CSV数据，使用动态分区重新分区

2017-12-07 15:38 691 查看

创建数据表

hive> create database cus;
hive> use cus;
hive> create table telno_md5(
> phone string,
> md5 string )
>  ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;

导入数据

hive> load data local inpath '/home/etluser/data/' into table telno_md5;

创建重新分区表

hive> create table telno_md5_prt(
> phone string,
> md5 string )
> partitioned by (prefix string);

使用动态分区，插入数据

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.max.dynamic.partitions.pernode=100000;
hive> set hive.exec.max.dynamic.partitions=100000;
hive> set hive.exec.max.created.files=1000000000;

hive> insert into table telno_md5_prt
> partition (prefix)
> select phone,md5,substr(md5,1,2) as prefix
> from telno_md5;

* 参数的含义参考https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts*

与桶表的比较

create table telno_md5_bucketed(
phone string,
md5 string )
clustered by(md5) into 1024 buckets;

insert overwrite table telno_md5_bucketed
select phone,md5 from telno_md5;

执行结果比较


数据分割方式	实际分割文件数	执行时间	关联查询时间
dynamic partitions	998	27m36s	16m23s
bucketed table	668	16m2s	6m5s

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： hive csv 动态分区 nonstrict buckets

相关文章推荐

新的分享

章节导航