您的位置:首页 > 其它

hive导入CSV数据,使用动态分区重新分区

2017-12-07 15:38 691 查看

创建数据表

hive> create database cus;
hive> use cus;
hive> create table telno_md5(
> phone string,
> md5 string )
>  ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;


导入数据

hive> load data local inpath '/home/etluser/data/' into table telno_md5;


创建重新分区表

hive> create table telno_md5_prt(
> phone string,
> md5 string )
> partitioned by (prefix string);


使用动态分区,插入数据

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.max.dynamic.partitions.pernode=100000;
hive> set hive.exec.max.dynamic.partitions=100000;
hive> set hive.exec.max.created.files=1000000000;

hive> insert into table telno_md5_prt
> partition (prefix)
> select phone,md5,substr(md5,1,2) as prefix
> from telno_md5;


* 参数的含义参考https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts*

与桶表的比较

create table telno_md5_bucketed(
phone string,
md5 string )
clustered by(md5) into 1024 buckets;

insert overwrite table telno_md5_bucketed
select phone,md5 from telno_md5;


执行结果比较

数据分割方式实际分割文件数执行时间关联查询时间
dynamic partitions99827m36s16m23s
bucketed table66816m2s6m5s
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息