您的位置:首页 > 其它

01-Hive创建表

2016-06-02 10:48 423 查看
声明:本实验环境是Apache hadoop-2.2.0,zookeeper-3.4.5,mysql Server version: 5.1.73作为元数据库,hive版本是apache-hive-0.9.0-bin,都是apache,不是CDH和其他。本实验集群3台,一个主节点(hadoop1),三个从节点(hadoop1,hadoop2,hadoop3),zk在三个节点中都安装。本实验的数据都是自己设计和添加的。

启动zk:zkServer.sh start

启动hdfs:start-dfs.sh

启动yarn:start-yarn.sh

启动mysql:service mysqld start

jps

[root@hadoop1 host]# jps
4388 RunJar
3435 NameNode
3693 SecondaryNameNode
3550 DataNode
4576 Jps
3843 ResourceManager
2076 QuorumPeerMain
3943 NodeManager


6.进入hive的命令行模式:

[root@hadoop1 host]# hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/usr/host/hive/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201606011756_819925906.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/host/hadoop/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/host/hive/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive>


可能是hive版本较老了,出现以上warning警告忽略就可以了,不是error。

7. 进入default数据库

hive> show databases;
OK
default
Time taken: 4.34 seconds
hive> use default;
OK
Time taken: 0.057 seconds
hive> show tables;
OK
testtable
Time taken: 0.32 seconds
----------
目前有一个testable表,我先删除,然后再看看是怎么创建的。


8.删除testable表

hive> drop table testable;
OK
Time taken: 4.373 seconds


9.创建表testable,注意一定要在语句后面加分号“;”,否则会出错。

hive> create table testtable(name string comment 'name value',addr string comment 'addr value');
OK
Time taken: 0.262 seconds


10.看看表的描述信息

hive> desc testtable;
OK
name    string  name value
addr    string  addr value
Time taken: 0.215 seconds
#这里面有两个字段name和addr,都是string类型的,描述信息comment分别是“name value”和“addr value”。
#也可以看拓展的描述信息
hive> desc formatted testtable;
OK
# col_name              data_type               comment

name                    string                  name value
addr                    string                  addr value

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Wed Jun 01 18:20:24 PDT 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop1:9000/user/hive/warehouse/testtable
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1464830424

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1
Time taken: 0.249 seconds


注意以上的两个地方:Location: hdfs://hadoop1:9000/user/hive/warehouse/testtable

Table Type: MANAGED_TABLE

Location表示数据的存储位置,Table Type表示这是一个内部表。除了内部表之外还会有外部表。

11.我们来建立一个外部表看看长什么样。

hive> create external table testtext(name string comment 'name value',addr string comment 'addr value');
OK
Time taken: 0.08 seconds
hive>


12 输入desc formatted testtext;

hive> desc formatted testtext;
OK
# col_name              data_type               comment

name                    string                  name value
addr                    string                  addr value

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Wed Jun 01 18:34:19 PDT 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop1:9000/user/hive/warehouse/testtext
Table Type:             EXTERNAL_TABLE
Table Parameters:
EXTERNAL                TRUE
transient_lastDdlTime   1464831259

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1
Time taken: 0.2 seconds
hive>


现在加了关键字external后,可以看到Table Type变为了EXTERNAL_TABLE,有没有觉得很神奇?hah

13.接下来我们先把testtext表删除,再重新建立,但是【子段】之间的空格符和【行】之间的空格符

hive> drop table testtext;
OK
Time taken: 0.19 seconds
hive> create table if not exists testtext(
> name string comment 'name value',
> addr string comment 'addr value')
> row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.059 seconds


比之前的多了row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile; 这句话的意思是说字段之间以制表符(Tab)隔开,行之间以换行符隔开,format delimited意思是格式界定,格式包括字段之间的格式和行之间的格式,terminated 是结束的意思。

我们再来看一下具体的描述信息:

hive> desc formatted testtext;
OK
# col_name              data_type               comment

name                    string                  name value
addr                    string                  addr value

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Wed Jun 01 18:41:03 PDT 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop1:9000/user/hive/warehouse/testtext
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1464831663

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
field.delim             \t
line.delim              \n
serialization.format    \t
Time taken: 0.314 seconds
hive>


信息上说,这是一个内部表,因为没有加关键字external嘛。另外看最后面的Storage Desc Params: field.delim 是\t,line.delim 是\n ;serialization.format 是\t,简单吧~

14.现在我们有两个表了testtable和testtext吧,我们插入点数据来玩玩吧。我是在本地的/usr/host 目录下建立data文件,里面的数据如下。当心:

[root@hadoop1 host]# vi data
test    table
test    table1


15 把linux上的/usr/host/data文件加载到testtable上。

hive> load data local inpath '/usr/host/data' overwrite into table testtable;
Copying data from file:/usr/host/data
Copying file: file:/usr/host/data
Loading data to table default.testtable
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/testtable
OK
Time taken: 0.981 seconds
hive> select * from testtable;
OK
test    table   NULL
test    table1  NULL
Time taken: 0.328 seconds
hive>


哎?我select * from testtable,为啥后面多了NULL?出现这样的情况很正常,当初建立testtable的时候我就没有指定字段之间用什么分割的对吧,hive就把“test tabel”当做一个字段了,那后面还有一个字段怎么办?用NULL来表示咯,这个NULL地球人都知道的吧?

那再来插入数据一次,看看什么情况:

hive> load data local inpath '/usr/host/data'  into table testtable;
Copying data from file:/usr/host/data
Copying file: file:/usr/host/data
Loading data to table default.testtable
OK
Time taken: 0.215 seconds
hive> select * from testtable;
OK
test    table   NULL
test    table1  NULL
test    table   NULL
test    table1  NULL
Time taken: 0.159 seconds
hive>


嘿嘿,又追加了数据了吧,注意哦,overwrite是覆盖,把overwrite换成into是追加。

16 好了,砸门再来建一个表xielaoshi

hive> drop table if exists xielaoshi;
OK
Time taken: 0.042 seconds
hive> create external table if not exists xielaoshi(
> name string,
> salary float,
> meinv array<string>,
> haoche map<string,float>,
> haoza struct<street:string,city:string,state:string, zip:int>)
> row format delimited fields terminated by '\t'
> collection items terminated by ','
> map keys terminated by ':'
> lines terminated by '\n'
> stored as textfile
> location '/data/';
OK
Time taken: 0.137 seconds
hive>


这里的字段有五个,name,salary,meinv,haoche,haoza.其中meinv所对应的数据类型是数组型的array,haoche对应的是map键值对型的,haoza对应的是struct结构体。

问题又来啦!stored as textfile是什么鬼?其实就是文件格式啦!文件格式在hive中主要是三种:textfile、Sequencefile(序列化文件,学hadoop的都会知道啦)、Rcfile。

17 来来来,看看描述信息

hive> desc formatted xielaoshi;
OK
# col_name              data_type               comment

name                    string                  None
salary                  float                   None
meinv                   array<string>           None
haoche                  map<string,float>       None
haoza                   struct<street:string,city:string,state:string,zip:int>  None

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Wed Jun 01 19:15:27 PDT 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop1:9000/data
Table Type:             EXTERNAL_TABLE
Table Parameters:
EXTERNAL                TRUE
transient_lastDdlTime   1464833727

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
colelction.delim        ,
field.delim             \t
line.delim              \n
mapkey.delim            :
serialization.format    \t
Time taken: 0.163 seconds
hive>


看到那个Location木有啊?这是什么情况呢?其实很简单得,就是指定了数据位置在哪里,为毛要指定?其实这是外部表的一个特点,砸门慢慢观察。

18 看看表谢老师里面的数据吧!

hive> select * from xielaoshi;
OK
wang    123.0   ["a1","a2","a3"]    {"k1":1.0,"k2":2.0,"k3":3.0}    {"street":"s1","city":"s2","state":"s3","zip":4}
liu 456.0   ["a4","a5","a6"]    {"k4":4.0,"k5":5.0,"k6":6.0}    {"street":"s4","city":"s5","state":"s6","zip":6}
zhang   789.0   ["a7","a8","a9"]    {"k7":7.0,"k8":8.0,"k9":9.0}    {"street":"s7","city":"s8","state":"s9","zip":9}
Time taken: 0.183 seconds
hive>




如果你是初学者你又会疑惑这数据怎么搞的呀?

其实就两步:

[root@hadoop1 host]# vi xielaoshi
wang    123     a1,a2,a3        k1:1,k2:2,k3:3  s1,s2,s3,4
liu     456     a4,a5,a6        k4:4,k5:5,k6:6  s4,s5,s6,6
zhang   789     a7,a8,a9        k7:7,k8:8,k9:9  s7,s8,s9,9

[root@hadoop1 host]# hadoop fs -put xielaoshi /data/
[root@hadoop1 host]# hadoop fs -text /data/xielaoshi
wang    123 a1,a2,a3    k1:1,k2:2,k3:3  s1,s2,s3,4
liu 456 a4,a5,a6    k4:4,k5:5,k6:6  s4,s5,s6,6
zhang   789 a7,a8,a9    k7:7,k8:8,k9:9  s7,s8,s9,9`


搞定了吧,所以接下来实验的数据其实都是这么弄来的,现在linux上编辑,需要上传到hdfs上就执行hadoop fs -put [local 文件] [目标目录]

19 hive的功能是啥?除了存储就查找嘛,当然我们要来玩玩怎么查找啊!

上代码!!!

hive> select meinv[1] from xielaoshi;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
16/06/01 19:34:48 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/01 19:34:48 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
16/06/01 19:34:48 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/root/root_20160601193434_f64326c5-c901-47b8-b0b6-163aa3ccd4ce.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/host/hadoop/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/host/hive/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 19:35:10,267 null map = 0%,  reduce = 0%
2016-06-01 19:35:24,814 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
2016-06-01 19:35:25,943 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
2016-06-01 19:35:27,205 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
MapReduce Total cumulative CPU time: 800 msec
Ended Job = job_1464828076391_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
a2
a5
a8
Time taken: 45.655 seconds


哎哟!这屎长屎长的代码,走的是啥呀?MapReduce 啊!其实很多是警告信息,怪我咯!Total MapReduce jobs = 1,Hadoop job information for null: number of mappers: 1; number of reducers: 0。看到没,reducers是0个,这没啥奇怪的!

继续:

hive>select haoche["k2"] from xielaoshi;
Total MapReduce jobs = 1
Launching Job 1 out of 1
警告信息省略
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 19:38:21,377 null map = 0%,  reduce = 0%
2016-06-01 19:38:32,483 null map = 100%,  reduce = 0%, Cumulative CPU 0.71 sec
2016-06-01 19:38:33,560 null map = 100%,  reduce = 0%, Cumulative CPU 0.71 sec
MapReduce Total cumulative CPU time: 710 msec
Ended Job = job_1464828076391_0002
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
2.0
NULL
NULL
Time taken: 31.13 seconds


为啥后面出来两个NULL?你猜~

继续:

hive> select haoza.city from xielaoshi;
Total MapReduce jobs = 1
Launching Job 1 out of 1
警告信息省略
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 19:41:14,958 null map = 0%,  reduce = 0%
2016-06-01 19:41:25,138 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
2016-06-01 19:41:26,255 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
MapReduce Total cumulative CPU time: 800 msec
Ended Job = job_1464828076391_0003
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
s2
s5
s8
Time taken: 28.25 seconds
hive>


有点累了,休息一下。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:名字:五十年后



蟹蟹你啊!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: