您的位置:首页 > 运维架构 > Apache

【大数据系列】apache hive 官方文档翻译

2017-08-03 15:39 621 查看

GettingStarted

开始Created by Confluence Administrator, last modified by Lefty Leverenz on Jun 15, 2017本文档由Confluence管理员创建,2017年6月15日由Lefty Leverenz最后做的修改。Table of Contents表格内容包含Installation and Configuration 安装和配置Running HiveServer2 and BeelineRequirements 前置条件Installing Hive from a Stable Release 从稳定版安装HiveBuilding Hive from Source 从源码编译HiveCompile Hive on master 从主干上编译hiveCompile Hive on branch-1 从分支1上编译hiveCompile Hive Prior to 0.13 on Hadoop 0.20 在hadoop0.20上编译0.13之前的版本Compile Hive Prior to 0.13 on Hadoop 0.23 在hadoop0.23上编译0.13之前的版本Running Hive 运行hiveRunning Hive CLI 运行Hive客户端Running HiveServer2 and BeelineRunning HCatalog 运行HCalalogRunning WebHCat (Templeton) 运行WebHCatConfiguration Management Overview 配置管理总览Runtime Configuration 运行配置Hive, Map-Reduce and Local-ModeHive Logging hive日志HiveServer2 Logs HiveServer2日志Audit Logs 监控日志Perf Logger Pref日志DDL Operations DDL操作Creating Hive Tables 创建Hive表Browsing through Tables 预览表Altering and Dropping Tables 修改和删除表Metadata Store 元数据仓库DML Operations DML操作SQL Operations SQL语句操作Example Queries 查询示例SELECTS and FILTERS 查询和过滤GROUP BY 汇总JOIN 联合MULTITABLE INSERT 多表插入STREAMING 流媒体Simple Example Use Cases 使用示例的简单样例MovieLens User Ratings 用户电影评分Apache Weblog Data apache博客数据

Installation and Configuration 安装和配置

You can install a stable release of Hive by downloading a tarball, or you can download the source code and build Hive from that.你可以通过下载一个tar文件或者下载源文件进行编译来安装一个稳定版的Hive.

Running HiveServer2 and Beeline

Requirements 前置条件

Java 1.7 Java1.7Note: Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607).注意:从1.2.版本之后Hive需要Java1.7或者更高的版本,Hive0.14到1.1版本依然可以在Java1.6环境下使用,强烈建议用户开始使用java1.8.Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward).建议使用hadoop2.x,hive2.0.0之后不再支持hadoop1.xHive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x.直到0.13版本Hive同样支持hadoop0.20.x和0.23.xHive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. The instructions in this document are applicable to Linux and Mac. Using it on Windows would require slightly different steps.生产中Hive通常使用时在linux和windows环境中,开发环境中通常使用Mac环境。在本文档中的直到适用于linux和mac环境。在windows中使用的话将需要略微不同的步骤。

Installing Hive from a Stable Release 从一个稳定的版本中安装Hive

Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).从apache下载镜像中下载一个最近的稳定的发行版。Next you need to unpack the tarball. This will result in the creation of a subdirectory named
hive-x.y.z
(where
x.y.z
is the release number):接下来你需要解压tar包。这会导致创建一个名为hive-x.y.z的子目录(x.y.z是发行版本号);
$ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable
HIVE_HOME
to point to the installation directory:设置环境变量HIVE_HOME指向安装目录:
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
Finally, add
$HIVE_HOME/bin
to your
PATH
:最后,添加HIVE_HOME到PATH环境变量中。
$ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source 从源码中编译Hive

The Hive GIT repository for the most recent Hive code is located here:
git clone https://git-wip-us.apache.org/repos/asf/hive.git
(the master branch).最新的Hive代码的git仓库在这里:git clone https://git-wip-us.apache.org/repos/asf/hive.git (主分支)All release versions are in branches named "branch-0.#" or "branch-1.#" or the upcoming "branch-2.#", with the exception of release 0.8.1 which is in "branch-0.8-r2". Any branches with other names are feature branches for works-in-progress. See Understanding Hive Branches for details.所有的发行版在名为branch-0.#、branch-1.#还有即将来临的branch-2.#分支中。有异常的发行版0.8.1在分支branch-0.8-r2上。任何其他名称的都是在开发中具有特性的分支。查看理解Hive分支的链接获得更多详情。As of 0.13, Hive is built using Apache Maven.0.13版本Hive是使用maven编译的。

Compile Hive on master 在主分支上编译Hive

To build the current Hive code from the master branch: 从主分支上编译现在的Hive代码
$ git clone https://git-wip-us.apache.org/repos/asf/hive.git $ cd hive
$ mvn clean package -Pdist     #打包
$ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin
$ ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)
hcatalog / (hcatalog installation)
scripts / (upgrade scripts for hive-metastore)
Here, {version} refers to the current Hive version.这里,{version}指的是当前的Hive版本。If building Hive source using Maven (mvn), we will refer to the directory "/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as <install-dir> for the rest of the page.如果使用maven编译hive原码,我们将指定packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin目录作为其余页中的安装目录(install-dir)。

Compile Hive on branch-1 在branch-1上编译Hive

In branch-1, Hive supports both Hadoop 1.x and 2.x. You will need to specify which version of Hadoop to build against via a Maven profile. To build against Hadoop 1.x use the profile
hadoop-1
; for Hadoop 2.x use
hadoop-2
. For example to build against Hadoop 1.x, the above mvn command becomes:在branch-1分支上,Hive支持hadoop1.x和hadoop2.x。你需要确定通过maven profile去编译哪个版本的hadoop.如果编译hadoop1.x使用hadoop-1文件,编译hadoop2.x使用hadoop-2文件。例如编译hadoop1.x则上面讲到的mvn命令将变成为mvn clean package -Phadoop-1, dist
$ mvn clean package -Phadoop-1,dist

Compile Hive Prior to 0.13 on Hadoop 0.20 在hadoop0.20上编译Hive0.13之前的版本

Prior to Hive 0.13, Hive was built using Apache Ant. To build an older version of Hive on Hadoop 0.20:在hive0.13版本之前是使用apache ant进行编译的。为了在hadoop0.20上编译旧版本的hive:
$ svn co http://svn.apache.org/repos/asf/hive/branches/branch-{version} hive
$ cd hive
$ ant clean package
$ cd build/dist
# ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)
hcatalog / (hcatalog installation)
scripts / (upgrade scripts for hive-metastore)
If using Ant, we will refer to the directory "
build/dist
" as
<install-dir>
.如果使用Ant,我们将认为build/dist为安装目录install-dir

Compile Hive Prior to 0.13 on Hadoop 0.23 在hadoop0.23上编译hive0.13之前的版本

To build Hive in Ant against Hadoop 0.23, 2.0.0, or other version, build with the appropriate flag; some examples below:为了在hadoop0.23 ,2.0.0或者其他版本中编译hive,使用正确的标识去build,如下是一些例子:
$ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23
$ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive 启动Hive

Hive uses Hadoop, so: Hive需要用到hadoop,所以 你必须添加hadoop到path路径中或者设置HADOOP_HOME环境变量。you must have Hadoop in your path OR
export HADOOP_HOME=<hadoop-install-dir>
In addition, you must use below HDFS commands to create
/tmp
and
/user/hive/warehouse
(aka
hive.metastore.warehouse.dir
) and set them
chmod g+w
before you can create a table in Hive.除此之外,你必须使用如下的HDFS命令去创建/tmp和/user/hive/warehouse(hive.metastore.warehouse.dir)并且要在hive中创建表之前设置chmod g+w。
$ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
You may find it useful, though it's not necessary, to set
HIVE_HOME
:你可能会发现设置HIVE_HOME环境变量虽然不时必要的但是是有用的。
$ export HIVE_HOME=<hive-install-dir>

Running Hive CLI 启动hive客户端

To use the Hive command line interface (CLI) from the shell: 从shell中使用hive客户端
$ $HIVE_HOME/bin/hive

Running HiveServer2 and Beeline

Starting from Hive 2.1, we need to run the schematool command below as an initialization step. For example, we can use "derby" as db type.从hive2.1版本开始我们需要运行如下schematool命令作为初始化步骤,比如,我们可以使用derby作为数据库类型。
$ $HIVE_HOME/bin/schematool -dbType <db type> -initSchema
HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. To run HiveServer2 and Beeline from shell:HiveServer2(hive0.11引进的)有其自己的客户端叫做Beeline.HiveCLI取代Beeline是被反对的。因为其缺少多用户、安全以及HiveServer2拥有的其他特性。从脚本中运行HiveServer2和Beeline:
$ $HIVE_HOME/bin/hiveserver2

$ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT
Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like jdbc:hive2://localhost:10000.Beeline是由HiveServer2的JDBC URL启动的,它依赖于HiveServer2启动时的地址和端口。默认情况下,它是(localhost:10000),所以它的地址看起来像觉得不错:hive://localhost:10000Or to start Beeline and HiveServer2 in the same process for testing purpose, for a similar user experience to HiveCLI:或者为了测试启动HiveServer2和Beeline在同一个进程中,与使用HiveCli的经历一样:
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://

Running HCatalog 运行Hcatalog

To run the HCatalog server from the shell in Hive release 0.11.0 and later:从hive0.11.0或之后的版本中运行HCatalog服务
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh
To use the HCatalog command line interface (CLI) in Hive release 0.11.0 and later:从hive0.11.0或之后的发行版找那个使用HCatalog客户端:
$ $HIVE_HOME/hcatalog/bin/hcat
For more information, see HCatalog Installation from Tarball and HCatalog CLI in the HCatalog manual.如果想获取更多的信息,在 HCatalog manual.查看HCatalog Installation from TarballHCatalog CLI

Running WebHCat (Templeton) 运行WebHCat

To run the WebHCat server from the shell in Hive release 0.11.0 and later:在Hive0.11.0和之后的发行版中使用WebHCat
$ $HIVE_HOME/hcatalog/sbin/webhcat_server.sh
For more information, see WebHCat Installation in the WebHCat manual.如果要获取更多的信息,在 WebHCat manual.中查看WebHCat Installation

Configuration Management Overview 配置管理总览

Hive by default gets its configuration from
<install-dir>/conf/hive-default.xml
从<install-dir>/conf/hive-default.xml中获取默认配置信息。The location of the Hive configuration directory can be changed by setting the
HIVE_CONF_DIR
environment variable.通过修改HIVE_CONF_DIR环境变量可以修改HIVE配置目录Configuration variables can be changed by (re-)defining them in
<install-dir>/conf/hive-site.xml
配置变量可以通过修改或重定义<install>/conf/hive-site.xml来修改。Log4j configuration is stored in
<install-dir>/conf/hive-log4j.properties
log4j配置存放在
<install-dir>/conf/hive-log4j.properties
Hive configuration is an overlay on top of Hadoop – it inherits the Hadoop configuration variables by default.hive的配置是对hadoop配置的一个覆盖--默认情况下它继承了hadoop的配置变量。Hive configuration can be manipulated by:hive的配置参数可以通过如下方式操作Editing hive-site.xml and defining any desired variables (including Hadoop variables) in it编辑hive-site.xml在其中定义任何需要的变量(包括hadoop变量)Using the set command (see next section)使用set命令(查看下一节)Invoking Hive (deprecated), Beeline or HiveServer2 using the syntax:调用Hive(这个是不推荐的),Beeline或者HiveServer2使用如下语法
$ bin/hive --hiveconf x1=y1 --hiveconf x2=y2  //this sets the variables x1 and x2 to y1 and y2 respectively
$ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2 //this sets server-side variables x1 and x2 to y1 and y2 respectively$ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2 //this sets client-side variables x1 and x2 to y1 and y2 respectively.Setting the
HIVE_OPTS
environment variable to "
--hiveconf x1=y1 --hiveconf x2=y2
" which does the same as above.设置HIVE_OPTS环境变量为"
--hiveconf x1=y1 --hiveconf x2=y2
" 这与上一个命令时一样的

Runtime Configuration 运行配置

Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.Hive查询是使用map-reduce查询执行的,所以,这些查询行为可以通过hadoop配置变量来控制。The HiveCLI (deprecated) and Beeline command 'SET' can be used to set any Hadoop (or Hive) configuration variable. For example:HiveCli(不推荐的)和Beeline的SET命令可以被用来设置任何的Hadoop(或者Hive)参数变量,比如:
beeline> SET mapred.job.tracker=myhost.mycompany.com:50030;
beeline> SET -v;
The latter shows all the current settings. Without the
-v
option only the variables that differ from the base Hadoop configuration are displayed.稍后将显示所有的当前的配置。除了-v选项与基本的hadoop配置的区别将会展示。

Hive, Map-Reduce and Local-Mode Hive、Map-Reduce和本地模式

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:Hive编译者为大部分的查询生成了map-reduce job,这些job之后被提交到由mapred.job.tracker标记的map-reduce集群中
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets – in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.虽然这些通常指向拥有几个节点的map-reduce集群,hadoop同样也提供了一个极好的选项去运行map-reduce job在用户本地的工作区间。这对于小数据集的查询是十分有用的--在那样的情形下本地执行要比提交job到一个大的集群快很多。数据可以很显而易见的从HDFS中接触到。相反的,本地模式只运行一个reducer因此在处理大数据集的时候会非常慢。Starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:从0.7发行版开会,HIVE完全支持本地模式运行。为了开启这个功能,用户需要开启以下选项。
hive> SET mapreduce.framework.name=local;
In addition,
mapred.local.dir
should point to a path that's valid on the local machine (for example
/tmp/<username>/mapred/local
). (Otherwise, the user will get an exception allocating local disk space.)除此之外,mapred.local.dir应该指向一个本地基取上有效的路径(比如/tmp/<username>/mapred/local)(否则的话,用户将会得到一个分配本地磁盘空间的异常)Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are
hive.exec.mode.local.auto
,
hive.exec.mode.local.auto.inputbytes.max
, and
hive.exec.mode.local.auto.tasks.max
:从0.7发行版之后,Hive同样支持本地自动运行map-reduce job的模式。与此相关的选项是
hive.exec.mode.local.auto
,
hive.exec.mode.local.auto.inputbytes.max
, and
hive.exec.mode.local.auto.tasks.max
:
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:请注意,默认情况下这些特性是不可用的。如果启用的话,Hive会分析查询中的每一个map-reduce job,如果以下临界值满足的话本地也可以运行:The total input size of the job is lower than:
hive.exec.mode.local.auto.inputbytes.max
(128MB by default)job输入值的大小小于:
hive.exec.mode.local.auto.inputbytes.max 默认是128MB
The total number of map-tasks is less than:
hive.exec.mode.local.auto.tasks.max
(4 by default)map-tasks的数量小于hive.exec.mode.local.auto.tasks.max 默认是4The total number of reduce tasks required is 1 or 0.总的reducer任务的数量是1或者0So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.所以对于小数据集或者对于后来的job实质上输入非常小的话(因为之前job中的减少和过滤),jobs会在本地运行。Note that there may be differences in the runtime environment of Hadoop server nodes and the machine running the Hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the Hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option
hive.mapred.local.mem
. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.请注意在hadoop服务节点的运行环境和运行Hive客户端的机器之间可能会存在不同(因为不同的Java虚拟机版本号或者不同的软件库)。在本地模式运行时这些将会导致不可期望的行为/错误。同样要注意的是本地模式执行是独立的,Hive客户端的虚拟机。如果用户希望的话,Hive 客户端内存的最大值可以通过hive.mapred.local.mem来控制。默认情况下,它被设置为0,在这种情况下,Hive交由Hadoop来确定默认的虚拟机内存的限制值。

Hive Logging Hive日志

Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is
WARN
for Hive releases prior to 0.13.0. Starting with Hive 0.13.0, the default logging level is
INFO
.Hive使用log4j去记录日志,默认情况下,日志不会打印到客户端上。在0.13.0的发行版本之前默认的日志级别的警告,从0.13.0开始,日志的默认级别的INFO。The logs are stored in the directory
/tmp/<user.name>
:日志存储在/tmp/<user.name>目录下:
/tmp/<user.name>/hive.log
Note: In local mode, prior to Hive 0.13.0 the log file name was "
.log
" instead of "
hive.log
". This bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).注意:在本地模式中,在0.13.0版本之前日志的名字是.log而不是hive.log.这个漏洞在0.13.0中得到了修正。To configure a different log location, set
hive.log.dir
in $HIVE_HOME/conf/hive-log4j.properties. Make sure the directory has the sticky bit set (
chmod 1777 <dir>
).如果要设置一个不同的log存储位置,在 $HIVE_HOME/conf/hive-log4j.properties中设置hive.log.dir.确定这个目录有最大权限(chmod 777 <dir>)
hive.log.dir=<other_location>
If the user wishes, the logs can be emitted to the console by adding the arguments shown below:如果用户希望的话,日志可以通过添加如下参数使其显示在控制台上。
bin/hive --hiveconf hive.root.logger=INFO,console  //for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf hive.root.logger=INFO,console
Alternatively, the user can change the logging level only by using:或者,用户也可以通过如下的设置去改变默认的日志级别:
bin/hive --hiveconf hive.root.logger=INFO,DRFA //for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf hive.root.logger=INFO,DRFA
Another option for logging is TimeBasedRollingPolicy (applicable for Hive 0.15.0 and above, HIVE-9001) by providing DAILY option as shown below:日志的另一个选择是TimeBasesRollingPolicy(适用于Hive0.15.0或更高的版本),提供了如下所示的DAILY选择:
bin/hive --hiveconf hive.root.logger=INFO,DAILY //for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf hive.root.logger=INFO,DAILY
Note that setting
hive.root.logger
via the 'set' command does not change logging properties since they are determined at initialization time.请注意通过set命令设置hive.root.logger不会改变logging的特性,因为他们是在初始化的时间确定的。Hive also stores query logs on a per Hive session basis in
/tmp/<user.name>/
, but can be configured in hive-site.xml with the
hive.querylog.location
property.Hive同样在/tmp/<user.name>目录中保存了每一个Hive回话的查询日志m,但是这个同样可以通过hive-site.xml中的hive.querylog.location参数来设置。Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI.在hadoop集群上Hive的执行日志是可以通过Hadoop配置参数控制。通常情况下hadoop会为每一个任务执行的集群机器中的map和reduce任务生成日志文件。该日志文件可以通过Hadoop JobTracker的web页面垫底任务详情获得。When using local mode (using
mapreduce.framework.name=local
), Hadoop/Hive execution logs are produced on the client machine itself. Starting with release 0.6 – Hive uses the
hive-exec-log4j.properties
(falling back to
hive-log4j.properties
only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under
/tmp/<user.name>
. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.当使用本地模式的时候,Hadoop/Hive执行日志会在客户端机器上生成。从0.6版本开始-Hive使用hive-exec-log4j.properties(如果它缺失的话可以会退到hive-log4j.properties)去决定日志默认在什么位置。默认配置下本地模式或为每一个查询生成一个日志文件并且将其存储在/tmp/<user.name>。提供一个单独的配置文件时为了使管理员在需要的时候集中执行日志快照(比如在一个NFS文件服务中)。执行日志对于跟踪运行时异常是十分宝贵的。For information about WebHCat errors and logging, see Error Codes and Responses and Log Files in the WebHCat manual.如果想获取到WebHCat错误和日志,请在 WebHCat manual.中查看Error Codes and Responses and Log Files 。Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to
hive-dev@hadoop.apache.org
.错误日志对于跟踪错误是非常有用的。请发送他们到hive@hadoop.apache.org.From Hive 2.1.0 onwards (with HIVE-13027), Hive uses Log4j2's asynchronous logger by default.从hive2.1.0开始,Hive默认使用了Log4j2异步日志。Setting hive.async.log.enabled to false will disable asynchronous logging and fallback to synchronous logging. Asynchronous logging can give significant performance improvement as logging will be handled in a separate thread that uses the LMAX disruptor queue for buffering log messages. Refer to https://logging.apache.org/log4j/2.x/manual/async.html for benefits and drawbacks.设置hive.async.log.enable为false可以关闭异步日志并且返回到同步日志。异步日志将会提供一个非常重要的性能提升,因为它会用一个单独的LMAX分裂的队列缓存日志信息的线程去处理。具体请参考 https://logging.apache.org/log4j/2.x/manual/async.html 去查看优势和劣势。

HiveServer2 Logs HiveServer2日志

HiveServer2 operation logs are available to clients starting in Hive 0.14. See HiveServer2 Logging for configuration.HiveServer2操作日志是从Hive0.14版本开始的,查看HiveServer2 Logging去获得如何配置。

Audit Logs 监控日志

Audit logs are logged from the Hive metastore server for every metastore API invocation.监控日志是从Hive元存储服务中对每一个元存储API请求记录的。An audit log has the function and some of the relevant function arguments logged in the metastore log file. It is logged at the INFO level of log4j, so you need to make sure that the logging at the INFO level is enabled (see HIVE-3505). The name of the log entry is "HiveMetaStore.audit".一个监控日志拥有该功能,并且一些相关的函数参数记录在元存储日志文件中。它是在log4j的INFO级别记录的,所以你需要确定日志的INFO级别是开启的。日志入口的名称是“HiveMetaStore.audit”.Audit logs were added in Hive 0.7 for secure client connections (HIVE-1948) and in Hive 0.10 for non-secure connections (HIVE-3277; also see HIVE-2797).监控日志是为Hive0.7中安全的客户端连接和Hive0.10张非安全连接而添加的。

Perf Logger Perf 日志

In order to obtain the performance metrics via the PerfLogger, you need to set DEBUG level logging for the PerfLogger class (HIVE-12675). This can be achieved by setting the following in the log4j properties file.为了通过PerfLogger获取 performance metrics,你需要为PerfLogger设置DUBUG级别。这些可以通过设置如下的log4j配置文件达到。
log4j.logger.org.apache.hadoop.hive.ql.log.PerfLogger=DEBUG
If the logger level has already been set to DEBUG at root via hive.root.logger, the above setting is not required to see the performance logs.如果日志级别已经通过hive.root.logger设置为DEBUG,以上的设置是不需要的。

DDL Operations DDL操作

The Hive DDL operations are documented in Hive Data Definition Language.

Creating Hive Tables 创建HIVE表

  hive> CREATE TABLE pokes (foo INT, bar STRING);
creates a table called pokes with two columns, the first being an integer and the other a string.创建一个叫做pokes的表,它拥有两列,第一列是integer类型另外的是String类型。
  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.创建一个表名字叫做invites,它含有两列和一个分区叫做ds。这个分区卷是一个虚拟卷。它不是数据本身的一部分,但是他来自于一个被载入的特定的数据集的一个分区。By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).默认情况下,表假定是文本输入格式,默认的分隔符是^A

Browsing through Tables 预览表

  hive> SHOW TABLES;
lists all the tables. 查看所有的表
  hive> SHOW TABLES '.*s';
lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.列出所有的以s结尾的表。模式匹配与Java常用表达式一致。查看如下链接获取文档http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.
hive> DESCRIBE invites;
shows the list of columns.显示卷。

Altering and Dropping Tables 修改和删除表

Table names can be changed and columns can be added or replaced: 表名字可以被修改,列可以添加和替换。
  hive> ALTER TABLE events RENAME TO 3koobecaf;hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');hive> ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');
Note that REPLACE COLUMNS replaces all existing columns and only changes the table's schema, not the data. The table must use a native SerDe. REPLACE COLUMNS can also be used to drop columns from the table's schema:注意:REPLACE COLUMNS替换已经存在的列只是改变表的schema定义而不是数据。REPLACE COKUMNS同样可以用来从表schema中删除列。
  hive> ALTER TABLE invites REPLACE COLUMNS (foo INT COMMENT 'only keep the first column');
Dropping tables: 删除表.
  hive> DROP TABLE pokes;

Metadata Store 元数据存储

Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named
javax.jdo.option.ConnectionURL
. By default this location is
./metastore_db
(see
conf/hive-default.xml
).元数据是包含在内置的Derby数据库中的,该数据库的存储路径由Hive的配置变量javax.jdo.option.ConnectionURL来决定。默认情况下该路径是./metastore_db(查看conf/hive-default.xml配置文件)Right now, in the default configuration, this metadata can only be seen by one user at a time.目前,在默认的配置中,元数据在同一时间只能由一个人看到。Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables
javax.jdo.option.ConnectionURL
and
javax.jdo.option.ConnectionDriverName
. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file
package.jdo
at
src/contrib/hive/metastore/src/model
.元数据可以存储在任何一个由JPOX支持的数据库中。关系型数据库的位置和类型可以有javax.jdo.option.ConnectionURL和javax.jdo.option.ConnectionDriverName来控制.通过查询JDO文档获取所支持的数据库的更多详情。元数据schema是在JDO元数据注解文件package.jdo(src/contrib/hive/metastore/src/model)中定义的,In the future, the metastore itself can be a standalone server.在将来,元数据可以单独称为一个服务。If you want to run the metastore as a network server so it can be accessed from multiple nodes, see Hive Using Derby in Server Mode.如果你想以一个网络服务运行元数据以便于获取多节点。请看:Hive Using Derby in Server Mode.

DML Operations DML操作

The Hive DML operations are documented in Hive Data Manipulation Language.HIVE的DML操作见文档Hive Data Manipulation Language.Loading data from flat files into Hive:加载普通的文件到Hive库中:
  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loads a file that contains two columns separated by ctrl-a into pokes table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.加载一个包含两列以ctrl-a切割的文件到pocks表中。‘LOCAL’指的是输入文件在本地文件系统中。如果LOCAL缺省的话它将会去HDFS中查找文件。The keyword 'OVERWRITE' signifies that existing data in the table is deleted. If the 'OVERWRITE' keyword is omitted, data files are appended to existing data sets.关键字‘OVERWRITE’指的是表中存在的数据会被删除。如果‘OVERWRITE’被缺省的话,数据将会追加在已经存在的数据集后。NOTES:注意:NO verification of data against the schema is performed by the load command.If the file is in hdfs, it is moved into the Hive-controlled file system namespace.The root of the Hive directory is specified by the option
hive.metastore.warehouse.dir
in
hive-default.xml
. We advise users to create this directory before trying to create tables via Hive.如果文件存在于hdfs,它将移动到Hive控制文件系统空间,Hive的根目录由hive-site.xml中的hive.warehouse.dir变量指定,我们建议用户在通过HIVE创建表之前先创建此目录。
  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.以上两个加载命令会加载数据到两个不同的分区中。
  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
The above command will load data from an HDFS file/directory to the table.上述的命令将会加载一个HDFS文件或目录到表中Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.注意:从HDFS中加载数据会导致文件或目录的移动。因此这个操作是瞬间的。

SQL Operations SQL操作

The Hive query operations are documented in Select.HIVE的查询操作见文档Select.

Example Queries 查询示例

Some example queries are shown below. They are available in
build/dist/examples/queries
.More are available in the Hive sources at
ql/src/test/queries/positive
.一些查询示例将在下面展示。可以在build/dist/examples/queries中获得更多的Hive源文件位于ql/src/test/queries/positive

SELECTS and FILTERS 查询和过滤

  hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
selects column 'foo' from all rows of partition
ds=2008-08-15
of the
invites
table. The results are not stored anywhere, but are displayed on the console.从partition ds=2008-08-15的所有行中查找列foo.这个结果不会在任何地方存储,但是会在控制台展示。Note that in all the examples that follow,
INSERT
(into a Hive table, local directory or HDFS directory) is optional.注意在以下的多有例子中插入是可选的
  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
selects all rows from partition
ds=2008-08-15
of the
invites
table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory.从分区ds=2008-08-15的invites表中查询所有的行到HDFS目录中,这个结果将会在该目中的文件中,文件的个数由mappers的个数决定。NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.Partitioned tables must always have a partition selected in the
WHERE
clause of the statement.
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
selects all rows from pokes table into a local directory.查询pokes表的所有列到本地的目录中
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
selects the sum of a column. The avg, min, or max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use
COUNT(1)
in place of
COUNT(*)
.查询列的和,平均值。最小值或最大值同样也可以使用。注意在一些HIVE的版本中,你需要使用count(1)去代替count(*).

GROUP BY 汇总

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
Note that for versions of Hive which don't include HIVE-287, you'll need to use
COUNT(1)
in place of
COUNT(*)
.

JOIN 连接

  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT 多表插入

  FROM srcINSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
This streams the data in the map phase through the script
/bin/cat
(like Hadoop streaming).Similarly – streaming can be used on the reduce side (please see the Hive Tutorial for examples).

Simple Example Use Cases 简单的使用示例

MovieLens User Ratings

First, create a table with tab-delimited text file format:
CREATE TABLE u_data (userid INT,movieid INT,rating INT,unixtime STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS TEXTFILE;
Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):从GroupLens datasets 下载文件
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip[/code] or:
curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip[/code] Note:  If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.Unzip the data files:  解压数据文件
unzip ml-100k.zip
And load
u.data
into the table that was just created: 加载u.data到创建的表中
LOAD DATA LOCAL INPATH '<path>/u.data'OVERWRITE INTO TABLE u_data;
Count the number of rows in table u_data: 汇总表u_data的数量
SELECT COUNT(*) FROM u_data;
Note that for older versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).Now we can do some complex data analysis on the table
u_data
:注意在旧版本的Hive中缺少 HIVE-287,,你将需要使用count(1)区代替count(*)现在我们可以在表u_data上做一些复杂的数据分析。Create
weekday_mapper.py
:
import sysimport datetimefor line in sys.stdin:line = line.strip()userid, movieid, rating, unixtime = line.split('\t')weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()print '\t'.join([userid, movieid, rating, str(weekday)])
Use the mapper script:
CREATE TABLE u_data_new (userid INT,movieid INT,rating INT,weekday INT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t';add FILE weekday_mapper.py;INSERT OVERWRITE TABLE u_data_newSELECTTRANSFORM (userid, movieid, rating, unixtime)USING 'python weekday_mapper.py'AS (userid, movieid, rating, weekday)FROM u_data;SELECT weekday, COUNT(*)FROM u_data_newGROUP BY weekday;
Note that if you're using Hive 0.5.0 or earlier you will need to use
COUNT(1)
in place of
COUNT(*)
.

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters use the default.For default Apache weblog, we can create a table with the following command.More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: