您的位置：首页 > 产品设计 > UI/UE

RHive：集成R和Hive

2012-04-20 11:55 316 查看

https://github.com/nexr/RHive/wiki/UserGuides

RHive是一个R包，在R环境中集成hive。通过使用RHive可以在R环境中写HQL（HiveQL），将R的对象传入hive中，在hive中进行计算。在rHive中

小数据集在R中执行，大数据集在hive中运行。

越来越多的企业收集了海量细节数据，需要TB或者PB级的数据存储以及从海量数据中发现知识。目前，人们主要关注R统计分析程序，并且越来越熟悉R开发。但是，R对于海量数据处理存在一些弊端。一些人，通过抽样技术来处理海量数据，这样很有可能导致丢失数据信息。Hadoop可以处理这样的海量数据，而数据分析师缺少使用hadoop开发能力。然而，数据分析师一般都很熟悉SQL，进行数据处理。正是基于此种情况，Rhive作为大数据的一种解决方案，应运而生，在R和Hive之间建立桥梁。

RHive架构

API

Rhive API for R

rhive.connect : connect to hive 连接hive

rhive.query : execute hive query in R 在R中执行hive查询

rhive.export : export only R function to cluster Rserve

rhive.exportAll : export all R functions and R objects to cluster Rserve

rhive.close : close hive connection

rhive.list.table : get hive table list

rhive.desc.table : get Hive table information

rhive.load.table : retrieve table data from hive to R

RHive UDF、UDAF for hive

通过R函数处理hdfs

R : R is UDF. For every record, this function from Hive calls an exported R function.

RA : RA is UDAF. This function uses exported R and hive query to aggregate data.

unfold : unfold is UDTF. This function takes return data of R and unfolds them into several columns.

expand : expand is UDTF.

实例

rhive.connect(hive-hostip)

rhive.query("select * from emp")

coefficient <- 1.1

scoring <- function(sal) {

coefficient * sal

}

rhive.export('scoring')

rhive.query("select R('scoring',col_sal,0.0) from emp")

hsum <- function(prev,sal) {

c(prev[1] + sal[1])

}

hsum.partial <- function(agg_sal) {

agg_sal

}

hsum.merge <- function(prev, agg_sal) {

c(prev[1] + agg_sal[1])

}

hsum.terminate <- function(agg_sal) {

agg_sal

}

rhive.exportAll('hsum',rserve-list)

rhive.query(”select RA('hsum',col_sal) from emp group by empno")

emp <- rhive.desc.table(emp)

colnames(emp)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： function 数据分析 table hadoop query api

相关文章推荐

新的分享

章节导航