您的位置：首页 > 运维架构

hadoop&nbsp;Streaming之aggregate

2013-02-25 18:42 330 查看

1. aggregate简介

aggregate是Hadoop提供的一个软件包，其用来做一些通用的计算和聚合。

Generally speaking, in order to implement an application using
Map/Reduce model, the developer needs to implement Map and Reduce
functions (and possibly Combine function). However, for a lot of applications related to counting
and statistics computing, these functions have very
similarcharacteristics. This provides a package implementing those
patterns. In particular,the
package provides a generic mapper class,a reducer class and a
combiner class, and a set of built-in value aggregators.It
also provides a generic utility class, ValueAggregatorJob, that
offers a static function that creates map/reduce jobs。

在Streaming中通常使用Aggregate包作为reducer来做聚合统计。

2. aggregate class
summary

DoubleValueSum	This class implements a value aggregator that sums up a sequence of double values. 可利用来统计Top K记录，类似LongValueSum
LongValueMax	This class implements a value aggregator that maintain the maximum of a sequence of long values.
LongValueMin	This class implements a value aggregator that maintain the minimum of a sequence of long values.
LongValueSum	This class implements a value aggregator that sums up a sequence of long values.
StringValueMax	This class implements a value aggregator that maintain the biggest of a sequence of strings.
StringValueMin	This class implements a value aggregator that maintain the smallest of a sequence of strings.
UniqValueCount	This class implements a value aggregator that dedupes a sequence of objects.
UserDefinedValueAggregatorDescriptor	This class implements a wrapper for a user defined value aggregator descriptor.
ValueAggregatorBaseDescriptor	This class implements the common functionalities of the subclasses of ValueAggregatorDescriptor class.
ValueAggregatorCombiner	This class implements the generic combiner of Aggregate.
ValueAggregatorJob	This is the main class for creating a map/reduce job using Aggregate framework.
ValueAggregatorJobBase	This abstract class implements some common functionalities of the the generic mapper, reducer and combiner classes of Aggregate.
ValueAggregatorMapper	This class implements the generic mapper of Aggregate.
ValueAggregatorReducer	This class implements the generic reducer of Aggregate.
ValueHistogram	This class implements a value aggregator that computes the histogram of a sequence of strings

3.
streaming中使用aggregate

在mapper任务的输出中添加控制，如下：

function：key\tvalue

eg：

LongValueSum：key\tvalue

此外，置-reducer =
aggregate。此时，Reducer使用aggregate中对应的function类对相同key的value进行操作，例如，设置function为LongValueSum则将对每个键值对应的value求和。

4. 实例1（value求和）

测试文件test.txt

a 15 1

a 17 1

a 18 1

a 19 1

a 19 1

a 19 1

a 19 1

b 20 1

c 15 1

c 15 1

d 16 1

a 16 1

mapper程序：

#include

#include

using namespace std;

int main(int argc, char** argv)

{

string a,b,c;

while(cin >> a >> b >> c)

{

cout << "LongValueSum:"<< a << "\t" << b << endl;

}

return 0;

}

运行：

$hadoop streaming -input /app/test.txt -output /app/test -mapper
./mapper -reducer aggregate -file
mapper -jobconf mapred.reduce.tasks=1 -jobconf
mapre.job.name="test"

输出：

a
142

b
20

c
30

d
16

5.
实例2（强大ValueHistogram）

ValueHistogram是aggregate package中最强大的类，基于每个键，对其value做以下统计

1）唯一值个数

2）最小值个数

3）中位置个数

4）最大值个数

5）平均值个数

6）标准方差

上述例子基础上修改mapper.cpp为：

#include

#include

using namespace std;

int main(int argc, char** argv)

{

string a,b,c;

while(cin >> a >> b >> c)

{

cout << "ValueHistogram:"<< a << "\t" << b << endl;

}

return 0;

}

运行命令同上

运行结果：

a
5
1
1
4
1.6
1.2

b
1
1
1
1
1.0
0.0

c
1
2
2
2
2.0
0.0

d
1
1
1
1
1.0
0.0

参考：

/docs/api/index.html?org/apache/hadoop/mapred/lib/aggregate/package-summary.htm

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

hadoop&amp;nbsp;Streaming之aggregate

hadoop Streaming之aggregate