hadoop Streaming之aggregate
2013-02-25 18:42
330 查看
1. aggregate简介
aggregate是Hadoop提供的一个软件包,其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using
Map/Reduce model, the developer needs to implement Map and Reduce
functions (and possibly Combine function). However, for a lot of applications related to counting
and statistics computing, these functions have very
similarcharacteristics. This provides a package implementing those
patterns. In particular,the
package provides a generic mapper class,a reducer class and a
combiner class, and a set of built-in value aggregators.It
also provides a generic utility class, ValueAggregatorJob, that
offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。
2. aggregate class
summary
3.
streaming中使用aggregate
在mapper任务的输出中添加控制,如下:
function:key\tvalue
eg:
LongValueSum:key\tvalue
此外,置-reducer =
aggregate。此时,Reducer使用aggregate中对应的function类对相同key的value进行操作,例如,设置function为LongValueSum则将对每个键值对应的value求和。
4. 实例1(value求和)
测试文件test.txt
a 15 1
a 17 1
a 18 1
a 19 1
a 19 1
a 19 1
a 19 1
b 20 1
c 15 1
c 15 1
d 16 1
a 16 1
mapper程序:
#include
#include
using namespace std;
int main(int argc, char** argv)
{
string a,b,c;
while(cin >> a >> b >> c)
{
cout << "LongValueSum:"<< a << "\t" << b << endl;
}
return 0;
}
运行:
$hadoop streaming -input /app/test.txt -output /app/test -mapper
./mapper -reducer aggregate -file
mapper -jobconf mapred.reduce.tasks=1 -jobconf
mapre.job.name="test"
输出:
a
142
b
20
c
30
d
16
5.
实例2(强大ValueHistogram)
ValueHistogram是aggregate package中最强大的类,基于每个键,对其value做以下统计
1)唯一值个数
2)最小值个数
3)中位置个数
4)最大值个数
5)平均值个数
6)标准方差
上述例子基础上修改mapper.cpp为:
#include
#include
using namespace std;
int main(int argc, char** argv)
{
string a,b,c;
while(cin >> a >> b >> c)
{
cout << "ValueHistogram:"<< a << "\t" << b << endl;
}
return 0;
}
运行命令同上
运行结果:
a
5
1
1
4
1.6
1.2
b
1
1
1
1
1.0
0.0
c
1
2
2
2
2.0
0.0
d
1
1
1
1
1.0
0.0
参考:
/docs/api/index.html?org/apache/hadoop/mapred/lib/aggregate/package-summary.htm
aggregate是Hadoop提供的一个软件包,其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using
Map/Reduce model, the developer needs to implement Map and Reduce
functions (and possibly Combine function). However, for a lot of applications related to counting
and statistics computing, these functions have very
similarcharacteristics. This provides a package implementing those
patterns. In particular,the
package provides a generic mapper class,a reducer class and a
combiner class, and a set of built-in value aggregators.It
also provides a generic utility class, ValueAggregatorJob, that
offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。
2. aggregate class
summary
DoubleValueSum | This class implements a value aggregator that sums up a sequence of double values. 可利用来统计Top K记录,类似LongValueSum |
LongValueMax | This class implements a value aggregator that maintain the maximum of a sequence of long values. |
LongValueMin | This class implements a value aggregator that maintain the minimum of a sequence of long values. |
LongValueSum | This class implements a value aggregator that sums up a sequence of long values. |
StringValueMax | This class implements a value aggregator that maintain the biggest of a sequence of strings. |
StringValueMin | This class implements a value aggregator that maintain the smallest of a sequence of strings. |
UniqValueCount | This class implements a value aggregator that dedupes a sequence of objects. |
UserDefinedValueAggregatorDescriptor | This class implements a wrapper for a user defined value aggregator descriptor. |
ValueAggregatorBaseDescriptor | This class implements the common functionalities of the subclasses of ValueAggregatorDescriptor class. |
ValueAggregatorCombiner | This class implements the generic combiner of Aggregate. |
ValueAggregatorJob | This is the main class for creating a map/reduce job using Aggregate framework. |
ValueAggregatorJobBase | This abstract class implements some common functionalities of the the generic mapper, reducer and combiner classes of Aggregate. |
ValueAggregatorMapper | This class implements the generic mapper of Aggregate. |
ValueAggregatorReducer | This class implements the generic reducer of Aggregate. |
ValueHistogram | This class implements a value aggregator that computes the histogram of a sequence of strings |
streaming中使用aggregate
在mapper任务的输出中添加控制,如下:
function:key\tvalue
eg:
LongValueSum:key\tvalue
此外,置-reducer =
aggregate。此时,Reducer使用aggregate中对应的function类对相同key的value进行操作,例如,设置function为LongValueSum则将对每个键值对应的value求和。
4. 实例1(value求和)
测试文件test.txt
a 15 1
a 17 1
a 18 1
a 19 1
a 19 1
a 19 1
a 19 1
b 20 1
c 15 1
c 15 1
d 16 1
a 16 1
mapper程序:
#include
#include
using namespace std;
int main(int argc, char** argv)
{
string a,b,c;
while(cin >> a >> b >> c)
{
cout << "LongValueSum:"<< a << "\t" << b << endl;
}
return 0;
}
运行:
$hadoop streaming -input /app/test.txt -output /app/test -mapper
./mapper -reducer aggregate -file
mapper -jobconf mapred.reduce.tasks=1 -jobconf
mapre.job.name="test"
输出:
a
142
b
20
c
30
d
16
5.
实例2(强大ValueHistogram)
ValueHistogram是aggregate package中最强大的类,基于每个键,对其value做以下统计
1)唯一值个数
2)最小值个数
3)中位置个数
4)最大值个数
5)平均值个数
6)标准方差
上述例子基础上修改mapper.cpp为:
#include
#include
using namespace std;
int main(int argc, char** argv)
{
string a,b,c;
while(cin >> a >> b >> c)
{
cout << "ValueHistogram:"<< a << "\t" << b << endl;
}
return 0;
}
运行命令同上
运行结果:
a
5
1
1
4
1.6
1.2
b
1
1
1
1
1.0
0.0
c
1
2
2
2
2.0
0.0
d
1
1
1
1
1.0
0.0
参考:
/docs/api/index.html?org/apache/hadoop/mapred/lib/aggregate/package-summary.htm
相关文章推荐
- hadoop&nbsp;streaming的单词统计C++版
- Hadoop&nbsp;Streaming机制
- Hadoop&nbsp;Streaming和Pipes
- Hadoop&nbsp;Streaming&nbsp;编程
- Hadoop&nbsp;Streaming高级编程
- Hadoop&nbsp;YARN&nbsp;简介:相比于MRv1,YA…
- Hadoop&nbsp;pipes编程
- hadoop&nbsp;streaming/c++编程指南
- hadoop学习;Streaming,aggregate;combiner
- Storm Spark 和 Hadoop区别
- Hadoop&nbsp;pipes设计原理
- 利用hadoopstreaming&python导入数据库数据
- Hadoop-2.6.0 集群的安装配置
- hadoop 中的序列化和反序列化
- hadoop&nbsp;eclipse&nbsp;plugin的生成
- [Hadoop] Sqoop安装过程详解
- hadoop 2.0 详细配置教程--转载
- Linux&nbsp;下用eclipse&nbsp;连接hadoop
- hadoop&nbsp;datanode启动不起来
- hadoop 单机安装