您的位置:首页 > 运维架构

使用dumbo开发hadoop streaming程序

2016-01-28 00:58 441 查看
1. dumbo的官网:https://github.com/klbostee/dumbo/

wiki: https://github.com/klbostee/dumbo/wiki

2. 安装配置

详见wiki:https://github.com/klbostee/dumbo/wiki/Building-and-installing

解压从网站下载下来的源码到dumbo文件夹,执行如下命令就可以了

1
wget
-O ez_setup.py  http://peak.telecommunity.com/dist/ez_setup.py
2
python
ez_setup.py dumbo
3. 例子程序

使用dumbo开发程序来统计apache的访问日志中最多的几个ip地址

apache日志格式如下:

1
[admin@search011
dumbotest]$
head
-n3
access.log
2
10.0.0.1
- - [02/Nov/2010:10:15:52 +0800]
"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=15&runhost=localhost&status=OK HTTP/1.0"
200
3
"-"
"Wget/1.10.2
(Red Hat modified)"
3
10.0.0.3
- - [02/Nov/2010:10:15:52 +0800]
"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=17&runhost=localhost&status=OK HTTP/1.0"
200
3
"-"
"Wget/1.10.2
(Red Hat modified)"
4
10.0.0.1
- - [02/Nov/2010:10:15:53 +0800]
"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=9&runhost=localhost&status=OK HTTP/1.0"
200
3
"-"
"Wget/1.10.2
(Red Hat modified)"
ipcount.py程序如下:

1
#!/bin/env
python
2
3
def
mapper(key,
value):
4
yield
value.split(
"
"
)[
0
],
1
5
6
def
reducer(key,
values):
7
yield
key,
sum
(values)
8
9
if
__name__
=
=
"__main__"
:
10
import
dumbo
11
dumbo.run(mapper,
reducer, combiner
=
reducer)
运行:

1
cut
-d
'
'
-f
1 access.log |
sort
|
uniq
-c
|
sort
-nr
|
head
-n
5
可以看到在本机上执行的结果。

在hadoop上运行:

1
dumbo
start ipcount.py -hadoop /home/admin/hadoop -libjar /home/admin/hadoop/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar  -input /admin/input -output /admin/output
1
就可以提交job看到运行结果了。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: