您的位置:首页 > 运维架构

【dumbo】【hadoop】使用dumbo开发hadoop streaming程序

2013-06-05 04:43 375 查看
转自:http://www.cnblogs.com/flying5/archive/2011/09/07/2169574.html1. dumbo的官网:https://github.com/klbostee/dumbo/    wiki: https://github.com/klbostee/dumbo/wiki2. 安装配置 详见wiki:https://github.com/klbostee/dumbo/wiki/Building-and-installing 解压从网站下载下来的源码到dumbo文件夹,执行如下命令就可以了
wget 
-
O
ez_setup.py  http:
/
/
peak.telecommunity.com
/
dist
/
ez_setup.py
python
ez_setup.py dumbo
3. 例子程序  使用dumbo开发程序来统计apache的访问日志中最多的几个ip地址    apache日志格式如下:
[admin
@search011
 
dumbotest]$
head -n3 access.log
10.0
.
0.1
 
-
- [
02
/Nov/
2010
:
10
:
15
:
52
 
+
0800
"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=15&runhost=localhost&status=OK HTTP/1.0"
200
 
3
 
"-"
 
"Wget/1.10.2
(Red Hat modified)"
10.0
.
0.3
 
-
- [
02
/Nov/
2010
:
10
:
15
:
52
 
+
0800
"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=17&runhost=localhost&status=OK HTTP/1.0"
200
 
3
 
"-"
 
"Wget/1.10.2
(Red Hat modified)"
10.0
.
0.1
 
-
- [
02
/Nov/
2010
:
10
:
15
:
53
 
+
0800
"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=9&runhost=localhost&status=OK HTTP/1.0"
200
 
3
 
"-"
 
"Wget/1.10.2
(Red Hat modified)"
  ipcount.py程序如下:
#!/bin/env
python
 
#
cut -d ' ' -f 1 access.log | sort | uniq -c | sort -nr | head -n 5
 
def
 
mapper(key,
value):
    
yield
 
value.split(
"
"
)[
0
], 
1
 
def
 
reducer(key,
values):
    
yield
 
key, 
sum
(values)
 
if
 
__name__ 
=
=
 
"__main__"
:
    
import
 
dumbo
    
dumbo.run(mapper,
reducer, combiner
=
reducer)
  运行:
cut 
-
'
'
 
-
1
 
access.log
| sort | uniq 
-
c
| sort 
-
nr
| head 
-
5
  可以看到在本机上执行的结果。  在hadoop上运行:viewsourceprint?
dumbo
start ipcount.py 
-
hadoop 
/
home
/
admin
/
hadoop 
-
libjar 
/
home
/
admin
/
hadoop
/
contrib
/
streaming
/
hadoop
-
streaming
-
0.20
.
2
-
CDH3B4.jar 

 
-
input
/
admin
/
input
 
-
output 
/
admin
/
output
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: