您的位置：首页 > 运维架构

【dumbo】【hadoop】使用dumbo开发hadoop streaming程序

2013-06-05 04:43 375 查看

转自：http://www.cnblogs.com/flying5/archive/2011/09/07/2169574.html1. dumbo的官网：https://github.com/klbostee/dumbo/ wiki: https://github.com/klbostee/dumbo/wiki2. 安装配置　详见wiki:https://github.com/klbostee/dumbo/wiki/Building-and-installing　解压从网站下载下来的源码到dumbo文件夹，执行如下命令就可以了

wget

O
ez_setup.py  http:

peak.telecommunity.com

dist

ez_setup.py

python
ez_setup.py dumbo

3. 例子程序　　使用dumbo开发程序来统计apache的访问日志中最多的几个ip地址 apache日志格式如下：

[admin

@search011

dumbotest]$
head -n3 access.log

10.0

0.1

-
- [

/Nov/

"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=15&runhost=localhost&status=OK HTTP/1.0"

"-"

"Wget/1.10.2
(Red Hat modified)"

10.0

0.3

-
- [

/Nov/

"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=17&runhost=localhost&status=OK HTTP/1.0"

"-"

"Wget/1.10.2
(Red Hat modified)"

10.0

0.1

-
- [

/Nov/

"GET
/index/Back/task_updator.php?taskid=22870&sub_taskid=9&runhost=localhost&status=OK HTTP/1.0"

"-"

"Wget/1.10.2
(Red Hat modified)"

　　ipcount.py程序如下：

#!/bin/env
python

#
cut -d ' ' -f 1 access.log | sort | uniq -c | sort -nr | head -n 5

def

mapper(key,
value):

yield

value.split(

"
"

)[

],

def

reducer(key,
values):

yield

key,

sum

(values)

if

__name__

"__main__"

import

dumbo

dumbo.run(mapper,
reducer, combiner

reducer)

　　运行：

cut

'
'

access.log
| sort | uniq

c
| sort

nr
| head

　　可以看到在本机上执行的结果。　　在hadoop上运行：viewsourceprint ?

dumbo
start ipcount.py

hadoop

home

admin

hadoop

libjar

home

admin

hadoop

contrib

streaming

hadoop

streaming

0.20

CDH3B4.jar

input

admin

input

output

admin

output

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航