您的位置：首页 > 编程语言 > Python开发

hadoop-python——Wordcount程序：python实现详解

2015-01-07 14:20 597 查看

从hadoop官网提供用python语言编写的wordcount程序如下：

mapper.py函数如下：

import sys

# 调用标准输入流
for line in sys.stdin:
# 读取文本内容
line = line.strip()
# 对文本内容分词，形成一个列表
words = line.split()
# 读取列表中每一个元素的值
for word in words:
# map函数输出，key为word，下一步将进行shuffle过程，将按照key排序，输出，这两步为map阶段工作为，在本地节点进行
print '%s\t%s' % (word, 1)

reducer.py函数如下：

#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\t', 1)

# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)

预备知识：mapreduce运作框架

过程详解

调用接口说明：调用python中的标准输入流 sys.stdin ，MAP具体过程是，HadoopStream每次从input文件读取一行数据，然后传到sys.stdin中，运行payhon的map函数脚本，然后用print输出回HadoopStreeam.
REDUCE过程一样

所以M和R函数的输入格式为 for line in sys.stdin:

line=line.strip

Mapper 过程如下：

第一步，在每个节点上运行我们编写的map程序：（多节点同时运行）

import sys

# 调用标准输入流

for line in sys.stdin:

# 读取文本内容

line = line.strip()

# 对文本内容分词，形成一个列表

words = line.split()

# 读取列表中每一个元素的值

for word in words:

# map函数输出，key为word，下一步将进行shuffle过程，将按照key排序，输出，这两步为map阶段工作为，在本地节点进行

print '%s\t%s' % (word, 1)

第二步，hadoop框架，把我们运行的结果，进入shuffle过程，每个节点对key单独进行排序，然后输出.

Reducer 过程如下：（单节点运行）

第一步，merge过程，把所有节点汇总到一个节点，合并并且按照key排序.

第二步，运行reducer函数.

实例：

Txt1.txt内容：hello my name is pan

hello hadoop

hello world

123

Txt2.txt内容：hello

123

假如这两个文本分别运行在两个节点上假设节点1运行文本1，节点2运行文本2.

Map函数运行

节点1结果

hello 1

my 1

name 1

is 1

pan 1

hello 1

hadoop 1

hello 1

world 1

123 1

节点2结果

hello 1

123 1

Shuffle步结果

节点1结果

123 1

hadoop 1

hello 1

hello 1

hello 1

is 1

my 1

name 1

pan 1

world 1

节点2结果

123 1

hello 1

Merge步运行结果：

123 1

123 1

hadoop 1

hello 1

hello 1

hello 1

hello 1

is 1

my 1

name 1

pan 1

world 1

Reducer函数运行结果

123 2

hadoop 1

hello 4

is 1

my 1

name 1

pan 1

world 1

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航