您的位置:首页 > 编程语言 > Python开发

用python编写mapreduce版的wordcount程序

2015-07-30 09:02 519 查看

python版的mapreduce版的wordcount程序网上有很多,但是都或多或少的有问题,运行时基本上都会出错,从而导致本人走了不少弯路。经过本人的探索和实践,整理出了能正常运行的代码,并且附上几点需要注意的地方。

1、代码整个编码阶段必须全过程在linux环境下编写,如果从windows拷贝过去,则会由于字符编码不一致,导致程序不能正常运行。

2、如果运行./mapper.py时报错,可以尝试使用python mapper.py

3、执行命令为:hadoop jar ~/hadoop-2.3.0

/hadoop/tools

b/hadoop-streaming-2.3.0.jar
-mapper mapper.py -reducer reducer.py -input /input/data.txt -output /output/o1 -file mapper.py -file reducer.py

4、map和reduce阶段的key/value输出,hadoop默认是以‘\t’分开的,所以在python程序里面也必须遵守这个原则,如果map过程不遵守这个原则,则会导致后面的reduce阶段的输入不一致,从而可能导致reduce没有输出,也就是说程序会不报错的运行完,但是最终的output为空。

mapper.py

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
for line in file:
# split the line into words
yield line.split()

def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
main()


reducer.py

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)

def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
#   current_word - string containing a word (the key)
#   group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass

if __name__ == "__main__":
main()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: