用python编写mapreduce版的wordcount程序
2015-07-30 09:02
519 查看
python版的mapreduce版的wordcount程序网上有很多,但是都或多或少的有问题,运行时基本上都会出错,从而导致本人走了不少弯路。经过本人的探索和实践,整理出了能正常运行的代码,并且附上几点需要注意的地方。
1、代码整个编码阶段必须全过程在linux环境下编写,如果从windows拷贝过去,则会由于字符编码不一致,导致程序不能正常运行。2、如果运行./mapper.py时报错,可以尝试使用python mapper.py
3、执行命令为:hadoop jar ~/hadoop-2.3.0
/hadoop/tools
b/hadoop-streaming-2.3.0.jar
-mapper mapper.py -reducer reducer.py -input /input/data.txt -output /output/o1 -file mapper.py -file reducer.py
4、map和reduce阶段的key/value输出,hadoop默认是以‘\t’分开的,所以在python程序里面也必须遵守这个原则,如果map过程不遵守这个原则,则会导致后面的reduce阶段的输入不一致,从而可能导致reduce没有输出,也就是说程序会不报错的运行完,但是最终的output为空。
mapper.py
#!/usr/bin/env python """A more advanced Mapper, using Python iterators and generators.""" import sys def read_input(file): for line in file: # split the line into words yield line.split() def main(separator='\t'): # input comes from STDIN (standard input) data = read_input(sys.stdin) for words in data: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 for word in words: print '%s%s%d' % (word, separator, 1) if __name__ == "__main__": main()
reducer.py
#!/usr/bin/env python """A more advanced Reducer, using Python iterators and generators.""" from itertools import groupby from operator import itemgetter import sys def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(separator='\t'): # input comes from STDIN (standard input) data = read_mapper_output(sys.stdin, separator=separator) # groupby groups multiple word-count pairs by word, # and creates an iterator that returns consecutive keys and their group: # current_word - string containing a word (the key) # group - iterator yielding all ["<current_word>", "<count>"] items for current_word, group in groupby(data, itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: # count was not a number, so silently discard this item pass if __name__ == "__main__": main()
相关文章推荐
- Python 类介绍
- Python的迭代器和生成器
- python for else primers
- python的正则表达式 re
- python的正则表达式 re
- Python type类具体的三大分类:metaclasses,classes,instance
- Python type类具体的三大分类:metaclasses,classes,instance
- python开发_大小写转换,首字母大写,去除特殊字符
- python开发_大小写转换,首字母大写,去除特殊字符
- python的str,unicode对象的encode和decode方法, Python中字符编码的总结和对比bytes和str
- python的str,unicode对象的encode和decode方法, Python中字符编码的总结和对比bytes和str
- Python图形界面开发包 PyGTK
- Python图形界面开发包 PyGTK
- python使用easygui写图形界面程序
- python使用easygui写图形界面程序
- python 可变对象与不可变对象
- opencv-python 学习笔记2:实现目光跟随(又叫人脸跟随)
- opencv-python 学习笔记2:实现目光跟随(又叫人脸跟随)
- opencv-python 学习笔记1:简单的图片处理
- opencv-python 学习笔记1:简单的图片处理