How to sort an arbitrarily large set of data using Hadoop?
2014-02-08 13:34
671 查看
Question:
My question is related to this postSorting large data using MapReduce/Hadoop. My idea of sorting an arbitrarily set is:
We have a large file with records, say 10^9 records.
The file is split across M mappers. Each mapper sorts a split of size, say 10000 records using QuickSort and outputs that sorted subsequence. The output key ranges between 1 and R, where R is the number of reducer tasks (suppose R = 4). The value is the
sorted subsequence.
Each Reducer reads K subsequences and merges them (taking the smallest element from subsequences iteratively until subsequences are empty). The output is written to a file.
Then the following processing is done:
To take advantage of the locality of data, new Reducer tasks could be scheduled to merge several output files produced by the previous reducer task. So for example if K=5, first reducer task would produce files of size 50000 and the new reducer task would
work with 5 files of 50000 sorted records each. New Reducer jobs would be scheduled until only one file remains, in this case of size 250.000.000 (because R=4). Finally a new Reducer job would be scheduled on another machine to merge the files into a single
10^9 file
My Question: is it possible in Hadoop to schedule the execution of Reducer jobs in such a way that they merge the files in some directory until only 1 file remains? If yes, how?
Another scenario would be to schedule the MapReduce jobs after each merge step, so for example the files of size 50000 would be merged in parallel by reduce tasks running on other machines, then files of size 250.000 on yet other machines, etc. but this
would generate a lot of network traffic. In any case the question remains valid for this case also - how to chain several MapReduce jobs such that the chaining stops after only 1 resulting file is output?
Answer:
Hadoop sorting is done with apartitioner. see for example the
source code for the
terasort benchmark
相关文章推荐
- Crontab 介绍
- hadoop+Hbase 模式
- Linux下遍历目录及文件,更改权限
- Linux下,mySQL的查看与重启
- 使用 Linux 自带的 logrotate 程序来控制日志文件尺寸
- Linux下进程见通信(IPC)
- linux .mp3 后台播放(arm)
- linux 模块编译步骤
- Linux 下个头文件作用
- copy_to_user、copy_from_us…
- Linux运维的一些常见问题及解决方法
- windows远程桌面管理CentOS 6.4
- Linux配置vnc,实现linux远程桌面
- LXC是如何与CGROUP,namespace扯上关系的?再加上DOCKER.IO。完美!!!
- linux清空文件
- Ubuntu “无法获得锁”解决方案(E: 无法获得锁 /var/cache/apt/archives/lock – open (11 资源临时不可用)
- 反向代理(Reverse Proxy)
- Linux 下的压缩解压命令大全
- Linux系统登录密码忘记解决办法
- 消息队列-Beanstalkd IronMQ 、以及Amazon SQS 深入详细比较(作者:Bashkim Isai)