您的位置:首页 > 运维架构

How to sort an arbitrarily large set of data using Hadoop?

2014-02-08 13:34 671 查看

Question:

My question is related to this post
Sorting large data using MapReduce/Hadoop. My idea of sorting an arbitrarily set is:

We have a large file with records, say 10^9 records.
The file is split across M mappers. Each mapper sorts a split of size, say 10000 records using QuickSort and outputs that sorted subsequence. The output key ranges between 1 and R, where R is the number of reducer tasks (suppose R = 4). The value is the
sorted subsequence.
Each Reducer reads K subsequences and merges them (taking the smallest element from subsequences iteratively until subsequences are empty). The output is written to a file.
Then the following processing is done:

To take advantage of the locality of data, new Reducer tasks could be scheduled to merge several output files produced by the previous reducer task. So for example if K=5, first reducer task would produce files of size 50000 and the new reducer task would
work with 5 files of 50000 sorted records each. New Reducer jobs would be scheduled until only one file remains, in this case of size 250.000.000 (because R=4). Finally a new Reducer job would be scheduled on another machine to merge the files into a single
10^9 file

My Question: is it possible in Hadoop to schedule the execution of Reducer jobs in such a way that they merge the files in some directory until only 1 file remains? If yes, how?

Another scenario would be to schedule the MapReduce jobs after each merge step, so for example the files of size 50000 would be merged in parallel by reduce tasks running on other machines, then files of size 250.000 on yet other machines, etc. but this
would generate a lot of network traffic. In any case the question remains valid for this case also - how to chain several MapReduce jobs such that the chaining stops after only 1 resulting file is output?

Answer:

Hadoop sorting is done with a
partitioner. see for example the
source code for the
terasort benchmark
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: