MapReduce
2010-04-30 11:07
120 查看
MapReduce
is a patented
[1]
software framework
introduced by Google
to support distributed computing
on large data sets
on clusters
of computers.[2]
The framework is inspired by map
and reduce
functions commonly used in functional programming
,[3]
although their purpose in the MapReduce framework is not the same as their original forms.[4]
MapReduce libraries
have been written in C++
, C#
, Erlang
, Java
, Python
, Ruby
, F#
, R
and other programming languages.
Overview
MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a
filesystem
(unstructured) or within a
database
(structured).
"Map" step:
The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree
structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step:
The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.
The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than that which "commodity" servers can handle - a large
server farm
can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.
is a patented
[1]
software framework
introduced by Google
to support distributed computing
on large data sets
on clusters
of computers.[2]
The framework is inspired by map
and reduce
functions commonly used in functional programming
,[3]
although their purpose in the MapReduce framework is not the same as their original forms.[4]
MapReduce libraries
have been written in C++
, C#
, Erlang
, Java
, Python
, Ruby
, F#
, R
and other programming languages.
Overview
MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a
filesystem
(unstructured) or within a
database
(structured).
"Map" step:
The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree
structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step:
The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.
The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than that which "commodity" servers can handle - a large
server farm
can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.
相关文章推荐
- MapReduce实现TopK的示例
- 谷歌三大核心技术(二)Google MapReduce中文版
- MapReduce工作原理图文详解
- hadoop MapReduce实例解析
- Hadoop之道--MapReduce之Hello World实例wordcount
- 我是如何向老婆解释MapReduce的?
- Hadoop MapReduce程序中解决第三方jar包问题方案
- mapreduce 默认排序算法
- MapReduce理解
- 谷歌技术"三宝"之MapReduce
- 从Hadoop框架与MapReduce模式中谈海量数据处理(含淘宝技术架构)
- hadoop学习笔记:mapreduce框架详解
- Hadoop之MapReduce概念
- Hadoop源代码分析(包hadoop.mapred中的MapReduce接口)
- 第一个mapreduce程序——执行和详解
- mapreduce输出数据存入HBase中
- Hadoop之——MapReduce实现从海量数字信息中获取最大值
- MapReduce 简单的全文搜索2
- Hadoop MapReduce 计数器
- mapreduce程序来实现分类