MapReduce是Google提出的一个软件架构,用于大规模数据集(大于1TB)的并行运算。
2011-06-10 05:19
477 查看
MapReduce is a patented[1] software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.[2]
The framework is inspired by the map and reduce functions commonly used in functional programming,[3] although their purpose in the MapReduce framework is not the same as their original forms.[4]
MapReduce libraries have been written in C++, C#, Erlang, Java, OCaml, Perl, Python, PHP, Ruby, F#, R and other programming languages.
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase – all that is required is that all outputs of the map operation that share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data are still available.
http://en.wikipedia.org/wiki/MapReduce
The framework is inspired by the map and reduce functions commonly used in functional programming,[3] although their purpose in the MapReduce framework is not the same as their original forms.[4]
MapReduce libraries have been written in C++, C#, Erlang, Java, OCaml, Perl, Python, PHP, Ruby, F#, R and other programming languages.
[]Overview
MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or as a grid (if the nodes use different hardware). Computational processing can occur on data stored either in afilesystem (unstructured) or within a database (structured)."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase – all that is required is that all outputs of the map operation that share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data are still available.
http://en.wikipedia.org/wiki/MapReduce
相关文章推荐
- Hadoop是Apache提出的一个软件框架(即:开放源码并行运算编程工具和分布式文件系统,与MapReduce和Google档案系统的概念类似)
- 处理海量数据的模式MapReduce,大规模数据集的并行运算
- 处理海量数据的模式MapReduce,大规模数据集的并行运算
- Google发布AVA:一个用于理解人类动作的精细标记视频数据集
- 一个QQ隐蔽聊天的软件,用于你在办公室QQ聊天又不想让其他人知道
- 比特币的唯一出路-可交易的大规模并行运算包
- Ted Mosby - 一个MVP框架的软件架构
- Python和Google AppEngine开发基于Google架构的应用软件
- Numeral.js 是一个用于格式化和数字四则运算的js 库
- 一个定期翻译国外Android优质的技术、开源库、软件架构设计、测试等文章的开源项目
- 设计一个计算器类Claculator,它只有一个用于计数的数据成员count。该计算器的有效计数范围是0~65535,实现计数器的前自增、后自增、前自减、后自减、两个计算器相加减运算
- Hadoop学习笔记—4.初识MapReduce 一、神马是高大上的MapReduce MapReduce是Google的一项重要技术,它首先是一个编程模型,用以进行大数据量的计算。对于大数据
- 一个QQ隐蔽聊天的软件,用于你在办公室QQ聊天又不想让其他人知道
- 基于DSP/BIOS的多信号并行处理软件架构设计
- 近期关于“软件架构”的一个看点
- 【软件架构】如何成为一个优秀的软件模型设计者
- Thrust 是一个开源的 C++ 库用于开发高性能并行应用程序
- 什么是MapReduce? Google的分布运算开发工具!
- 软件工程第一个程序:像阿超那样,花20分钟写一个能自动生成小学四则运算题目的 “软件”,要求:除了整数以外,还要支持真分数的四则运算。
- Google之大规模分布式系统的监控基础架构Dapper