Mapreduce pattern(chapter3)
2016-01-05 11:38
204 查看
A single reducer getting a lot of data is bad for a few reasons:
单独一个需要大量数据的reduce任务所带来的问题
1 The sort can become an expensive operation when it has too many records and has to do most of the sorting on local disk, instead of memory;
当数据量很大并且需要在磁盘进行排序的情况下,这种操作是十分耗费资源的;
2 The host where the reduce is running will receive a lot of data over the network, which may create a network resource hot spot for that single host.
并且,对于reduce 任务运行的那个节点而言,将会耗费很多网络资源去获取需要输入的数据;
3 Naturally, scanning through the data in the reduce will take a long time, if there are many records to look through;
因此,要遍历所有的输入记录,将会耗费很多时间;
4 Any sort of memory growth in the reducer has the possibility of blowing through the Java Virtual Machine's memory, for example, if you are all of the values into an ArrayList to perform the median,that ArrayList can grow very big. This will not be a particular
problem if you are looking for the top ten items, but if you want to extract from a very large number, you may run into memory limits.
Reduce 端的任何形式的内存增长,都可能对所在节点的JVM的内存使用造成影响,例如,你尝试将所有的value放进一个ArrayList中,从而计算出其中位数,这将会导致将所有的values都要加载进内存中,并且,这个ArrayList 的规模可能会很庞大。如果你想计算出top 10这类的问题,那么上述提到的问题并不鲜见,因此,在这样的情况下,内存资源将会成为reduce的瓶颈;
5 Writes to the output file are not paralleled. Writing to the locally attached disk can be a more expensive operation in reduce phase, when we are dealing with a lot of data. Since there is only one reducer, we are not taking advantage of the parallelism
involved in writing data to several hosts, or even several disks on the same host. Again, this is not an issue for top 10, but a become a factor when the data extracts are very large.
另外一个瓶颈是在写输出数据的时候,无法使用并行化的方式。对于reduce阶段,当数据量很大时,向本地磁盘写数据是一种更加耗费资源的操作。由于只使用了一个reducer我们病没有实现写数据的并行操作,这并不只是对于top 10这类问题存在的,当数据量很大时,这个瓶颈就会出现。
单独一个需要大量数据的reduce任务所带来的问题
1 The sort can become an expensive operation when it has too many records and has to do most of the sorting on local disk, instead of memory;
当数据量很大并且需要在磁盘进行排序的情况下,这种操作是十分耗费资源的;
2 The host where the reduce is running will receive a lot of data over the network, which may create a network resource hot spot for that single host.
并且,对于reduce 任务运行的那个节点而言,将会耗费很多网络资源去获取需要输入的数据;
3 Naturally, scanning through the data in the reduce will take a long time, if there are many records to look through;
因此,要遍历所有的输入记录,将会耗费很多时间;
4 Any sort of memory growth in the reducer has the possibility of blowing through the Java Virtual Machine's memory, for example, if you are all of the values into an ArrayList to perform the median,that ArrayList can grow very big. This will not be a particular
problem if you are looking for the top ten items, but if you want to extract from a very large number, you may run into memory limits.
Reduce 端的任何形式的内存增长,都可能对所在节点的JVM的内存使用造成影响,例如,你尝试将所有的value放进一个ArrayList中,从而计算出其中位数,这将会导致将所有的values都要加载进内存中,并且,这个ArrayList 的规模可能会很庞大。如果你想计算出top 10这类的问题,那么上述提到的问题并不鲜见,因此,在这样的情况下,内存资源将会成为reduce的瓶颈;
5 Writes to the output file are not paralleled. Writing to the locally attached disk can be a more expensive operation in reduce phase, when we are dealing with a lot of data. Since there is only one reducer, we are not taking advantage of the parallelism
involved in writing data to several hosts, or even several disks on the same host. Again, this is not an issue for top 10, but a become a factor when the data extracts are very large.
另外一个瓶颈是在写输出数据的时候,无法使用并行化的方式。对于reduce阶段,当数据量很大时,向本地磁盘写数据是一种更加耗费资源的操作。由于只使用了一个reducer我们病没有实现写数据的并行操作,这并不只是对于top 10这类问题存在的,当数据量很大时,这个瓶颈就会出现。
相关文章推荐
- 提交代码到github
- GreenDao的简单使用说明(三)多表的操作1:n
- \x 开头编码的数据解码成中文
- Zebra POS打印机Wifi无线打印方案和Android实现
- [Android实例] 类似地震波向外扩散的自定义控件
- Chrome开发者工具不完全指南(二、进阶篇)
- rk3288_休眠唤醒问题
- android根据Url获取访问网页的源码
- windows7中的“mklink命令”
- webpy模版中写JS代码的陷阱
- 代码重构-3 用Tuple代替 out与ref
- 原型模式
- php页面编码设置
- MultiplyVector方法
- 2016年上半年考试计划
- 2016年,写给自己看
- 【C++】用指针数组构造字符串数组
- 学习笔记(二) oracle 的控制文件 control file
- 【C++】用指针数组构造字符串数组
- 【C++】逆序存放数组元素值