spark通过合理设置spark.default.parallelism参数提高执行效率
2016-07-21 17:12
387 查看
spark中有partition的概念(和slice是同一个概念,在spark1.2中官网已经做出了说明),一般每个partition对应一个task。在我的测试过程中,如果没有设置spark.default.parallelism参数,spark计算出来的partition非常巨大,与我的cores非常不搭。我在两台机器上(8cores *2 +6g * 2)上,spark计算出来的partition达到2.8万个,也就是2.9万个tasks,每个task完成时间都是几毫秒或者零点几毫秒,执行起来非常缓慢。在我尝试设置了
spark.default.parallelism 后,任务数减少到10,执行一次计算过程从minute降到20second。
参数可以通过spark_home/conf/spark-default.conf配置文件设置。
eg.
下面是官网的相关描述:
from:http://spark.apache.org/docs/latest/configuration.html
from:http://spark.apache.org/docs/latest/tuning.html
Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to
etc), and for distributed “reduce” operations, such as
as a second argument (see the
or set the config property
spark.default.parallelism 后,任务数减少到10,执行一次计算过程从minute降到20second。
参数可以通过spark_home/conf/spark-default.conf配置文件设置。
eg.
spark.master spark://master:7077 spark.default.parallelism 10 spark.driver.memory 2g spark.serializer org.apache.spark.serializer.KryoSerializer spark.sql.shuffle.partitions 50
下面是官网的相关描述:
from:http://spark.apache.org/docs/latest/configuration.html
Property Name | Default | Meaning |
---|---|---|
spark.default.parallelism | For distributed shuffle operations like reduceByKeyand join, the largest number of partitions in a parent RDD. For operations like parallelizewith no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger | Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelizewhen not set by user. |
Level of Parallelism
Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile,
etc), and for distributed “reduce” operations, such as
groupByKeyand
reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism
as a second argument (see the
spark.PairRDDFunctionsdocumentation),
or set the config property
spark.default.parallelismto change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.
相关文章推荐
- malloc底层实现
- android 调用应用的代码片段
- 【幻化万千戏红尘】qianfengDay09-java基础学习:接口,抽象类,抽象方法,多态,instanceof
- python json.dumps输出中文问题
- css 控制不规则img再div中居中完整显示
- windows mysql 自动备份的几种方法
- Java的全排列
- Java多线程学习(吐血超详细总结)
- 自定义函数
- 微信随机生成红包金额算法java版
- Quartz基本概念
- poj 3304
- java方法调用
- NSMutableArray使用注意
- PHP 快速排序法
- Cocos2dx中的一些设计到内存管理的宏
- LNMP一键安装包+Thinkphp搭建基于pathinfo模式的路由(可以去除url中的.php)
- Golang web 开发实战之 session 缓存:如何使用 redigo 将一个结构体数据保存到 redis?
- 几个字符串相关的题目,来自LeetCode和LintCode
- python之字典