您的位置：首页 > 其它

Spark快速入门指南

2016-04-12 14:00 459 查看

转载地址：

http://blog.csdn.net/macyang/article/details/7100523

Spark是什么？

Spark is a MapReduce-like cluster computing framework designed to support

low-latency iterative jobs and interactive use from an interpreter. It is

written in Scala, a high-level language for the JVM, and exposes a clean

language-integrated syntax that makes it easy to write parallel jobs.

Spark runs on top of the Mesos cluster manager.

Spark下载地址？

git clone git://github.com/mesos/spark.git

Spark编译与运行？

1）scala 2.9 +（将bin添加到PATH中或者设定了SCALA_HOME环境变量）

2) spark支持local模式和cluster模式, local不需要安装mesos

3) 如果需要将spark运行在cluster上，需要安装mesos

4）使用spark自带的sbt编译/打包： sbt/sbt compile, sbt/sbt assembly

5）使用spark自带的run脚本运行spark程序

验证spark环境是否OK?

在spark目录下运行：

1) local单线程： ./run spark.examples.SparkPi local

2) local多核: ./run spark.examples.SparkPi local[2]

3) mesos本地master: ./run spark.examples.SparkPi master@localhost:5050

Spark Programming Guide介绍了哪些东西？

1) 将Spark jar包（sbt/sbt assembly）放入CLASSPATH

2) Spark Application可以运行在local或者mesos上

3) Spark提供了两种RDD: Parallelized Collections 和 Hadoop Datasets, RDD能

够支持fault-tolerant，能够恢复因为节点crash造成的partition丢失问题

4) RDD上支持两种类型的Operation: transformation 和 action，其中transformation提供的

lazy类型的操作，只有当实际调用了action才会真正触发transformations

5) Spark提供了两种类型的shared variables: Broadcast Variables 和 Accumulators，对于

Broadcast variables则会将一份share variable分发到每台机器上，而不是默认情况下的每个task；

而对于accumulator则只能支持count和sum型的加操作，并且只有dirver program能够获取其value

如何写一些spark application？

多看一些spark例子，如：http://www.spark-project.org/examples.html

https://github.com/mesos/spark/tree/master/examples

遇到问题怎么办？

首先是google遇到的问题，如果还是解决不了就可以到spark google group去向作者提问题：

http://groups.google.com/group/spark-users?hl=en

想深入理解spark怎么办？

阅读spark的理论paper: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf

阅读spark源代码：https://github.com/mesos/spark

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： spark

相关文章推荐

新的分享

章节导航