您的位置:首页 > 产品设计 > UI/UE

spark文档学习1 Spark Streaming Programming Guide

2016-10-17 15:57 411 查看

一、 Overview

定义:Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live
data streams.

 工作原理:Spark Streaming receives live input data
streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. 接受实时的输入数据,划分为batches,经过spark engine处理后产生最终的结果。

DStream is represented as a sequence of RDDs. 

三、Basic
Concepts

3.1 Linking

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.1</version>
</dependency>

For ingesting data from sources like Kafka, Flume, and Kinesis that are not present in the Spark Streaming core API, you will have to add the corresponding
artifact 
spark-streaming-xyz_2.11
 to
the dependencies. For example, some of the common ones are as follows.

SourceArtifact
Kafka spark-streaming-kafka-0-8_2.11 
Flume spark-streaming-flume_2.11 
Kinesisspark-streaming-kinesis-asl_2.11 [Amazon Software License]
3.2 Input DStreams and Receivers


Points to remember


When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka,
Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark
Properties for information on how to set the master).


 Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.


3.3  Basic Sources
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: