您的位置:首页 > 编程语言

Spark Streaming编程指南(部分)

2015-05-24 13:32 453 查看
有一个链接

streaming-programming-guide

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed
using complex algorithms expressed with high-level functions like
map
,
reduce
,
join
and
window
.Finally, processed data can be pushed out to filesystems, databases,and live dashboards. In fact, you can apply Spark’s machine learning and

graph processing algorithms on data streams.

翻译:

Spark Streaming是Spark核心API的扩展,它提供可扩展、高流量和可容错的实时数据流的流式处理。

数据能从许多数据源获取到,比如:Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets,

能使用高级方法描述的复杂算法高级方法来处理,比如:
map
,
reduce
,
join
window


最后,处理过的数据能被推到文件系统,数据库和实时仪表盘。

实际上,你能在数据流上应用Spark的机器学习和图形处理算法。



Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

翻译:

内部,它像如下方式运行。Spark Streaming接收实时输入数据流并将数据分成多个批次,它被Spark引擎处理,分批生成最终结果流。



Spark Streaming provides a high-level abstraction called discretized stream orDStream,which represents a continuous stream of data. DStreams can be created either from input datastreams from sources such as Kafka, Flume, and Kinesis, or
by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

翻译:

Spark Streaming提供一个名为离散流或DStream的高级抽象,它代表流一个连续数据流。

DStream既能从Kafka, Flume和Kinesis等来源的输入数据流中创建,

也能从其他DStream应用高级操作得到。

内部,一个DStream代表了一个系列的RDDs(弹性数据集)。

下面是AMPCamp训练时的一个例子,之前是用命令行的方式运行,改成了idea项目直接运行(这一块后面再介绍)

用一个小例子来演示Spark处理文件流:

1、创建输入目录

mkdir -p /tmp/fileStream

2、创建Scala项目

引入Scala和Spark的jar包,比如:scala-sdk-2.10.4和spark-assembly-1.3.1-hadoop2.6.0。

3、新建一个scala object

import org.apache.spark.streaming._
import org.apache.spark.SparkConf

object TestFileStreaming {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkPi").setMaster("local")
val ssc = new StreamingContext(conf, Seconds(4))

val lines = ssc.textFileStream("/tmp/fileStream")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()

ssc.start()
ssc.awaitTermination()
}
}


要点:

a. StreamContext,每隔4秒钟处理一次;

b. 从stream中取得RDD,并进行处理;

c. 启动流服务,一定要写awaitTermination,不然程序直接结束了。

4、输入

echo "hello world" > /tmp/fileStream/hello1

echo "hello world" > /tmp/fileStream/hello2

然后可以看到执行结果
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: