您的位置:首页 > 其它

Why do we need another Big data processing engine, like SPARK ?

2014-03-07 14:17 429 查看
Current ubiquitous standard for storing and processing very large data is Hadoop. It’s an open source Apache project with storage provided by HDFS (Hadoop Distributed File System) and processing by Map-Reduce computing paradigm. Fault tolerant capability in
Hadoop is achieved by replication. So in case of eventual failure of any node in the cluster, the system still functions without losing any data.

Computation of the large data across cluster of commodity machines, essentially involves a good amount of network I/O and disk I/O for each of the Map and Reduce stages and 90% of the time is being spent on I/O, rather than actual computation, thereby making
it a high latency system. Hadoop is a batch processing system of very large data, often of the order of several terabytes to petabytes of data. Number of hard disks required to store 40PB data, if stacked one after another, the height will be double the height
of empire state building !!

Although map-reduce is a great computing paradigm for distributed programming, it’s not easy to write program in map-reduce. So some higher level abstraction was required and that gave birth to Hive, which is a declarative language and Pig (Pig latin) which
is a scripting framework, both work on top of map-reduce. Any job written on Hive or Pig essentially gets converted to a map-reduce job. That does not solve high latency problem though. Imagine cases where you need to answer within a bounded period in time,
otherwise the purpose of analysis is lost, such as fraud detection, spam analysis, face detection etc. Many algorithms to do this kind of work or machine learning or computing page rank etc are inherently multipass algorithm or iterative in nature. This is
again another difficulty with hadoop to perform iterative operations.

So we needed a low latency distributed system where iterative algorithms can be run with ease. Welcome to the UC Berkeley research project called SPARK that
essentially solves these problems. SPARK is part of Berkeley Distributed Analytics Stack (BDAS) developed at AmpLab. Other parts of the stack are Mesos and Shark (Others are BlinkDB and MLbase - not yet released). Mesos is a cluster Manager and Shark is Hive
on Spark.



SPARK can run in three modes, local, standalone and with Mesos. It can also run with YARN.



SPARK is built with Scala and the code size is 22KLOC which is one tenth of Hadoop.

SPARK has been built by keeping generality in mind. It supports diverse workload; Not just batch jobs that are run with map-reduce, but also
iterative algorithm ilke machine learning, graph algorithms — from jobs that run in sub sec to hours. Combine wide array of operators
and group them together and everything in a fault tolerant way. It’s a self heal kind of a system, so the user need not worry about fault tolerance.

Some of the beautiful things in SPARK, are,

Speed. Shared in-memory immutable dataset, greatly reduces network and disk I/O.

Consistency is free, because of using immutable dataset.

local mode: where things can be tried out in a single box (even a Windows machine). Once you are comfortable, you can try this in a cluster setup. This speeds up the learning curve.

REPL: Ability to try things from command line in a interactive way. This is a great way to start without writing a
single line of code.

Primary language is Scala, although it supports Java and python. Scala is a great language that combines functional
programming and OOP. It’s a JVM language and can utilize existing Java libraries.

Fault tolerant without replication (through RDDs) and fast recovery time. Self healing - works under the hood in case
of a failure.

Interoperability: It’s ability to talk to HDFS, S3, EC2, MPI, even local filesystem.

Not only map-reduce. Spark’s programming model includes mutable accumulators and broadcast variables and immutable RDDs, along with 2 types of operations, lazy Transformations
and Actions. Transformations,
which create a new dataset from an existing one, and actions,
which return a value to the driver program after running a computation on the dataset.

Persisting or caching a
dataset in memory across operations. These significantly increases the speed of subsequent operations on the dataset.

Easy deployability in the Cloud like amazon web services.

This kind of projects are a great power to the masses. Storing, processing, analyzing - entire Big Data stack is available for free as open source projects.

Will discuss things in more detail in my next writeup. Will talk about why scala is great. Will also discuss aboutSPARK streaming that
is slated to be released as alpha in the upcoming 0.7 version.

Ref: http://arindampaul.tumblr.com/post/42919937653/why-do-we-need-another-big-data-processing-engine-like
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: