您的位置：首页 > 运维架构 > Apache

Is Apache Spark the Next Big Thing in Big Data?

2014-03-29 16:05 495 查看

In any article or blog post, any mention of Big Data usually includes something about Hadoop. When it comes to Big Data, Apache Hadoop
has been the big elephant in the room, and the release of Hadoop 2.0 in 2013 made the environment easier and more stable. But even with the inclusion ofImpala for
querying stored information real-time, Hadoop is still a batch-based system that processes data in, well, batch mode.

Big
Data processing is said to have three main characteristics, a.k.a. the 3Vs: volume, velocity and variety. Many data scientistsand Big
Data engineers will argue that Apache Hadoop processes data fast enough to meet the criteria of velocity. Perhaps so. However, coming from a real-time and a high-performance computing background, this argument about Hadoop and velocity reminds me of engineers
and computer scientists in the ‘90s who would ask, “Why do you need 64 bits when 32 bits work perfectly fine?”

This ability to process Big Data really, really fast — at higher velocities than Hadoop — has left the door open for a myriad of proprietary and open source solutions. It has also led the Apache Foundation – the “grey beards” sponsoring Hadoop and others –
to fast track viable solutions and projects from incubator status to top level projects.

Enter Apache Spark, a “lightning-fast cluster computing” solution for Big Data processing.

For those of you who run Apache Hadoop 2.0, how would you like to run programs 100 times faster than Hadoop MapReduce in memory or
10 times faster on disk? How would you like to write applications very quickly in Java, Python or Scala,
in such a way that you could build parallel applications that take advantage of your distributed environment? How would you like to combineSQL,
streaming and complex analytics in the same application? And finally, how would you like to still access and process all your data from your current Hadoop environment? This is what ApacheSpark can
do.

So what is Apache Spark? Developed in Scala, Spark is an open source distributed computing framework for advanced analytics that can leverage much of the Hadoop storage environment (like HDFS). It was originally developed as a research project at UC Berkeley’s
AMPLab. In June 2013, Spark achieved incubator project status at the Apache Foundation. In February 2014, it was promoted to a top level project. As of this writing, Apache Spark 0.9.0 is available as open source.

Why Spark instead of Hadoop?

From the beginning, Spark was designed to support in-memory processing so iterative algorithm programs could be developed without writing out a result set after each pass through the data. This ability to keep everything in memory is a high performance computing
technique applied to advanced analytics. It allows Spark to be blazing fast at processing speeds that are reported to be 100 times faster than comparable algorithms using MapReduce.

Spark has an integrated framework for performing advanced analytics, including the machine learning library MLlib, the graph engine GraphX,
the streaming analytics engine Spark Streaming and the interactive real-time query tool Shark.
This single platform ensures that users can have consistency in product results across different types of analysis.

Spark has an abstraction layer called Resilient Distributed Datasets (RDDs) that are read-only partitioned collections of records created through “deterministic operations on stable data.” RDDs include “information about data lineage together with instructions
for data transformation and instructions for persistence.” They’re a distributed collection of objects that can be cached in memory across multiple cluster nodes. RDDs are designed with failure in mind, so if an operation fails it is automatically reconstructed.
For more information on RDDs please see this white paperfrom UC Berkeley’s AMPLab.

Spark works with any file stored in HDFS or any other storage system supported by Hadoop and supports text files, SequenceFiles and any other Hadoop InputFormat files.

Spark Streaming provides an abstraction called “discretized streams,” or DStreams. DStreams are a continuous sequence of RDDs representing a stream of data that’s created from live ingested data or generated by transforming other DStreams. Spark receives data,
divides it into batches, then replicates the batches for fault tolerance and persists them in memory, where they are available for mathematical operations. They are then processed at short time-burst intervals. The collected data become an RDD of their own
which is then processed using the usual set of Spark applications.

Spark supports programming interfaces for Scala, Java, Python and now R.

One of the really cool things about Spark is that you can actually download and run it on your laptop (Note: Don’t do this on a Chromebook). For those new to Spark,
this is an excellent opportunity to get familiar with this new technology and get ahead of the game.

For those who are interested in learning more about Apache Spark, here’s a quick tutorial.

Who supports Spark commercially? Cloudera, Hortonworks and DataBricks all appear to. For those interested in seeing how businesses are using Spark, the latest O’Reilly Strata 2014 Conference in Santa Clara had a great presentation provided by one of the core
Spark Team members. You cansee the YouTube video here.

Conclusion

I’ve had an opportunity to use Apache Spark in real-world analytics since Sept. 2013. I found GraphXto be a bit buggy, but then I believe it’s
still in beta. I used Scala, Python and R for developing applications on Spark and have found it to be blazingly fast. Was this on a laptop? No. It was on a high-performance computing cluster of six nodes with lots and lots of memory that also had an existing
Hadoop infrastructure.

Would I recommend Spark to others? Yes. Spark fixes many “oversights” that I see in Hadoop. It’s fast and gets the job that I need to get done very quickly.

Would I dump my Hadoop and/or Storm environments? No, not yet. Spark is very new and only recently are companies starting to adopt and support it. If this support continues, Spark could be the next evolutionary change in Big Data processing environments.

虽然说spark很好，但是在大数量『T级别』上达到如何呢? 目前还没有看到谁处理这么大量的数据吧。

Ref: http://news.dice.com/2014/03/12/apache-spark-next-big-thing-big-data/

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航