您的位置:首页 > 运维架构 > Apache

Moving Hadoop Beyond Batch with Apache YARN

2013-06-16 21:07 323 查看
Apache Hadoop 2.0 continues to make its way through the open source community process at the Apache Software Foundation and is getting closer to being declared “ready” from a community development perspective. Once ready, our team at Hortonworks will apply
our usual enterprise rigor in providing a tested and integrated distribution that includes Hadoop 2.0 along with the other enterprise-focused services our customers and partners require.

In my roles both at Hortonworks and in the open-source Apache Hadoop community, I’m asked a lot of questions regarding the key aspects and motivations behind Hadoop 2.0. Here is some information to sate the curious mind.


First-generation success inspires second-generation focus

In the early days of Hadoop at Yahoo!, we had a very particular objective: store and process very large amounts of data to support our internet search efforts. And so the first generation of Hadoop was a purpose-built system for web-scale data processing that
was embraced by Yahoo! as well as other technology-savvy early adopters such as Facebook.

As usage at Yahoo! began to expand so did the number of ways that users wanted to interact with the data stored in Hadoop. As with
any successful open-source project, the broader ecosystem of Hadoop users responded by contributing additional capabilities to the Hadoop community, with some of the most popular examples being Apache Hive for SQL-based querying, Apache Pig for scripted data
processing and Apache HBase as a NoSQL database.

These additional open source projects opened the door for a much richer set of applications to be built on top of Hadoop – but they didn’t really address the design limitations inherent in Hadoop; specifically, that it was designed as a single application system
with MapReduce at the core (i.e. batch-oriented data processing).


Do we need SQL ON Hadoop or SQL IN Hadoop?

Fast forward to today, and we see that Hadoop’s momentum has continued and many more enterprises (not just web scale companies) want to store ALL incoming data in Hadoop, and then enable their users to interact with it in a host of different ways: batch, interactive,
analyzing data streams as they arrive, and more. And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so.

Nothing illustrates this dynamic more clearly than the current industry noise around SQL on Hadoop. All kinds of vendors are clamoring to provide better SQL access to data stored in Hadoop – and so they should, since SQL is understood by many users. Since
Apache Hive has been the defacto SQL interface to Hadoop data for many years, we’ve found most users would like to continue to leverage the power of Hive in support of these
additional interactive SQL use cases.

But by building SQL access on top of Hadoop, it just highlights the challenge of Hadoop being a single application system. For when I run a SQL query on that data, it could consume all the resources of the cluster and cause performance issues
for the other applications and jobs running in the cluster – not a good outcome to say the least.


YARN enables SQL IN Hadoop and many more applications

When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets. And do so in a way where multiple types of applications can operate efficiently and predictably
within the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0. By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating
system
.

Getting back to the SQL ON Hadoop point, with YARN we now have the ability to run SQL IN Hadoop. For by being IN Hadoop (built on YARN), it becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed. Why
stop at SQL? What about machine learning or modeling? What about processing events (data) as they arrive? Would it be not nice to manage all of these through a common system?

Enter YARN.

By turning Apache Hadoop 2.0 into a multi application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational
level rather than being commercial bolt-ons that complicate the environment for customers.

And so that is the trailer for the story for Hadoop 2.0: Unleashing the Power of YARN. Coming soon to a cluster near you, summer of 2013! Stay tuned!

Ref: http://hortonworks.com/blog/moving-hadoop-beyond-batch-with-apache-yarn/
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: