Moving Hadoop Beyond Batch with Apache YARN
2013-06-16 21:07
323 查看
Apache Hadoop 2.0 continues to make its way through the open source community process at the Apache Software Foundation and is getting closer to being declared “ready” from a community development perspective. Once ready, our team at Hortonworks will apply
our usual enterprise rigor in providing a tested and integrated distribution that includes Hadoop 2.0 along with the other enterprise-focused services our customers and partners require.
In my roles both at Hortonworks and in the open-source Apache Hadoop community, I’m asked a lot of questions regarding the key aspects and motivations behind Hadoop 2.0. Here is some information to sate the curious mind.
In the early days of Hadoop at Yahoo!, we had a very particular objective: store and process very large amounts of data to support our internet search efforts. And so the first generation of Hadoop was a purpose-built system for web-scale data processing that
was embraced by Yahoo! as well as other technology-savvy early adopters such as Facebook.
As usage at Yahoo! began to expand so did the number of ways that users wanted to interact with the data stored in Hadoop. As with
any successful open-source project, the broader ecosystem of Hadoop users responded by contributing additional capabilities to the Hadoop community, with some of the most popular examples being Apache Hive for SQL-based querying, Apache Pig for scripted data
processing and Apache HBase as a NoSQL database.
These additional open source projects opened the door for a much richer set of applications to be built on top of Hadoop – but they didn’t really address the design limitations inherent in Hadoop; specifically, that it was designed as a single application system
with MapReduce at the core (i.e. batch-oriented data processing).
Fast forward to today, and we see that Hadoop’s momentum has continued and many more enterprises (not just web scale companies) want to store ALL incoming data in Hadoop, and then enable their users to interact with it in a host of different ways: batch, interactive,
analyzing data streams as they arrive, and more. And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so.
Nothing illustrates this dynamic more clearly than the current industry noise around SQL on Hadoop. All kinds of vendors are clamoring to provide better SQL access to data stored in Hadoop – and so they should, since SQL is understood by many users. Since
Apache Hive has been the defacto SQL interface to Hadoop data for many years, we’ve found most users would like to continue to leverage the power of Hive in support of these
additional interactive SQL use cases.
But by building SQL access on top of Hadoop, it just highlights the challenge of Hadoop being a single application system. For when I run a SQL query on that data, it could consume all the resources of the cluster and cause performance issues
for the other applications and jobs running in the cluster – not a good outcome to say the least.
When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets. And do so in a way where multiple types of applications can operate efficiently and predictably
within the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0. By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating
system.
Getting back to the SQL ON Hadoop point, with YARN we now have the ability to run SQL IN Hadoop. For by being IN Hadoop (built on YARN), it becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed. Why
stop at SQL? What about machine learning or modeling? What about processing events (data) as they arrive? Would it be not nice to manage all of these through a common system?
Enter YARN.
By turning Apache Hadoop 2.0 into a multi application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational
level rather than being commercial bolt-ons that complicate the environment for customers.
And so that is the trailer for the story for Hadoop 2.0: Unleashing the Power of YARN. Coming soon to a cluster near you, summer of 2013! Stay tuned!
Ref: http://hortonworks.com/blog/moving-hadoop-beyond-batch-with-apache-yarn/
our usual enterprise rigor in providing a tested and integrated distribution that includes Hadoop 2.0 along with the other enterprise-focused services our customers and partners require.
In my roles both at Hortonworks and in the open-source Apache Hadoop community, I’m asked a lot of questions regarding the key aspects and motivations behind Hadoop 2.0. Here is some information to sate the curious mind.
First-generation success inspires second-generation focus
In the early days of Hadoop at Yahoo!, we had a very particular objective: store and process very large amounts of data to support our internet search efforts. And so the first generation of Hadoop was a purpose-built system for web-scale data processing thatwas embraced by Yahoo! as well as other technology-savvy early adopters such as Facebook.
As usage at Yahoo! began to expand so did the number of ways that users wanted to interact with the data stored in Hadoop. As with
any successful open-source project, the broader ecosystem of Hadoop users responded by contributing additional capabilities to the Hadoop community, with some of the most popular examples being Apache Hive for SQL-based querying, Apache Pig for scripted data
processing and Apache HBase as a NoSQL database.
These additional open source projects opened the door for a much richer set of applications to be built on top of Hadoop – but they didn’t really address the design limitations inherent in Hadoop; specifically, that it was designed as a single application system
with MapReduce at the core (i.e. batch-oriented data processing).
Do we need SQL ON Hadoop or SQL IN Hadoop?
Fast forward to today, and we see that Hadoop’s momentum has continued and many more enterprises (not just web scale companies) want to store ALL incoming data in Hadoop, and then enable their users to interact with it in a host of different ways: batch, interactive,analyzing data streams as they arrive, and more. And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so.
Nothing illustrates this dynamic more clearly than the current industry noise around SQL on Hadoop. All kinds of vendors are clamoring to provide better SQL access to data stored in Hadoop – and so they should, since SQL is understood by many users. Since
Apache Hive has been the defacto SQL interface to Hadoop data for many years, we’ve found most users would like to continue to leverage the power of Hive in support of these
additional interactive SQL use cases.
But by building SQL access on top of Hadoop, it just highlights the challenge of Hadoop being a single application system. For when I run a SQL query on that data, it could consume all the resources of the cluster and cause performance issues
for the other applications and jobs running in the cluster – not a good outcome to say the least.
YARN enables SQL IN Hadoop and many more applications
When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets. And do so in a way where multiple types of applications can operate efficiently and predictablywithin the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0. By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating
system.
Getting back to the SQL ON Hadoop point, with YARN we now have the ability to run SQL IN Hadoop. For by being IN Hadoop (built on YARN), it becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed. Why
stop at SQL? What about machine learning or modeling? What about processing events (data) as they arrive? Would it be not nice to manage all of these through a common system?
Enter YARN.
By turning Apache Hadoop 2.0 into a multi application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational
level rather than being commercial bolt-ons that complicate the environment for customers.
And so that is the trailer for the story for Hadoop 2.0: Unleashing the Power of YARN. Coming soon to a cluster near you, summer of 2013! Stay tuned!
Ref: http://hortonworks.com/blog/moving-hadoop-beyond-batch-with-apache-yarn/
相关文章推荐
- Moving Hadoop Beyond Batch with Apache YARN
- Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2
- Apache Hadoop YARN – ResourceManager
- Examples with HiveSemanticAnalyzerHook org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook used
- Apache Hadoop YARN简介
- Apache Hadoop NextGen MapReduce (YARN)
- spark on yarn 报 org.apache.hadoop.util.Shell$ExitCodeException: 问题
- org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container
- Apache Hadoop 下一代的MapReduce(YARN)
- 错误:no such method error:org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashLong(J)I
- Apache Hadoop YARN – ResourceManager--转载
- Creating Hadoop MapReduce Job with Spring Data Apache Hadoop
- Apache Hadoop YARN – NodeManager
- spark 笔记 4:Apache Hadoop YARN: Yet Another Resource Negotiator
- Apache Solr real-time live index updates at scale with Apache Hadoop
- Apache Hadoop NextGen MapReduce (YARN)
- Apache Hadoop NextGen MapReduce (YARN) 2.6(翻译自官方)
- Hadoop运行错误:org.apache.hadoop.yarn.exceptions.YarnException
- Apache Hadoop YARN – NodeManager--转载
- Apache Hadoop YARN - 项目背景与简介