您的位置:首页 > 运维架构

Hadoop项目简介

2013-08-29 11:09 162 查看
一个完整的Hadoop集群采用模块化设计,其核心项目包括:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.用于存储数据,和普通的文件系统提供的功能类似。
Hadoop YARN: A framework for job scheduling and cluster resource management.资源管理与调度系统。例如如何检测是否有新机器加入了集群,如果给新的机器协调分配工作;如何检测是否有机器故障并离开了集群,且如何将故障机器数据和计算转移到其他节点。
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.一种分布式计算范式

在Hadoop的官网上下载并安装Hadoop后,以上4个项目也就全部安装好了。这4个项目互相协同,支撑Hadoop集群的工作。但其中的HDFS和YARN也可以脱离common和MapReduce被应用到其他地方。例如另一个分布式计算系统新星Spark,也是采用模块化设计。其中的底层数据存储系统就可以使用HDFS(也可以使用普通的本地文件系统),且Spark的资源管理与调度系统可以使用YARN(或另一个Mesos等)。

其他Hadoop生态圈项目

其他不是必须的,但提供了更多功能和更广泛应用场景的项目还有很多。这些项目需要单独安装才能使用。

例如Ambari就是一个基于Web的用于管理Hadoop集群的管理工具。Hive则是搭建在Hadoop之上的,提供类SQL查询的数据仓库。其[b]类SQL查询语句(称为HQL)可实现对HDFS上数据的快速查询分析。[/b]

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop
clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications
visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large
tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which
provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g.
ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

Hadoop发行版本

类似于linux是开源操作系统,因此我们可以看到很多其他的经过改造,增强的linux操作系统版本,如Ubuntu,CentOS等。Hadoop作为开源项目,除了Hadoop官网提供的官方版本以外,同样有商业公司提供增强改造版的Hadoop发行版本,其中最著名的就是Cloudera和Hortonworks。这两家公司提供完整的Hadoop项目解决方案,并为合作伙伴提供咨询,支持,培训等服务。

Cloudera提供的Hadoop发行版本称为CDH,其集成了Hadoop,Pig,HBase,Hive等等几乎所有Hadoop相关项目,且也为开源的。Hortonworks提供的版本名为HDP,也集成了众多Hive,Pig,Ambari,HCatalog等等项目,且开源。

总的来说和官方版本相比,Cloudera和Hortonworks都增强了Hadoop的功能,简化了搭建管理Hadoop集群的时间和人员技术成本。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: