您的位置:首页 > 产品设计 > UI/UE

HDFS Users Guide

2017-08-27 14:54 225 查看
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. This user guide primarily deals with the interaction of users and administrators with HDFS clusters. The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.

HDFS是Hadoop应用程序使用的主要分布式存储。 HDFS集群主要由管理文件系统元数据的NameNode和存储实际数据的DataNode组成。 HDFS体系结构指南详细介绍了HDFS。 本用户指南主要处理用户和管理员与HDFS群集的交互。 HDFS体系结构图描述了NameNode,DataNode和客户端之间的基本交互。 客户联系NameNode获取文件元数据或文件修改,并直接使用DataNode执行实际的文件I / O。

The following are some of the salient features that could be of interest to many users.

Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.

HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.

Hadoop is written in Java and is supported on all major platforms.

Hadoop supports shell-like commands to interact with HDFS directly.

The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.

New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS:

File permissions and authentication.

Rack awareness: to take a node’s physical location into account while scheduling tasks and allocating storage.

Safemode: an administrative mode for maintenance.

fsck: a utility to diagnose health of the file system, to find missing files or blocks.

fetchdt: a utility to fetch DelegationToken and store it in a file on the local system.

Balancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.

Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS’ state before the upgrade in case of unexpected problems.

Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.

Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.

Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.

以下是许多用户可能感兴趣的一些显着特性。

包括HDFS的Hadoop非常适合使用商用硬件的分布式存储和分布式处理。它具有容错性,可扩展性,并且非常易于扩展。 MapReduce以其简单和适用于大量分布式应用程序而闻名,是Hadoop不可分割的一部分。

HDFS具有高度可配置性,其默认配置非常适合许多安装。大多数情况下,配置只能针对非常大的群集进行调整。

Hadoop使用Java编写,并在所有主要平台上受支持。

Hadoop支持类似shell的命令来直接与HDFS交互。

NameNode和Datanodes内置了Web服务器,可以轻松检查群集的当前状态。

HDFS中定期实施新功能和改进。以下是HDFS中有用功能的一个子集:

文件权限和认证。

机架感知:在调度任务和分配存储时考虑节点的物理位置。

Safemode:维护的管理模式。

fsck:用于诊断文件系统健康状况的实用程序,用于查找丢失的文件或块。

fetchdt:用于获取DelegationToken并将其存储在本地系统文件中的实用程序。

平衡器:当数据不均匀分布在DataNode中时平衡集群的工具。

升级和回滚:在软件升级之后,如果出现意外问题,可以在升级之前回滚到HDFS的状态。

Secondary NameNode:对名称空间执行定期检查点,并有助于将包含HDFS修改日志的文件(其实就是editLog)的大小保持在NameNode的特定限制内。

检查点节点:执行名称空间的定期检查点,并帮助最大限度地减少存储在包含对HDFS的更改的NameNode中的日志大小。替换之前由Secondary NameNode填充的角色,但尚未加强战斗力。 NameNode同时允许多个Checkpoint节点,只要没有备份节点向系统注册即可。

备份节点:对检查点节点的扩展。除了检查点之外,它还接收来自NameNode的编辑流(指editLog)并维护其自己的内存中的名称空间副本(指file system meta data),该副本始终与活动的NameNode名称空间状态同步。一次只能向NameNode注册一个备份节点。
Prerequisites

The following documents describe how to install and set up a Hadoop cluster:

Single Node Setup for first-time users.
Cluster Setup for large, distributed clusters.

The rest of this document assumes the user is able to set up and run a HDFS with at least one DataNode. For the purpose of this document, both the NameNode and DataNode could be running on the same physical machine.
先决条件

以下文档介绍了如何安装和设置Hadoop集群:

初始用户的单节点设置。
大型分布式群集的群集设置。

本文档的其余部分假设用户能够设置并运行至少一个DataNode的HDFS。 就本文档而言,NameNode和DataNode都可以在同一台物理机器上运行。
Web Interface

NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster. With the default configuration, the NameNode front page is at http://namenode-name:9870/. It lists the DataNodes in the cluster and basic statistics of the cluster. The web interface can also be used to browse the file system (using “Browse the file system” link on the NameNode front page).
Web界面

NameNode和DataNode各自运行一个内部Web服务器,以显示有关集群当前状态的基本信息。 使用默认配置,NameNode首页位于http:// namenode-name:9870 /。 它列出了群集中的DataNode和群集的基本统计信息。 Web界面也可用于浏览文件系统(使用NameNode首页上的“浏览文件系统”链接)。
Shell Commands

Hadoop includes various shell-like commands that directly interact with HDFS and other file systems that Hadoop supports. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. These commands support most of the normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files. For more information see File System Shell Guide.
Shell命令

Hadoop包含各种类似shell的命令,可直接与Hadoop支持的HDFS和其他文件系统进行交互。 命令bin / hdfs dfs -help列出了Hadoop shell支持的命令。 此外,命令bin / hdfs dfs -help 命令名 显示更详细的命令帮助。 这些命令支持大多数正常的文件系统操作,如复制文件,更改文件权限等。它还支持一些HDFS特定操作,如更改文件复制。 有关更多信息,请参阅文件系统Shell指南。
The bin/hdfs dfsadmin command supports a few HDFS administration related operations. The bin/hdfs dfsadmin -help command lists all the commands currently supported. For e.g.:

-report: reports basic statistics of HDFS. Some of this information is also available on the NameNode front page.

-safemode: though usually not required, an administrator can manually enter or leave Safemode.

-finalizeUpgrade: removes previous backup of the cluster made during last upgrade.

-refreshNodes: Updates the namenode with the set of datanodes allowed to connect to the namenode. By default, Namenodes re-read datanode hostnames in the file defined by dfs.hosts, dfs.hosts.exclude Hosts defined in dfs.hosts are the datanodes that are part of the cluster. If there are entries in dfs.hosts, only the hosts in it are allowed to register with the namenode. Entries in dfs.hosts.exclude are datanodes that need to be decommissioned. Alternatively if dfs.namenode.hosts.provider.classname is set to org.apache.hadoop.hdfs.server.blockmanagement.CombinedHostFileManager, all include and exclude hosts are specified in the JSON file defined by dfs.hosts. Datanodes complete decommissioning when all the replicas from them are replicated to other datanodes. Decommissioned nodes are not automatically shutdown and are not chosen for writing for new replicas.

-printTopology : Print the topology of the cluster. Display a tree of racks and datanodes attached to the tracks as viewed by the NameNode.

For command usage, see dfsadmin.

bin / hdfs dfsadmin命令支持一些与HDFS管理相关的操作。 bin / hdfs dfsadmin -help命令列出当前支持的所有命令。
(在bin目录下输入 hdfs dfsadmin -help,会列出当前支持的所有命令)
例如:

-report:报告HDFS的基本统计数据。 NameNode首页上也提供了这些信息中的一部分。

-safemode:虽然通常不需要,管理员可以手动输入或离开安全模式。

-finalizeUpgrade:删除上次升级期间所做的群集的先前备份。

-refreshNodes:使用允许连接到namenode的一组datanodes更新namenode。默认情况下,Namenodes会重新读取由dfs.hosts,dfs.hosts.exclude定义的文件中的数据节点主机名。dfs.hosts中定义的主机是属于集群一部分的数据节点。如果dfs.hosts中有条目,则只允许其中的主机注册namenode。 dfs.hosts.exclude中的条目是需要停用的datanode。或者,如果dfs.namenode.hosts.provider.classname设置为org.apache.hadoop.hdfs.server.blockmanagement.CombinedHostFileManager,则在由dfs.hosts定义的JSON文件中指定所有包含和排除主机。当所有副本都复制到其他数据节点时,Datanodes完成停用。退役的节点不会自动关闭,也不会选择写入新副本。

-printTopology:打印群集的拓扑。显示由NameNode查看的轨道上附加的机架和datanode树。

有关命令用法,请参阅dfsadmin。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: