Removing and adding DataNodes in cluster by hadoop
2013-08-06 10:12
841 查看
You
may want to remove or add some DataNodes from your HDFS cluster at some point. In fact ,Removing or adding nodes in Hadoop can be straightforward.Like this, we only do some simply operations, in which we will not affect ongoing other jobs.But in order
to removing or adding more safe and efficient ,we
must note replication of blocks and other points.
I think you should known that:
dfs.xxx ==> datanode ==> hdfs-site.xml
mapred.xxx ==>
tasktracker ==>
mapred-site.xml
It's means:you can do some operations on datanode and tasktracker respectively.Because
they are hardly in the same of operations.
1.Removing DataNodes
Modify
your cluster file of hdfs-site.xml on namenode:add some property like this
<property>
<name>dfs.hosts</name>
<value>/usr/hadoop/conf/datanode-allow-list</value>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/usr/hadoop/conf/datanode-deny-list</value>
</property>
The dfs.host list you datanode what can contanct your namenode.If it's NULL ,all of your datanodes can contanct your namenode.If it's not NULL,then only the listing datanodes can line on your namenode. dfs.hosts.exclude
list some datanodes that can't connect your namenode.When a datanode in not only dfs.hosts but also dfs.hosts.exclude ,it can not connectt!
(1)#touch datanode-deny-list
Then put some datanodes's IP or hostname in the file datanode-deny-list.
(2) #hadoop dfsadmin -refreshNodes
Dynamic refresh the configuration, do
not need to restart the namenode
(3)#hadoop
dfsadmin -report
look some info show that:datanode turn to "Decommissioning"
or in webui.
(4)wati for a while,the datanode turn to "dead"
(5)remove the datanode's IP from /usr/hadoop/conf/datanode-allow-list
2.Adding
datanode
If the dfs.hosts not empty ,you can add datanode's IP to this file.
#hadoop dfsadmin -refreshNodes
Dynamic refresh the configuration, do
not need to restart the namenode.
#bin/hadoop-daemon.sh
start datanode
start the job of DataNode on you datanode that
you want to add.
If the dfs.hosts is NULL or is
not exist,you only start datanode.
As for Removing or Adding tasktracker:it's the
same to datanode.
you only use the property
in mapred-site.xml
<property>
<name>mapred.hosts</name>
<value>/usr/local/hadoop/conf/tasktracker-allow-list</value>
</property>
<property>
<name>mapred.hosts.exclude</name>
<value>/usr/hadoop/conf/tasktracker-deny-list</value>
</property>
#bin/hadoop-daemon.sh start tasktracker
3.Do
balance
#bin/stop-balancer.sh
A cluster is considered balanced when the utilization rates of all the DataNodes are within the range of the average utilization rate plus or minus a threshold.
This thresh-old is 10 percent by default. You can specify a different threshold when you start the balancer script. For example, to set the threshold to 5 percent for a more evenly distributed cluster, start the balancer with
#bin/start-balancer.sh -threshold 5
As balancing can be network intensive, we recommend doing it overnight or over a weekend when your cluster may be less busy. Alternatively, you can set the dfs.balance.bandwidthPerSec configuration parameter to limit the bandwidth devoted to balancing.
The dfs.balance.bandwidthPerSec
is 1M/s by
default so low. In
some case that no mr job,you can modify the value up for balance faster.
Like this :modify hdfs-site.xml on your namenode.
<property>
<name>dfs.balance.bandwidthPerSec</name>
<value>xxxxxxxx</value>
<description>
Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second.
</description>
</property>
may want to remove or add some DataNodes from your HDFS cluster at some point. In fact ,Removing or adding nodes in Hadoop can be straightforward.Like this, we only do some simply operations, in which we will not affect ongoing other jobs.But in order
to removing or adding more safe and efficient ,we
must note replication of blocks and other points.
I think you should known that:
dfs.xxx ==> datanode ==> hdfs-site.xml
mapred.xxx ==>
tasktracker ==>
mapred-site.xml
It's means:you can do some operations on datanode and tasktracker respectively.Because
they are hardly in the same of operations.
1.Removing DataNodes
Modify
your cluster file of hdfs-site.xml on namenode:add some property like this
<property>
<name>dfs.hosts</name>
<value>/usr/hadoop/conf/datanode-allow-list</value>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/usr/hadoop/conf/datanode-deny-list</value>
</property>
The dfs.host list you datanode what can contanct your namenode.If it's NULL ,all of your datanodes can contanct your namenode.If it's not NULL,then only the listing datanodes can line on your namenode. dfs.hosts.exclude
list some datanodes that can't connect your namenode.When a datanode in not only dfs.hosts but also dfs.hosts.exclude ,it can not connectt!
(1)#touch datanode-deny-list
Then put some datanodes's IP or hostname in the file datanode-deny-list.
(2) #hadoop dfsadmin -refreshNodes
Dynamic refresh the configuration, do
not need to restart the namenode
(3)#hadoop
dfsadmin -report
look some info show that:datanode turn to "Decommissioning"
or in webui.
(4)wati for a while,the datanode turn to "dead"
(5)remove the datanode's IP from /usr/hadoop/conf/datanode-allow-list
2.Adding
datanode
If the dfs.hosts not empty ,you can add datanode's IP to this file.
#hadoop dfsadmin -refreshNodes
Dynamic refresh the configuration, do
not need to restart the namenode.
#bin/hadoop-daemon.sh
start datanode
start the job of DataNode on you datanode that
you want to add.
If the dfs.hosts is NULL or is
not exist,you only start datanode.
As for Removing or Adding tasktracker:it's the
same to datanode.
you only use the property
in mapred-site.xml
<property>
<name>mapred.hosts</name>
<value>/usr/local/hadoop/conf/tasktracker-allow-list</value>
</property>
<property>
<name>mapred.hosts.exclude</name>
<value>/usr/hadoop/conf/tasktracker-deny-list</value>
</property>
#bin/hadoop-daemon.sh start tasktracker
3.Do
balance
#bin/stop-balancer.sh
A cluster is considered balanced when the utilization rates of all the DataNodes are within the range of the average utilization rate plus or minus a threshold.
This thresh-old is 10 percent by default. You can specify a different threshold when you start the balancer script. For example, to set the threshold to 5 percent for a more evenly distributed cluster, start the balancer with
#bin/start-balancer.sh -threshold 5
As balancing can be network intensive, we recommend doing it overnight or over a weekend when your cluster may be less busy. Alternatively, you can set the dfs.balance.bandwidthPerSec configuration parameter to limit the bandwidth devoted to balancing.
The dfs.balance.bandwidthPerSec
is 1M/s by
default so low. In
some case that no mr job,you can modify the value up for balance faster.
Like this :modify hdfs-site.xml on your namenode.
<property>
<name>dfs.balance.bandwidthPerSec</name>
<value>xxxxxxxx</value>
<description>
Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second.
</description>
</property>
相关文章推荐
- Adding and Removing Routes in the Linux Routing Table in C/C++
- cascading-simhash a library to cluster by minhashes in Hadoop
- Adding and removing KITL drivers in x86 BSPs
- Intercept and Manage Windows Originated by Third-party Components Hosted in C# Application
- mysql遇见Expression #1 of SELECT list is not in GROUP BY clause and contains nonag
- Docker安装MySQL遇见Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggre的问题
- (轉貼) Evolving a language in and for the real world C++ 1991-2006 (英文版) (by Bjarne Stroustrup) (C/C++)
- run hadoop in ghc and aws
- Using LINQ Group By and String.Join() / Aggregate() in Entity Framework 3.5
- Mysql解决SELECT list is not in GROUP BY clause and contains nonaggregated column 问题
- mysql 5.7 [Err] 1055 - Expression #1 of ORDER BY clause is not in GROUP BY clause and contains nonag
- Gratuitous ARP and DAD in Windows XP, Windows 7/Vista, Windows 8 and Failover Cluster
- Chicago Boss: a server framework inspired by Rails and written in Erlang
- Adding And Removing Remote Branches
- Authentication in HDFS and Hadoop common
- #1055 - Expression of SELECT list is not in GROUP BY clause and contains nonaggregated column this i
- Adding supplementary tables and figures in LaTeX【转】
- There are no datanodes in the cluster
- The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.
- ERROR 1055 (42000): Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregate