您的位置:首页 > 运维架构

64位Ubuntu1404集群安装配置hadoop-2.2.0

2014-06-14 13:46 489 查看

1 先决条件

确保在你集群中的每个节点上都安装了所有必需软件:sun-JDK ,ssh,Hadoop

JavaTM1.7.x,必须安装,建议选择Sun公司发行的Java版本。

ssh 必须安装并且保证 sshd一直运行,以便用Hadoop 脚本管理远端Hadoop守护进程。


2 实验环境搭建




2.1 准备工作

操作系统:Ubuntu

部署:Vmvare

在vmvare安装好一台Ubuntu虚拟机后,可以导出或者克隆出另外两台虚拟机。

说明:

保证虚拟机的ip和主机的ip在同一个ip段,这样几个虚拟机和主机之间可以相互通信。

为了保证虚拟机的ip和主机的ip在同一个ip段,虚拟机连接设置为桥连。

准备机器:一台master,若干台slave,配置每台机器的/etc/hosts保证各台机器之间通过机器名可以互访,例如:

192.168.0.107 cloud001(master)

192.168.0.108 cloud002
(slave1)
192.168.0.109 cloud003
(slave2)
为保证环境一致先安装好JDK和ssh

配置修改hosts

$ sudo vi /etc/hosts
127.0.0.1 localhost

192.168.0.107 cloud001

#127.0.1.1 cloud001

192.168.0.108 cloud002

192.168.0.109 cloud003


2.2 安装JDK

略...


2.3 创建用户

$ useradd hadoop

$ cd /home/hadoop

在所有的机器上都建立相同的目录,也可以就建立相同的用户,最好是以该用户的home路径来做hadoop的安装路径。
例如在所有的机器上的安装路径都是:/home/hadoop/hadoop-2.2.0,这个不需要mkdir,在/home/hadoop/下解压hadoop包的时候,会自动生成)
(最好不要使用root安装,因为不推荐各个机器之间使用root访问 )


2.4 安装ssh和配置

1) 安装:sudo apt-get install ssh

这个安装完后,可以直接使用ssh命令 了。

执行$ netstat -nat 查看22端口是否开启了。

测试:ssh localhost。

输入当前用户的密码,回车就ok了。说明安装成功,同时ssh登录需要密码。

(这种默认安装方式完后,默认配置文件是在/etc/ssh/目录下。sshd配置文件是:/etc/ssh/sshd_config):

注意:在所有机子都需要安装ssh。

2) 配置

在Hadoop启动以后,Namenode是通过SSH(Secure Shell)来启动和停止各个datanode上的各种守护进程的,这就须要在节点之间执行指令的时候是不须要输入密码的形式,故我们须要配置SSH运用无密码公钥认证的形式。

以本文中的三台机器为例,现在cloud001是主节点,他须要连接cloud002和cloud003。须要确定每台机器上都安装了ssh,并且datanode机器上sshd服务已经启动。

( 说明:hadoop@hadoop~]$ssh-keygen -t rsa

这个命令将为hadoop上的用户hadoop生成其密钥对,询问其保存路径时直接回车采用默认路径,当提示要为生成的密钥输入passphrase的时候,直接回车,也就是将其设定为空密码。生成的密钥对id_rsa,id_rsa.pub,默认存储在/home/hadoop/.ssh目录下然后将id_rsa.pub的内容复制到每个机器(也包括本机)的/home/dbrg/.ssh/authorized_keys文件中,如果机器上已经有authorized_keys这个文件了,就在文件末尾加上id_rsa.pub中的内容,如果没有authorized_keys这个文件,直接复制过去就行.)

3) 首先设置namenode的ssh为无需密码的、自动登录。

切换到hadoop用户( 保证用户hadoop可以无需密码登录,因为我们后面安装的hadoop属主是hadoop用户。)

$ su hadoop

cd /home/hadoop

$ ssh-keygen -t rsa

然后一直按回车

完成后,在home跟目录下会产生隐藏文件夹.ssh

$ cd .ssh

之后ls 查看文件

cp id_rsa.pub authorized_keys

测试:

$ssh localhost

或者:

$ ssh node1

第一次ssh会有提示信息:

The authenticity of host ‘node1 (10.64.56.76)’ can’t be established.

RSA key fingerprint is 03:e0:30:cb:6e:13:a8:70:c9:7e:cf:ff:33:2a:67:30.

Are you sure you want to continue connecting (yes/no)?

输入 yes 来继续。这会把该服务器添加到你的已知主机的列表中

发现链接成功,并且无需密码。

4 ) 复制authorized_keys到cloud002 和cloud003 上

为了保证cloud001可以无需密码自动登录到cloud002和cloud003,先在cloud002和cloud003上执行

$ su hadoop

cd /home/hadoop
$ ssh-keygen -t rsa

一路按回车.

然后回到cloud001,复制authorized_keys到cloud002 和cloud003

[hadoop@cloud001 .ssh]$ scp authorized_keys cloud002:/home/hadoop/.ssh/

[hadoop@cloud001 .ssh]$ scp authorized_keys cloud003:/home/hadoop/.ssh/

这里会提示输入密码,输入hadoop账号密码就可以了。

改动你的 authorized_keys 文件的许可权限

[hadoop@cloud001 .ssh]$chmod 644 authorized_keys

测试:ssh cloud002或者ssh cloud003(第一次需要输入yes)。

如果不须要输入密码则配置成功,如果还须要请检查上面的配置能不能正确。


2.5 安装Hadoop

由于hadoop集群中每个机器上面的配置基本相同,所以我们先在namenode上面进行配置部署,然后再复制到其他节点。所以这里的安装过程相当于在每台机器上面都要执行。

1、 解压文件
将编译后的hadoop-2.2.0.tar.gz解压到/home/hadoop路径下(或者将在64位机器上编译的结果存放在此路径下)。然后为了节省空间,可删除此压缩文件,或将其存放于其他地方进行备份。

注意:每台机器的安装路径要相同!!

2、 hadoop配置过程

配置之前,需要在cloud001本地文件系统创建以下文件夹:

~/dfs/name

~/dfs/data

~/tmp

这里要涉及到的配置文件有7个:

~/hadoop-2.2.0/etc/hadoop/hadoop-env.sh

~/hadoop-2.2.0/etc/hadoop/yarn-env.sh

~/hadoop-2.2.0/etc/hadoop/slaves

~/hadoop-2.2.0/etc/hadoop/core-site.xml

~/hadoop-2.2.0/etc/hadoop/hdfs-site.xml

~/hadoop-2.2.0/etc/hadoop/mapred-site.xml

~/hadoop-2.2.0/etc/hadoop/yarn-site.xml

以上个别文件默认不存在的,可以复制相应的template文件获得。

配置文件1:hadoop-env.sh

修改JAVA_HOME值(export JAVA_HOME=/usr/java/jdk1.7.0_55

配置文件2:yarn-env.sh

修改JAVA_HOME值(export JAVA_HOME=/usr/java/jdk1.7.0_55

配置文件3:slaves(这个文件里面保存所有slave节点)

写入以下内容:

cloud002

cloud003

配置文件4:core-site.xml

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://cloud001:9000</value>

</property>

<property>

<name>io.file.buffer.size</name>

<value>131072</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>file:/home/hadoop/tmp</value>

<description>Abase for other temporary directories.</description>

</property>

<property>

<name>hadoop.proxyuser.hadoop.hosts</name>

<value>*</value>

</property>

<property>

<name>hadoop.proxyuser.hadoop.groups</name>

<value>*</value>

</property>

<property>

<name>hadoop.native.lib</name>

<value>true</value>

<description>Should native hadoop libraries, if present, be used.</description>

</property>

</configuration>

配置文件5:hdfs-site.xml

<configuration>

<property>

<name>dfs.namenode.secondary.http-address</name>

<value>cloud001:9001</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/home/hadoop/dfs/name</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/home/hadoop/dfs/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

<property>

<name>dfs.webhdfs.enabled</name>

<value>true</value>

</property>

<property>

<name>dfs.datanode.max.xcievers</name>

<value>4096</value>

</property>

</configuration>

配置文件6:mapred-site.xml

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>mapreduce.jobhistory.address</name>

<value>cloud001:10020</value>

</property>

<property>

<name>mapreduce.jobhistory.webapp.address</name>

<value>cloud001:19888</value>

</property>

</configuration>

配置文件7:yarn-site.xml

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>cloud001:8032</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>cloud001:8030</value>

</property>

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>cloud001:8031</value>

</property>

<property>

<name>yarn.resourcemanager.admin.address</name>

<value>cloud001:8033</value>

</property>

<property>

<name>yarn.resourcemanager.webapp.address</name>

<value>cloud001:8088</value>

</property>

</configuration>

3、复制到其他节点

这里可以写一个shell脚本进行操作(有大量节点时比较方便)

cp2slave.sh

#!/bin/bash
$ scp -r /home/hadoop/hadoop-2.2.0 hadoop@cloud002:/home/hadoop/

$ scp -r /home/hadoop/hadoop-2.2.0 hadoop@cloud003:/home/hadoop/

也可以复制替换相关配置文件:

Cp2slave2.sh

#!/bin/bash

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/slaves hadoop@cloud002:~/hadoop-2.2.0/etc/hadoop/slaves

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/slaves hadoop@cloud003:~/hadoop-2.2.0/etc/hadoop/slaves

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/core-site.xml hadoop@cloud002:~/hadoop-2.2.0/etc/hadoop/core-site.xml

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/core-site.xml hadoop@cloud003:~/hadoop-2.2.0/etc/hadoop/core-site.xml

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/hdfs-site.xml hadoop@cloud002:~/hadoop-2.2.0/etc/hadoop/hdfs-site.xml

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/hdfs-site.xml hadoop@cloud003:~/hadoop-2.2.0/etc/hadoop/hdfs-site.xml

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/mapred-site.xml hadoop@cloud002:~/hadoop-2.2.0/etc/hadoop/mapred-site.xml

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/mapred-site.xml hadoop@cloud003:~/hadoop-2.2.0/etc/hadoop/mapred-site.xml

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/yarn-site.xml hadoop@cloud002:~/hadoop-2.2.0/etc/hadoop/yarn-site.xml

scp /home/hadoop/hadoop-2.2.0/etc/hadoop/yarn-site.xml hadoop@cloud003:~/hadoop-2.2.0/etc/hadoop/yarn-site.xml

4、启动验证

4.1 启动hadoop

进入安装目录: cd ~/hadoop-2.2.0/

格式化namenode:./bin/hdfs namenode –format

启动hdfs: ./sbin/start-dfs.sh

此时在001上面运行的进程有:namenode secondarynamenode

002和003上面运行的进程有:datanode

启动yarn: ./sbin/start-yarn.sh

此时在001上面运行的进程有:namenode secondarynamenode resourcemanager

002和003上面运行的进程有:datanode nodemanaget

查看集群状态:./bin/hdfs dfsadmin –report

查看文件块组成: ./bin/hdfs fsck / -files -blocks

查看HDFS: http://192.168.0.107:50070
查看RM: http://192.168.0.107:8088
4.2 运行示例程序:

完成Hadoop2.2.0集群环境搭建之后需要利用一个例子程序来检验hadoop2的mapreduce的功能

.首先先在一个文件夹里面建立两个文件file01.txt和file02.txt里面加入如下内容
file01.txt
hello
hehe

hey

haha

miaomiao

file02.txt
hello world
hehe

haha

miaomiao

heihei

h

hh

hey

houhou

hadoop

hbase

hawk

pengfei

hadoop的shell脚本学习:

./bin/hadoop fs -
ls
/
//
查看hdfs目录情况

./bin/hadoop fs -
mkdir
-p
/input
//-p
这个参数是必须加入的hadoop2和之前的版本是不一样的

./bin/hadoop fs -put
file
*.txt /input    //将刚才的两个文件放入到hadoop的文件系统之中

./bin/hadoop fs -
cat
/
input
/file01
.txt
//查看文件内容

./bin/hadoop fs -
rm
-r
/input/file02.txt  //删除文件命令


先在hdfs上创建一个文件夹并导入文件
$./bin/hdfs dfs –mkdir /input
$./bin/hadoop fs -put
file
*.txt
/input


$./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output

运行过程摘录:
$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
wordcount /input /output

14/06/14 17:08:41 INFO client.RMProxy: Connecting to ResourceManager at cloud001/192.168.0.107:8032

14/06/14 17:08:43 INFO input.FileInputFormat: Total input paths to process : 2

14/06/14 17:08:43 INFO mapreduce.JobSubmitter: number of splits:2

14/06/14 17:08:43 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

14/06/14 17:08:43 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class

14/06/14 17:08:43 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name

14/06/14 17:08:43 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

14/06/14 17:08:43 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

14/06/14 17:08:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1402718202415_0002

14/06/14 17:08:44 INFO impl.YarnClientImpl: Submitted application application_1402718202415_0002 to ResourceManager at cloud001/192.168.0.107:8032

14/06/14 17:08:44 INFO mapreduce.Job: The url to track the job: http://cloud001:8088/proxy/application_1402718202415_0002/
14/06/14 17:08:44 INFO mapreduce.Job: Running job: job_1402718202415_0002

14/06/14 17:08:54 INFO mapreduce.Job: Job job_1402718202415_0002 running in uber mode : false

14/06/14 17:08:54 INFO mapreduce.Job: map 0% reduce 0%

14/06/14 17:09:12 INFO mapreduce.Job: map 50% reduce 0%

14/06/14 17:09:13 INFO mapreduce.Job: map 100% reduce 0%

14/06/14 17:10:00 INFO mapreduce.Job: map 100% reduce 100%

14/06/14 17:10:01 INFO mapreduce.Job: Job job_1402718202415_0002 completed successfully

14/06/14 17:10:02 INFO mapreduce.Job: Counters: 43

File System Counters

FILE: Number of bytes read=229

FILE: Number of bytes written=238142

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=313

HDFS: Number of bytes written=108

HDFS: Number of read operations=9

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=2

Total time spent by all maps in occupied slots (ms)=30879

Total time spent by all reduces in occupied slots (ms)=43860

Map-Reduce Framework

Map input records=18

Map output records=19

Map output bytes=185

Map output materialized bytes=235

Input split bytes=204

Combine input records=19

Combine output records=19

Reduce input groups=14

Reduce shuffle bytes=235

Reduce input records=19

Reduce output records=14

Spilled Records=38

Shuffled Maps =2

Failed Shuffles=0

Merged Map outputs=2

GC time elapsed (ms)=666

CPU time spent (ms)=5390

Physical memory (bytes) snapshot=483500032

Virtual memory (bytes) snapshot=1987952640

Total committed heap usage (bytes)=257171456

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=109

File Output Format Counters

Bytes Written=108

查看运行结果如下:
$./bin/hadoop fs -ls /output

Found 2 items

-rw-r--r-- 3 hadoop supergroup 0 2014-06-14 17:09 /output/_SUCCESS

-rw-r--r-- 3 hadoop supergroup 108 2014-06-14 17:09 /output/part-r-00000

$ ./bin/hadoop fs -cat /output/part-r-00000

h 1

hadoop 1

haha 2

hawk 1

hbase 1

hehe 2

heihei 1

hello 2

hey 2

hh 1

houhou 1

miaomiao 2

pengfei 1

world 1
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: