使用Docker搭建hadoop集群
2016-02-23 11:46
821 查看
使用Docker搭建hadoop集群
参考文档:http://blog.mymusise.com/?p=1501.准备工作
1.1下载软件
下载一些相关的文件:jdk-8u60-linux-x64.tar.gz
hadoop-2.7.0.tar.gz
由于hadoop官方提供的hadoop版本是32位的,如果在64位的系统上运行需要编译下,这里提供一个编译好的64位的hadoop 2.7.0的安装包:http://pan.baidu.com/s/1c0HD0Nu
1.2准备挂载卷
在宿主机上创建一个文件夹来存放刚才下载的文件,这里我创建了一个~/dockerspace/hadoop-docker/。将 jdk-8u60-linux-x64.tar.gz、hadoop-2.7.0.tar.gz复制到该目录下。同时创建文件sources.list。该文件是用来修改container的源的,这里用的是阿里的源。sources.list:
deb http://mirrors.aliyun.com/ubuntu/ vivid main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ vivid-security main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ vivid-updates main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ vivid-proposed main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ vivid-backports main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ vivid main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ vivid-security main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ vivid-updates main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ vivid-proposed main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ vivid-backports main restricted universe multiverse
另附上网易源:
deb http://mirrors.163.com/ubuntu/ vivid main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ vivid-security main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ vivid-updates main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ vivid-proposed main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ vivid-backports main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ vivid main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ vivid-security main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ vivid-updates main restricted universe multiverse deb-src http://m 4000 irrors.163.com/ubuntu/ vivid-proposed main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ vivid-backports main restricted universe multiverse
1.3下载镜像
下载一个镜像作为基础镜像,hadoop镜像是基于基础镜像构建的。建议在拉取镜像之前先把ubuntu系统的源换成国内的源
docker pull ubuntu:14.04
2.安装jdk
进入容器:xx@xx-desktop:~/dockerspace/hadoop-docker/config$ docker run -it -v ~/dockerspace/hadoop-docker/:/root/software ubuntu:14.04 #进入ubuntu容器 root@6ccfe3ce6d3b:/# cd /root/ root@6ccfe3ce6d3b:~# ls software
上面的命令docker run -it -v ~/dockerspace/hadoop-docker/:/root/software ubuntu:14.04 将宿主机上的文件夹~/dockerspace/hadoop-docker/挂载到container的/root/software目录下
修改源:
root@6ccfe3ce6d3b:~# cp /etc/apt/sources.list /etc/apt/sources.list.bak #备份源 root@6ccfe3ce6d3b:~# cp /root/software/sources.list /etc/apt/ #更新源,使用阿里源 root@6ccfe3ce6d3b:~# apt-get update #更新软件
安装vim
apt-get install vim
安装Java环境
创建文件夹/root/jdk,将jdk-8u60-linux-x64.tar.gz解压到/root/jdk下,并重命名:
root@fc3caf2b3183:~# ls software root@fc3caf2b3183:~# cd software/ root@fc3caf2b3183:~/software# ls authorized_keys hosts sources.list hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz zookeeper-3.4.6.tar.gz root@fc3caf2b3183:~/software# tar -zxf jdk-8u60-linux-x64.tar.gz #解压jdk安装包 root@fc3caf2b3183:~/software# ls authorized_keys hosts jdk1.8.0_60 zookeeper-3.4.6.tar.gz hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz sources.list root@fc3caf2b3183:~/software# cd .. root@fc3caf2b3183:~# ls software root@fc3caf2b3183:~# mkdir jdk root@fc3caf2b3183:~# mv software/jdk1.8.0_60/ jdk/ root@fc3caf2b3183:~# cd jdk/ root@fc3caf2b3183:~/jdk# ls jdk1.8.0_60 root@fc3caf2b3183:~/jdk# mv jdk1.8.0_60/ jdk-1.8 #重命名 root@fc3caf2b3183:~/jdk# ls jdk-1.8 root@fc3caf2b3183:~/jdk#
配置java环境变量,vim /etc/profile,在profile的最后一行添加如下内容:
export JAVA_HOME=/root/jdk/jdk-1.8 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=$PATH:${JAVA_HOME}/bin export HADOOP_HOME=/root/hadoop/hadoop-2.7.0 export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
让其生效,并验证jdk是否安装成功:
root@fc3caf2b3183:~/jdk/jdk-1.8# source /etc/profile #使环境变量生效 root@fc3caf2b3183:~/jdk/jdk-1.8# java -version #查看jdk是否安装成功 java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) root@fc3caf2b3183:~/jdk/jdk-1.8#
3.配置hadoop
上面的HADOOP_HOME我们接下来创建root@fc3caf2b3183:~# ls jdk software root@fc3caf2b3183:~# mkdir hadoop #创建hadoop文件夹 root@fc3caf2b3183:~# cd software/ root@fc3caf2b3183:~/software# ls authorized_keys hosts sources.list hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz zookeeper-3.4.6.tar.gz root@fc3caf2b3183:~/software# tar -zxf hadoop-2.7.0.tar.gz #解压hadoop文件 root@fc3caf2b3183:~/software# ls authorized_keys hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz zookeeper-3.4.6.tar.gz hadoop-2.7.0 hosts sources.list root@fc3caf2b3183:~/software# mv hadoop-2.7.0 ../hadoop/ #将hadoop文件拷贝到$HADOOP_HOME目录 root@fc3caf2b3183:~/software# cd ../hadoop/ root@fc3caf2b3183:~/hadoop# ls hadoop-2.7.0 root@fc3caf2b3183:~/hadoop# cd hadoop-2.7.0/ root@fc3caf2b3183:~/hadoop/hadoop-2.7.0# ls LICENSE.txt NOTICE.txt README.txt bin etc include lib libexec sbin share root@fc3caf2b3183:~/hadoop/hadoop-2.7.0#
在/root/hadoop/hadoop-2.7.0文件夹下创建如下文件夹
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir namenode root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir datanode root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir tmp root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# cd $HADOOP_CONFIG_HOME root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0/etc/hadoop#
配置hadoop-env.sh,设置JAVA_HOME变量,找到这个变量改成下面的:
export JAVA_HOME=/root/jdk/jdk-1.8
配置core-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/root/hadoop/hadoop-2.7.0/tmp/</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <final>true</final> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>
配置hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-styl 12ab9 esheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>2</value> <final>true</final> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>/root/hadoop/hadoop-2.7.0/namenode</value> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>/root/hadoop/hadoop-2.7.0/datanode</value> <final>true</final> </property> </configuration>
配置mapred-site.xml,但是由于这里没有mapred-site.xml,使用cp mapred-site.xml.template mapred-site.xml创建一个,然后配置mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>master:9001</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description> </property> </configuration>
然后进行下文件系统的格式化,hadoop namenode -format
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0/etc/hadoop# hadoop namenode -format DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 16/02/22 08:45:13 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = 6ccfe3ce6d3b/172.17.0.6 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.7.0 .... .... 16/02/22 08:45:14 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB 16/02/22 08:45:14 INFO util.GSet: capacity = 2^15 = 32768 entries 16/02/22 08:45:14 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1624198475-172.17.0.6-1456130714548 16/02/22 08:45:14 INFO common.Storage: Storage directory /root/hadoop/hadoop-2.7.0/namenode has been successfully formatted. 16/02/22 08:45:14 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 16/02/22 08:45:14 INFO util.ExitUtil: Exiting with status 0 16/02/22 08:45:14 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at 6ccfe3ce6d3b/172.17.0.6 ************************************************************/
4.配置ssh
安装和配置sshapt-get install ssh #安装ssh root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# ssh-keygen -t rsa #创建公秘钥 Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Created directory '/root/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: cd:68:e6:cc:42:13:68:55:50:33:5c:1e:da:c9:41:f0 root@fc3caf2b3183 The key's randomart image is: +---[RSA 2048]----+ | o=*+= | | o .O + | | o . . E | | . . + | | o S o | | . B | | . + | | . | | | +-----------------+ root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# ssh-keygen -t dsa #创建公秘钥 Generating public/private dsa key pair. Enter file in which to save the key (/root/.ssh/id_dsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_dsa. Your public key has been saved in /root/.ssh/id_dsa.pub. The key fingerprint is: 4e:49:2c:31:90:83:34:8a:6e:eb:32:c9:36:f4:b6:02 root@fc3caf2b3183 The key's randomart image is: +---[DSA 1024]----+ | .o..oo | |....o + | |o .. o | |. o . | | o S | |E.. o | |+o. . | |== o | |oo+.. | +-----------------+ root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# cd ~/.ssh/ root@fc3caf2b3183:~/.ssh# ls id_dsa id_dsa.pub id_rsa id_rsa.pub root@fc3caf2b3183:~/.ssh# cat id_rsa.pub >> authorized_keys #实现无密码登陆 root@fc3caf2b3183:~/.ssh# cat id_dsa.pub >> authorized_keys #实现无密码登陆 root@fc3caf2b3183:~/.ssh# /etc/init.d/ssh start #启动下ssh服务 * Starting OpenBSD Secure Shell server sshd [ OK ] root@fc3caf2b3183:~/.ssh#
测试下:
root@fc3caf2b3183:~/.ssh# ssh localhost Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-49-generic x86_64) * Documentation: https://help.ubuntu.com/ The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. root@fc3caf2b3183:~#
成功登入
5.将容器创建成镜像
exit退出这个容器,然后我们把这个容器生成一个docker镜像。xx@xx-desktop:~$ docker ps -a #查看容器的id CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES fc3caf2b3183 ubuntu:14.04 "/bin/bash" 41 minutes ago Exited (0) 3 seconds ago kickass_davinci xx@xx-desktop:~$ docker commit -m "hadoop install" fc3c ubuntu:hadoop #将容器生成一个镜像 bfc32f70813f1a6f3ec68dd4b5514ec59c3dbcf1516114a57b5f8b9e933b8ded xx@xx-desktop:~$ docker images #查看刚才生成的镜像 REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE ubuntu hadoop bfc32f70813f 15 seconds ago 941.8 MB
6.启动集群
现在我们开始真正来搭建这个分布式集群了。我们开三个终端,通过刚创建的ubuntu:hadoop镜像生成三个容器:master,slave1.slave2docker run -it -h=master ubuntu:hadoop docker run -it -h=slave1 ubuntu:hadoop docker run -it -h=slave2 ubuntu:hadoop
编辑三个容器的/etc/hosts文件,在/etc/hosts文件中添加其他几个容器的ip
172.17.0.5 master 172.17.0.6 slave1 172.17.0.7 slave2 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters
在容器启动后有的服务可能开启的没有开启,比如这里的ssh,此时我们可能需要手动开启,还有环境变量我们也需要重新source /etc/profit
root@master:/# /etc/init.d/ssh start * Starting OpenBSD Secure Shell server sshd [ OK ] root@master:/# source /etc/profile
对每个容器进行上述操作,然后我们验证容器之间是否能够相互无密码登陆
root@master:/# ssh slave1 The authenticity of host 'slave1 (172.17.0.6)' can't be established. ECDSA key fingerprint is 74:d2:98:c8:dc:f2:ad:4b:48:80:b0:47:dc:37:ae:d5. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'slave1,172.17.0.6' (ECDSA) to the list of known hosts. Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-49-generic x86_64) * Documentation: https://help.ubuntu.com/ Last login: Tue Feb 23 03:01:15 2016 from localhost root@slave1:~#
然后再配置一下master的slaves文件:
root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# vim slaves root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# cat slaves slave1 slave2 root@master:~/hadoop/hadoop-2.7.0/etc/hadoop#
此时整个环境都搭建好了,我们现在只需要在master上启动hadoop即可:
root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# start-all.sh
最后在两个slave里面输入jps,可以看到下面的三个服务:
root@slave2:~/hadoop/hadoop-2.7.0/etc/hadoop# jps 146 DataNode 254 NodeManager 351 Jps
此时,我们可以在宿主机上的浏览器中访问:http://172.17.0.5:50070。其中ip为master主机的ip
7.验证集群
最后我们来测试一下hadoop下的wordcount程序hdfs dfs -mkdir /user hdfs dfs -mkdir /user/root hdfs dfs -put etc/hadoop input hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
如果成功的话可以看到下面的显示:
root@master:~/hadoop/hadoop-2.7.0# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+' 16/02/23 03:34:40 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 16/02/23 03:34:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 16/02/23 03:34:40 INFO input.FileInputFormat: Total input paths to process : 30 16/02/23 03:34:40 INFO mapreduce.JobSubmitter: number of splits:30 16/02/23 03:34:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427717513_0001 16/02/23 03:34:40 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 16/02/23 03:34:40 INFO mapreduce.Job: Running job: job_local1427717513_0001 16/02/23 03:34:40 INFO mapred.LocalJobRunner: OutputCommitter set in config null .... .... .... 16/02/23 03:34:45 INFO mapreduce.Job: Counters: 35 File System Counters FILE: Number of bytes read=1224838 FILE: Number of bytes written=2240055 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=159752 HDFS: Number of bytes written=1271 HDFS: Number of read operations=155 HDFS: Number of large read operations=0 HDFS: Number of write operations=16 Map-Reduce Framework Map input records=13 Map output records=13 Map output bytes=323 Map output materialized bytes=355 Input split bytes=127 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=355 Reduce input records=13 Reduce output records=13 Spilled Records=26 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=1062207488 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=513 File Output Format Counters Bytes Written=245
最后的统计结果
root@master:~/hadoop/hadoop-2.7.0# hdfs dfs -get output output 16/02/23 03:40:47 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/02/23 03:40:47 WARN hdfs.DFSClient: DFSInputStream has been closed already root@master:~/hadoop/hadoop-2.7.0# cat output/* 6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode. 2 dfs.audit.log.maxbackupindex 2 dfs.period 2 dfs.audit.log.maxfilesize 1 dfsmetrics.log 1 dfsadmin 1 dfs.servers 1 dfs.replication 1 dfs.file 1 dfs.datanode.data.dir 1 dfs.namenode.name.dir root@master:~/hadoop/hadoop-2.7.0#
相关文章推荐
- 详解HDFS Short Circuit Local Reads
- Hadoop_2.1.0 MapReduce序列图
- 使用Hadoop搭建现代电信企业架构
- 单机版搭建Hadoop环境图文教程详解
- hadoop常见错误以及处理方法详解
- hadoop 单机安装配置教程
- hadoop的hdfs文件操作实现上传文件到hdfs
- hadoop实现grep示例分享
- Apache Hadoop版本详解
- linux下搭建hadoop环境步骤分享
- hadoop client与datanode的通信协议分析
- hadoop中一些常用的命令介绍
- Hadoop单机版和全分布式(集群)安装
- 用PHP和Shell写Hadoop的MapReduce程序
- hadoop map-reduce中的文件并发操作
- Hadoop1.2中配置伪分布式的实例
- java结合HADOOP集群文件上传下载
- 让python在hadoop上跑起来
- 用python + hadoop streaming 分布式编程(一) -- 原理介绍,样例程序与本地调试
- Hadoop安装感悟