hadoop漫谈
2013-01-14 21:16
155 查看
内容概要
一. hadoop简介
1. 概述
hadoop是apache下的一个开源项目,由java开发,具备高可用性,高可靠性,以及分布式计算的特性。hadoop的基础是HDFS分布式文件系统,它是建立在廉价PC的基础上的,因此对HDFS来说硬件的不可靠性是常事,如有机器宕机,磁盘损坏,都有可能造成数据的丢失。但是hadoop提供的HDFS提供的数据冗余机制,能够保障在机器宕机,磁盘损坏等情况下数据不丢失。hadoop之所以能够做到快速大大户据处理,全然依靠于HDFS这一可靠的分布式文件系统,Mapreduce就是建立在HDFS基础上的。在hadoop上数据处理视同过移动计算到对应的数据进行数据的处理和分析,减少数据的移动带来磁盘IO的时间开销,从而保障Mapreduce处理数据时的处理速度。Google的Mapreduce就是建立在GFS上的一个计算引擎,Google每天抓取海量的url做pagerank,需要采用Mapreduce这一计算引擎完成其对海量数据大处理和分析。因此讨论hadoop就必须先要了解HDFS这一分布式文件系统
1.1 HDFS
1.1.1 节点构成Namenode, Secondayrnamenode, Datanode, Jobtracker, Tasktracker
Namenode 提供整个namespace, metadata的管理,通过namenode可以获取到对应的文件的分片信息,文件大存取都需要用到namenode的metadata和namespace。它在HDFS中扮演着至关重要大角色,如果Namenode宕掉,需要及时恢复,在目前的架构中hadoop仍然存在单点故障(single point failure )
Secondarynamenode只与Namenode进行通信,其功能主要使用来将Namenode内的edits文件fsimage文件进行merge后,保存到Secondarynamenode本地,用以避免Namenode宕机后元数据丢失,使得文件丢失,同时也可以保证不让Namenode的edits文件过大。Secondarynamenode将merge后的文件会通过HTTP协议POST到Namenode对应的fs.name.dir目录下。
Datanode,从名字便可以看出,它是用来存储数据,Datanode与Namenode之间通过RPC通信,由Namenode来管理数据的存放
Jobtracker用于接受Mapreduce任务,调度任务
Tasktracker用于分析数据的节点,其与Datanode在同一机器上,Tasktracker通过“拉客”的方式,告诉Jobtracker自己可以接受任务,如数据在该Datanode上,则避免数据的移动,直接可以通过读写本地磁盘上的数据完成数据的分析和处理。下面看其具体的架构
1.1.2 架构
二. hadoop安装
2.1 安装环境及条件(伪分布式安装)
操作系统ubuntu 11.10, hadoop-1.1.1, JDK1.72.2 安装步骤
2.2.1 安装java下载JDK,解压缩后放置在cc@ubuntu:~/software/jdk1.7.0_07目录下,配置java环境变量,边界/etc/profile,在其后添加上如下内容
JAVA_HOME=/home/cc/software/jdk1.7.0_07 CLASSPATH=$JAVA_HOME/lib:$JAVA_HOME/jre/lib PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH export JAVA_HOME CLASSPATH PATH
然后source /etc/profile使环境变量生效
2.2.2 安装hadoop-1.1.1
下载hadoop-1.1.1解压hadoop到目录cc@ubuntu:~/software/hadoop-1.1.1$
配置文件在cc@ubuntu:~/software/hadoop-1.1.1/conf$目录下,需要修改的配置文件有core-site.xml, mapred-site.xml, hdfs-site.xml, hadoop-env.sh,(伪分布式模式下无需修改masters和slaves文件,均为localhost即可)
各配置文件添加的内容分别如下:
core-site.xml
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/cc/data/tmp</value> </property> </configuration>
hdfs-site.xml
<configuration> <property> <name>dfs.name.dir</name> <value>/home/cc/data/namedir</value> </property> <property> <name>dfs.data.dir</name> <value>/home/cc/data/datadir</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9090</value> </property> </configuration>
hadoop-env.sh
export JAVA_HOME=/home/cc/software/jdk1.7.0_07 export HADOOP_LOG_DIR=/home/cc/data/logs export HADOOP_PID_DIR=/home/cc/data/pids
2.2.3 配置ssh免密码登录
cc@ubuntu:~$ mkdir .ssh cc@ubuntu:~$ cd .ssh/ cc@ubuntu:~/.ssh$ cc@ubuntu:~/.ssh$ ssh-keygen -t rsa cc@ubuntu:~/.ssh$ cat id_rsa.pub > authorized_keys此时ssh localhost可以直接登录成功,如果没有成功,且报此错误agent admitted failure to sign using the key,则可直接运行ssh-add解决此问题
2.3. 运行hadoop
2.3.1 启动hadoop
格式化hadoop,cc@ubuntu:~/software/hadoop-1.1.1$ bin/hadoop namenode -format
启动集群,cc@ubuntu:~/software/hadoop-1.1.1$ bin/start-all.sh
2.3.2 测试hadoop是否可用
1.检验进程是否存在
cc@ubuntu:~/software/hadoop-1.1.1$ jps 9275 NameNode 10058 TaskTracker 9518 DataNode 9739 SecondaryNameNode 9821 JobTracker 13483 Jps2.向hdfs存放文件
cc@ubuntu:~/software/hadoop-1.1.1$ bin/hadoop fs -mkdir test cc@ubuntu:~/software/hadoop-1.1.1$ bin/hadoop fs -put README.txt test cc@ubuntu:~/software/hadoop-1.1.1$ bin/hadoop fs -ls test Found 1 items -rw-r--r-- 1 cc supergroup 1366 2013-01-26 00:24 /user/cc/test/README.txt cc@ubuntu:~/software/hadoop-1.1.1$3.执行mapreduce程序
cc@ubuntu:~/software/hadoop-1.1.1$ bin/hadoop jar hadoop-examples-1.1.1.jar pi 1 4 Number of Maps = 1 Samples per Map = 4 Wrote input for Map #0 Starting Job 13/01/26 00:26:04 INFO mapred.FileInputFormat: Total input paths to process : 1 13/01/26 00:26:05 INFO mapred.JobClient: Running job: job_201301252155_0001 13/01/26 00:26:06 INFO mapred.JobClient: map 0% reduce 0% 13/01/26 00:26:13 INFO mapred.JobClient: map 100% reduce 0% 13/01/26 00:26:21 INFO mapred.JobClient: map 100% reduce 33% 13/01/26 00:26:23 INFO mapred.JobClient: map 100% reduce 100% 13/01/26 00:26:24 INFO mapred.JobClient: Job complete: job_201301252155_0001 13/01/26 00:26:24 INFO mapred.JobClient: Counters: 30 13/01/26 00:26:24 INFO mapred.JobClient: Job Counters 13/01/26 00:26:24 INFO mapred.JobClient: Launched reduce tasks=1 13/01/26 00:26:24 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6649 13/01/26 00:26:24 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/26 00:26:24 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/01/26 00:26:24 INFO mapred.JobClient: Launched map tasks=1 13/01/26 00:26:24 INFO mapred.JobClient: Data-local map tasks=1 13/01/26 00:26:24 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9398 13/01/26 00:26:24 INFO mapred.JobClient: File Input Format Counters 13/01/26 00:26:24 INFO mapred.JobClient: Bytes Read=118 13/01/26 00:26:24 INFO mapred.JobClient: File Output Format Counters 13/01/26 00:26:24 INFO mapred.JobClient: Bytes Written=97 13/01/26 00:26:24 INFO mapred.JobClient: FileSystemCounters 13/01/26 00:26:24 INFO mapred.JobClient: FILE_BYTES_READ=28 13/01/26 00:26:24 INFO mapred.JobClient: HDFS_BYTES_READ=237 13/01/26 00:26:24 INFO mapred.JobClient: FILE_BYTES_WRITTEN=47877 13/01/26 00:26:24 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 13/01/26 00:26:24 INFO mapred.JobClient: Map-Reduce Framework 13/01/26 00:26:24 INFO mapred.JobClient: Map output materialized bytes=28 13/01/26 00:26:24 INFO mapred.JobClient: Map input records=1 13/01/26 00:26:24 INFO mapred.JobClient: Reduce shuffle bytes=28 13/01/26 00:26:24 INFO mapred.JobClient: Spilled Records=4 13/01/26 00:26:24 INFO mapred.JobClient: Map output bytes=18 13/01/26 00:26:24 INFO mapred.JobClient: Total committed heap usage (bytes)=196018176 13/01/26 00:26:24 INFO mapred.JobClient: CPU time spent (ms)=1900 13/01/26 00:26:24 INFO mapred.JobClient: Map input bytes=24 13/01/26 00:26:24 INFO mapred.JobClient: SPLIT_RAW_BYTES=119 13/01/26 00:26:24 INFO mapred.JobClient: Combine input records=0 13/01/26 00:26:24 INFO mapred.JobClient: Reduce input records=2 13/01/26 00:26:24 INFO mapred.JobClient: Reduce input groups=2 13/01/26 00:26:24 INFO mapred.JobClient: Combine output records=0 13/01/26 00:26:24 INFO mapred.JobClient: Physical memory (bytes) snapshot=167493632 13/01/26 00:26:24 INFO mapred.JobClient: Reduce output records=0 13/01/26 00:26:24 INFO mapred.JobClient: Virtual memory (bytes) snapshot=780718080 13/01/26 00:26:24 INFO mapred.JobClient: Map output records=2 Job Finished in 20.375 seconds Estimated value of Pi is 4.00000000000000000000
hadoop streaming数据处理
ha
a416
doop java api数据处理
相关文章推荐
- 漫谈Hadoop的思想之源:Google
- 漫谈Hadoop HDFS Balancer
- 漫谈HADOOP HDFS BALANCER
- 漫谈Hadoop HDFS Balancer
- 漫谈Hadoop HDFS Balancer
- 工业大数据漫谈13:Hadoop在工业大数据中的作用
- 【转载】漫谈HADOOP HDFS BALANCER
- Hadoop系列之一:hadoop部署安装
- Hadoop系列之二:大数据、大数据处理模型及MapReduce
- Hadoop示例程序WordCount运行及详解
- 使用Python实现Hadoop MapReduce程序
- Hadoop2.2.0版本多节点集群安装及测试
- hadoop容灾能力测试
- RHEL6.6 64位源码编译hadoop 2.7.1
- hadoop集群默认配置和常用配置
- Hadoop 2.6.4单节点集群配置
- Hadoop HDFS Shell命令
- 生产环境实战spark (8)分布式集群 Hadoop集群WEBUI打不开问题解决,关闭防火墙firewall
- eclipse hadoop windows 运行wordcount程序,上传文件内容为空的原因及解决办法
- iOS 7:漫谈#define 宏定义