您的位置:首页 > 大数据 > Hadoop

第3周 分布式文件系统HDFS原理与操作

2014-01-09 16:26 183 查看

Hello, World!

##对刚安装好的hadoop集群做个测试
[root@hadoop1 hadoop]# pwd
/nosql/hadoop
[root@hadoop1 hadoop]# mkdir input
[root@hadoop1 hadoop]# cd input/
[root@hadoop1 input]# echo "hello word" > test1.txt
[root@hadoop1 input]# echo "hello hadoop" > test2.txt
[root@hadoop1 input]# cat test1.txt
hello word
[root@hadoop1 input]# cat test2.txt
hello hadoop
[root@hadoop1 input]#
[root@hadoop1 input]#
[root@hadoop1 input]# cd /nosql/hadoop/hadoop-0.20.2/bin/
[root@hadoop1 bin]# ./hadoop dfs -put /nosql/hadoop/input/ in
[root@hadoop1 bin]# ./hadoop dfs -ls ./in/*
-rw-r--r-- 2 root supergroup 11 2014-01-09 05:55 /user/root/in/test1.txt
-rw-r--r-- 2 root supergroup 13 2014-01-09 05:55 /user/root/in/test2.txt
[root@hadoop1 bin]#
[root@hadoop1 bin]#
[root@hadoop1 bin]# ./hadoop jar ../hadoop-0.20.2-examples.jar wordcount in out
14/01/09 07:58:35 INFO input.FileInputFormat: Total input paths to process : 2
14/01/09 07:58:35 INFO mapred.JobClient: Running job: job_201401090755_0001
14/01/09 07:58:36 INFO mapred.JobClient: map 0% reduce 0%
14/01/09 07:58:45 INFO mapred.JobClient: map 50% reduce 0%
14/01/09 07:58:49 INFO mapred.JobClient: map 100% reduce 0%
14/01/09 07:58:58 INFO mapred.JobClient: map 100% reduce 100%
14/01/09 07:58:59 INFO mapred.JobClient: Job complete: job_201401090755_0001
14/01/09 07:58:59 INFO mapred.JobClient: Counters: 17
14/01/09 07:58:59 INFO mapred.JobClient: Job Counters
14/01/09 07:58:59 INFO mapred.JobClient: Launched reduce tasks=1
14/01/09 07:58:59 INFO mapred.JobClient: Launched map tasks=2
14/01/09 07:58:59 INFO mapred.JobClient: Data-local map tasks=2
14/01/09 07:58:59 INFO mapred.JobClient: FileSystemCounters
14/01/09 07:58:59 INFO mapred.JobClient: FILE_BYTES_READ=54
14/01/09 07:58:59 INFO mapred.JobClient: HDFS_BYTES_READ=24
14/01/09 07:58:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=178
14/01/09 07:58:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=24
14/01/09 07:58:59 INFO mapred.JobClient: Map-Reduce Framework
14/01/09 07:58:59 INFO mapred.JobClient: Reduce input groups=3
14/01/09 07:58:59 INFO mapred.JobClient: Combine output records=4
14/01/09 07:58:59 INFO mapred.JobClient: Map input records=2
14/01/09 07:58:59 INFO mapred.JobClient: Reduce shuffle bytes=60
14/01/09 07:58:59 INFO mapred.JobClient: Reduce output records=3
14/01/09 07:58:59 INFO mapred.JobClient: Spilled Records=8
14/01/09 07:58:59 INFO mapred.JobClient: Map output bytes=40
14/01/09 07:58:59 INFO mapred.JobClient: Combine input records=4
14/01/09 07:58:59 INFO mapred.JobClient: Map output records=4
14/01/09 07:58:59 INFO mapred.JobClient: Reduce input records=4
[root@hadoop1 bin]# ./hadoop dfs -ls
Found 2 items
drwxr-xr-x - root supergroup 0 2014-01-09 07:57 /user/root/in
drwxr-xr-x - root supergroup 0 2014-01-09 07:58 /user/root/out
[root@hadoop1 bin]# ./hadoop dfs -ls ./out
Found 2 items
drwxr-xr-x - root supergroup 0 2014-01-09 07:58 /user/root/out/_logs
-rw-r--r-- 2 root supergroup 24 2014-01-09 07:58 /user/root/out/part-r-00000
[root@hadoop1 bin]# ./hadoop dfs -cat ./out/*
hadoop 1
hello 2
word 1
cat: Source must be a file.
[root@hadoop1 bin]#
在做实验的过程中遇到很多错误,参考:
/article/4370125.html
/content/3793962.html

通过web了解Hadoop的活动

通过用浏览器和http访问jobtracker所在节点的50030端口监控jobtracker
通过用浏览器和http访问namenode所在节点的50070端口监控集群 http://192.168.136.128:50030/jobtracker.jsp http://192.168.136.128:50070/dfshealth.jsp

HDFS分布式文件系统

HDFS设计基础与目标
硬件错误是常态。因此需要冗余
流式数据访问。即数据批量读取而非随机读写,Hadoop擅长做的是数据分析而不是事务处理
大规模数据集
简单一致性模型。为了降低系统复杂度,对文件采用一次性写多次读的逻辑设计,即是文件一经写入,关闭,就再也不能修改
程序采用“数据就近”原则分配节点执行

HDFS文件操作
命令行方式
API方式

##列出HDFS下的文件
##注意,hadoop没有当前目录的概念,也没有cd命令
# ./hadoop dfs -ls
# ./hadoop dfs -ls ./in
# ./hadoop dfs -ls ./out

##上传文件到HDFS
# ./hadoop dfs -put /nosql/hadoop/input/ in

##将HDFS的文件复制到本地
# ./hadoop dfs -get in /tmp/abc

##删除HDFS下的文档
# ./hadoop dfs -rmr in

##查看HDFS下某个文件的内容
# ./hadoop dfs -cat ./out/part-r-00000

##查看HDFS基本统计信息
[root@hadoop1 bin]# ./hadoop dfsadmin -report
Configured Capacity: 58035453952 (54.05 GB)
Present Capacity: 26640138240 (24.81 GB)
DFS Remaining: 26639933440 (24.81 GB)
DFS Used: 204800 (200 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)
Name: 192.168.136.130:50010
Decommission Status : Normal
Configured Capacity: 29017726976 (27.02 GB)
DFS Used: 102400 (100 KB)
Non DFS Used: 15697620992 (14.62 GB)
DFS Remaining: 13320003584(12.41 GB)
DFS Used%: 0%
DFS Remaining%: 45.9%
Last contact: Thu Jan 09 11:10:20 EST 2014
Name: 192.168.136.129:50010
Decommission Status : Normal
Configured Capacity: 29017726976 (27.02 GB)
DFS Used: 102400 (100 KB)
Non DFS Used: 15697694720 (14.62 GB)
DFS Remaining: 13319929856(12.41 GB)
DFS Used%: 0%
DFS Remaining%: 45.9%
Last contact: Thu Jan 09 11:10:21 EST 2014

##进入和退出安全模式
[root@hadoop1 bin]# ./hadoop dfsadmin -safemode enter
Safe mode is ON
[root@hadoop1 bin]# ./hadoop dfsadmin -safemode leave
Safe mode is OFF

怎样添加节点?

在新节点安装好hadoop
把namenode的有关配置文件复制到该节点
修改masters和slaves文件,增加该节点
设置ssh免密码进出该节点
单独启动该节点上的datanode和tasktracker(hadoop-daemon.sh start datanode/tasktracker)
运行start-balancer.sh进行数据负载均衡

启动某些特定后台进程而非所有后台进程

##Start-all.sh的内容
[root@hadoop1 bin]# cat start-all.sh
#!/usr/bin/env bash
# Start all hadoop daemons. Run this on master node.
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
. "$bin"/hadoop-config.sh
# start dfs daemons
"$bin"/start-dfs.sh --config $HADOOP_CONF_DIR
# start mapred daemons
"$bin"/start-mapred.sh --config $HADOOP_CONF_DIR

负载均衡

##作用:当节点出现故障,或新增加节点时,数据块分布可能不均匀,负载均衡可以重新平衡各个datanode上数据块的分布
[root@hadoop1 bin]# ./start-balancer.sh
starting balancer, logging to /nosql/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-balancer-hadoop1.out
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: