您的位置：首页 > 运维架构

hadoop集群小项目实践及遇到问题解决办法

2017-07-03 17:52 435 查看

小项目功能说明:

创建外部分区表hmbbs，分区字段当天日期为$CURRENT，用于记录清洗后的日志信息
使用flume工具将浏览日志传入flume
使用脚本对数据进行清洗并将清洗后的数据保存到/cleaned/$CURRENT目录下
为数据库hmbbs添加分区，即将清洗后的数据添加到数据库中
查询有效信息数据，并将有效信息输出到hive表中，实际数据存储在hdfs中
将hdfs有效信息数据，导出到windows平台下的MySQL

实践步骤：

1.进入hive删除所有表（我的机器是itcast03）
drop table [];

2.初始化，创建一张外部分区表，分隔符是'\t'
create external table hmbbs (ip string, logtime string, url string) partitioned by (logdate string) row format delimited fields terminated by '\t' location '/cleaned';

3.创建一个shell脚本,我们使用脚本每天执行一次运算
touch daily.sh
#添加执行权限
chmod +x daily.sh

打开daily.sh，添加如下脚本
# 取出当前时间
CURRENT=`/bin/date +%G%m%d%H%M%S`

# 将时间输出到mylog.txt
echo $CURRENT >> /root/mylog.txt

# 将数据上传到hadoop集群，hdfs工作目录
/itcast/hadoop-2.2.0/bin/hadoop fs -put /root/access_data.log /flume/$CURRENT

# 将数据进行清洗，减少IO负载
/itcast/hadoop-2.2.0/bin/hadoop jar /root/cleaner.jar /flume/$CURRENT /cleaned/$CURRENT

# 为hmhhb添加外部分区目录
/itcast/apache-hive-0.13.0-bin/bin/hive -e "alter table hmbbs add partition (logdate=$CURRENT) location '/cleaned/$CURRENT'"

# 用于测试数据，查找刚刚添加外部分区目录是否可以查询到数据
#/itcast/apache-hive-0.13.0-bin/bin/hive -e "select count(*) from hmbbs where logdate = $CURRENT"

# 用于测试数据，查找刚刚添加外部分区目录，不同IP的浏览记录
#/itcast/apache-hive-0.13.0-bin/bin/hive -e "select count(distinct ip) from hmbbs where logdate = $CURRENT"

# 用于测试数据
#/itcast/apache-hive-0.13.0-bin/bin/hive -e "select count(*) from hmbbs where logdate = $CURRENT and instr(url, 'member.php?mod=register')>0;"

# 查询同一个ip，每天超过50次记录的前20名，并将结果输出到表vip_$CURRNET中
/itcast/apache-hive-0.13.0-bin/bin/hive -e "create table vip_$CURRENT row format delimited fields terminated by '\t' as select ip, count(*) as vtimes from hmbbs where logdate = $CURRENT  group by ip having vtimes >= 50 order by vtimes desc limit 20"

# 将vip_$CURRENT中内容，其实就是hadoop工作目录hdfs内容输出到远程主机192.168.8.100的MySQL数据库中
/itcast/sqoop-1.4.4/bin/sqoop export --connect jdbc:mysql://192.168.8.100:3306/itcast --username root --password 1234 --export-dir "/user/hive/warehouse/vip_$CURRENT" --table vip --fields-terminated-by '\t'

# 显示完成
echo "Finished!"

4.使用定时器，每8分钟执行一次（考虑到mapreduce执行时间比较长）
# 打开定时器编辑器
crontab -e
# 添加执行脚本及定时时间
*/8 * * * * /root/daily.sh >> /root/runlog.txt

5.在windows下MySQL客户端，查询数据输出
SELECT COUNT(*) FROM vip;

单独运行daily.sh一切正常，开始运行crontab，运行crontab过程报异常，但是单独脚本运行又是正常的，很奇怪，反复测试几次都是如此。

查看runlog.txt文件异常信息如下：

Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.

查看异常信息大意就是我没有配HADOOP_HOME环境变量之类的，哎，我就奇怪了。我使用hadoop是使用绝对路径，单纯执行shell脚本都可以，放在crontab定时器，应该也是可以的，于是开始进入检查。

1.检查我的环境配置文件/etc/profile, 在末尾看到有添加HADOOP_HOME环境变量
[root@itcast03 ~]# more /etc/profile | tail -3
export JAVA_HOME=/usr/java/jdk1.7.0_60
export HADOOP_HOME=/itcast/hadoop-2.2.0
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:/itcast/hadoop-2.2.0/bin/hadoop
2.担心配置还是有问题，于是我再运行echo $HADOOP_HOME, 以确认HADOOP_HOME环境是否生效,结果也是正常的
[root@itcast03 ~]# echo $HADOOP_HOME
/itcast/hadoop-2.2.0
3.那是不是crontab执行时没有调用/etc/profile环境变量呢，为了进一步确认。打开了crontab执行时调用的配置/etc/crontab，查看一下
[root@itcast03 ~]# cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/
# For details see man 4 crontabs
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
4.网上查找相关资料，发现crontab有一个坏毛病，就是它总是不会缺省地从用户profile文件中读取环境变量参数，经常导致在手工执行脚本某个脚本时是成功的，但是到crontab中试图让它定期执行时就是会成错。

5.查阅相关资料，其解决办法有两个
1）在crontab -e编辑时加入 . /etc/profile;/bin/sh #主要"." 后面有个空格
*/3 * * * * . /etc/profile;/bin/sh /root/daily.sh >> runlog.txt

2）在我的运行脚本的开头加入
source /etc/profile
最终测试结果：

两种试都是可以的，这crontab定时器问题纠结了半天。期间多次反复测试，多谢网友分享的资料，终于将这一个问题解决了。现在hadoop集群计算结果也源源不断地向windows中MySQL（远程MySQL）导出数据了。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航