您的位置:首页 > 编程语言 > Python开发

Python的MapReduce调用及多输入文件的使用(统计url的点击量)

2017-08-28 10:16 986 查看
1、在日志中统计对应链接的点击量脚本
由于业务上暂用不到reduce过程,所以只有一个mapper脚本。
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/filter_mapperOnly.py
#!/usr/bin/env python
# encoding: utf-8
import sys

# 输入为标准输入stdin
for line in sys.stdin:
# 删除开头和结果的空格
if '/event/apply/template/yhzrsolo.htm?s_=rmhd' in line:
print '%s' % (line)

2、Python的MapReduce调用
2.1、按天统计
即一次统计一天的日志文件,计算链接在一天内的点击量。
hadoop dfs -rm -r -skipTrash /nisj/mp_result;
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \
-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.*.txt \
-output /nisj/mp_result

2.2、一天内某几个小时的点击量统计
可以使用正则实现需求,中括号里的对应的是一个字符。
hadoop dfs -rm -r -skipTrash /nisj/mp_result;
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \
-input /tmp/oss_access/2017-08-2[1-4]/*_localhost_access_log.2017-08-2[1-4].*.txt \
-output /nisj/mp_result

2.3、正则及多输入文件实现跨天某几个小时的点击量统计
多输入文件可以是如下两种方式,经测试,它们的结果是一致的。
hadoop dfs -rm -r -skipTrash /nisj/mp_result;
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \
-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt \
-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt \
-input /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.0[0-9].txt \
-input /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.1[0-8].txt \
-output /nisj/mp_result
hadoop dfs -rm -r -skipTrash /nisj/mp_result;
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \
-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.0[0-9].txt /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.1[0-8].txt \
-output /nisj/mp_result

另一个的测试:
hadoop dfs -rm -r -skipTrash /nisj/mp_result;
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/nisj/hitsCalc3/xx.py -file /home/hadoop/nisj/hitsCalc3/xx.py \
-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt \
-output /nisj/mp_result

hadoop dfs -rm -r -skipTrash /nisj/mp_result;
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/nisj/hitsCalc3/xx.py -file /home/hadoop/nisj/hitsCalc3/xx.py \
-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt \
-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt \
-output /nisj/mp_result

3、结果的最终统计
#过滤出的结果查看:
hadoop dfs -cat /nisj/mp_result/*
#点击量的统计计算
hadoop dfs -cat /nisj/mp_result/* |wc -l
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐