您的位置:首页 > 编程语言 > Python开发

Python之——网站访问来源IP统计

2018-01-14 17:20 429 查看
转载请注明出处:http://blog.csdn.net/l1028386804/article/details/79057671

一、场景描述

数据源准备工作详见博文《Python之——自动上传本地log文件到HDFS(基于Hadoop 2.5.2)》。

统计用户的访问来源IP可以更好的了解用户的分布,同时也可以帮助安全人员捕捉攻击来源。实现的原理是:定义匹配IP正则,将匹配到的字符串作为key,将value初始化为1,执行redecuer操作时做累加统计。

二、实现MapReduce

【/usr/local/python/source/ipstat.py】

# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日

@author: liuyazhuang
'''

from mrjob.job import MRJob
import re

#定义IP正则匹配
IP_RE = re.compile(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")

class MRCount(MRJob):

def mapper(self, key, line):
#匹配IP正则后生成key:value,其中key为IP地址,value初始值为1
for ip in IP_RE.findall(line):
yield ip, 1

def reducer(self, ip, occurrences):
yield ip, sum(occurrences)

if __name__ == '__main__':
MRCount.run()


三、生成MapReduce任务

执行命令:

python ipstat.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/ipstat hdfs://liuyazhuang121:9000/user/root/website.com/20180114
打印的日志如下:

[root@liuyazhuang121 source]#python ipstat.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/ipstat hdfs://liuyazhuang121:9000/user/root/website.com/20180114
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Found hadoop binary: /usr/local/hadoop-2.5.2/bin/hadoop
Using Hadoop version 2.5.2
Looking for Hadoop streaming jar in /usr/local/hadoop-2.5.2...
Found Hadoop streaming jar: /usr/local/hadoop-2.5.2/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar
Creating temp directory /tmp/ipstat.root.20180114.091040.605990
Copying local files to hdfs:///user/root/tmp/mrjob/ipstat.root.20180114.091040.605990/files/...
Running step 1 of 1...
packageJobJar: [/usr/local/hadoop-2.5.2/tmp/hadoop-unjar4828642106994965791/] [] /tmp/streamjob4775985125407933464.jar tmpDir=null
Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
Total input paths to process : 1
number of splits:2
Submitting tokens for job: job_1515893542122_0010
Submitted application application_1515893542122_0010
The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0010/ Running job: job_1515893542122_0010
Job job_1515893542122_0010 running in uber mode : false
map 0% reduce 0%
map 100% reduce 0%
map 100% reduce 100%
Job job_1515893542122_0010 completed successfully
Output directory: hdfs://liuyazhuang121:9000/output/ipstat
Counters: 49
File Input Format Counters
Bytes Read=2355499
File Output Format Counters
Bytes Written=303
File System Counters
FILE: Number of bytes read=176261
FILE: Number of bytes written=657303
FILE: Number of large read operations=0
FILE: Number of read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2355749
HDFS: Number of bytes written=303
HDFS: Number of large read operations=0
HDFS: Number of read operations=9
HDFS: Number of write operations=2
Job Counters
Data-local map tasks=2
Launched map tasks=2
Launched reduce tasks=1
Total megabyte-seconds taken by all map tasks=7339008
Total megabyte-seconds taken by all reduce tasks=3062784
Total time spent by all map tasks (ms)=7167
Total time spent by all maps in occupied slots (ms)=7167
Total time spent by all reduce tasks (ms)=2991
Total time spent by all reduces in occupied slots (ms)=2991
Total vcore-seconds taken by all map tasks=7167
Total vcore-seconds taken by all reduce tasks=2991
Map-Reduce Framework
CPU time spent (ms)=3780
Combine input records=0
Combine output records=0
Failed Shuffles=0
GC time elapsed (ms)=77
Input split bytes=250
Map input records=7555
Map output bytes=154577
Map output materialized bytes=176267
Map output records=10839
Merged Map outputs=2
Physical memory (bytes) snapshot=656932864
Reduce input groups=19
Reduce input records=10839
Reduce output records=19
Reduce shuffle bytes=176267
Shuffled Maps =2
Spilled Records=21678
Total committed heap usage (bytes)=468189184
Virtual memory (bytes) snapshot=2660089856
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Streaming final output from hdfs://liuyazhuang121:9000/output/ipstat...
"10.2.2.105" 6
"10.2.2.113" 94
"10.2.2.116" 125
"10.2.2.144" 176
"10.2.2.186" 64
"10.2.2.190" 41
"10.2.2.2" 2925
"10.2.2.209" 921
"10.2.2.230" 424
"10.2.2.234" 1889
"10.2.2.24" 733
"10.2.2.250" 2018
"10.2.2.44" 40
"10.2.2.54" 1138
"10.2.2.86" 109
"10.2.2.95" 86
"10.2.2.97" 43
"8.8.3.167" 6
"9.0.6.0" 1
Removing HDFS temp directory hdfs:///user/root/tmp/mrjob/ipstat.root.20180114.091040.605990...
Removing temp directory /tmp/ipstat.root.20180114.091040.605990...
我们可以看到,打印出了相关的结果。

四、验证结果

输入命令:

hadoop fs -ls /output/ipstat
查看输出的结果文件如下:

[root@liuyazhuang121 source]#hadoop fs -ls /output/ipstat
Found 2 items
-rw-r--r-- 1 root supergroup 0 2018-01-14 17:11 /output/ipstat/_SUCCESS
-rw-r--r-- 1 root supergroup 303 2018-01-14 17:11 /output/ipstat/part-00000
此时我们执行命令:
hadoop fs -cat /output/ipstat/part-00000
查看输出结果如下:

[root@liuyazhuang121 source]# hadoop fs -cat /output/ipstat/part-00000
"10.2.2.105"    6
"10.2.2.113"    94
"10.2.2.116"    125
"10.2.2.144"    176
"10.2.2.186"    64
"10.2.2.190"    41
"10.2.2.2"      2925
"10.2.2.209"    921
"10.2.2.230"    424
"10.2.2.234"    1889
"10.2.2.24"     733
"10.2.2.250"    2018
"10.2.2.44"     40
"10.2.2.54"     1138
"10.2.2.86"     109
"10.2.2.95"     86
"10.2.2.97"     43
"8.8.3.167"     6
"9.0.6.0"       1
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: