大数据可视化之Nginx日志分析及web图表展示(HDFS+Flume+Spark+Nginx+Highcharts)
项目需求:
采集1-3台机器的nginx的access.log(/var/log/nginx/access.log)实时保存在HDFS中
使用spark对当天的日志进行汇总分析
在web界面中以图表的形式展示出来,需要体现如下2个表:
1:哪个URL访问数量最大,按访问量从多到少排序展示出来
2:哪些IP访问造成404错误最多,按从多到少排序展示出来
提高练习:
使用spark对所有的日志进行汇总分析,在web界面中展示出来
==》分析题目,绘制项目架构图
确定完成步骤及技术点
分三步执行:
第一步:
需求:
采集1-3台机器的nginx的access.log(/var/log/nginx/access.log)
实时保存在HDFS中
技术点:
- 搭建集群3台(master1、master2和slave1),
- 3台机器分别安装并配置好flume数据导入工具、Nginx服务器以及spark,
- 将日志数据实时读取,存入access.log(/var/log/nginx/access.log)日志文件中,保存在HDFS中
- Flume采集和存储:Flume工具实现日志数据导入,3台Nginx服务器的日志文件access.log统一导入到HDFS上的同一个目录下,即(/user/flume/)
- 日志不够多,可以采用bash脚本产生更多记录
实现:
1.搭建集群3台(master1、master2和slave1),配置hadop集群,yarn等
参考: https://blog.csdn.net/wugenqiang/article/details/81102294
2.3台机器分别安装并配置好flume数据导入工具、Nginx服务器以及spark,
2.1配置flume,安装Nginx服务器,使用flume收集nginx服务器的日志到hdfs中
2.1.1配置文件/etc/flume/conf/flume.conf
[code][root@master1 ~]# cd /etc/flume/conf [root@master1 conf]# ls flume.conf flume-env.sh.template flume-conf.properties.template log4j.properties flume-env.ps1 [root@master1 conf]# vim flume.conf
添加:
[code]##配置Agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # # 配置Source a1.sources.r1.type = exec a1.sources.r1.channels = c1 a1.sources.r1.deserializer.outputCharset = UTF-8 # # 配置需要监控的日志输出目录 a1.sources.r1.command = tail -F /var/log/nginx/access.log # # 配置Sink a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1 a1.sinks.k1.hdfs.useLocalTimeStamp = true a1.sinks.k1.hdfs.path = hdfs://master1:8020/user/flume/nginx_logs/%Y%m%d a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H a1.sinks.k1.hdfs.fileSuffix = .log a1.sinks.k1.hdfs.minBlockReplicas = 1 a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.rollInterval = 86400 a1.sinks.k1.hdfs.rollSize = 1000000 a1.sinks.k1.hdfs.rollCount = 10000 a1.sinks.k1.hdfs.idleTimeout = 0 # # 配置Channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # # 将三者连接 a1.sources.r1.channel = c1 a1.sinks.k1.channel = c1
复制给其他服务器:
[code][root@master1 ~]# scp /etc/flume/conf/flume.conf root@master2:/etc/flume/conf/flume.conf flume.conf 100% 2016 1.6MB/s 00:00 [root@master1 ~]# scp /etc/flume/conf/flume.conf root@slave1:/etc/flume/conf/flume.conf flume.conf 100% 2016 134.1KB/s 00:00
2.1.2集群分别挂载光盘,并安装nginx服务
[code][root@master2 ~]# mount /dev/sr0 /mnt/cdrom mount: 挂载点 /mnt/cdrom 不存在 [root@master2 ~]# cd /mnt/ [root@master2 mnt]# ls [root@master2 mnt]# mkdir cdrom [root@master2 mnt]# mount /dev/sr0 /mnt/cdrom mount: /dev/sr0 写保护,将以只读方式挂载 [root@master2 mnt]# cd [root@master2 ~]# yum install nginx 已加载插件:fastestmirror hanwate_install | 3.7 kB 00:00 (1/2): hanwate_install/group_gz | 2.1 kB 00:00 (2/2): hanwate_install/primary_db | 917 kB 00:00
安装nginx服务的时候,若出现下面这个,
请参考: https://blog.csdn.net/wugenqiang/article/details/81075446
[code][root@master2 ~]# yum install nginx 已加载插件:fastestmirror There are no enabled repos. Run "yum repolist all" to see the repos you have. To enable Red Hat Subscription Management repositories: subscription-manager repos --enable <repo> To enable custom repositories: yum-config-manager --enable <repo>
2.1.3 准备web服务器日志,集群分别开启nginx服务
[code][root@master2 ~]# systemctl start nginx
2.1.4开启服务后,可通过/var/log/nginx/access.log
[code][root@master2 ~]# cd /var/log/nginx/ [root@master2 nginx]# ls access.log error.log
2.2 创建日志导入hdfs目录
[code][root@master1 ~]# su - hdfs 上一次登录:五 7月 20 13:07:04 CST 2018pts/0 上 -bash-4.2$ hadoop fs -mkdir /user/flume -bash-4.2$ hadoop fs -chmod 777 /user/flume
2.3 启动a1代理,集群都操作一遍
[code][root@master1 ~]# flume-ng agent --conf /etc/flume/conf/ --conf-file /etc/flume/conf/flume.conf --name a1
2.4 日志在hdfs上存储的结果
[code]-bash-4.2$ hadoop fs -ls -R /user/flume/nginx_logs drwxr-xr-x - root supergroup 0 2018-07-30 16:30 /user/flume/nginx_logs/20180730 -rw-r--r-- 4 root supergroup 2068 2018-07-30 16:30 /user/flume/nginx_logs/20180730/2018-07-30-16.1532939308683.log
2.5 spark自带即可用,启动pyspark
[code][root@master1 ~]# pyspark Python 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 18/07/30 16:35:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/07/30 16:36:04 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235 /_/ Using Python version 2.7.5 (default, Aug 4 2017 00:39:18) SparkSession available as 'spark'. >>>
3 日志不够多,可以采用bash脚本产生更多记录
编写自动运行脚本,来执行登录网页,来生成日志信息
[code]#!/bin/bash step=1 #间隔的秒数,不能大于60 host_list=("localhost" "192.168.75.137" "master1" "wugenqiang.master") while [ 1 ] do num=$(((RANDOM%7)+1)) seq=$(((RANDOM%4))) url="http://"${host_list[$seq]}"/"$num".html"; echo " `date +%Y-%m-%d\ %H:%M:%S` get $url" #curl http://192.168.75.137/1.html #调用链接 curl -s $url > /dev/null sleep $step done
结果:
[code][root@master1 conf]# vim test_log_records_add.sh [root@master1 conf]# ./test_log_records_add.sh 2018-07-31 20:40:33 get http://master1/7.html 2018-07-31 20:40:34 get http://wugenqiang.master/6.html 2018-07-31 20:40:35 get http://localhost/6.html 2018-07-31 20:40:36 get http://master1/6.html 2018-07-31 20:40:37 get http://wugenqiang.master/3.html 2018-07-31 20:40:38 get http://192.168.75.137/2.html 2018-07-31 20:40:39 get http://192.168.75.137/1.html 2018-07-31 20:40:40 get http://wugenqiang.master/1.html 2018-07-31 20:40:41 get http://192.168.75.137/3.html 2018-07-31 20:40:42 get http://wugenqiang.master/1.html 2018-07-31 20:40:43 get http://localhost/4.html 2018-07-31 20:40:44 get http://192.168.75.137/7.html 2018-07-31 20:40:45 get http://192.168.75.137/7.html 2018-07-31 20:40:46 get http://master1/7.html 2018-07-31 20:40:47 get http://master1/6.html 2018-07-31 20:40:48 get http://wugenqiang.master/1.html 2018-07-31 20:40:50 get http://192.168.75.137/2.html 2018-07-31 20:40:51 get http://wugenqiang.master/6.html 2018-07-31 20:40:52 get http://localhost/3.html
第二步:
需求:
使用spark对当天的日志进行汇总分析
技术点:
1.使用Spark提取相关的日志数据项(url、ip-404)并进行计算,
难点:如何实施?
- 安装配置
- Spark读入路径,使用/user/flume/
- 测试:pyspark
- 排序
- 分割提取数据,取对应列
- 按时间来划分日志
- 汇总访问的URL,
- 最终文件保存在linux中
- Nginx_log.py脚本
- 计划任务运行,每5分钟执行一次
实现:
1.使用Spark提取相关的日志数据项(url、ip-404)并进行计算
1.1 安装配置:见第一步2.5
1.2 Spark读入路径,使用/user/flume/nginx_logs/%Y%m%d
[code]>>> rdd = sc.textFile("hdfs://master1:8020/user/flume/nginx_logs/20180730/*")
1.3 测试pyspark
[code]>>> rdd.count() [Stage 0:> ( [Stage 0:=======================================> ( 31
1.4 汇总访问的URL,按访问量从多到少排序展示出来
[code]>>> rdd=sc.textFile("hdfs://master1:9000/user/flume/nginx_logs/20180731 >>> url=rdd.map(lambda line:line.split(" ")).map(lambda w:w[6]) >>> url_add=url.map(lambda w:(str(w),1)) >>> url_add_reduce=url_add.reduceByKey(lambda x,y:x+y) >>> url_add_reduce.collect() [('/5.html;', 27), ('/7.html;', 23), ('/1.html', 87), ('/7.html', 40), ('/1.html;', 25), ('/5.html', 60), ('/3.html', 54), ('/4.html;', 30), ('/6.html;', 30), ('/8.html', 11), ('/2.html;', 21), ('/nginx_log_need/ip404/20180731/part-00000', 2), ('/analysisOfData/url_AmountOfAccess.html', 2), ('/3.html;', 27), ('/favicon.ico', 3), ('/analysisOfData/ip404_AmountOfAccess.html', 2), ('/4.html', 45), ('/2.html', 37), ('/6.html', 64)] >>> url_add_reduce_sort=url_add_reduce.sortBy(lambda x:-x[1]) >>> url_add_reduce_sort.collect() [Stage 35:> (0 [('/1.html', 87), ('/6.html', 64), ('/5.html', 60), ('/3.html', 54), ('/4.html', 45), ('/7.html', 40), ('/2.html', 37), ('/4.html;', 30), ('/6.html;', 30), ('/5.html;', 27), ('/3.html;', 27), ('/1.html;', 25), ('/7.html;', 23), ('/2.html;', 21), ('/8.html', 11), ('/favicon.ico', 3), ('/nginx_log_need/ip404/20180731/part-00000', 2), ('/analysisOfData/url_AmountOfAccess.html', 2), ('/analysisOfData/ip404_AmountOfAccess.html', 2)] >>>url_add_reduce_sort.repartition(1).saveAsTextFile("file:///usr/share/nginx/html/nginx_log_need/url/20180731")
结果:
[code][root@master1 nginx_log_need]# ls ip404 url [root@master1 nginx_log_need]# cd url/ [root@master1 url]# ls 20180731 [root@master1 url]# cd 20180731/ [root@master1 20180731]# ls part-00000 _SUCCESS [root@master1 20180731]# cat part-00000 ('/1.html', 87) ('/6.html', 64) ('/5.html', 60) ('/3.html', 54) ('/4.html', 45) ('/7.html', 40) ('/2.html', 37) ('/4.html;', 30) ('/6.html;', 30) ('/5.html;', 27) ('/3.html;', 27) ('/1.html;', 25) ('/7.html;', 23) ('/2.html;', 21) ('/8.html', 11) ('/favicon.ico', 3) ('/nginx_log_need/ip404/20180731/part-00000', 2) ('/analysisOfData/url_AmountOfAccess.html', 2) ('/analysisOfData/ip404_AmountOfAccess.html', 2)
1.5 汇总访问的ip404,IP访问造成404错误最多,按从多到少排序展示出来
[code]>>> rdd=sc.textFile("hdfs://master1:9000/user/flume/nginx_logs/20180731/*") >>> line_contain_404=rdd.filter(lambda line:"404" in line) >>> line_404_ips=line_contain_404.map(lambda line:line.split(" ")).map(lambda w:w[0]) >>> line_404_ips_add=line_404_ips.map(lambda w:(str(w),1)) >>> line_404_ips_add_redu=line_404_ips_add.reduceByKey(lambda x,y:x+y) >>> line_404_ips_add_redu.count() [Stage 2:> (0 + 1)[Stage 2:=========> (1 + 1)[Stage 2:=============================> (3 + 1)[Stage 2:=================================================> (5 + 1)[Stage 3:=============================> (3 + 1) 5 >>> line_404_ips_add_redu.collect() [('::1', 64), ('192.168.75.1', 11), ('192.168.75.139', 4), ('192.168.75.138', 9), ('192.168.75.137', 217)] >>> line_404_ips_add_redu.sort=line_404_ips_add_redu.sortBy(lambda x:x[1]) >>> line_404_ips_add_redu.sort.collect() [('192.168.75.139', 4), ('192.168.75.138', 9), ('192.168.75.1', 11), ('::1', 64), ('192.168.75.137', 217)] >>> line_404_ips_add_redu.sort=line_404_ips_add_redu.sortBy(lambda x:-x[1]) >>> line_404_ips_add_redu.sort.collect() [('192.168.75.137', 217), ('::1', 64), ('192.168.75.1', 11), ('192.168.75.138', 9), ('192.168.75.139', 4)] >>>line_404_ips_add_redu.sort.repartition(1).saveAsTextFile("file:///usr/share/nginx/html/nginx_log_need/ip404/20180731")
降序排列演示:
[code]>>> line_404_ips_add_redu.sort=line_404_ips_add_redu.sortBy(lambda x:-x[1]) >>> line_404_ips_add_redu.sort.collect() [('192.168.75.137', 217), ('::1', 64), ('192.168.75.1', 11), ('192.168.75.138', 9), ('192.168.75.139', 4)]
结果:
[code][root@master1 ~]# cd /usr/share/nginx/html/nginx_log_need/ip404/20180731/ [root@master1 20180731]# ls part-00000 _SUCCESS [root@master1 20180731]# cat part-00000 ('192.168.75.137', 217) ('::1', 64) ('192.168.75.1', 11) ('192.168.75.138', 9) ('192.168.75.139', 4)
最终文件保存在linux中
2 编写nginx_log.py脚本
处理log数据的脚本,将spark处理的数据上传hdfs,为脚本计划运行做准备
[code][root@master nginx]# cat nginx_log.py #!/usr/bin/python #coding=utf-8 from pyspark import SparkContext import os os.environ['PYTHONPATH']='python2' import sys reload(sys) sys.setdefaultencoding('utf-8') sc=SparkContext("yarn") #将url访问量降序排序,存入hdfs对应路径中: rdd_url = sc.textFile("hdfs://master:9000/user/flume/nginx_logs/20180803/*").map(lambda s:s.split(" ")).map(lambda w:[str(w[6])]).flatMap(lambda w:w).map(lambda w:[w,1]).reduceByKey(lambda x,y:x+y).sortBy(lambda x:-x[1]).map(lambda x:list(x)).coalesce(1).saveAsTextFile("/user/flume/nginx_log_output_need/url/20180803") #ip_404错误降序排序,存入hdfs对应路径中: rdd_ip404 = sc.textFile("hdfs://master:9000/user/flume/nginx_logs/20180803/*").filter(lambda line:"404" in line).map(lambda s:s.split(" ")).map(lambda w:[str(w[0])]).flatMap(lambda w:w).map(lambda w:[w,1]).reduceByKey(lambda x,y:x+y).sortBy(lambda x:-x[1]).map(lambda x:list(x)).coalesce(1).saveAsTextFile("/user/flume/nginx_log_output_need/ip404/20180803")
3 计划任务运行,每5分钟执行一次
输入crontab -e命令
[code]*/5 * * * * /usr/share/nginx/spark_run_nginx_log.sh 2>&1 >> /usr/share/nginx/crontab_spark_run_nginx.log
第三步:
需求:
在web界面中以图表的形式展示出来,需要体现如下2个表:
1:哪个URL访问数量最大,按访问量从多到少排序展示出来
2:哪些IP访问造成404错误最多,按从多到少排序展示出来
3:使用spark对所有的日志进行汇总分析,在web界面中展示出来
技术点:
- 数据读取写入nginx服务器,
-- 如何写入?
- 以文本文件的形式写入Nginx服务器下的目录中
2.控件使用highcharts
3.Ajax获取nginx服务器数据
4.Web页面读取nginx代理服务器上的日志分析结果,
- 计划任务,设置隔一段时间更新一次
- 实现数据的实时统计与展示
5.如何以图表形式显示,柱状图或饼状图
===========================================
实际实施:
1.优化之前两步,改写自动脚本运行
2.ajax获取spark处理后的数据,拼接成需要的json格式
(1)url_Access.html,ajax算法部分
[code]$.ajax({//获取url type: "get", async: false, url: "/nginx_log_need/url.txt", success: function(data) { //数据源通过]符号进行分割,命名为data_source var data_source = data.split(']'); console.log("数据源:"+data_source); var url_value = "";//url值 for(i = 0; i < data_source.length - 1; i++) { if(i == 0) { url_value = url_value + (data_source[i].split(',')[0]).substr(1) + ","; } else { url_value = url_value + (data_source[i].split(',')[0]).substr(2); if(i < data_source.length - 2) { url_value = url_value + ","; } } } console.log("url:"+url_value); rst = eval('[' + url_value + ']'); console.log(rst); } }); $.ajax({//获取url访问量 type: "get", async: false, url: "/nginx_log_need/url.txt", success: function(data) { //数据源 var data_source = data.split(']'); var data_handle = "";//数据处理 console.log(data_source); for(i = 0; i < data_source.length - 1; i++) { data_handle = data_handle.concat(data_source[i].split(',')[1]); if(i < data_source.length - 2) { data_handle = data_handle + ","; } } console.log("处理:"+data_handle); rst = eval('[' + data_handle + ']'); } });
(2)ip404_Access.html,ajax算法部分
[code]$.ajax({ type: "get", async: false, url: "/nginx_log_need/ip404.txt", success: function(data) { //从本地文件中读取数据,以换行符分割 var data_source=data.split("\n");//数据源 var data_handle="";//数据处理 var sum=0;//求和变量,为算比例做准备 //控制台打印调试 console.log("数据源:"+data_source)//控制台查看数据源 for(i=0;i<data_source.length-2;i++){//数据源中数据处理后放入data_handle data_handle=data_handle+data_source[i].substr(0,data_source[i].length)+",";} data_handle=data_handle+data_source[i].substr(0,data_source[i].length); console.log("数据源处理结果:"+data_handle);//控制台打印处理好的数据 rst=eval('['+data_handle+']');//eval处理表达式 for(i=0;i<rst.length;i++){ sum=sum+rst[i][1];//value值求和 } for(i=0;i<rst.length;i++){ rst[i][1]=(parseFloat(rst[i][1])/parseFloat(sum)); //可用toFixed(2)控制小数精度 }}});
(3)优化算法,改写算法,图表显示
- 大数据可视化之Nginx服务器日志分析及可视化展示(Nginx+flume+HDFS+Spark+Highcharts)
- flume + kafka + sparkStreaming + HDFS 构建实时日志分析系统
- flume + kafka + sparkStreaming + HDFS 构建实时日志分析系统
- flume + kafka + sparkStreaming + HDFS 构建实时日志分析系统
- 大数据分析-web图表展示-收集
- Webpack 2 视频教程 018 - 使用可视化图表进行统计分析打包过程
- ELK系列二:kibana操作及nginx日志分析图表创建
- scrapydweb:实现 Scrapyd 服务器集群监控和交互,Scrapy 日志分析和可视化
- 使用Flume+Kafka+SparkStreaming进行实时日志分析
- hadoop日志分析系统二 第一部分 利用任务调度系统定期的把web系统所产生的日志文件导入到hdfs中
- flume采集方案nginx日志到hdfs上
- 可视化web日志分析工具Logstalgia
- Spark日志分析项目Demo(1)--Flume-ng的安装
- nginx+flume+hdfs搭建实时日志收集系统
- Flume-NG + HDFS + HIVE 日志收集分析 | EyeLu技术Blog
- Flume+Kafka+Sparkstreaming日志分析
- Flume-NG + HDFS + HIVE 日志收集分析
- highcharts 图表插件与Struts2结合应用,用图表直观展现后台数据分析结果
- 使用Kibana 分析Nginx 日志并在 Dashboard上展示
- Spark 10 Spark SQL 实战:日志分析(三)结果可视化