按日期切割nginx访问日志--及性能优化
2016-12-29 16:38
771 查看
先谈下我们需求,一个比较大的nginx访问日志,根据访问日期切割日志,保存在/tmp目录下。
测试机器为腾讯云机子,单核1G内存。测试日志大小80M。
不使用多线程版:
#!/usr/bin/env python # coding=utf-8 import re import datetime if __name__ == '__main__': date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):') with open('./access_all.log-20161227') as f: for line in f: day, mon, year = re.search(date_pattern, line).groups() mon = datetime.datetime.strptime(mon, '%b').month log_file = '/tmp/%s-%s-%s' % (year, mon, day) with open(log_file, 'a+') as f: f.write(line)View Code
耗时:
[root@VM_255_164_centos data_parse]# time python3 log_cut.py
real 0m41.152s user 0m32.578s sys 0m6.046s
多线程版:
#!/usr/bin/env python # coding=utf-8 import re import datetime import threading date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):') def log_cut(line): day, mon, year = re.search(date_pattern, line).groups() mon = datetime.datetime.strptime(mon, '%b').month log_file = '/tmp/%s-%s-%s' % (year, mon, day) with open(log_file, 'a+') as f: f.write(line) if __name__ == '__main__': with open('./access_all.log-20161227') as f: for line in f: t = threading.Thread(target=log_cut, args=(line,)) t.setDaemon(True) t.start()View Code
耗时:
# time python3 log_cut.py real 1m35.905s user 1m10.292s sys 0m19.666s
使用多线程版竟然比不使用多进程版要慢的多。。cpu密集型任务使用上下文切换果然很耗时。
线程池版:
线程池类
#!/usr/bin/env python # coding=utf-8 import queue import threading import contextlib import time StopEvent = object() class ThreadPool(object): def __init__(self, max_num, max_task_num = None): if max_task_num: self.q = queue.Queue(max_task_num) else: self.q = queue.Queue() self.max_num = max_num self.cancel = False self.terminal = False self.generate_list = [] self.free_list = [] def run(self, func, args, callback=None): if self.cancel: return if len(self.free_list) == 0 and len(self.generate_list) < self.max_num: self.generate_thread() w = (func, args, callback,) self.q.put(w) def generate_thread(self): t = threading.Thread(target=self.call) t.start() def call(self): current_thread = threading.currentThread() self.generate_list.append(current_thread) event = self.q.get() while event != StopEvent: func, arguments, callback = event try: result = func(*arguments) success = True except Exception as e: success = False result = None if callback is not None: try: callback(success, result) except Exception as e: pass with self.worker_state(self.free_list, current_thread): if self.terminal: event = StopEvent else: event = self.q.get() else: self.generate_list.remove(current_thread) def close(self): self.cancel = True full_size = len(self.generate_list) while full_size: self.q.put(StopEvent) # full_size -= 1 def terminate(self): self.terminal = True while self.generate_list: self.q.put(StopEvent) self.q.queue.clear() @contextlib.contextmanager def worker_state(self, state_list, worker_thread): state_list.append(worker_thread) try: yield finally: state_list.remove(worker_thread)threadingPool.py
代码
#!/usr/bin/env python # coding=utf-8 import re import datetime from threadingPool import ThreadPool date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+)\:') def log_cut(line): day, mon, year = date_pattern.search(line).groups() mon = datetime.datetime.strptime(mon, '%b').month log_file = '/tmp/%s-%s-%s' % (year, mon, day) with open(log_file, 'a+') as f: f.write(line) def callback(status, result): pass pool = ThreadPool(1) with open('./access_all.log-20161227') as f: for line in f: pool.run(log_cut, (line,), callback) pool.close()View Code
耗时:
# time python3 log_cut2.py real 0m53.371s user 0m44.761s sys 0m5.600s
线程池版比多线程版要快,看来写的线程池类还是有用的。减少了上下文切换时间。
进程池版:
#!/usr/bin/env python # coding=utf-8 import re import datetime from multiprocessing import Pool date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):') def log_cut(line): day, mon, year = re.search(date_pattern, line).groups() mon = datetime.datetime.strptime(mon, '%b').month log_file = '/tmp/%s-%s-%s' % (year, mon, day) with open(log_file, 'a+') as f: f.write(line) if __name__ == '__main__': pool = Pool(1) with open('./access_all.log-20161227') as f: for line in f: pool.apply_async(func=log_cut, args=(line,)) pool.close()View Code
单个进程耗时:
# time python3 log_cut.py real 0m28.392s user 0m23.451s sys 0m1.888s
2个进程耗时:
# time python3 log_cut.py real 0m40.920s user 0m33.690s sys 0m3.206s
看来使用多进程时,如果是单核cpu只开一个进程,多核cpu的话开多个速度更快,单核cpu开多个进程速度很慢。
shell版
#!/bin/bash Usage(){ echo "Usage: $0 Logfile" } if [ $# -eq 0 ] ;then Usage exit 0 else Log=$1 fi date_log=$(mktemp) cat $Log |awk -F'[ :]' '{print $5}'|awk -F'[' '{print $2}'|uniq > date_log for i in `cat date_log` do grep $i $Log > /tmp/log/${i:7:10}-${i:3:3}-${i:0:2}.access doneView Code
耗时:
# time sh log_cut.sh access_all.log-20161227 real 0m2.435s user 0m2.042s sys 0m0.304s
shell的效果非常棒啊,只用2s多久完成了。
相关文章推荐
- nginx自动切割访问日志
- Nginx访问日志,Nginx日志切割,静态文件不记录日志和过期时间
- nginx自动切割访问日志
- 12.10 Nginx访问日志;12.11 Nginx日志切割;12.12 静态文件不记录日志和过期时间
- Linux下Nginx如何切割访问日志?
- nginx自动切割访问日志方法一
- 性能调优之访问日志IO性能优化
- 11-5 12 Nginx访问日志 日志切割 静态过期
- nginx自动切割访问日志
- nginx自动切割访问日志
- nginx自动切割访问日志方法二
- 12.10 Nginx访问日志 12.11 Nginx日志切割 12.12 静态文件不记录日志和过期时间
- LNMP架构 (3) 之 Nginx访问日志、日志切割、静态文件不记录日志和过期时间
- nginx1.10.3一键安装/系统内核优化/配置文件优化/https/日志切割
- LNMP架构(nginx访问日志,Nginx日志切割,静态文件不记录访问日志)
- nginx自动切割日志访问文件脚本
- Linux CentOS 7.4下nginx 访问日志的轮询切割
- Apache按日期切割访问日志
- Nginx 访问日志轮询切割
- Nginx访问日志、日志切割、静态文件不记录日志和过期时间