您的位置:首页 > 运维架构 > Nginx

按日期切割nginx访问日志--及性能优化

2016-12-29 16:38 771 查看

先谈下我们需求,一个比较大的nginx访问日志,根据访问日期切割日志,保存在/tmp目录下。

测试机器为腾讯云机子,单核1G内存。测试日志大小80M。

不使用多线程版:

#!/usr/bin/env python
# coding=utf-8

import re
import datetime

if __name__ == '__main__':
date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):')
with open('./access_all.log-20161227') as f:
for line in f:
day, mon, year = re.search(date_pattern, line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line)
View Code

耗时:

[root@VM_255_164_centos data_parse]# time python3 log_cut.py 
real 0m41.152s user 0m32.578s sys 0m6.046s

多线程版:

#!/usr/bin/env python
# coding=utf-8

import re
import datetime
import threading

date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):')

def log_cut(line):
day, mon, year = re.search(date_pattern, line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line)

if __name__ == '__main__':
with open('./access_all.log-20161227') as f:
for line in f:
t = threading.Thread(target=log_cut, args=(line,))
t.setDaemon(True)
t.start()
View Code

耗时:

# time python3 log_cut.py

real    1m35.905s
user    1m10.292s
sys    0m19.666s

使用多线程版竟然比不使用多进程版要慢的多。。cpu密集型任务使用上下文切换果然很耗时。

线程池版:

线程池类

#!/usr/bin/env python
# coding=utf-8

import queue
import threading
import contextlib
import time

StopEvent = object()

class ThreadPool(object):

def __init__(self, max_num, max_task_num = None):
if max_task_num:
self.q = queue.Queue(max_task_num)
else:
self.q = queue.Queue()
self.max_num = max_num
self.cancel = False
self.terminal = False
self.generate_list = []
self.free_list = []

def run(self, func, args, callback=None):
if self.cancel:
return
if len(self.free_list) == 0 and len(self.generate_list) < self.max_num:
self.generate_thread()
w = (func, args, callback,)
self.q.put(w)

def generate_thread(self):
t = threading.Thread(target=self.call)
t.start()

def call(self):
current_thread = threading.currentThread()
self.generate_list.append(current_thread)

event = self.q.get()
while event != StopEvent:

func, arguments, callback = event
try:
result = func(*arguments)
success = True
except Exception as e:
success = False
result = None

if callback is not None:
try:
callback(success, result)
except Exception as e:
pass

with self.worker_state(self.free_list, current_thread):
if self.terminal:
event = StopEvent
else:
event = self.q.get()
else:
self.generate_list.remove(current_thread)

def close(self):
self.cancel = True
full_size = len(self.generate_list)
while full_size:
self.q.put(StopEvent)  #
full_size -= 1

def terminate(self):
self.terminal = True

while self.generate_list:
self.q.put(StopEvent)

self.q.queue.clear()

@contextlib.contextmanager
def worker_state(self, state_list, worker_thread):
state_list.append(worker_thread)
try:
yield
finally:
state_list.remove(worker_thread)
threadingPool.py

代码

#!/usr/bin/env python
# coding=utf-8

import re
import datetime
from threadingPool import ThreadPool

date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+)\:')

def log_cut(line):
day, mon, year = date_pattern.search(line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line)

def callback(status, result):
pass

pool = ThreadPool(1)

with open('./access_all.log-20161227') as f:
for line in f:
pool.run(log_cut, (line,), callback)

pool.close()
View Code

耗时:

# time python3 log_cut2.py

real    0m53.371s
user    0m44.761s
sys    0m5.600s

线程池版比多线程版要快,看来写的线程池类还是有用的。减少了上下文切换时间。

进程池版:

#!/usr/bin/env python
# coding=utf-8

import re
import datetime
from multiprocessing import Pool

date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):')

def log_cut(line):
day, mon, year = re.search(date_pattern, line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line)

if __name__ == '__main__':
pool = Pool(1)
with open('./access_all.log-20161227') as f:
for line in f:
pool.apply_async(func=log_cut, args=(line,))
pool.close()
View Code

单个进程耗时:

# time python3 log_cut.py

real    0m28.392s
user    0m23.451s
sys    0m1.888s

2个进程耗时:

# time python3 log_cut.py

real    0m40.920s
user    0m33.690s
sys    0m3.206s

看来使用多进程时,如果是单核cpu只开一个进程,多核cpu的话开多个速度更快,单核cpu开多个进程速度很慢。

shell版

#!/bin/bash

Usage(){
echo "Usage: $0 Logfile"
}

if [ $# -eq 0 ] ;then
Usage
exit 0
else
Log=$1
fi

date_log=$(mktemp)

cat $Log |awk -F'[ :]' '{print $5}'|awk -F'[' '{print $2}'|uniq > date_log

for i in `cat date_log`
do
grep $i $Log > /tmp/log/${i:7:10}-${i:3:3}-${i:0:2}.access

done
View Code

耗时:

# time sh log_cut.sh access_all.log-20161227

real    0m2.435s
user    0m2.042s
sys    0m0.304s

shell的效果非常棒啊,只用2s多久完成了。

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: