您的位置:首页 > 运维架构 > Linux

利用nagios,snmp,监听处理linux下的特定进程和端口,以及邮件短信报警功能 推荐

2011-08-01 14:54 666 查看
nagios的安装配置请见

http://993995.blog.51cto.com/983995/628483

这里主要讲通过nagios的监听和事件处理机制,对一些故障服务进行远程处理。处理后如果还不正常,nagios启动邮件,短信报警。

python_action.sh,python_action.py 代码见

http://down.51cto.com/data/230111



1.首先启用邮件报警功能。

下载sendEmail软件,解压后直接将sendEmail复制到/usr/bin里

因为没有搞懂sendmail,所以下了个sendEmail

编辑/usr/local/nagios/etc/objects/commands.cfg

将原来/bin/mail -s 这一部份替换为

tail /usr/local/nagios/var/nagios.log | /usr/local/bin/sendEmail -f username@163.com -t $CONTACTEMAIL$ -s smtp.163.com -u "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" -xu username -xp 123

意思是用sendEmail客户端通过163邮箱的smtp服务,发送邮件。username 是你163邮箱名,123是163邮箱密码。$CONTACTEMAIL$ 是你要发送的目的邮箱,也就是nagios.cfg配置中系统管理员的邮箱。我是将nagios.log的后十行作为邮件正文一起发送的。

这是我的配置

# 'notify-host-by-email' command definition

define command{

command_name notify-host-by-email

command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | tail /usr/local/nagios/var/nagios.log | /usr/local/bin/sendEmail -f username@163.com -t $CONTACTEMAIL$ -s smtp.163.com -u "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" -xu username -xp 123

}

# 'notify-service-by-email' command definition

define command{

command_name notify-service-by-email

command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | tail /usr/local/nagios/var/nagios.log | /usr/local/bin/sendEmail -f username@163.com -t $CONTACTEMAIL$ -s smtp.163.com -u "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" -xu username -xp 123

}

配置好有,就可以将带有日志信息的邮件发送到指定邮箱。

短信报警这块可以注册一个移动的139邮箱,邮件到后自动发消息到手机。把短信通知设置为长短信,就可以直接看了。记得把nagios.cfg中邮件地址改为139邮箱的地址。

2.利用nagios的事件处理机制,监控Linux下指定进程。

编辑/usr/local/nagios/etc/objects/localhost.cfg

这是我配置的两个服务,一个是用TCP监听django的8000端口,一个是通过SNMP

监听django的 manage.py runserver 进程

#check_django_tcp

define service{

use local-service ; Name of service template to use

host_name RedHat-host

service_description Django_TCP

check_command check_django_tcp!8000

notifications_enabled 1

event_handler_enabled 1

event_handler python_action


}

#check_django_snmp

define service{

use local-service ; Name of service template to use

host_name RedHat-host

service_description Django_SNMP

check_command check_django_snmp!2c!public!.1.3.6.1.4.1.2021.54.101.2!"manage.py runserver"

notifications_enabled 1

event_handler_enabled 1

event_handler python_action

}

注意这两项

event_handler_enabled 1

event_handler python_action

事件使能打开,处理方式是python_action

python_action 我是在command.cfg中定义的。

#'python_action'

define command{

command_name python_action

command_line $USER1$/python_action.sh "$HOSTNAME$,$SERVICEDESC$,$SERVICESTATE$,$SERVICESTATETYPE$,$SERVICEATTEMPT$"

}

#'check_django_tcp'

define command{

command_name check_django_tcp

command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$

}

#'check_django_snmp'

define command{

command_name check_django_snmp

command_line $USER1$/check_snmp -H $HOSTADDRESS$ -P $ARG1$ -C $ARG2$ -o $ARG3$ -r $ARG4$

}

python_action.sh是自己写的脚本,调用python_action.py

要讲python_action.sh和python_action.py复制到

/usr/local/nagios/libexec/

改权限为chown -R nagios:nagios /usr/local/nagios/*

python_action.sh 代码

#!/bin/bash

cd /usr/local/nagios/libexec

if [ $# -ne 2 ]; then

service_info="$1"

/usr/bin/python /usr/local/nagios/libexec/python_action.py "$service_info"

fi

python_action.py 代码

# -*- coding: utf-8 -*-

import pxssh, time, os,sys,pexpect

from time import sleep, ctime

machine_name_list = {"ubuntu-host":["192.168.15.67", "root", "123"],

"localhost":["172.172.10.100", "root", "123"],

"RedHat-host":["192.168.15.67", "root", "123"]

}

server_command_list = {"Django_TCP":"/usr/bin/python /root/dmdu_manage/manage.py runserver &",

"SMTP":"/etc/init.d/sendmail restart",

"Django_SNMP":"/usr/bin/python /root/dmdu_manage/manage.py runserver &"

}

def write_opt_log(service_info='None',command='None'):

try:

f = open("service_opt_info.txt",'a')

info=[]

info.append(service_info)

info.append(command)

print info

f.write("%s,%s\n" % (info[0],ctime()))

f.write("%s\n" % (info[1]))

f.write("\n")

f.close

except Exception , e:

print "Exception is ",e

def ssh_cmd(hostIP='172.172.10.101', username="root", password="kk",command=""):

print "Now connecting %s" % (hostIP)

print "Please Wait... ...\n"

#import pdb;pdb.set_trace()

s = pxssh.pxssh()

s.login(hostIP, username, password, login_timeout=30, original_prompt="[$#>]", auto_prompt_reset="['Password','password: ', 'continue connecting (yes/no)?']")

print "Start OS\n"

s.sendline(command)

s.prompt()

print s.before

s.sendline("exit")

s.prompt()

print s.before

#s.logout()

print "End OS \n"

def pexpect_cmd(hostIP='172.172.10.101', username="root", password="kk",command=""):

print "Start OS \n"

print "Please Wait... ...\n"

ssh = pexpect.spawn('ssh -l %s %s %s'%(username, hostIP, command))

r = ''

try:

i = ssh.expect(['[Pp]assword: ', 'continue connecting (yes/no)?', pexpect.EOF, pexpect.TIMEOUT])

if i == 0 :

ssh.sendline(password)

elif i == 1:

ssh.sendline('yes')

except pexpect.EOF:

ssh.close()

else:

r = ssh.read()

ssh.expect(pexpect.EOF)

ssh.close()

print "End OS\n"

return r

def restart_opt(service_info='None'):

#import pdb;pdb.set_trace

info_detail=[]

info_detail = service_info.split(',')

hostname=info_detail[0]

service_desc=info_detail[1]

service_state=info_detail[2]

service_state_type=info_detail[3]

service_attempt=info_detail[4]

hostIP = machine_name_list[hostname][0]

username = machine_name_list[hostname][1]

password = machine_name_list[hostname][2]

command = server_command_list[service_desc]

if service_state == "CRITICAL" and int(service_attempt) >= 3 :

try:

write_opt_log(service_info,command)

ssh_cmd(hostIP,username,password,command)

#pexpect_cmd(hostIP,username,password,command)

service_opt="up"

except pxssh.ExceptionPxssh, e:

print "ExceptionPxssh is", e

if __name__ == '__main__':

service_info = sys.argv[1]

restart_opt(service_info)

由于调用了pexpect库,所以在监控机器上要装pexpect-2.3这个可以到网上下。

tar -zxvf pexpect-2.3.tar.gz

cd pexpect-2.3.tar.gz

python setup.py install

修改 vim

/usr/local/lib/python2.6/dist-packages/pxssh.py

/usr/lib/python2.6/dist-packages/pxssh.py

第134行。在第一个

self.read_nonblocking(size=10000,timeout=1) # GAS: Clear out the cache before getting the prompt

前加入

self.sendline()

time.sleep(0.5)

修改后为

self.sendline()

time.sleep(0.5)

self.read_nonblocking(size=10000,timeout=1) # GAS: Clear out the cache before getting the prompt

不改的话,会报pxssh超时错误。

装好后,就可以执行带有pxssh 的python脚本。

3.开始配置受控端的snmp

要想监控Linux服务器下的指定进程,可以采取这种办法。

配置受控端的/etc/snmp/snmpd.conf

找到这一行

exec .1.3.6.1.4.1.2021.54

将其改为

exec .1.3.6.1.4.1.2021.54 /bin/sh /root/test.sh

建立/root/test.sh文件

编辑为以下内容,假如我要监测django的 manage.py runserver 进程。

#!/bin/bash

/bin/ps x | grep manage.py | awk '{print $6 " " $7;}'

保存后退出。

重启snmp服务。

在监控端机器上运行snmpwalk -v 2c -c public 192.168.15.67 .1.3.6.1.4.1.2021.54

可以看到以下信息

root@sifksky:/usr/local/nagios/libexec# snmpwalk -v 2c -c public 192.168.15.67 .1.3.6.1.4.1.2021.54

UCD-SNMP-MIB::ucdavis.54.1.1 = INTEGER: 1

UCD-SNMP-MIB::ucdavis.54.2.1 = STRING: "/bin/sh"

UCD-SNMP-MIB::ucdavis.54.3.1 = STRING: "/root/test.sh"

UCD-SNMP-MIB::ucdavis.54.100.1 = INTEGER: 0

UCD-SNMP-MIB::ucdavis.54.101.1 = STRING: "manage.py runserver"

UCD-SNMP-MIB::ucdavis.54.101.2 = STRING: "manage.py runserver"

UCD-SNMP-MIB::ucdavis.54.102.1 = INTEGER: 0

UCD-SNMP-MIB::ucdavis.54.103.1 = ""

如果没有,请确认受控端防火墙已经关闭。

看到UCD-SNMP-MIB::ucdavis.54.101.1 = STRING: "manage.py runserver"

这个时候就可以用nagios的 check_snmp -H 192.168.15.67 -P 2c -C public -o .1.3.6.1.4.1.2021.54.101.1 -s "manage.py runserver"

来监控这个进程了。

有什么不懂的,请大家留言指出。

PS

现知道的nagios监听服务的三种方式:

1)检测服务指定端口。通过TCP协议,不需要在受控端安装任何软件即可监听。

2)通过snmp监听进程名。需要在受控端开通snmp,然后配置snmp,添加一些脚本。

3)通过nagios的nrpe插件监控,相当于在受控端安装一个小型服务,通过SSL与监控端上的nagios通信,服务重启脚本也放在受控端上。

前两种需要开通SSH服务,以便出现问题后,nagios可以通过python脚本重启服务,程序重启后还不正常,才发送报警信息。

第三种不需要开通SSH服务,但就是要装插件在受控端。

通过监听TCP端口的预警时间要比通过snmp监听要快一点。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息