有故障报警行为,但看不到?
由上面的排查,基本可以断定sendmail没什么问题,域名解析也是正常的(专门为这个监控服务器解析MX及A记录),现在可能的原因只能在Nagios这边了。进入Nagios配置文件所在的目录,挨个察看配置文件。我的Nagios配置目录的情况如下:
[root@Nagios /usr/local/Nagios]# pwd
[root@Nagios /usr/local/Nagios/etc]# ls *.cfg
既有主机故障通知,又有服务故障报警通知,而且都应该按我的定义发送邮件的呀!
Nagios自己也有日志记录
查去查来,找不到头绪。再查sendmail
的日志/var/log/maillog,只发现我手动发送邮件的记录,而没有其他发送记录---只有下面这么一条记录:
Jul 27 14:27:48 Nagios sm-mta[37141]: m6RERkYR037139:
to=
差点忘了,Nagios自己也有日志记录呢!赶快打开看一眼,发现里面有不少Warning,抽一个出来,其内容如下:
[1217166816] HOST NOTIFICATION:
sery;mail-server;DOWN;host-notify-by-email;CRITICAL - Plugin timed
out after 10 seconds
原因:邮件路径不对
其他的行也更这个类似;最有用的信息我用红色标记,其大意是不能执行上面的2进制或可执行文件。在这个条目中,只有2个执行文件?printf及mail。我把它按原样单独拿出来执行,操作过程如下:
(1)/usr/bin/printf “"%b" "***** Nagios 2.9 *****n”
输出 ***** Nagios 2.9 *****,这是正常的结果。
(2)/bin/mail -s "Host DOWN alert for mail-server!"
sery@163.com 输出su: /bin/mail: No such file or
directory,没找到路径或目录。前面还手动发了邮件的,明明有mail这个客户端程序呀!可能这个路径不对,是linux的mail路径。查一下FreeBSD的mail路径,执行find
/ -name 得到mail在FreeBSD的路径为/usr/bin/mail 。
到这里,我们知道了为啥不能发邮件的根本原因,接下来,我把Nagios的配置文件commands.cfg的host-notify-by-email、service-notify-by-email的”/bin/mail”替换为“/usr/bin/mail”。其完整形式为:
# 'host-notify-by-email' command definition
修改完配置文件commands.cfg后重启 Nagios,再查看Nagios日志,不再有“Make sure the
script or binary you are trying to execute actually
exists...”报错,并且有发送报警邮件的记录了:
[root@Nagios /usr/local/Nagios/var]# tail -f Nagios.log
收邮件,迫不及待,哈哈,我的163邮箱收到久违的报警信息了。再回去瞧一眼邮件日志/var/log/malllog,也记录了这个发送情况。
经验总结:通过日志记录,对于我们排查故障确实有着不可估量的好处。在实际的工作中,我们应该随时检查系统日志以及应用程序相关的日志,从记录项中寻找蛛丝马迹,从而得出解决问题的方法。
如上图所示,真有一个服务器的443端口对应的服务发生故障了,可是等了半天就是收不到报警邮件。登录Nagios所在的系统,检查与邮件发送相关的情况,其基本操作是:检查sendmail是否起来(ps
aux | grep sendmail),结果正常;用mail程序手动发一封邮件给我的一个邮箱(mail Cs “This is a
mail test project” sery@163.com
/usr/local/Nagios/etc
cgi.cfg
contacts.cfg
localhost.cfg
services.cfg
commands.cfg
hostgroups.cfg
Nagios.cfg
timeperiods.cfg
contactgroups.cfg
hosts.cfg
resource.cfg没看见有什么异常的情况,改了其中的某些设置,如cgi.cfg文件,重启Nagios,还是不能发报警邮件。可是,但我点击web管理界面的时候,确实是有邮件报警行为,如下图所示:
[1217166816] Warning: Attempting to execute the command
"/usr/bin/printf "%b" "***** Nagios 2.9 *****nnNotification Type:
PROBLEMnHost: mail-servernState: DOWNnAddress: 211.155.115.66nInfo:
CRITICAL - Plugin timed out after 10 secondsnnDate/Time: Sun Jul 27
13:53:36 UTC 2008n" | /bin/mail -s "Host DOWN alert for
mail-server!" sery@163.com" resulted in a return code of 127.
Make sure the script or binary you are trying to execute actually
exists...
define command{
command_name host-notify-by-email
command_line /usr/bin/printf "%b" "***** Nagios
2.9 *****nnNotification Type: $NOTIFICATIONTYPE$nHost:
$HOSTNAME$nState: $HOSTSTATE$nAddress: $HOSTADDRESS$nInfo:
$HOSTOUTPUT$nnDate/Time: $LONGDATETIME$n" | /usr/bin/mail -s "Host
$HOSTSTATE$ alert for $HOSTNAME$!" $CONTACTEMAIL$
}
# 'notify-by-email' command definition
define command{
command_name service-notify-by-email
command_line /usr/bin/printf "%b" "***** Nagios
2.9 *****nnNotification Type: $NOTIFICATIONTYPE$nnService:
$SERVICEDESC$nHost: $HOSTALIAS$nAddress: $HOSTADDRESS$nState:
$SERVICESTATE$nnDate/Time: $LONGDATETIME$nnAdditional
Info:nn$SERVICEOUTPUT$" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$
alert - $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **"
$CONTACTEMAIL$
}
[1217170467] SERVICE ALERT: mail-server;check_tcp
995;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[1217170534] Auto-save of retention data completed
successfully.
[1217170577] HOST ALERT: mail-server;DOWN;SOFT;1;CRITICAL -
Plugin timed out after 10 seconds
[1217170587] HOST ALERT: mail-server;DOWN;SOFT;2;CRITICAL -
Plugin timed out after 10 seconds
[1217170597] HOST ALERT: mail-server;DOWN;SOFT;3;CRITICAL -
Plugin timed out after 10 seconds
[1217170607] HOST ALERT: mail-server;DOWN;SOFT;4;CRITICAL -
Plugin timed out after 10 seconds
[1217170607] HOST ALERT: mail-server;UP;SOFT;5;PING OK - Packet
loss = 0%, RTA = 111.63 ms
[1217170607] SERVICE ALERT: mail-server;check_tcp
995;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
[1217170687] SERVICE ALERT: mail-server;check_tcp
995;OK;SOFT;3;TCP OK - 3.137 second response time on port 995
[1217171057] SERVICE NOTIFICATION: sery;fav-0;check_tcp
443;CRITICAL;service-notify-by-email;CRITICAL - Socket timeout
after 10 seconds