一台服务器设备,反复重启,每天重启数次。
1. 硬件,内存主板,一一更换,甚至除了硬盘将整台机器都换掉了,依然重启。
2. 排除电源问题,换了电源线,换了插座,还是重启。
3. 那么接下来,还有三种可能:
A。内核问题,内核crash。(redhat的稳定性还是十分让人信赖的,这种可能性不高)
B。硬盘或文件系统故障。本质上,这样会导致内核crash。
C。程序自主reboot。(我们自己的程序reboot,或进了黑客放了reboot脚本。好无聊的黑客。。。。)
内核在crash那一刻是会发现,自己即将crash的,于是他会在临死前留下一些信息。告诉用户我发生了什么。 可是问题在于:文件系统的复杂性,会导致内核临死之前文件系统也随之崩溃了。
通过重启之后查看日志,确实没有留下有用的信息。
这是时候我们还有另一种手段,netcosole,他的功能是吧内核日志从socket以udp的方式,自组IP包而不走协议栈,讲包推出网卡端口。包的格式为syslog格式。
netcosole使用:
1. 修改配置文件
[root@S205 ~]# cat /etc/sysconfig/netconsolecat /etc/sysconfig/netconsole
# This is the configuration file for the netconsole service. By starting
# this service you allow a remote syslog daemon to record console output
# from this system.
# The local port number that the netconsole module will use
LOCALPORT=6666
# The ethernet device to send console messages out of (only set this if it
# can't be automatically determined)
DEV=enp3s0
# The IP address of the remote syslog server to send messages to
SYSLOGADDR=192.168.10.214
# The listening port of the remote syslog daemon
SYSLOGPORT=514
# The MAC address of the remote syslog server (only set this if it can't
# be automatically determined)
SYSLOGMACADDR=40:8d:5c:22:53:18
[root@S205 ~]#
2. 启动服务
[root@S205 ~]# systemctl start netconsole
[root@S205 ~]# systemctl enable netconsole
当前系统及内核版本:
[root@S205 ~]# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)
[root@S205 ~]# uname -a
Linux S205 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@S205 ~]#
成功收到内核crash日志:
Jul 25 08:14:54 192.168.10.205 [20239.422386] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
Jul 25 08:14:54 192.168.10.205
Jul 25 08:14:54 192.168.10.205 [20239.422529] Kernel panic - not syncing: Hard LOCKUP
Jul 25 08:14:54 192.168.10.205 [20239.422543] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 3.10.0-514.el7.x86_64 #1
Jul 25 08:14:54 192.168.10.205 [20239.422561] Hardware name: LENOVO 10C0A038CD/ , BIOS FCKT73AUS 08/28/2015
Jul 25 08:14:54 192.168.10.205 [20239.422579] ffffffff818d9784
Jul 25 08:14:54 192.168.10.205 90a5d9572fc8872b
Jul 25 08:14:54 192.168.10.205 ffff88041edc5b18
Jul 25 08:14:54 192.168.10.205 ffffffff81685fac
Jul 25 08:14:54 192.168.10.205
Jul 25 08:14:54 192.168.10.205 [20239.422603] ffff88041edc5b98
Jul 25 08:14:54 192.168.10.205 ffffffff8167f3b3
Jul 25 08:14:54 192.168.10.205 0000000000000010
Jul 25 08:14:54 192.168.10.205 ffff88041edc5ba8
Jul 25 08:14:54 192.168.10.205
Jul 25 08:14:54 192.168.10.205 [20239.422627] ffff88041edc5b48
Jul 25 08:14:54 192.168.10.205 90a5d9572fc8872b
Jul 25 08:14:54 192.168.10.205 ffff88041edc5ba8
Jul 25 08:14:54 192.168.10.205 ffffffff818d948a
Jul 25 08:14:54 192.168.10.205
Jul 25 08:14:54 192.168.10.205 [20239.422651] Call Trace:
Jul 25 08:14:54 192.168.10.205 [20239.422658]
Jul 25 08:14:54 192.168.10.205 [] dump_stack+0x19/0x1b
Jul 25 08:14:54 192.168.10.205 [20239.422678] [] panic+0xe3/0x1f2
Jul 25 08:14:54 192.168.10.205 [20239.422692] [] nmi_panic+0x3f/0x40
Jul 25 08:14:54 192.168.10.205 [20239.422706] [] watchdog_overflow_callback+0xf6/0x100
Jul 25 08:14:54 192.168.10.205 [20239.422725] [] __perf_event_overflow+0x8e/0x1f0
Jul 25 08:14:54 192.168.10.205 [20239.422741] [] perf_event_overflow+0x14/0x20
Jul 25 08:14:54 192.168.10.205 [20239.422759] [] intel_pmu_handle_irq+0x1f8/0x4e0
Jul 25 08:14:54 192.168.10.205 [20239.422776] [] perf_event_nmi_handler+0x2b/0x50
Jul 25 08:14:54 192.168.10.205 [20239.422793] [] nmi_handle.isra.0+0x69/0xb0
Jul 25 08:14:54 192.168.10.205 [20239.422808] [] do_nmi+0x133/0x410
Jul 25 08:14:54 192.168.10.205 [20239.422822] [] end_repeat_nmi+0x1e/0x2e
Jul 25 08:14:54 192.168.10.205 [20239.422838] [] ? _raw_spin_lock_irqsave+0x47/0x60
Jul 25 08:14:54 192.168.10.205 [20239.422855] [] ? _raw_spin_lock_irqsave+0x47/0x60
Jul 25 08:14:54 192.168.10.205 [20239.422871] [] ? _raw_spin_lock_irqsave+0x47/0x60
Jul 25 08:14:54 192.168.10.205 [20239.422887] <>
Jul 25 08:14:54 192.168.10.205
Jul 25 08:14:54 192.168.10.205 [] nvkm_fantog_update+0x43/0x110 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.422947] [] nvkm_fantog_set+0x38/0x40 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.422976] [] nvkm_fan_update+0xc8/0x210 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423005] [] nvkm_therm_fan_set+0x19/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423035] [] nvkm_therm_update+0x97/0x310 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423064] [] nvkm_therm_alarm+0x17/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423106] [] nvkm_timer_alarm_trigger+0x103/0x150 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423147] [] nvkm_timer_alarm+0x60/0xb0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423176] [] alarm_timer_callback+0xd1/0xe0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423207] [] nvkm_timer_alarm_trigger+0x103/0x150 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423238] [] nvkm_timer_alarm+0x60/0xb0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423266] [] nvkm_fantog_update+0x10a/0x110 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423295] [] nvkm_fantog_alarm+0x1a/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423324] [] nvkm_timer_alarm_trigger+0x103/0x150 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423355] [] nv04_timer_intr+0x6b/0xb0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423384] [] nvkm_timer_intr+0x14/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423419] [] nvkm_subdev_intr+0x17/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423458] [] nvkm_mc_intr+0x79/0x110 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423486] [] nvkm_pci_intr+0x55/0xa0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423503] [] handle_irq_event_percpu+0x3e/0x1e0
Jul 25 08:14:54 192.168.10.205 [20239.423521] [] handle_irq_event+0x3d/0x60
Jul 25 08:14:54 192.168.10.205 [20239.423536] [] handle_edge_irq+0x77/0x130
Jul 25 08:14:54 192.168.10.205 [20239.424012] [] handle_irq+0xbf/0x150
Jul 25 08:14:54 192.168.10.205 [20239.424491] [] ? tick_check_idle+0x8a/0xd0
Jul 25 08:14:54 192.168.10.205 [20239.424967] [] ? atomic_notifier_call_chain+0x1a/0x20
Jul 25 08:14:54 192.168.10.205 [20239.425445] [] do_IRQ+0x4f/0xf0
Jul 25 08:14:54 192.168.10.205 [20239.425921] [] common_interrupt+0x6d/0x6d
Jul 25 08:14:54 192.168.10.205 [20239.426389]
Jul 25 08:14:54 192.168.10.205 [] ? cpuidle_enter_state+0x52/0xc0
Jul 25 08:14:54 192.168.10.205 [20239.426863] [] cpuidle_idle_call+0xd9/0x210
Jul 25 08:14:54 192.168.10.205 [20239.427314] [] arch_cpu_idle+0xe/0x30
Jul 25 08:14:54 192.168.10.205 [20239.427793] [] cpu_startup_entry+0x245/0x290
Jul 25 08:14:54 192.168.10.205 [20239.428222] [] start_secondary+0x1ba/0x230
Jul 25 08:18:03 192.168.10.205 [ 2.633081] nouveau 0000:01:00.0: priv: HUB0: 085014 ffffffff (1b70820b)
Jul 25 08:20:01 S214 systemd: Started Session 171 of user root.
Jul 25 08:20:01 S214 systemd: Starting Session 171 of user root.
Jul 25 08:30:01 S214 systemd: Started Session 172 of user root.
这是正确的处理方式,不是去深入调查原因,也不是去hacking。
1. 升至最新版稳定内核。
2. 回退至前一版稳定内涵。
[root@S205 ~]# yum upgrade
Installing:
kernel x86_64 3.10.0-514.26.2.el7 updates 37 M
已升最新,待观察:
[root@S205 ~]# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)
[root@S205 ~]# uname -a
Linux S205 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
参考:https://stackoverflow.com/questions/44039958/kernel-panic-not-syncing-watchdog-detected-hard-lockup
好像是 nvidia 显卡的问题。
[root@S205 ~]# cat /etc/default/grub |grep CMDLINE
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rd.driver.blacklist=nouveau nomodeset rhgb quiet"
[root@S205 ~]#
增加内核参数:rd.driver.blacklist=nouveau nomodeset
再观察。
连续24小时未重启。
完。