原帖由 freas 于 2011-7-16 21:41 发表
如果可以,尽量提供案发时v$session(或者v$session_wait)中的event信息。
根据iostat记录的信息做出如下分析(以下只是部分信息,但是具有很大代表性)
在案发以前(19:00以前),系统IO信息如下:
Time: 18:34:22
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 13.50 0.00 5.80 0.00 164.80 0.00 82.40 28.41 0.15 25.38 6.14 3.56
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 4.60 88.50 147.20 718.50 73.60 359.25 9.30 0.03 0.29 0.28 2.60
sdd 0.00 0.00 10.60 13.20 169.60 302.40 84.80 151.20 19.83 0.12 5.08 4.57 10.88
sde 0.00 0.00 9.70 21.50 155.20 475.20 77.60 237.60 20.21 0.09 3.04 2.83 8.84
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
也就是说,在案发前,系统的IO比较空闲。
在案发时(19:00以后):
Time: 19:14:22
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 13.90 0.00 4.10 0.00 156.00 0.00 78.00 38.05 0.04 9.66 6.34 2.60
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 4.80 116.00 153.60 916.60 76.80 458.30 8.86 0.03 0.23 0.23 2.72
sdd 0.00 0.00 13.40 11.20 214.40 292.80 107.20 146.40 20.62 0.15 6.23 5.64 13.88
sde 0.00 0.00 153.30 27.50 31193.60 635.20 15596.80 317.60 176.04 1.12 6.21 5.33 96.44
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
这时候,发生了大量的读操作!
在案发前(同样是18:34),查看top信息,如下:
2011-07-15 18:34:42 [COMMAND]: top -b -d 1 -n 1
------------------------------------------------------------------------
top - 18:34:43 up 105 days, 17:34, 7 users, load average: 0.93, 1.44, 1.29
Tasks: 476 total, 3 running, 472 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 2.3%sy, 0.0%ni, 91.9%id, 1.2%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 32902912k total, 21752292k used, 11150620k free, 249276k buffers
Swap: 33559744k total, 194900k used, 33364844k free, 17502728k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26042 root 21 0 41224 31m 1564 R 8 0.1 241:43.28 col_monitor.sh
25866 root 25 0 11792 2320 1520 S 8 0.0 0:00.04 col_monitor_sar
19745 oracle 15 0 16.2g 248m 245m S 4 0.8 0:01.28 oracle
19749 oracle 15 0 16.2g 184m 181m S 4 0.6 0:00.77 oracle
26048 root 25 0 11784 2304 1512 R 4 0.0 0:00.02 col_monitor_vms
9726 oracle 16 0 16.2g 45m 39m S 2 0.1 124:48.00 oracle
14319 oracle 15 0 16.2g 407m 404m S 2 1.3 0:02.99 oracle
19579 oracle 15 0 16.2g 339m 335m S 2 1.1 0:01.76 oracle
19622 oracle 15 0 16.2g 244m 241m S 2 0.8 0:01.31 oracle
19741 oracle 15 0 16.2g 264m 261m S 2 0.8 0:01.47 oracle
22156 oracle 15 0 16.2g 164m 161m S 2 0.5 0:00.78 oracle
22232 oracle 15 0 16.2g 104m 101m S 2 0.3 0:00.25 oracle
25022 oracle 16 0 16.2g 87m 84m S 2 0.3 0:00.14 oracle
25814 root 15 0 5780 1404 812 R 2 0.0 0:00.02 top
1 root 15 0 780 308 256 S 0 0.0 30:40.87 init
2 root RT 0 0 0 0 S 0 0.0 0:54.23 migration/0
3 root 34 19 0 0 0 S 0 0.0 26:43.82 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 1:01.75 migration/1
5 root 34 19 0 0 0 S 0 0.0 11:01.04 ksoftirqd/1
6 root RT 0 0 0 0 S 0 0.0 2:29.07 migration/2
7 root 34 19 0 0 0 S 0 0.0 24:35.80 ksoftirqd/2
8 root RT 0 0 0 0 S 0 0.0 3:37.25 migration/3
9 root 34 19 0 0 0 S 0 0.0 11:00.48 ksoftirqd/3
10 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
11 root 10 -5 0 0 0 S 0 0.0 0:00.06 events/1
12 root 10 -5 0 0 0 S 0 0.0 0:00.45 events/2
13 root 10 -5 0 0 0 S 0 0.0 0:00.21 events/3
14 root 10 -5 0 0 0 S 0 0.0 4:34.46 khelper
15 root 14 -5 0 0 0 S 0 0.0 0:00.00 kthread
21 root 10 -5 0 0 0 S 0 0.0 0:00.00 kblockd/0
22 root 10 -5 0 0 0 S 0 0.0 0:00.00 kblockd/1
23 root 10 -5 0 0 0 S 0 0.0 0:00.02 kblockd/2
24 root 10 -5 0 0 0 S 0 0.0 0:00.01 kblockd/3
25 root 10 -5 0 0 0 S 0 0.0 0:00.00 kacpid
206 root 15 0 0 0 0 S 0 0.0 0:00.30 pdflush
208 root 15 0 0 0 0 S 0 0.0 0:47.94 kswapd0
209 root 10 -5 0 0 0 S 0 0.0 0:00.00 aio/0
210 root 11 -5 0 0 0 S 0 0.0 0:00.00 aio/1
211 root 10 -5 0 0 0 S 0 0.0 0:00.00 aio/2
212 root 11 -5 0 0 0 S 0 0.0 0:00.00 aio/3
432 root 10 -5 0 0 0 S 0 0.0 0:00.00 cqueue/0
433 root 10 -5 0 0 0 S 0 0.0 0:00.00 cqueue/1
434 root 10 -5 0 0 0 S 0 0.0 0:00.00 cqueue/2
435 root 11 -5 0 0 0 S 0 0.0 0:00.00 cqueue/3
436 root 10 -5 0 0 0 S 0 0.0 0:00.00 kseriod
467 root 10 -5 0 0 0 S 0 0.0 0:00.00 kpsmoused
876 root 11 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_0
877 root 0 -20 0 0 0 S 0 0.0 0:00.00 lpfc_worker_0
878 root 19 -5 0 0 0 S 0 0.0 0:00.60 scsi_wq_0
879 root 10 -5 0 0 0 S 0 0.0 0:00.00 fc_wq_0
880 root 11 -5 0 0 0 S 0 0.0 0:00.00 fc_dl_0
941 root 11 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_1
942 root 0 -20 0 0 0 S 0 0.0 0:00.00 lpfc_worker_1
943 root 10 -5 0 0 0 S 0 0.0 0:00.63 scsi_wq_1
944 root 10 -5 0 0 0 S 0 0.0 0:00.00 fc_wq_1
945 root 11 -5 0 0 0 S 0 0.0 0:00.00 fc_dl_1
1035 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_2
1068 root 10 -5 0 0 0 S 0 0.0 0:00.00 ata/0
1069 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata/1
1070 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata/2
1071 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata/3
1072 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata_aux
1131 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_3
1132 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_4
1208 root 10 -5 0 0 0 S 0 0.0 0:00.01 khubd
1327 root 0 -20 0 0 0 S 0 0.0 133:07.64 upDispatch
1328 root 0 -20 0 0 0 S 0 0.0 1:31.44 upFailback
1329 root 0 -20 0 0 0 S 0 0.0 0:00.00 upPathValidate
1330 root 0 -20 0 0 0 S 0 0.0 0:00.12 upWorker
1331 root 0 -20 0 0 0 S 0 0.0 0:00.03 HwLogHandle
1332 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_5
1472 root 10 -5 0 0 0 S 0 0.0 0:01.00 reiserfs/0
1473 root 10 -5 0 0 0 S 0 0.0 0:00.13 reiserfs/1
1474 root 10 -5 0 0 0 S 0 0.0 0:00.14 reiserfs/2
1475 root 10 -5 0 0 0 S 0 0.0 0:01.30 reiserfs/3
1531 root 13 -4 7240 736 336 S 0 0.0 0:00.43 udevd
2486 root 20 0 0 0 0 S 0 0.0 0:00.00 shpchpd_event
3112 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3113 root 15 0 0 0 0 S 0 0.0 0:01.92 dmp_daemon
3114 root 15 0 0 0 0 S 0 0.0 0:00.86 dmp_daemon
3115 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3116 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3117 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3118 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3119 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3120 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3121 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3169 root 15 0 0 0 0 S 0 0.0 0:00.01 voltuned
3170 root 15 0 0 0 0 S 0 0.0 0:00.00 vxiod
3171 root 15 0 0 0 0 S 0 0.0 0:00.00 vxiod
3172 root 15 0 0 0 0 S 0 0.0 0:00.00 vxiod
3173 root 15 0 0 0 0 S 0 0.0 0:00.00 vxiod
3174 root 15 0 0 0 0 S 0 0.0 0:00.00 vxiod
其中并没有CPU高消耗的Oracle进程,系统状况非常良好。
在案发后(19:15),系统出现了单Oracle进程高消耗CPU:
2011-07-15 19:15:42 [COMMAND]: top -b -d 1 -n 1
------------------------------------------------------------------------
top - 19:15:42 up 105 days, 18:15, 7 users, load average: 2.57, 2.95, 2.55
Tasks: 483 total, 4 running, 478 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 2.3%sy, 0.0%ni, 91.9%id, 1.2%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 32902912k total, 22791788k used, 10111124k free, 249300k buffers
Swap: 33559744k total, 194620k used, 33365124k free, 18525748k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13907 oracle 25 0 16.2g 725m 722m R 96 2.3 4:00.88 oracle
26042 root 25 0 41224 31m 1564 S 10 0.1 241:54.70 col_monitor.sh
25724 oracle 17 0 16.3g 4.3g 4.3g D 6 13.8 2:16.91 oracle
9726 oracle 16 0 16.2g 45m 39m R 4 0.1 125:11.15 oracle
1418 oracle 15 0 16.2g 89m 86m S 4 0.3 0:00.16 oracle
1442 oracle 15 0 16.2g 109m 106m S 4 0.3 0:00.40 oracle
2461 root 16 0 5784 1412 812 R 4 0.0 0:00.03 top
2481 root 19 0 11800 2320 1520 S 4 0.0 0:00.02 col_monitor_sar
2641 oracle 16 0 16.2g 18m 15m S 4 0.1 0:00.02 oracle
23779 oracle 15 0 16.2g 228m 225m S 2 0.7 0:01.44 oracle
26585 oracle 15 0 16.2g 296m 293m S 2 0.9 0:01.79 oracle
28616 oracle 15 0 16.2g 210m 207m S 2 0.7 0:00.88 oracle
28863 oracle 15 0 16.2g 169m 166m S 2 0.5 0:00.94 oracle
28867 oracle 15 0 16.2g 244m 241m S 2 0.8 0:01.11 oracle
1028 oracle 16 0 16.2g 142m 139m S 2 0.4 0:00.62 oracle
1354 oracle 15 0 16.2g 175m 172m S 2 0.5 0:00.50 oracle
1370 oracle 15 0 16.2g 91m 88m S 2 0.3 0:00.31 oracle
1390 oracle 16 0 16.2g 130m 127m S 2 0.4 0:00.24 oracle
1430 oracle 15 0 16.2g 147m 144m S 2 0.5 0:00.54 oracle
2666 root 24 0 11796 2308 1512 R 2 0.0 0:00.01 col_monitor_vms
1 root 15 0 780 308 256 S 0 0.0 30:41.61 init
2 root RT 0 0 0 0 S 0 0.0 0:54.24 migration/0
3 root 34 19 0 0 0 S 0 0.0 26:46.74 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 1:01.76 migration/1
5 root 34 19 0 0 0 S 0 0.0 11:01.18 ksoftirqd/1
6 root RT 0 0 0 0 S 0 0.0 2:29.09 migration/2
7 root 34 19 0 0 0 S 0 0.0 24:38.48 ksoftirqd/2
8 root RT 0 0 0 0 S 0 0.0 3:37.34 migration/3
9 root 34 19 0 0 0 S 0 0.0 11:00.63 ksoftirqd/3
10 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
11 root 10 -5 0 0 0 S 0 0.0 0:00.06 events/1
12 root 10 -5 0 0 0 S 0 0.0 0:00.45 events/2
13 root 10 -5 0 0 0 S 0 0.0 0:00.21 events/3
14 root 10 -5 0 0 0 S 0 0.0 4:34.54 khelper
15 root 14 -5 0 0 0 S 0 0.0 0:00.00 kthread
21 root 10 -5 0 0 0 S 0 0.0 0:00.00 kblockd/0
22 root 10 -5 0 0 0 S 0 0.0 0:00.00 kblockd/1
23 root 10 -5 0 0 0 S 0 0.0 0:00.02 kblockd/2
24 root 10 -5 0 0 0 S 0 0.0 0:00.01 kblockd/3
25 root 10 -5 0 0 0 S 0 0.0 0:00.00 kacpid
206 root 15 0 0 0 0 S 0 0.0 0:00.30 pdflush
208 root 15 0 0 0 0 S 0 0.0 0:47.94 kswapd0
209 root 10 -5 0 0 0 S 0 0.0 0:00.00 aio/0
210 root 11 -5 0 0 0 S 0 0.0 0:00.00 aio/1
211 root 10 -5 0 0 0 S 0 0.0 0:00.00 aio/2
212 root 11 -5 0 0 0 S 0 0.0 0:00.00 aio/3
432 root 10 -5 0 0 0 S 0 0.0 0:00.00 cqueue/0
433 root 10 -5 0 0 0 S 0 0.0 0:00.00 cqueue/1
434 root 10 -5 0 0 0 S 0 0.0 0:00.00 cqueue/2
435 root 11 -5 0 0 0 S 0 0.0 0:00.00 cqueue/3
436 root 10 -5 0 0 0 S 0 0.0 0:00.00 kseriod
467 root 10 -5 0 0 0 S 0 0.0 0:00.00 kpsmoused
876 root 11 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_0
877 root 0 -20 0 0 0 S 0 0.0 0:00.00 lpfc_worker_0
878 root 19 -5 0 0 0 S 0 0.0 0:00.60 scsi_wq_0
879 root 10 -5 0 0 0 S 0 0.0 0:00.00 fc_wq_0
880 root 11 -5 0 0 0 S 0 0.0 0:00.00 fc_dl_0
941 root 11 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_1
942 root 0 -20 0 0 0 S 0 0.0 0:00.00 lpfc_worker_1
943 root 10 -5 0 0 0 S 0 0.0 0:00.63 scsi_wq_1
944 root 10 -5 0 0 0 S 0 0.0 0:00.00 fc_wq_1
945 root 11 -5 0 0 0 S 0 0.0 0:00.00 fc_dl_1
1035 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_2
1068 root 10 -5 0 0 0 S 0 0.0 0:00.00 ata/0
1069 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata/1
1070 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata/2
1071 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata/3
1072 root 12 -5 0 0 0 S 0 0.0 0:00.00 ata_aux
1131 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_3
1132 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_4
1208 root 10 -5 0 0 0 S 0 0.0 0:00.01 khubd
1327 root 0 -20 0 0 0 S 0 0.0 133:19.23 upDispatch
1328 root 0 -20 0 0 0 S 0 0.0 1:31.46 upFailback
1329 root 0 -20 0 0 0 S 0 0.0 0:00.00 upPathValidate
1330 root 0 -20 0 0 0 S 0 0.0 0:00.12 upWorker
1331 root 0 -20 0 0 0 S 0 0.0 0:00.03 HwLogHandle
1332 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_5
1472 root 10 -5 0 0 0 S 0 0.0 0:01.00 reiserfs/0
1473 root 10 -5 0 0 0 S 0 0.0 0:00.13 reiserfs/1
1474 root 10 -5 0 0 0 S 0 0.0 0:00.14 reiserfs/2
1475 root 10 -5 0 0 0 S 0 0.0 0:01.30 reiserfs/3
1531 root 13 -4 7240 736 336 S 0 0.0 0:00.43 udevd
2486 root 20 0 0 0 0 S 0 0.0 0:00.00 shpchpd_event
3112 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3113 root 15 0 0 0 0 S 0 0.0 0:01.92 dmp_daemon
3114 root 15 0 0 0 0 S 0 0.0 0:00.86 dmp_daemon
3115 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3116 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3117 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3118 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3119 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3120 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
3121 root 15 0 0 0 0 S 0 0.0 0:00.00 dmp_daemon
总结我的解决思路如下:
1、关注在案发时v$session_wait进程,看系统到底在等待什么(主要是为了弄清楚状况)。
2、弄清楚案发时高消耗的Oracle进程到底是什么进程(如果是server process,相对比较容易,通过连接v$session,v$sqltext,v$process可以把正在执行的bad sql抓取出来,做优化;如果是background process,请提供相应的进程信息,做进一步诊断)。
附录:
以下是盖国强老师的通过system的pid抓取sql的脚本:
SELECT /*+ ORDERED */
sql_text
FROM v$sqltext a
WHERE (a.hash_value, a.address) IN (
SELECT DECODE (sql_hash_value,
0, prev_hash_value,
sql_hash_value
),
DECODE (sql_hash_value, 0, prev_sql_addr, sql_address)
FROM v$session b
WHERE b.paddr = (SELECT addr
FROM v$process c
WHERE c.spid = '&pid'))
ORDER BY piece ASC
/