热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

追根溯源数据库deadlock重启

每周日必出故障系列,似乎是看不惯我周末休息,老美的圣诞节也

每周日必出故障系列,似乎是看不惯我周末休息,老美的圣诞节也没镇住老美的数据库。

数据库在圣诞节最后一刻2节点宕机了!!!!当然按照重启法则似乎能解决一切,但是这里并不适用,重启之后故障重现,给了我足够的取证和分析时间。

Oracle database version 19.5 no CDB

下面看一组alert

2021-12-25T23:28:05.264842+08:00
ARC0 (PID:12469): Archived Log entry 452730 added for T-2.S-25671 ID 0xce9c4479 LAD:1
2021-12-25T23:28:08.840050+08:00 <<<<<<<<<<<Global Enqueue Services Deadlock detected (DID = 12_1_1572). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:28:19.648620+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1573). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:28:50.740136+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1574). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:29:01.455790+08:00
Thread 2 advanced to log sequence 25673 (LGWR switch)
Current log# 8 seq# 25673 mem# 0: +DATAC1/xxxx/ONLINELOG/group_8.884.1090597299
2021-12-25T23:29:02.815247+08:00
ARC1 (PID:12477): Archived Log entry 452732 added for T-2.S-25672 ID 0xce9c4479 LAD:1
2021-12-25T23:29:04.722593+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1575). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
…省略…
2021-12-25T23:46:50.130826+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1628). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:46:52.969550+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lms1_8986_8992.trc (incident=2633432): <<<<ORA-04031: unable to allocate 168 bytes of shared memory (“shared pool”,“unknown object”,“sga heap(4,0)”,“gcs dynamic shadows lms”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633432/xxxx2_lms1_8986_8992_i2633432.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:46:53.052028+08:00
ORA-04031 heap dump being written to trace file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633432/xxxx2_lms1_8986_8992_i2633432.trc
2021-12-25T23:46:54.659917+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1629). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:47:04.677474+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1630). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:47:08.264118+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j006_276744.trc (incident=2634840):
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^11005”,“kglseshtSegs”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2634840/xxxx2_j006_276744_i2634840.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:47:08.980300+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j006_276744.trc:
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^11005”,“kglseshtSegs”)
2021-12-25T23:47:09.696941+08:00
Process J006 died, see its trace file
2021-12-25T23:47:09.697025+08:00
kkjcre1p: unable to spawn jobq slave process
2021-12-25T23:47:09.697096+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_cjq0_12905.trc:
2021-12-25T23:47:11.501606+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_mmon_9046.trc (incident=2633640):
ORA-04031: unable to allocate 40 bytes of shared memory (“shared pool”,“select size_for_estimate, …”,“KGLH0^51522e0b”,“kglHeapInitialize:temp”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633640/xxxx2_mmon_9046_i2633640.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:48:03.057455+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j009_279833.trc (incident=2635032):
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^9837”,“kglseshtSegs”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2635032/xxxx2_j009_279833_i2635032.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:48:03.470720+08:00
LMS1 (ospid: 8986_8992) has not called a wait for 50 secs.
2021-12-25T23:48:04.135748+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j009_279833.trc:
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^9837”,“kglseshtSegs”)
2021-12-25T23:48:04.481052+08:00
Process J009 died, see its trace file
2021-12-25T23:48:04.481136+08:00
kkjcre1p: unable to spawn jobq slave process
2021-12-25T23:48:04.481196+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_cjq0_12905.trc:
<<<<<2021-12-25T23:48:06.155760+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmhb_9009.trc (incident=2633480):
ORA-29770: global enqueue process LMS1 (OSID 8986_8992) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633480/xxxx2_lmhb_9009_i2633480.trc
2021-12-25T23:48:08.689426+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_xxxx by memno 1
2021-12-25T23:48:09.325166+08:00
Process termination requested for pid 8986_8992 [source = rdbms], [info = 0] [request issued by pid: 9009, uid: 1001]
2021-12-25T23:48:09.325191+08:00
Process termination requested for pid 8984_8991 [source = rdbms], [info = 0] [request issued by pid: 9009, uid: 1001]
2021-12-25T23:48:09.325281+08:00
Process termination requested for pid 9020 [source = rdbms], [info = 0] [request issued by pid: 9009, uid: 1001]
LMHB (ospid: 9009): terminating the instance due to ORA error 29770
2021-12-25T23:48:09.350502+08:00
Cause - ‘ERROR: Some process(s) is not making progress.
LMHB (ospid: 9009) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s)’
2021-12-25T23:48:09.393157+08:00
Dumping diagnostic data in directory=[cdmp_20211225234809], requested by (instance=2, osid=9009 (LMHB)), summary=[abnormal instance termination].
2021-12-25T23:48:09.662023+08:00
ORA-1092 : opitsk aborting process
2021-12-25T23:48:09.682359+08:00
ORA-1092 : opitsk aborting process
2021-12-25T23:48:10.708035+08:00
License high water mark = 211
2021-12-25T23:48:11.562372+08:00
Instance terminated by LMHB, pid = 9009
2021-12-25T23:48:11.663362+08:00
Warning: 2 processes are still attacheded to shmid 2135162902:
(size: 2097152 bytes, creator pid: 7506, last attach/detach pid: 280327)
2021-12-25T23:48:12.708585+08:00
USER(prelim) (ospid: 280324): terminating the instance
2021-12-25T23:48:12.710943+08:00
Instance terminated by USER(prelim), pid = 280324
2021-12-25T23:48:15.646596+08:00
Starting ORACLE instance (normal) (OS id: 280415)
2021-12-25T23:48:15.665292+08:00

2021-12-25T23:28:08.840050开始数据库出现Deadlock ,从LMDtrace可以看到是业务相关SQL,从LMDtrace可以分析得出是1节点该SQL阻塞了2节点,在这一瞬间,1节点持有latch,2节点想要获取,产生了Deadlock,继续分析。

Global Wait-For-Graph(WFG) for GES Deadlock ID=[12_1_1630]
------------------------------------------------------------------------
Victim : (instance=2, lock=0x5a206ff10)
Start (master) Instance : 2
Number of Locks involved : 4
Number of Sessions involved : 2
User session identified by:
{
User Name : fedc
User Machine : localhost.localdomaincn
OS Terminal Name : UNKNOWN
OS Process ID : 977078
OS Program Name : sqlldr@localhost.localdomain (TNS V1-V3)
Application Name : SQL Loader Conventional Path Load V1-V3)V3)
Action Name : 5/load.ctl1640446941377_182108eanup Slave
Current SQL : INSERT INTO xxxxxxx (ID,UDN,TIME) VALUES (:1,:2,CAST(TO_DATE(:TIME, 'yyyy-mm-dd hh24:mi:ss') AS DATE))
Session Number : 2918
Session Serial Number : 16119
Server Process ORAPID : 252
Server Process OSPID : 251878
Instance : 2
}
waiting for Lock 0x5a206ff10 (Transaction):
{
Lock Level : KJUSERPR
Resource Name : TX 0x2b8000e.0x7cee(ext 0x0,0x0)
GES Transaction ID : FC000-0002-00000B1C
}
which is blocked by Lock 0xaf1b8a50 (Transaction):
{
Lock Level : KJUSERPR
Resource Name : TX 0x2b8000e.0x7cee(ext 0x0,0x0)
GES Transaction ID : 21B000-0001-0000001E
}
owned by the
User session identified by:
{
User Name : fedc
User Machine : localhost.localdomaincn
OS Terminal Name : UNKNOWN
OS Process ID : 32302
OS Program Name : sqlldr@localhost.localdomain (TNS V1-V3)
Application Name : SQL Loader Conventional Path Load V1-V3)
Action Name : 5/load.ctl1640446930728_722208
Current SQL : INSERT INTO xxxxxxx (ID,UDN,TIME) VALUES (:1,:2,CAST(TO_DATE(:TIME, 'yyyy-mm-dd hh24:mi:ss') AS DATE))
Session Number : 11008
Session Serial Number : 1635
Server Process ORAPID : 539
Server Process OSPID : 88755
Instance : 1
}
waiting for Lock 0x2e5942bc8 (Transaction):
{
Lock Level : KJUSERPR
Resource Name : TX 0xe590011.0xfcec(ext 0x0,0x0)
GES Transaction ID : 21B000-0001-0000001E
}

2021-12-25T23:46:52.969550数据库出现内存不足 报ORA-4031,无法获取heap(4,0)即第四个subpool的0号持久内存。从trace中可以看到lms进程在等待gc message
以下是LMS trace

*** 2021-12-25T23:46:50.433900+08:00
2021-12-25 23:46:50.433 : GSIPC:LOOP: cycle time(us) 544177 active 544155 rl/rq/wq/mq 0/0/0/0
2021-12-25 23:46:50.433 : GSIPC:LOOP: wait/busy(us) 22/0 sndf 0 pmsg 0 sync 0 frez 0 fbufs 0 rcvm 544177 othr 0
===[ LM History ]===
KJM HIST LMS1 (most recent first):
7:0u 6:0u 10:0u 17:0u 16:0u 18:0u 12:544177u 7:0u 6:0u 10:0u
17:0u 16:0u 14:32:0u 1:0u 12:1:0u 7:0u 6:0u 10:0u 17:0u 16:0u
12:0u 7:0u 6:0u 10:0u 17:0u 16:0u 13:65521:0u 20:32:0u 20:32:0u 20:32:0u
1:0u 14:32:0u 1:0u 12:2:0u 7:0u 6:0u 10:0u 17:0u 16:0u 12:0u
7:0u 6:0u 10:0u 17:0u 16:0u 14:32:0u 1:0u 12:1:0u 7:0u 6:0u
10:0u 17:0u 16:0u 12:0u 7:0u 6:0u 10:0u 17:0u 16:0u 14:32:0u
1:0u 12:1:0u 7:0u 6:0u
===[ Process Comm State Object ]===
GES MSG BUFFERS: st=emp chunk=(nil) hdr=(nil) lnk=(nil) flags=0x1000 inc=12
outq=0 sndq=0 rmsq=0 opid=25
prmb=(nil) : (id 0 st 0x0 where 0 curq (nil))
smp[i]=(1 0x60dd02980) smp[b]=(1 0x60dd02e10) pmp=0x503aba6e0
mbg[i]=(1 2482) mbg[b]=(4 0) mbg[r]=(0 0)
fmq[i]=(30 72) fmq[b]=(20 33) fmq[r]=(0 0)
mop[s]=1068966156 mop[q]=769906694 pendq=0 zmbq=0 tmoq=0 nonksxp_recvs=0
SO dump_time_stamp=0
SO SendQ=0 start_tm=0 end_tm=0
SO FreeQ reg=72 start_tm=3deb6c end_tm=1edf46
SO FreeQ reg : 32:1 43:2 44:66 114:3
SO FreeQ bth=33 start_tm=3debe3 end_tm=3ce553
SO FreeQ bth : rsv:(tot 33 ib 0 pb 33)
SO FreeQ bth : embed: 32:20 34:1327 36:16
SO FreeQ rsv=0 start_tm=0 end_tm=0
===[ Session Wait History ]===
Current Wait Stack:
Not in wait; last wait ended 0.000393 sec ago
Wait State:
fixed_waits=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.000410 sec since last wait
0: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=307140422 seq_num=43764 snap_id=1
wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec
wait times: max=infinite
wait counts: calls=1 os=1
occurred after 0.000005 sec of elapsed time
1: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=307140421 seq_num=43763 snap_id=1
wait times: snap=0.000007 sec, exc=0.000007 sec, total=0.000007 sec
wait times: max=infinite
wait counts: calls=1 os=1
occurred after 0.000001 sec of elapsed time
.......省略.........
*** 2021-12-25T23:46:52.970095+08:00
2021-12-25T23:46:52.970082+08:00
Incident 2633432 created, dump file: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633432/xxxx2_lms1_8986_8992_i2633432.trc
ORA-04031: unable to allocate 168 bytes of shared memory ("shared pool","unknown object","sga heap(4,0)","gcs dynamic shadows lms")
TOC00000 - Table of contents
TOC00001 - Error Stack
TOC00002 - Dump for incident 2633432 (ORA 4031)
| TOC00003 - Beginning of Customized Incident Dump(s)
| | TOC00004 - Call Stack Trace
| | TOC00005 - Kernel Stack Trace
| | TOC00006 - Process Map Dump
| | TOC00007 - LIBRARY CACHE DUMP
End of TOC
*** 2021-12-25T23:48:06.100278+08:00
Received ORADEBUG command (#11) 'dump KSTDUMPCURPROC 1' from process '9009'
-------------------------------------------------------------------------------
Trace Bucket Dump Begin: default bucket for process 25 (osid: 8986_8992, LMS1)
TIME(*=approx):SEQ:COMPONENT:FILE@LINE:FUNCTION:SECT/DUMP:SID:SERIAL#: [EVENT#:PID] DATA
-------------------------------------------------------------------------------
2021-12-25 23:45:54.768 :A649BD31:db_trace:ksl2.c@3492:kslwtectx():14451:2949: [10005:25] KSL WAIT END [latch: gcs resource hash] 25702449152/0x5fbfc4000 253/0xfd 0/0x0 wait_id=306599374 seq_num=26996 snap_id=1 <<<<<<<<<<<<
2021-12-25 23:45:54.768 :A649BD32:db_trace:ksl2.c@3496:kslwtectx():14451:2949: [10005:25] KSL WAIT END wait times (usecs) - snap=316, exc=316, tot=316
.....省略......
2021-12-25 23:46:51.261 :A79A1459:db_trace:ksl2.c@3496:kslwtectx():14451:2949: [10005:25] KSL WAIT END wait times (usecs) - snap=103, exc=103, tot=103
-------------------------------------------------------------------------------
Trace Bucket Dump End: default bucket for process 25 (osid: 8986_8992, LMS1)
*** 2021-12-25T23:48:06.101690+08:00
Finished processing ORADEBUG command (#11) 'dump KSTDUMPCURPROC 1'

2021-12-25T23:48:06.155760 LMHD(LOCAL MANAGER HeartBeat)进程监控到LMS进程超时达到70s,且等待事件是sga相关的。之后数据库宕机。
以下是LMDB trace

*** 2021-12-25T23:48:03.470524+08:00
==============================
LMS1 (ospid: 8986_8992) has not moved for 71 sec (1640447283.1640447212)
: heartbeat check status 6 (no-heartbeat) (threshold 70 sec)
==================================================
=== LMS1 (ospid: 8986_8992) Heartbeat Report
==================================================
LMS1 (ospid: 8986_8992) has no heartbeats for 71 sec. (threshold 70) <<<<
: heartbeat state 0x2.ffff () pso-flag 0x100
: Not in wait; last wait ended 50 secs ago.
: last wait_id 307162779 at 'latch: shared pool'.
==============================
Dumping PROCESS LMS1 (ospid: 8986_8992) States
==============================
===[ System Load State ]===
CPU Total 25 Raw 25 Core 48 Socket 2
Load normal: Cur 2856 Highmark 122880 (11.15 480.00)
===[ Latch State ]===
Not in Latch Get
===[ Session State Object ]===
----------------------------------------
SO: 0x623e95c00, type: session (4), map: 0x61e1d07d0
state: LIVE (0x4532), flags: 0x5
owner: 0x623fe0458, proc: 0x623fe0458
link: 0x623e95c20[0x61bd43ab8, 0x54bed7c28]
child list count: 1, link: 0x623e95c70[0x603fb84c0, 0x603fb84c0]
pg: 1
SOC: 0x61e1d07d0, type: session (4), map: 0x623e95c00
state: LIVE (0x99fc), flags: INIT (0x1)
(session) sid: 14451 ser: 2949 trans: (nil), creator: 0x62000f978
flags: (0x80000011) USR/- flags2: (0x409) -/-/INC
flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
DID: 0000-0019-000000030000-0000-00000000, short-term DID:
txn branch: (nil)
edition#: 0 user#/name: 0/SYS
oct: 0, prv: 0, sql: (nil), psql: (nil)
stats: 0x5fede9c98, PX stats: 0x129fa544
service name: SYS$BACKGROUND
Current Wait Stack:
Not in wait; last wait ended 51.130431 sec ago
Wait State:
fixed_waits=0 flags=0x20 boundary=(nil)/-1
Session Wait History:
elapsed time of 51.130459 sec since last wait
0: waited for 'latch: shared pool' <<<<
address=0x608c20c8, number=0x26b, why=0x60187f18
wait_id=307162779 seq_num=611 snap_id=1
wait times: snap=0.001530 sec, exc=0.001530 sec, total=0.001530 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 19.636973 sec of elapsed time
1: waited for 'PGA memory operation' <<<<
=0x130000, =0x1, =0x0
wait_id=307162778 seq_num=610 snap_id=1
wait times: snap=0.000006 sec, exc=0.000006 sec, total=0.000006 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.078517 sec of elapsed time
2: waited for 'SGA: allocation forcing component growth' <<<<
=0x0, =0x0, =0x0
wait_id=307162752 seq_num=609 snap_id=26
wait times: snap=0.000000 sec, exc=0.259441 sec, total=0.263625 sec
wait times: max=infinite
wait counts: calls=25 os=25
occurred after 0.000000 sec of elapsed time
.......省略......

分析到这里思路略有清晰,开始怀疑是不是内存问题导致的,DeadLock 和sga到底谁是因谁是果,陷入沉思!!!既然要确定因果那么就看下dba_hist_active_sess_history吧.
分析完日志似乎并不能得到准确答案,接下来进入数据库查一些问题,从下图看到在23:27分开始两个节点执行次数是开始增长的,对比alert日志(23:28:08开始抛出的死锁信息)

select to_char(sample_time,'YYYY-MM-DD hh24:mi:ss') mtime,instance_number,count(*)
from awr_1225 where
sample_time > to_date('2021-12-25 23:10:00','YYYY-MM-DD hh24:mi:ss')
and
sample_timegroup by to_char(sample_time,'YYYY-MM-DD hh24:mi:ss') ,instance_number
order by 1,2,3

image.png

继续分析,发现执行次数突增的SQL并没有一些规律,似乎所有SQL都慢。

select * from (
select to_char(sample_time,'YYYY-MM-DD hh24:mi:ss') mtime,instance_number,sql_id,count(*) coun
from awr_1225 where
sample_time > to_date('2021-12-25 23:10','YYYY-MM-DD hh24:mi:ss')
and sample_timegroup by to_char(sample_time,'YYYY-MM-DD hh24:mi:ss') ,instance_number,sql_id
order by 1,2,4 ) where coun >8

image.png
针对以上圈出的三个SQL我们查一下等待事件
image.png
image.png
image.png
等待事件过多的是gc freelist,看到这里一点一点确认内存可能真的有些问题了,但分析问题还是要论证清楚。
DeadLock报出的SQL我们也要看下。

select to_char(sample_time,'YYYY-MM-DD hh24:mi:ss') mtime,instance_number,sql_id, blocking_inst_id,
blocking_session,blocking_session_serial#,time_waited,event
from awr_1225 where sql_id=' c2n8va5bpatuu ' and
sample_time > to_date('2021-12-25 23:10','YYYY-MM-DD hh24:mi:ss')
and sample_timeorder by 1

image.png
image.png
而阻塞c2n8va5bpatuu的三个session又十分有意思,从sql_exec_id确认这是一次执行,执行了至少4分钟没有执行完,这个SQL也又必须要看下。
image.png
image.png
image.png
接下来看下这个SQL的历史执行情况,由于是sqlloader导入的,sql_plan_hash_value是0,一瞬间我看一个insert要需要数毫秒至数百毫秒(这是一小时内的平均值,上面那三个session执行数分钟),—这一定有问题。
可当我看到21号的执行情况时打消了我对这个SQL的怀疑,21号也有398毫秒的情况,这个跟一次导入的数据量有一定关系,但是这里看到的insert需要数毫秒至数百毫秒也不是根。
image.png
image.png
追根溯源,那么这个根在哪

2021-12-25T23:28:08.840050 数据库开始deadlock ,产生deadlock的SQL在这个时间段内增长并不典型,且执行效率慢并不能确定是根。
2021-12-25T23:46:52.969550 数据库开始出现内存不足,直接导致数据库宕机,产生的等待是sga相关内存等待。
在故障之前active session在300左右,主要的SQL等待事件23:26分开始是gc freelist,甚至在23:18分也有gc freelist等待。

问题开始浮出水面
image.png
image.png
24G的sga,shared pool居然占到了21G,buffer cache可用不足1G,难怪一直报free list。继续看占用sharedpool比较大的是 SQLA和KGLHO,SQLA存放的是SQL文本,KGLHO(Kernel Generic Library Handle)是用来管理library cache内存分配的。这两个占了9G,需要看下数据库硬解析了。
image.png
看到这里问题已经明确了,硬解析比率占到了4%,这已经很高了

select substr(a.SQL_TEXT,1,80),count(*) cont
from v$sqlarea a
group by substr(a.SQL_TEXT,1,80)
having count(*) > 10
order by cont desc ;

image.png

追根溯源,到底是硬解析占用了内存导致圣诞节也不平安。

温故而知新。

数据库硬解析较高,占用了大量内存(9G,sga 24G),致使ASMM管理下的sharedpool在极具膨胀,受罪的就是buffer(不足1G),一条SQL执行是需要buffer来缓存storage数据块的,那么buffer cache中的LRU list就会快速老化,不断频繁的从buffer中申请内存和等待申请内存,原本正常执行的SQL就增加了等待内存分配的时间,反过来 shared pool持有latch时间也随之边长,需要更多的shared pool,在不断地占地盘,直至4031,2节点无法分配内存来存储1节点返回来的gc message,lms等待71s,自动重启。


推荐阅读
  • 本文分享了一个关于在C#中使用异步代码的问题,作者在控制台中运行时代码正常工作,但在Windows窗体中却无法正常工作。作者尝试搜索局域网上的主机,但在窗体中计数器没有减少。文章提供了相关的代码和解决思路。 ... [详细]
  • 基于PgpoolII的PostgreSQL集群安装与配置教程
    本文介绍了基于PgpoolII的PostgreSQL集群的安装与配置教程。Pgpool-II是一个位于PostgreSQL服务器和PostgreSQL数据库客户端之间的中间件,提供了连接池、复制、负载均衡、缓存、看门狗、限制链接等功能,可以用于搭建高可用的PostgreSQL集群。文章详细介绍了通过yum安装Pgpool-II的步骤,并提供了相关的官方参考地址。 ... [详细]
  • 本文介绍了Oracle数据库中tnsnames.ora文件的作用和配置方法。tnsnames.ora文件在数据库启动过程中会被读取,用于解析LOCAL_LISTENER,并且与侦听无关。文章还提供了配置LOCAL_LISTENER和1522端口的示例,并展示了listener.ora文件的内容。 ... [详细]
  • Spring特性实现接口多类的动态调用详解
    本文详细介绍了如何使用Spring特性实现接口多类的动态调用。通过对Spring IoC容器的基础类BeanFactory和ApplicationContext的介绍,以及getBeansOfType方法的应用,解决了在实际工作中遇到的接口及多个实现类的问题。同时,文章还提到了SPI使用的不便之处,并介绍了借助ApplicationContext实现需求的方法。阅读本文,你将了解到Spring特性的实现原理和实际应用方式。 ... [详细]
  • 本文介绍了Hyperledger Fabric外部链码构建与运行的相关知识,包括在Hyperledger Fabric 2.0版本之前链码构建和运行的困难性,外部构建模式的实现原理以及外部构建和运行API的使用方法。通过本文的介绍,读者可以了解到如何利用外部构建和运行的方式来实现链码的构建和运行,并且不再受限于特定的语言和部署环境。 ... [详细]
  • ZSI.generate.Wsdl2PythonError: unsupported local simpleType restriction ... [详细]
  • 本文介绍了通过ABAP开发往外网发邮件的需求,并提供了配置和代码整理的资料。其中包括了配置SAP邮件服务器的步骤和ABAP写发送邮件代码的过程。通过RZ10配置参数和icm/server_port_1的设定,可以实现向Sap User和外部邮件发送邮件的功能。希望对需要的开发人员有帮助。摘要长度:184字。 ... [详细]
  • Linux如何安装Mongodb的详细步骤和注意事项
    本文介绍了Linux如何安装Mongodb的详细步骤和注意事项,同时介绍了Mongodb的特点和优势。Mongodb是一个开源的数据库,适用于各种规模的企业和各类应用程序。它具有灵活的数据模式和高性能的数据读写操作,能够提高企业的敏捷性和可扩展性。文章还提供了Mongodb的下载安装包地址。 ... [详细]
  • 本文讨论了在openwrt-17.01版本中,mt7628设备上初始化启动时eth0的mac地址总是随机生成的问题。每次随机生成的eth0的mac地址都会写到/sys/class/net/eth0/address目录下,而openwrt-17.01原版的SDK会根据随机生成的eth0的mac地址再生成eth0.1、eth0.2等,生成后的mac地址会保存在/etc/config/network下。 ... [详细]
  • Redis底层数据结构之压缩列表的介绍及实现原理
    本文介绍了Redis底层数据结构之压缩列表的概念、实现原理以及使用场景。压缩列表是Redis为了节约内存而开发的一种顺序数据结构,由特殊编码的连续内存块组成。文章详细解释了压缩列表的构成和各个属性的含义,以及如何通过指针来计算表尾节点的地址。压缩列表适用于列表键和哈希键中只包含少量小整数值和短字符串的情况。通过使用压缩列表,可以有效减少内存占用,提升Redis的性能。 ... [详细]
  • WhenIusepythontoapplythepymysqlmoduletoaddafieldtoatableinthemysqldatabase,itdo ... [详细]
  • 数字账号安全与数据资产问题的研究及解决方案
    本文研究了数字账号安全与数据资产问题,并提出了解决方案。近期,大量QQ账号被盗事件引起了广泛关注。欺诈者对数字账号的价值认识超过了账号主人,因此他们不断攻击和盗用账号。然而,平台和账号主人对账号安全问题的态度不正确,只有用户自身意识到问题的严重性并采取行动,才能推动平台优先解决这些问题。本文旨在提醒用户关注账号安全,并呼吁平台承担起更多的责任。令牌云团队对此进行了长期深入的研究,并提出了相应的解决方案。 ... [详细]
  • 本文介绍了将mysql从5.6.15升级到5.7.15的详细步骤,包括关闭访问、备份旧库、备份权限、配置文件备份、关闭旧数据库、安装二进制、替换配置文件以及启动新数据库等操作。 ... [详细]
  • Imtryingtofigureoutawaytogeneratetorrentfilesfromabucket,usingtheAWSSDKforGo.我正 ... [详细]
  • Spring学习(4):Spring管理对象之间的关联关系
    本文是关于Spring学习的第四篇文章,讲述了Spring框架中管理对象之间的关联关系。文章介绍了MessageService类和MessagePrinter类的实现,并解释了它们之间的关联关系。通过学习本文,读者可以了解Spring框架中对象之间的关联关系的概念和实现方式。 ... [详细]
author-avatar
峡谷人123_742
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有