近日,连续收到ASM磁盘dismount,并且是错误“Waited 15 secs for write IO to PST”的问题,这是ASM特有的心跳超时检测,ASM instance会定期检查每个asm disk是不是能正常反
近日,连续收到ASM磁盘dismount,并且是错误“Waited 15 secs for write IO to PST”的问题,这是ASM特有的心跳超时检测,ASM instance会定期检查每个asm disk是不是能正常反馈。所以决定针对这个问题,做个小总结。
在文档ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (Doc ID 1581684.1) 中有下面一段描述:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
上面描述,可以理解为下面几点:
1. ASM实例会定期检查每一个磁盘组的磁盘状态,是否通信正常;
2. 这个检查,只是针对normal和high冗余模式,对于external冗余,不会遇到这个错误;
3. 默认情况是15s超时,也就是说15s磁盘组还是没有对ASM实例响应的话,就会dismount磁盘组。
而遇到这个问题的客户,都是使用光纤网络存储,在存储网络出现问题的情况下,会引发这个错误的出现。也就是说,在ASM定期发出检查信息的时候,如果磁盘没有在15s内反馈的话,我就认为磁盘已经无法访问。
针对这个错误,我尝试在测试环境测试,由于测试环境是VMware的虚拟机,在物理层面删除磁盘,并不会引发这个问题。原因是在同一个主机上的磁盘被异常删除后,ASM的读取操作会立即返回系统层面的IO错误,而不需要去等待错误“Waited 15 secs for write IO to PST”的超时。
所以,我总结这个错误,只会出现在共享的ASM磁盘,不在物理主机的本地,而是在存储网络中,ASM发出去的检测信息,不能及时被反馈,才会出现这个错误。这时,可能是存储主机,存储网络,甚至存储磁盘的问题,anyway,我ASM没有收到我需要的确认信息,我认为你有问题,如果有问题的磁盘数够多,达到影响数据完整性了,那我ASM就要dismount这个磁盘组了。
这里对于“Waited 15 secs for write IO to PST”错误信息,根据文档1581684.1介绍,是在11.2.0.3.0之后出现的。同时在文档中有描述,如何手动修改这个检测超时的时间,可以通过参数_asm_hbeatiowait来控制:
alter system set "_asm_hbeatiowait"=
<需要重启ASM/CRS来时修改生效。>
为了确认,这个参数是在11.2.0.3之后出现的,我将全部数据库版本都查询一遍,具体可以参考下面信息:
======================10.2===================== SQL> select * from v$version; BANNER ---------------------------------------------------------------- Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod PL/SQL Release 10.2.0.5.0 - Production CORE 10.2.0.5.0 Production TNS for Linux: Version 10.2.0.5.0 - Production NLSRTL Version 10.2.0.5.0 - Production SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm; hidden parameter value -------------------------------------------------------------------------------- ---------- _asm_acd_chunks 1 _asm_allow_only_raw_disks TRUE _asm_allow_resilver_corruption FALSE _asm_ausize 1048576 _asm_blksize 4096 _asm_direct_con_expire_time 120 _asm_disk_repair_time 14400 _asm_droptimeout 60 _asm_emulmax 10000 _asm_emultimeout 0 _asm_fob_tac_frequency 3 hidden parameter value -------------------------------------------------------------------------------- ---------- _asm_instlock_quota 0 _asm_kfdpevent 0 _asm_libraries ufs _asm_maxio 1048576 _asm_skip_resize_check FALSE _asm_stripesize 131072 _asm_stripewidth 8 _asm_wait_time 18 _asmlib_test 0 _asmsid asm 21 rows selected. ======================11.2.0.1===================== sqlplus / as sysdba Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm; hidden parameter value -------------------------------------------------------------------------------- _asm_hbeatwaitquantum 2 ======================11.2.0.2===================== $ sqlplus / as sysdba Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Partitioning, Oracle Label Security, OLAP, Data Mining and Real Application Testing options SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm; hidden parameter value -------------------------------------------------------------------------------- _asm_hbeatwaitquantum 2 在11.2.0.3.0之后才有这个参数出现,也就是说ASM实例对磁盘超时的检测是在11.2.0.3之后才出现的 ======================11.2.0.3===================== sys@R11203> select * from v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm; hidden parameter value hidden parameter value -------------------------------------------------- -------------------- _asm_hbeatiowait 15 _asm_hbeatwaitquantum 2 ======================11.2.0.4===================== SQL> select * from v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - Production SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm; hidden parameter value -------------------------------------------------------------------------------- --------- _asm_hbeatiowait 15 <<<<<<<<<<<<<<<<<<<< _asm_hbeatwaitquantum 2 ======================12.1.0.1===================== $ sqlplus / as sysdba Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm; hidden parameter value -------------------------------------------------------------------------------- _asm_hbeatiowait 15 _asm_hbeatwaitquantum 2 在12.1.0.2之后,这个参数默认值被调整为120s ======================12.1.0.2===================== $ sqlplus / as sysdba Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm; hidden parameter value -------------------------------------------------------------------------------- _asm_hbeatiowait 120 _asm_hbeatwaitquantum 2
希望总结的这个知识点,对你有帮助。日常中,经常感叹,这个问题很简单,但是不sure,测试过后,记录下来,以备查询。