今天遇到个rac的异步IO的问题,环境是oracle 10201的双节点rac,存储使用ASM,机器是HP 的PA的机器,操作系统是11.23.
在udump下的日志中有大量的报错:
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1_ora_19552.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19552.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2
System name: HP-UX
Node name: rp8440a
Release: B.11.23
Version: U
Machine: 9000/800
Instance name: gzxnh1
Redo thread mounted by this instance: 1
Oracle process number: 0
Unix process pid: 19552, image: oracle@rp8440a
Ioctl ASYNC_CONFIG error, errno = 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>catgzxnh1_ora_19552.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19552.trc
OracleDatabase10gEnterpriseEditionRelease10.2.0.1.0-64bitProduction
WiththePartitioning,RealApplicationClusters,OLAPandDataMiningoptions
ORACLE_HOME=/oracle/oracle/product/10.2.0/db_2
Systemname:HP-UX
Nodename:rp8440a
Release:B.11.23
Version:U
Machine:9000/800
Instancename:gzxnh1
Redothreadmountedbythisinstance:1
Oracleprocessnumber:0
Unixprocesspid:19552,image:oracle@rp8440a
IoctlASYNC_CONFIGerror,errno=1
看到这个报错,觉得很轻松,之前也遇到过,是oracle用户的dba组在操作系统上没有MLOCK的权限,在这里的文章,我也写到过。由于hp的11.23的系统,默认是开asyncdsk的。于是就用root给oracle用户的dba组加MLOCK权限了。
可是,等加完之后,问题就出现了:登录的提示就是不完整的,登录之后select所有信息,都会报没有连接数据库:
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> sqlplus "/ as sysdba"
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Dec 3 10:28:47 2009
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected.
SQL> select * from dual;
select * from dual
*
ERROR at line 1:
ORA-01012: not logged on
SQL> conn /a as sysdba
Connected to an idle instance.
SQL>
SQL>
SQL> exit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump>sqlplus"/ as sysdba"
SQL*Plus:Release10.2.0.1.0-ProductiononThuDec310:28:472009
Copyright(c)1982,2005,Oracle.Allrightsreserved.
Connected.
SQL>select *fromdual;
select *fromdual
*
ERRORatline1:
ORA-01012:notloggedon
SQL>conn/aassysdba
Connectedtoanidleinstance.
SQL>
SQL>
SQL>exit
检查环境变量的正常的,确实登录的是gzxnh1的实例,而且,oracle的进程都是在的:
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> ps -ef |grep ora _
oracle 24562 13823 0 10:28:41 pts/1 0:00 grep ora_
oracle 26691 1 0 00:37:55 ? 0:00 ora_q002_gzxnh1
oracle 18836 1 0 Nov 28 ? 4:42 ora_cjq0_gzxnh1
oracle 18865 1 0 Nov 28 ? 4:15 ora_mmon_gzxnh1
oracle 19044 1 0 Nov 28 ? 0:09 ora_rbal_gzxnh1
oracle 18614 1 0 Nov 28 ? 1:13 ora_pmon_gzxnh1
oracle 19598 1 0 Nov 28 ? 273:22 ora_j000_gzxnh1
oracle 18800 1 0 Nov 28 ? 2:03 ora_ckpt_gzxnh1
oracle 18927 1 0 Nov 28 ? 13:13 ora_lck0_gzxnh1
oracle 18643 1 0 Nov 28 ? 3:21 ora_diag_gzxnh1
oracle 18805 1 0 Nov 28 ? 1:42 ora_smon_gzxnh1
oracle 18672 1 0 Nov 28 ? 0:10 ora_psp0_gzxnh1
oracle 18674 1 0 Nov 28 ? 22:01 ora_lmon_gzxnh1
oracle 18771 1 0 Nov 28 ? 2:56 ora_lgwr_gzxnh1
oracle 19013 1 0 Nov 28 ? 0:11 ora_asmb_gzxnh1
oracle 18834 1 0 Nov 28 ? 0:00 ora_reco_gzxnh1
oracle 18867 1 0 Nov 28 ? 7:25 ora_mmnl_gzxnh1
oracle 19835 1 0 Nov 28 ? 0:14 ora_q000_gzxnh1
oracle 18898 1 0 Nov 28 ? 0:00 ora_s000_gzxnh1
oracle 18896 1 0 Nov 28 ? 0:00 ora_d000_gzxnh1
oracle 18707 1 0 Nov 28 ? 10:52 ora_lms0_gzxnh1
oracle 18703 1 0 Nov 28 ? 18:31 ora_lmd0_gzxnh1
oracle 18736 1 0 Nov 28 ? 8:54 ora_lms1_gzxnh1
oracle 19461 1 0 22:00:01 ? 0:05 ora_j001_gzxnh1
oracle 18738 1 0 Nov 28 ? 0:25 ora_mman_gzxnh1
oracle 18769 1 0 Nov 28 ? 3:22 ora_dbw0_gzxnh1
oracle 27708 1 0 10:54:35 ? 1:47 ora_pz99_gzxnh1
oracle 15114 1 0 Dec 1 ? 0:00 ora_o002_gzxnh1
oracle 19541 1 0 Nov 28 ? 0:14 ora_qmnc_gzxnh1
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump>ps-ef|grepora_
oracle2456213823010:28:41pts/10:00grepora_
oracle266911000:37:55?0:00ora_q002_gzxnh1
oracle1883610Nov28?4:42ora_cjq0_gzxnh1
oracle1886510Nov28?4:15ora_mmon_gzxnh1
oracle1904410Nov28?0:09ora_rbal_gzxnh1
oracle1861410Nov28?1:13ora_pmon_gzxnh1
oracle1959810Nov28?273:22ora_j000_gzxnh1
oracle1880010Nov28?2:03ora_ckpt_gzxnh1
oracle1892710Nov28?13:13ora_lck0_gzxnh1
oracle1864310Nov28?3:21ora_diag_gzxnh1
oracle1880510Nov28?1:42ora_smon_gzxnh1
oracle1867210Nov28?0:10ora_psp0_gzxnh1
oracle1867410Nov28?22:01ora_lmon_gzxnh1
oracle1877110Nov28?2:56ora_lgwr_gzxnh1
oracle1901310Nov28?0:11ora_asmb_gzxnh1
oracle1883410Nov28?0:00ora_reco_gzxnh1
oracle1886710Nov28?7:25ora_mmnl_gzxnh1
oracle1983510Nov28?0:14ora_q000_gzxnh1
oracle1889810Nov28?0:00ora_s000_gzxnh1
oracle1889610Nov28?0:00ora_d000_gzxnh1
oracle1870710Nov28?10:52ora_lms0_gzxnh1
oracle1870310Nov28?18:31ora_lmd0_gzxnh1
oracle1873610Nov28?8:54ora_lms1_gzxnh1
oracle194611022:00:01?0:05ora_j001_gzxnh1
oracle1873810Nov28?0:25ora_mman_gzxnh1
oracle1876910Nov28?3:22ora_dbw0_gzxnh1
oracle277081010:54:35?1:47ora_pz99_gzxnh1
oracle1511410Dec1?0:00ora_o002_gzxnh1
oracle1954110Nov28?0:14ora_qmnc_gzxnh1
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump>
怎么回事?为何数据库登录有问题?登录上去报错没有连接,再次登录就报错connect to an idle instance?
进一步检查oracle的alertlog:
Thread 1 advanced to log sequence 332
Current log# 6 seq# 332 mem# 0: +DATA1/gzxnh/onlinelog/group_6.592.704141411
Thu Dec 3 10:24:47 2009
Process PZ98 died, see its trace file
Thu Dec 3 10:32:55 2009
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
Thu Dec 3 10:33:57 2009
Process PZ99 died, see its trace file
Thu Dec 3 10:34:26 2009
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>
1
2
3
4
5
6
7
8
9
10
11
12
13
Thread1advancedtologsequence332
Currentlog# 6 seq# 332 mem# 0: +DATA1/gzxnh/onlinelog/group_6.592.704141411
ThuDec310:24:472009
ProcessPZ98died,seeitstracefile
ThuDec310:32:552009
ProcessPZ99died,seeitstracefile
ProcessPZ99died,seeitstracefile
ThuDec310:33:572009
ProcessPZ99died,seeitstracefile
ThuDec310:34:262009
ProcessPZ99died,seeitstracefile
ProcessPZ99died,seeitstracefile
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>
PZ99的进程死了?虽然在alertlog中没看到trace file指的是哪个,但是通过bdump下的最新一个trace文件可以看到:
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1 _ora _19598.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19598.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2
System name: HP-UX
Node name: rp8440a
Release: B.11.23
Version: U
Machine: 9000/800
Instance name: gzxnh1
Redo thread mounted by this instance: 1
Oracle process number: 0
Unix process pid: 19598, image: oracle@rp8440a
……
*** 2009-12-03 10:34:26.934
*** ACTION NAME:() 2009-12-03 10:34:26.934
*** MODULE NAME:() 2009-12-03 10:34:26.934
*** SERVICE NAME:(SYS$USERS) 2009-12-03 10:34:26.934
*** CLIENT ID:() 2009-12-03 10:34:26.934
*** SESSION ID:(3251.32206) 2009-12-03 10:34:26.934
kxfpg1srv
could not start local PZ99
kxfpg1sg
GV$ query failed to get slave on inst 1
kxfpg1srv
could not start PZ99, inst 2
kxfpg1sg
GV$ query failed to get slave on inst 2
kxfpg1srv
could not start local PZ99
kxfpg1sg
GV$ query failed to get slave on inst 1
kxfpg1srv
could not start PZ99, inst 2
kxfpg1sg
GV$ query failed to get slave on inst 2
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>catgzxnh1_ora_19598.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19598.trc
OracleDatabase10gEnterpriseEditionRelease10.2.0.1.0-64bitProduction
WiththePartitioning,RealApplicationClusters,OLAPandDataMiningoptions
ORACLE_HOME=/oracle/oracle/product/10.2.0/db_2
Systemname:HP-UX
Nodename:rp8440a
Release:B.11.23
Version:U
Machine:9000/800
Instancename:gzxnh1
Redothreadmountedbythisinstance:1
Oracleprocessnumber:0
Unixprocesspid:19598,image:oracle@rp8440a
……
***2009-12-0310:34:26.934
***ACTIONNAME:()2009-12-0310:34:26.934
***MODULENAME:()2009-12-0310:34:26.934
***SERVICENAME:(SYS$USERS)2009-12-0310:34:26.934
***CLIENTID:()2009-12-0310:34:26.934
***SESSIONID:(3251.32206)2009-12-0310:34:26.934
kxfpg1srv
couldnotstartlocalPZ99
kxfpg1sg
GV$queryfailedtogetslaveoninst1
kxfpg1srv
couldnotstartPZ99,inst2
kxfpg1sg
GV$queryfailedtogetslaveoninst2
kxfpg1srv
couldnotstartlocalPZ99
kxfpg1sg
GV$queryfailedtogetslaveoninst1
kxfpg1srv
couldnotstartPZ99,inst2
kxfpg1sg
GV$queryfailedtogetslaveoninst2
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>
在rac2上看:
oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh2_ora_10642.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh2_ora_10642.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2
System name: HP-UX
Node name: rp8440b
Release: B.11.23
Version: U
Machine: 9000/800
Instance name: gzxnh2
Redo thread mounted by this instance: 2
Oracle process number: 0
Unix process pid: 10642, image: oracle@rp8440b
Died during process startup with error 27061 (seq=596)
OPIRIP: Uncaught error 27061. Error stack:
ORA-27061: waiting for async I/Os failed
oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump>catgzxnh2_ora_10642.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh2_ora_10642.trc
OracleDatabase10gEnterpriseEditionRelease10.2.0.1.0-64bitProduction
WiththePartitioning,RealApplicationClusters,OLAPandDataMiningoptions
ORACLE_HOME=/oracle/oracle/product/10.2.0/db_2
Systemname:HP-UX
Nodename:rp8440b
Release:B.11.23
Version:U
Machine:9000/800
Instancename:gzxnh2
Redothreadmountedbythisinstance:2
Oracleprocessnumber:0
Unixprocesspid:10642,image:oracle@rp8440b
Diedduringprocessstartupwitherror27061(seq=596)
OPIRIP:Uncaughterror27061.Errorstack:
ORA-27061:waitingforasyncI/Osfailed
oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump
看来确实和rac上刚刚开启的异步IO有关系了,既然这样,赶紧还是把异步IO关闭了吧:在操作系统级别用sam将asyncdsk改成unused,将/etc/privgroup文件删除,重启主机后,数据库启动,恢复正常。
PZ99是rac中并发进程的slave进程(可查看这里),查询Gv$视图需要该进程,目前我猜测是由于修改成了异步IO,导致原来正在执行的pz99进程挂死,sqlplus登录时无法查询gv$视图。如果此时kill掉PZ99的进程,说不定不用重启主机,关闭异步IO,就能解决故障了。
不过,还是想不通PZ99进程和异步IO间的关系,哪位如果也遇到这个情况,能说说当时你是怎么处理的么?