aix 16g内存 单机 10.2.0.4
应用反馈无法访问数据库了。
sqlplus / as sysdba 挂起
看alert.log 不断输出
tue apr 6 10:01:41 2021
pmon failed to acquire latch, see pmon dump
tue apr 6 10:02:41 2021
pmon failed to acquire latch, see pmon dump
tue apr 6 10:02:55 2021
ksvcreate: process(m000) creation failed
tue apr 6 10:03:41 2021
pmon failed to acquire latch, see pmon dump
kill -9 pmon进程后,告警日志输出以下内容:
tue apr 6 10:46:30 2021
mmnl: terminating instance due to error 472
tue apr 6 10:46:30 2021
errors in file /home/oracle/admin/orcl/udump/orcl_ora_5865490.trc:
ora-00600: internal error code, arguments: [504], [0x700000010007608], [640], [5], [session allocation], [0], [0], [0x000000000]
tue apr 6 10:46:31 2021
errors in file /home/oracle/admin/orcl/udump/orcl_ora_5865490.trc:
ora-07445: exception encountered: core dump [kgscdump 01dc] [sigsegv] [address not mapped to object] [0x7c000fac7c017000] [] []
ora-00600: internal error code, arguments: [504], [0x700000010007608], [640], [5], [session allocation], [0], [0], [0x000000000]
tue apr 6 10:46:31 2021
奇怪的是,告警日志中提示mmnl进程由于472错误终止实例。
重启实例正常了。
检查当时的其他trc文件,发现mmnl的trc里有很多信息。
而 mmon 的跟踪文件中不断输出以下信息:
再看 pmon 的跟踪文件输出以下信息:
。。。
。。。
还有如下信息:
----------------------------------------
so: 70000006f6b84c0, type: 2, owner: 0, flag: init/-/-/0x00
(process) oracle pid=92, calls cur/top: 0/70000002ed1a930, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: no post
last process to post me: none
last post sent: 0 0 0
last post sent-location: no post
last process posted by me: none
(latch info) wait_event=0 bits=280
holding (efd=6) 70000006b9d8148 child library cache level=5 child#=2
location from where latch is held: kghfrunp: clatch: nowait:
context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
8 (2181, 1617676902, 0)
waiter count=1
holding (efd=6) 7000000100e7668 child shared pool level=7 child#=1
location from where latch is held: kghfrunp: alloc: session dur:
context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
89 (2450, 1617676902, 0)
94 (2450, 1617676902, 0)
78 (2445, 1617676902, 0)
28 (2442, 1617676902, 0)
84 (2442, 1617676902, 0)
省略500字
85 (2070, 1617676902, 0)
130 (2061, 1617676902, 3)
138 (1992, 1617676902, 0)
146 (1923, 1617676902, 0)
waiter count=46
process group: default, pseudo proc: 70000006f783298
o/s info: user: oracle, term: unknown, ospid: 5865490
osd pid info: unix process pid: 5865490, image: oracle@p550-zjgl1
short stack dump:
ksdxfstk 002c<-ksdxcb 04e4<-sspuser 0074<-000044f0<-skgpwwait 00bc<-ksliwat 06c0<-kslwaitns_timed 0024<-kskthbwt 022c<-kslwait 00f4<-kkslockwait 01fc<-kgxwait 0168<-kgxexclusive 00bc<-kksfreeheapgetmutex 0158<-kkscursorfreecallback 0088<-kglobf0 0264<-kglhpd_internal 0228<-kglhpd 0010<-kghfrx 0028<-kghfrunp 0b44<-kghfnd 07e8<-kghalo 0a24<-ksp_param_handle_alloc 0168<-kspcrec 01bc<-ksucre 0408<-ksucresg 017c<-kpolna 02c4<-kpogsk 00c4<-opiodr 0ae0<-ttcpip 1020<-opitsk 1124<-opiino 0990<-opiodr 0ae0<-opidrv 0484<-sou2o 0090<-opimai_real 01bc<-main 0098<-__start 0098
dump of memory from 0x070000006f657380 to 0x070000006f657588
70000006f657380 00000004 00000000 07000000 6946e648 [............if.h]
70000006f657390 00000010 000313a7 07000000 2ed1a930 [...............0]
70000006f6573a0 00000003 000313a7 07000000 6fd593e0 [............o...]
70000006f6573b0 0000000b 000313a7 07000000 6f9cbea8 [............o...]
70000006f6573c0 00000004 0003129b 00000000 00000000 [................]
70000006f6573d0 00000000 00000000 00000000 00000000 [................]
repeat 26 times
70000006f657580 00000000 00000000 [........]
*** 2021-04-06 10:41:42.821
pmon unable to acquire latch 7000000100e7668 child shared pool level=7 child#=1
location from where latch is held: kghfrunp: alloc: session dur:
context saved from call: 0
state=busy, wlstate=free
通过筛选possible holder关键字发现:
可能时pid=92 130的两个进程持有资源导致pmon异常。
停库前做了 hang analysis,根据输出的文件,只找到了 92 这个pid的信息
印象中,10点此数据库不是非常忙,应该是解析方面出了问题,如果杀掉92这个会话可能会轻量级解决。
下次再发生类似问题时,优先看pmon的trc。
搜集信息方法:
sqlplus -prelim / as sysdba
oradebug setmypid
oradebug unlimit
oradebug hanganalyze 3
oradebug dump systemstate 258
等待30秒
oradebug hanganalyze 3
oradebug dump systemstate 258
oradebug tracefile_name
exit
或者
alter session set events 'immediate trace name systemstate level 258';
alter session set events 'immediate trace name hanganalyze level 3';
等待1分钟
alter session set events 'immediate trace name systemstate level 258';
alter session set events 'immediate trace name hanganalyze level 3';
参考:
patch for bug:5377099 is superseded by patch for bug:8426816 (doc id 817033.1)
其他参考:
awk-f ass109.awk/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_9976.trc
阅读(979) | 评论(0) | 转发(0) |