Quantcast
Channel: 95cc9c08c466d7576a985d536da1f40e
Viewing all 49 articles
Browse latest View live

2015年第一季度PSU更新(OJVM PSU更新)

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 2015年第一季度PSU更新(OJVM PSU更新)

Oracle的Patch越来越复杂了,从2014年10月份发布的补丁集开始,引入了新的几个补丁集概念,即:

JDBC Patch,针对jdbc客户端(Instant Client, Database and Grid ORACLE_HOMES).
Oracle JavaVM Component Database PSU,简称OJVM PSU(仅仅是针对ORACLE_HOMEs).
Combo Patch:可以理解为是OJVM PSU(DB PSU/GI PSU)一样的东西,只不过其允许我们去进行选择安装其中
的部分Patch,而OJVM PSU是不行的,要么都安装,要么都不安装。Combo Patch更加灵活。

从2015年1月的补丁发布开始,OJVM PSU 就集成了JDBC Patch(2014年10月份的补丁并不包括)。
2015年第1季度的PSU更新主要是4个版本:12.1.0.2,12.1.0.1、11.2.0.4、11.2.0.3以及11.1.0.7

++++10.2.0.4
Oracle Database PSU   Unix             Comments                 Includes Cpu 

10.2.0.4.1           8576156          Bash 10.2.0.4.0          includes CPU Jul 2009
10.2.0.4.2           8833280          Bash 10.2.0.4.0          includes CPU Oct 2009
10.2.0.4.3           9119284          Bash 10.2.0.4.0          includes CPU Jan 2010
10.2.0.4.4           9352164          Bash 10.2.0.4.0          includes CPU Apr 2010
10.2.0.4.5           9654991          该psu必须基于10.2.0.4.4  includes CPU Jul 2010
10.2.0.4.6           9952234          该psu必须基于10.2.0.4.4  includes CPU Oct 2010
10.2.0.4.7           10248636         该psu必须基于10.2.0.4.4  includes CPU Jan 2011
10.2.0.4.8           11724977         该psu必须基于10.2.0.4.4  includes CPU Apr 2011
10.2.0.4.9           12419397         该psu必须基于10.2.0.4.4  includes CPU Jul 2011
10.2.0.4.10          12827778         该psu必须基于10.2.0.4.4  includes CPU Oct 2011
10.2.0.4.11          12879929         该psu必须基于10.2.0.4.4  includes CPU Jan 2012
10.2.0.4.12          12879933         该psu必须基于10.2.0.4.4  includes CPU Apr 2012
10.2.0.4.13          13923851         该psu必须基于10.2.0.4.4  includes CPU Jul 2012
10.2.0.4.14          14275630         该psu必须基于10.2.0.4.4  includes CPU Oct 2012
10.2.0.4.15          14736542         该psu必须基于10.2.0.4.4  includes CPU Jan 2013
10.2.0.4.16          16056269         该psu必须基于10.2.0.4.4  includes CPU Apr 2013
10.2.0.4.17          16619897         该psu必须基于10.2.0.4.4  includes CPU Jul 2013

+++++10.2.0.5
Oracle Database PSU   Unix             Comments                Includes Cpu 

10.2.0.5.1           9952230          Bash 10.2.0.5.0         includes CPU Oct 2010
10.2.0.5.2           10248542         Bash 10.2.0.5.0         includes CPU Jan 2011
10.2.0.5.3           11724962         Bash 10.2.0.5.0         includes CPU Apr 2011
10.2.0.5.4           12419392         Bash 10.2.0.5.0         includes CPU Jul 2011
10.2.0.5.5           12827745         Bash 10.2.0.5.0         Includes CPU Oct 2011
10.2.0.5.6           13343471         Bash 10.2.0.5.0         includes CPU Jan 2012
10.2.0.5.7           13632743         Bash 10.2.0.5.0         includes CPU Apr 2012
10.2.0.5.8           13923855         Bash 10.2.0.5.0         includes CPU Jul 2012
10.2.0.5.9           14275629         Bash 10.2.0.5.0         includes CPU Oct 2012
10.2.0.5.10          14727319         Bash 10.2.0.5.0         includes CPU Jan 2013
10.2.0.5.11          16056270         Bash 10.2.0.5.0         includes CPU Apr 2013
10.2.0.5.12          16619894         Bash 10.2.0.5.0         includes CPU Jul 2013

+++++11.1.0.7
Oracle Database PSU   Database        CRS                   Comments          Includes Cpu 

11.1.0.7.1            8833297         bug: 8287931          Bash 11.1.0.7.0   includes CPU Oct 2009
11.1.0.7.2            9209238         bug: 9207257          Bash 11.1.0.7.0   includes CPU Jan 2010
11.1.0.7.3            9352179                               Bash 11.1.0.7.0   includes CPU Apr 2010
11.1.0.7.4            9654987         bug: 9294495          Bash 11.1.0.7.0   includes CPU Jul 2010
11.1.0.7.5            9952228         bug: 9952240          Bash 11.1.0.7.0   includes CPU Oct 2010
11.1.0.7.6            10248531        bug: 10248535         Bash 11.1.0.7.0   includes CPU Jan 2011
11.1.0.7.7            11724936        11724953              Bash 11.1.0.7.0   includes CPU Apr 2011
11.1.0.7.8            12419384        11724953              Bash 11.1.0.7.0   includes CPU Jul 2011
11.1.0.7.9            12827740        11724953              Bash 11.1.0.7.0   includes CPU Oct 2011
11.1.0.7.10           13343461        11724953              Bash 11.1.0.7.0   includes CPU Jan 2012
11.1.0.7.11           13621679        11724953              Bash 11.1.0.7.0   includes CPU Apr 2012
11.1.0.7.12           13923474        11724953              Bash 11.1.0.7.0   includes CPU Jul 2012
11.1.0.7.13           14275623        11724953              Bash 11.1.0.7.0   includes CPU Oct 2012
11.1.0.7.14           14739378        11724953              Bash 11.1.0.7.0   includes CPU Jan 2013
11.1.0.7.15           16056268        11724953              Bash 11.1.0.7.0   includes CPU Apr 2013
11.1.0.7.16           16619896        11724953              Bash 11.1.0.7.0   includes CPU Jul 2013
11.1.0.7.17           17082366        11724953              Bash 11.1.0.7.0   includes CPU Oct 2013
11.1.0.7.18           17465583        11724953              Bash 11.1.0.7.0   includes CPU Jan 2014
11.1.0.7.19           18031726        11724953              Bash 11.1.0.7.0   includes CPU Apr 2014
11.1.0.7.20           18522513        11724953              Bash 11.1.0.7.0   includes CPU Jul 2014
11.1.0.7.21           19152553        11724953              Bash 11.1.0.7.0   includes CPU Oct 2014
11.1.0.7.22           19769499        11724953              Bash 11.1.0.7.0   includes CPU Jan 2015

OJVM PSU:             Database              CRS        Comments                  Includes JDBC Patch
11.1.0.7.1            19282002(Unix)                   Bash 11.1.0.7.0
                      19806118(Win)
                      19852363(JDBC Patch)
11.1.0.7.2            19877446                        Bash 11.1.0.7.21           Jan 2015
                                                     或SPU 11.1.0.7.0(CPUOct2014)

++++++11.2.0.1
Oracle Database PSU  Database   Grid Infrastructure   Comments          Includes Cpu 

11.2.0.1.1           9352237    9343627               Bash 11.2.0.1.0   includes CPU Apr 2010
11.2.0.1.2           9654983    9343627               Bash 11.2.0.1.0   includes CPU Jul 2010
11.2.0.1.3           9952216    9655006               Bash 11.2.0.1.0   includes CPU Oct 2010
11.2.0.1.4           10248516   9655006               Bash 11.2.0.1.0   includes CPU Jan 2011
11.2.0.1.5           11724930   9655006               Bash 11.2.0.1.0   includes CPU Apr 2011
11.2.0.1.6           12419378   9655006               Bash 11.2.0.1.0   includes CPU Apr 2011

+++++++11.2.0.2
OracleDatabase PSU   Database   Grid Infrastructure   Comments          Includes  Cpu
11.2.0.2.1           10248523   Bundle1 10157506      Bash 11.2.0.2.0   no CPU fixes
11.2.0.2.2           11724916   Bundle2 10425672      Bash 11.2.0.2.0   includes CPU Apr 2011
                                PSU2 12311357
11.2.0.2.3           12419331   12419353              Bash 11.2.0.2.0   includes CPU Jul 2011
11.2.0.2.4           12827726   12827731              Bash 11.2.0.2.0   includes CPU Oct 2011
11.2.0.2.5           13343424   13343447              Bash 11.2.0.2.0   includes CPU Jan 2012
11.2.0.2.6           13696224   1396242               Bash 11.2.0.2.0   includes CPU Apr 2012
11.2.0.2.7           13923804   14192201              Bash 11.2.0.2.0   includes CPU Jul 2012
11.2.0.2.8           14275621   14390437              Bash 11.2.0.2.0   includes CPU Oct 2012
11.2.0.2.9           14727315   14390437              Bash 11.2.0.2.0   includes CPU Jan 2013
11.2.0.2.10          16056267   16166868              Bash 11.2.0.2.0   includes CPU Apr 2013
11.2.0.2.11          16619893   16742320              Bash 11.2.0.2.0   includes CPU Jul 2013
11.2.0.2.12          17082367   17272753              Bash 11.2.0.2.0   includes CPU Oct 2013

+++++++11.2.0.3
OracleDatabase PSU   Database               Grid Infrastructure       Comments           Includes  Cpu
11.2.0.3.0           10404530               (包含在10404530中)
11.2.0.3.1           13343438               13348650                  Bash 11.2.0.3.0     includes CPU Jan 2012
11.2.0.3.2           13696216               13696251                  Bash 11.2.0.3.0     includes CPU Apr 2012
11.2.0.3.3           13923374               13919095                  Bash 11.2.0.3.0     includes CPU Jul 2012
11.2.0.3.4           14275605               14275572                  Bash 11.2.0.3.0     includes CPU Oct 2012
11.2.0.3.5           14727310               14727347                  Bash 11.2.0.3.0     includes CPU Jan 2013
11.2.0.3.6           16056266               16083653                  Bash 11.2.0.3.0     includes CPU Apr 2013
11.2.0.3.7           16619892               16742216                  Bash 11.2.0.3.0     includes CPU Jul 2013
11.2.0.3.8           16902043               17272731                  Bash 11.2.0.3.0     includes CPU Oct 2013
11.2.0.3.9           17540582               17735354                  Bash 11.2.0.3.0     includes CPU Jan 2014
11.2.0.3.10          18031683               18139678                  Bash 11.2.0.3.0     includes CPU Apr 2014
11.2.0.3.11          18522512               18706488                  Bash 11.2.0.3.0     includes CPU Jul 2014
11.2.0.3.12          19121548               19440385                  Bash 11.2.0.3.0     includes CPU Oct 2014
11.2.0.3.13          19769496               19971343                  Bash 11.2.0.3.0     includes CPU Jan 2015

OJVM PSU:            Database               Grid Infrastructure       Comments            Includes JDBC Patch

11.2.0.3.1           19282015(Unix)                                   PSU 11.2.0.3.12
                     19806120(Win)                                 或 SPU 11.2.0.3(CPUOct2014)
                     19852361(JDBC Patch)
11.2.0.3.2           19877443   19852361                              Bash 11.2.0.3.11     Jan 2015

+++++++11.2.0.4
OracleDatabase PSU   Database               Grid Infrastructure       Comments               Includes  Cpu
11.2.0.4.0           13390677               13390677
11.2.0.4.1           17478514                                          Bash 11.2.0.4.0         includes CPU Jan 2014
11.2.0.4.2           18031668               18139609                   Bash 11.2.0.4.0         includes CPU Apr 2014
11.2.0.4.3           18522509               18706472                   Bash 11.2.0.4.0         includes CPU Jul 2014
11.2.0.4.4           19121551               19380115                   Bash 11.2.0.4.0         includes CPU Oct 2014
11.2.0.4.5           19769489               19955028                   Bash 11.2.0.4.0         includes CPU Jan 2015

OJVM PSU:            Database               Grid Infrastructure        Comments               Includes JDBC Patch

11.2.0.4.1           19282021(Unix)                                    Bash 11.2.0.4.4
                     19799291(WIN)                                   或SPU 11.2.0.4(CPUOct2014)
                     19852360(JDBC Patch)
11.2.0.4.2           19877440   19852360                               Bash 11.2.0.4.4        Jan 2015

+++++++12.1.0.1
OracleDatabase PSU   Database           Grid Infrastructure       Comments           Includes  Cpu

12.1.0.1.1           17027533           17272829                  Bash 12.1.0.1.0
12.1.0.1.2           17552800           17735306                  Bash 12.1.0.1.0    includes CPU Jan 2014
12.1.0.1.3           18031528           18139660(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Apr 2014
                                        18413105(Linux/Solaris)
12.1.0.1.4           18522516           18705972(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Jul 2014
                                        18705901(Linux/Solaris)

12.1.0.1.5           19121550           19392451(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Oct 2014
                                        19392372(Linux/Solaris)  

12.1.0.1.6           19769486           19971331(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Jan 2015
                                        19971324(Linux/Solaris)      

OJVM PSU:            Database              Grid Infrastructure       Comments           Includes JDBC Patch

12.1.0.1.1           19282024(Unix)                                  Bash 12.1.0.1.5
                     19801531(WIN)
                     19852357(JDBC Patch)
12.1.0.1.2           19877342             19852357                   Bash 12.1.0.1.5    Jan 2015

++++++ 12.1.0.2
OracleDatabase PSU   Database         Grid Infrastructure       Comments           Includes  Cpu

12.1.0.2.1           19303936          19392646                  Bash 12.1.0.2.0    includes CPU Oct 2014
12.1.0.2.2           19769480          19954978                  Bash 12.1.0.2.0    includes CPU Jan 2015

OJVM PSU:            Database         Grid Infrastructure       Comments           Includes JDBC Patch

12.1.2.0.1           19282028                                    Bash 12.1.0.2.1
12.1.2.0.2           19877336          20132450                  Bash 12.1.0.2.1    Jan 2015

备注:

1) 关于JDBC Patch和OJVM PSU的信息,请参考MOS doc:

Oracle Recommended Patches — “Oracle JavaVM Component Database PSU” (OJVM PSU) Patches (1929745.1)
2) 要安装OJVM PSU,那么数据库环境版本也是有要求的,不能低于2014年10月发布的补丁号,及:
require the database home to be patched to at least October 2014 DB PSU

换句话讲,11.1.0.7如果要安装OJVM PSU,那么版本不能低于11.1.0.7.20。请参考上面的Comments说明。

3) 之前CPU(Critical Patch Update)安装补丁,现在改名被称为SPU Security Patch Update。

Related posts:

  1. 10.2.0.4+版本PSU以及相关bundle patch列表-(2012/4/17 update)
  2. 10.2.0.4+版本PSU以及相关bundle patch列表-(2012/7/18 update)
  3. 10.2.0.4+版本PSU以及相关bundle patch列表-(2012/10/19 update)
  4. 10.2.0.4+版本PSU以及相关bundle patch列表-(2013/1/20 update)
  5. 10.2.0.4+版本PSU以及相关bundle patch列表-(2013/4/18 update)

A RAC node crash due to ora-00481

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: A RAC node crash due to ora-00481

这是某个客户的案例,这里分享给大家! 在2015/1/13号凌晨3:44分左右,XXXX集群数据库的节点1出现出现crash。
通过分析XXXX1节点的告警日志,我们发现如下内容:

Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lmon_10682988.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
USER: terminating instance due to error 481
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lms0_27525728.trc:
ORA-00481: LMON process terminated with error
.......省略部分内容
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lms1_27001440.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
System state dump is made for local instance
System State dumped to trace file /home/oracle/app/admin/XXXX/bdump/XXXX1_diag_28246956.trc
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lmd0_27198128.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_mman_28378004.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lck0_25952674.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:44 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lgwr_33489026.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:44 2015
Doing block recovery for file 94 block 613368
Tue Jan 13 03:44:45 2015
Shutting down instance (abort)
License high water mark = 1023
Tue Jan 13 03:44:49 2015
Instance terminated by USER, pid = 19333184
Tue Jan 13 03:44:55 2015
Instance terminated by USER, pid = 33554510
Tue Jan 13 03:45:46 2015
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name

从上述日志来看,在3:44:43时间点,节点1的LMON进程出现异常被终止,抛出ORA-00481错误。接着节点1的数据库实例被强行终止掉。

对于Oracle 的LMON进程,其中作用主要是监控RAC的GES信息,当然其作用不仅仅局限于此,还负责检查集群中各个Node的健康情况,当有节点出现故障是,负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道RAC的脑裂机制,如果IO fencing是Oracle本身来完成,也就是说由Clusterware来完成。那么LMON进程检查到实例级别出现脑裂时,会通知Clusterware来进行脑裂操作,然而其并不会等待Clusterware的处理结果。当等待超过一定时间,那么LMON进程会自动触发IMR(instance membership recovery),这实际上也就是我们所说的Instance membership reconfig。

从上述的日志分析,我们可以看出,节点1实例是被LMON进程强行终止的,而LMON进程由于本身出现异常才采取了这样的措施。那么,节点1的LMON进程为什么会出现异常呢?通过分析节点1数据库实例LMON进程的trace 内容,我们可以看到如下内容:

*** 2015-01-13 03:44:18.067
kjfcdrmrfg: SYNC TIMEOUT (1295766, 1294865, 900), step 31
Submitting asynchronized dump request [28]
KJC Communication Dump:
 state 0x5  flags 0x0  mode 0x0  inst 0  inc 68
 nrcv 17  nsp 17  nrcvbuf 1000
 reg_msg: sz 456  cur 1235 (s:0 i:1235) max 5251  ini 3750
 big_msg: sz 8240  cur 263 (s:0 i:263) max 1409  ini 1934
 rsv_msg: sz 8240  cur 0 (s:0 i:0) max 0  tot 1000
 rcvr: id 1  orapid 7  ospid 27525728
 rcvr: id 9  orapid 15  ospid 26149404
 .......
 .......
 rcvr: id 7  orapid 13  ospid 17105074
 rcvr: id 16  orapid 22  ospid 29033450
 send proxy: id 1  ndst 1 (1:1 )
 send proxy: id 9  ndst 1 (1:9 )
 .......
 .......
 send proxy: id 7  ndst 1 (1:7 )
 send proxy: id 16  ndst 1 (1:16 )
GES resource limits:
 ges resources: cur 0 max 0 ini 39515
 ges enqueues: cur 0 max 0 ini 59069
 ges cresources: cur 4235 max 7721
 gcs resources: cur 4405442 max 5727836 ini 7060267
 gcs shadows: cur 4934515 max 6358617 ini 7060267
KJCTS state: seq-check:no  timeout:yes  waitticks:0x3  highload no
GES destination context:
GES remote instance per receiver context:
GES destination context:
.......
kjctseventdump-end tail 238 heads 0 @ 0 238 @ -744124571
sync() timed out - lmon exiting
kjfsprn: sync status  inst 0  tmout 900 (sec)
kjfsprn: sync propose inc 68  level 85020
kjfsprn: sync inc 68  level 85020
kjfsprn: sync bitmap 0 1
kjfsprn: dmap ver 68 (step 0)
.......
DUMP state for lmd0 (ospid 27198128)
DUMP IPC context for lmd0 (ospid 27198128)
Dumping process 6.27198128 info:

从上面LMON进程的trace信息来看,LMON进程检测到了DRM在进行sync时出现了timeout,最后LMON强制退出了。既然如此,那么我们就来分析为什么DRM会出现timeout;同时,我们也知道DRM的主要进程其实是LMD进程,那么我们来分析节点1实例的LMD进程的trace内容:

*** 2015-01-13 03:44:43.666
lmd abort after exception 481
KJC Communication Dump:
 state 0x5  flags 0x0  mode 0x0  inst 0  inc 68
 nrcv 17  nsp 17  nrcvbuf 1000
 reg_msg: sz 456  cur 1189 (s:0 i:1189) max 5251  ini 3750
 big_msg: sz 8240  cur 261 (s:0 i:261) max 1409  ini 1934
 rsv_msg: sz 8240  cur 0 (s:0 i:0) max 0  tot 1000
 rcvr: id 1  orapid 7  ospid 27525728
 .......
 rcvr: id 7  orapid 13  ospid 17105074
 rcvr: id 16  orapid 22  ospid 29033450
 send proxy: id 1  ndst 1 (1:1 )
 send proxy: id 9  ndst 1 (1:9 )
 .......
 send proxy: id 7  ndst 1 (1:7 )
 send proxy: id 16  ndst 1 (1:16 )
GES resource limits:
 ges resources: cur 0 max 0 ini 39515
 ges enqueues: cur 0 max 0 ini 59069
 ges cresources: cur 4235 max 7721
 gcs resources: cur 4405442 max 5727836 ini 7060267
 gcs shadows: cur 4934515 max 6358617 ini 7060267
KJCTS state: seq-check:no  timeout:yes  waitticks:0x3  highload no
GES destination context:
GES remote instance per receiver context:
GES destination context:

我们可以看到,当lmon进程遭遇ORA-00481错误之后,lmd进程也会强制abort终止掉。在LMON进程被强制终止掉之前,触发了一个process dump,如下:

*** 2015-01-13 03:44:18.114
Dump requested by process [orapid=5]
REQUEST:custom dump [2] with parameters [5][6][0][0]
Dumping process info of pid[6.27198128] requested by pid[5.10682988]
Dumping process 6.27198128 info:
*** 2015-01-13 03:44:18.115
Dumping diagnostic information for ospid 27198128:
OS pid = 27198128
loadavg : 1.71 1.75 2.33
swap info: free_mem = 13497.62M rsv = 96.00M
alloc = 342.91M avail = 24576.00M swap_free = 24233.09M
 F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
 240001 A   oracle 19530440 10682988  10  65 20 16ae3ea590  1916          03:44:18      -  0:00 /usr/bin/procstack 27198128
 242001 T   oracle 27198128        1   1  60 20 7412f4590 108540            Dec 29      - 569:20 ora_lmd0_XXXX1
procstack: open(/proc/27198128/ctl): Device busy
*** 2015-01-13 03:44:18.420

通过上述的分析,我们可以看到ORA-00481错误的产生是关键,而这个错误是LMON进程产生的。
对于ORA-00481错误来讲,根据Oracle MOS文档(1950963.1)描述,通常有如下几种可能性的原因:

1)实例无法获得LE(Lock Elements)锁
2)RAC流控机制的 tickets不足

根据文档描述,我们从数据库两个节点的LMS进程trace中未发现如下的关键字信息:
Start affinity expansion for pkey 81885.0
Expand failed: pkey 81885.0, 229 shadows traversed, 153 replayed 1 retries

因此,我们可以排除第一种可能性。 同理,我们从lmd 进程的trace文件中,可以看到如下类似信息:

GES destination context:
Dest 1  rcvr 0  inc 68  state 0x10041  tstate 0x0
 batch-type quick  bmsg 0x0  tmout 0x20f0dd31  msg_in_batch 0
tkt total 1000  avl 743 sp_rsv 242 max_sp_rsv 250
 seq wrp 0  lst 268971339  ack 0  snt 268971336
 sync seq 0.268971339  inc 0  sndq enq seq 0.268971339
 batch snds 546480  tot msgs 5070830  max sz 88  fullload 85  snd seq 546480
 pbatch snds 219682271  tot msgs 267610831
 sndq msg tot 225339578  tm (0 17706)
 sndq msg 0  maxlmt 7060267  maxlen 149  wqlen 225994573
 sndq msg 0  start_tm 0  end_tm 0

我们从上述红色部分内容可以看出,tickets是足够的,因此我们也可以排除第2种情况。换句话讲,该ORA-00481错误的产生,本身并不是Oracle RAC的配置问题导致。

对于LMON检查到DRM操作出现timeout,最后导致实例crash。timeout的原因通常有如下几种:

1)操作系统Load极高,例如CPU极度繁忙,导致进程无法获得CPU资源
2)进程本身处理异常,比如进程挂起
3)网络问题,比如数据库节点之间通信出现异常
4)DRM本身机制的不完善
5)Oracle DRM Bug
从上面的信息来看,系统在出现异常时,操作系统的Load是很低的,因此第一点我们可以直接排除。

我们现在的目的是需要分析出LMON检查到了什么异常,以及为什么会出现异常。LMD进程在abort之前进行了dump,那么我们可以从dump 中寻找一些蛛丝马迹,如下:

PROCESS 5:
 ----------------------------------------
 SO: 700001406331850, type: 2, owner: 0, flag: INIT/-/-/0x00
 (process) Oracle pid=5, calls cur/top: 7000014054d75e0/7000014054d75e0, flag: (6) SYSTEM
 int error: 0, call error: 0, sess error: 0, txn error 0
 (post info) last post received: 0 0 24
 last post received-location: ksasnd
 last process to post me: 700001405330198 1 6
 last post sent: 0 0 24
 last post sent-location: ksasnd
 last process posted by me: 7000014023045d0 1 2
 (latch info) wait_event=0 bits=0
 Process Group: DEFAULT, pseudo proc: 70000140336dd88
 O/S info: user: oracle, term: UNKNOWN, ospid: 10682988
OSD pid info: Unix process pid: 10682988, image: oracle@tpihxdb1 (LMON)
Dump of memory from 0x07000014022E9320 to 0x07000014022E9528
......
7000014022E9520 00000000 00000000                    [........]
(FOB) flags=67 fib=7000013c30cedf0 incno=0 pending i/o cnt=0
 fname=/oradata2/XXXX/control03.ctl
 fno=2 lblksz=16384 fsiz=1626
 (FOB) flags=67 fib=7000013c30cea50 incno=0 pending i/o cnt=0
 fname=/oradata1/XXXX/control02.ctl
 fno=1 lblksz=16384 fsiz=1626
 (FOB) flags=67 fib=7000013c30ce6b0 incno=0 pending i/o cnt=0
 fname=/oradata1/XXXX/control01.ctl
 fno=0 lblksz=16384 fsiz=1626
 ----------------------------------------
 SO: 7000014036e36a8, type: 19, owner: 700001406331850, flag: INIT/-/-/0x00
 GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=68
 outq=0 sndq=0 opid=5 prmb=0x0
mbg[i]=(158041 85020) mbg[b]=(3534178 0) mbg[r]=(0 0)
 fmq[i]=(20 1) fmq[b]=(5 0) fmq[r]=(0 0)
 mop[s]=1 mop[q]=3692218 pendq=0 zmbq=0
 nonksxp_recvs=0
 ------------process 0x7000014036e36a8--------------------
 proc version      : 0
 Local node        : 0
 pid               : 10682988
 lkp_node          : 0
 svr_mode          : 0
 proc state        : KJP_NORMAL
 Last drm hb acked : 0
 Total accesses    : 2
 Imm.  accesses    : 1
 Locks on ASTQ     : 0
 Locks Pending AST : 0
 Granted locks     : 0
 AST_Q:
PENDING_Q:
GRANTED_Q:
----------------------------------------
 SO: 7000014036efd68, type: 19, owner: 7000014036e36a8, flag: INIT/-/-/0x00
 ----------------------------------------
 SO: 70000136e79f658, type: 18, owner: 7000014036efd68, flag: INIT/-/-/0x00
 ----------enqueue 0x70000136e79f658------------------------
 lock version     : 1
 Owner node       : 1
 grant_level      : KJUSEREX
 req_level        : KJUSEREX
 bast_level       : KJUSEREX
 notify_func      : 0
 resp             : 70000140c6d9d40
 procp            : 7000014036efd68
 pid              : 0
 proc version     : 0
 oprocp           : 0
 opid             : 0
 group lock owner : 0
 xid              : 0000-0000-00000000
 dd_time          : 0.0 secs
 dd_count         : 0
 timeout          : 0.0 secs
 On_timer_q?      : N
 On_dd_q?         : N
 lock_state       : GRANTED
 Open Options     :
Convert options  :
History          : 0x9c8d
 Msg_Seq          : 0x1
 res_seq          : 2
 valblk           : 0x00000000000000000000000000000000 .
 ----------resource 0x70000140c6d9d40----------------------
 resname       : [0x19][0x2],[RS]
 Local node    : 0
 dir_node      : 0
 master_node   : 0
 hv idx        : 122
 hv last r.inc : 68
 current inc   : 68
 hv status     : 0
 hv master     : 0
 open options  :
grant_bits    : KJUSERNL KJUSEREX
grant mode    : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
 count         : 1         0         0         0         0         1
 val_state     : KJUSERVS_DUBVALUE
 valblk        : 0x00000000000000000000000000000000 .
 access_node   : 1
 vbreq_state   : 0
 state         : x0
 resp          : 70000140c6d9d40
 On Scan_q?    : N
 Total accesses: 132825
 Imm.  accesses: 123137
 Granted_locks : 1
Cvting_locks  : 1
value_block:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 GRANTED_Q :
 lp 70000136e79f658 gl KJUSEREX rp 70000140c6d9d40 [0x19][0x2],[RS]
 master 0 owner 1  bast 0 rseq 2 mseq 0x1 history 0x9c8d
 open opt
CONVERT_Q:
lp 700001403a3d0c0 gl KJUSERNL rl KJUSEREX rp 70000140c6d9d40 [0x19][0x2],[RS]
 master 0 pid 25428644 bast 0 rseq 3 mseq 0 history 0x9a
 convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
Rdomain number is 0
 ----------------------------------------
 SO: 7000014064d88c8, type: 4, owner: 700001406331850, flag: INIT/-/-/0x00
 (session) sid: 1652 trans: 0, creator: 700001406331850, flag: (51) USR/- BSY/-/-/-/-/-
 DID: 0001-0005-00000006, short-term DID: 0000-0000-00000000
 txn branch: 0
 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
 service name: SYS$BACKGROUND
 last wait for 'ges generic event' blocking sess=0x0 seq=35081 wait_time=158 seconds since wait started=0
 =0, =0, =0
 Dumping Session Wait History
 for 'ges generic event' count=1 wait_time=158
 =0, =0, =0
 .......
 ----------------------------------------
 SO: 7000013c2f52018, type: 41, owner: 7000014064d88c8, flag: INIT/-/-/0x00
 (dummy) nxc=0, nlb=0
----------------------------------------
 SO: 7000014035cb1e8, type: 11, owner: 700001406331850, flag: INIT/-/-/0x00
 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 700001406331850,
 event: 5, last message event: 70,
 last message waited event: 70,                        next message: 0(0), messages read: 1
 channel: (7000014074b6298) system events broadcast channel
 scope: 2, event: 129420, last mesage event: 70,
 publishers/subscribers: 0/915,
 messages published: 1
 ----------------------------------------
 SO: 7000014054d75e0, type: 3, owner: 700001406331850, flag: INIT/-/-/0x00
 (call) sess: cur 7000014064d88c8, rec 0, usr 7000014064d88c8; depth: 0
 ----------------------------------------
 SO: 7000014036e3060, type: 16, owner: 700001406331850, flag: INIT/-/-/0x00
 (osp req holder)

从上面LMON进程本身的dump来看,节点1实例的LMON进程状态是正常的,最后发送消息给LMON进程的是SO: 700001405330198,搜索该SO,我们可以发现为LMD进程,如下:

PROCESS 6:
 ----------------------------------------
 SO: 700001405330198, type: 2, owner: 0, flag: INIT/-/-/0x00
 (process) Oracle pid=6, calls cur/top: 7000014054d78a0/7000014054d78a0, flag: (6) SYSTEM
 int error: 0, call error: 0, sess error: 0, txn error 0
 (post info) last post received: 0 0 104
 last post received-location: kjmpost: post lmd
 last process to post me: 700001402307510 1 6
 last post sent: 0 0 24
 last post sent-location: ksasnd
 last process posted by me: 700001407305690 1 6
 (latch info) wait_event=0 bits=0
 Process Group: DEFAULT, pseudo proc: 70000140336dd88
 O/S info: user: oracle, term: UNKNOWN, ospid: 27198128 (DEAD)
 OSD pid info: Unix process pid: 27198128, image: oracle@tpihxdb1 (LMD0)
Dump of memory from 0x07000014072E8460 to 0x07000014072E8668
7000014072E8460 00000007 00000000 07000014 036E30E8  [.............n0.]
......
7000014072E8660 00000000 00000000                    [........]
----------------------------------------
 SO: 7000014008a8d00, type: 20, owner: 700001405330198, flag: -/-/-/0x00
namespace [KSXP] key   = [ 32 31 30 47 45 53 52 30 30 30 00 ]
 ----------------------------------------
 SO: 7000014074a73d8, type: 4, owner: 700001405330198, flag: INIT/-/-/0x00
 (session) sid: 1651 trans: 0, creator: 700001405330198, flag: (51) USR/- BSY/-/-/-/-/-
 DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
 txn branch: 0
 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
 service name: SYS$BACKGROUND
 last wait for 'ges remote message' blocking sess=0x0 seq=25909 wait_time=163023 seconds since wait started=62
 waittime=40, loop=0, p3=0
 Dumping Session Wait History
 for 'ges remote message' count=1 wait_time=163023
 waittime=40, loop=0, p3=0
 .......
 ----------------------------------------
 SO: 7000013c2f51f28, type: 41, owner: 7000014074a73d8, flag: INIT/-/-/0x00
 (dummy) nxc=0, nlb=0
----------------------------------------
 SO: 7000014035cb2f8, type: 11, owner: 700001405330198, flag: INIT/-/-/0x00
 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 700001405330198,
 event: 6, last message event: 70,
 last message waited event: 70,                        next message: 0(0), messages read: 1
 channel: (7000014074b6298) system events broadcast channel
 scope: 2, event: 129420, last mesage event: 70,
 publishers/subscribers: 0/915,
 messages published: 1
 ----------------------------------------
 SO: 7000014036e3ba0, type: 19, owner: 700001405330198, flag: INIT/-/-/0x00
 GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=68
 outq=0 sndq=1 opid=6 prmb=0x0
mbg[i]=(0 55) mbg[b]=(11085 217185741) mbg[r]=(0 0)
 fmq[i]=(30 14) fmq[b]=(20 5) fmq[r]=(0 0)
 mop[s]=224759292 mop[q]=216634842 pendq=0 zmbq=0
 nonksxp_recvs=0
 ------------process 0x7000014036e3ba0--------------------
 proc version      : 0
 Local node        : 0
 pid               : 27198128
 lkp_node          : 0
 svr_mode          : 0
 proc state        : KJP_NORMAL
 Last drm hb acked : 0
 Total accesses    : 31515
 Imm.  accesses    : 31478
 Locks on ASTQ     : 0
 Locks Pending AST : 0
 Granted locks     : 2
 AST_Q:
PENDING_Q:
GRANTED_Q:
KJM HIST LMD0:
7:0 6:1 10:31:0 9:31:3 11:1 15:1 12:78181 7:0 6:0 10:31:1
9:31:2 11:1 15:1 12:78106 7:1 6:0 10:31:0 9:31:2 11:2 15:0
12:78805 7:0 6:0 10:31:1 9:31:2 11:1 15:1 12:78194 7:0 6:0
10:31:1 9:31:2 11:1 15:0 12:78177 7:0 6:0 10:31:1 9:31:2 11:1
15:0 12:78176 7:0 6:1 10:31:0 9:31:2 11:1 15:1 12:78890 7:1
6:0 10:31:0 9:31:2 11:2 15:0 12:78177 7:0 6:1 10:31:0 9:31:3
11:1 15:0 12:78180 7:0
DEFER MSG QUEUE ON LMD0 IS EMPTY
 SEQUENCES:
0:0.0  1:283096258.0
 ----------------------------------------
 SO: 7000014054d78a0, type: 3, owner: 700001405330198, flag: INIT/-/-/0x00
 (call) sess: cur 7000014074a73d8, rec 0, usr 7000014074a73d8; depth: 0
 ----------------------------------------
 SO: 7000014036e30e8, type: 16, owner: 700001405330198, flag: INIT/-/-/0x00
 (osp req holder)

从LMD 进程本身的进程dump信息来看,似乎并无异常。从LMON和LMD进程的process dump来看,进程本身状态是正常的。因此我们可以排除进程挂起导致出现Timeout的可能性。

我们可以看到LMD进程一直在等待ges remote message,很明显这是和另外一个数据库节点进行通信;因此我们要分析问题的根本原因,还需要分析节点2数据库实例的一些信息。

首先我们来分析节点2实例的数据库告警日志,如下:

Tue Jan 13 02:26:13 2015
Thread 2 advanced to log sequence 47410 (LGWR switch)
 Current log# 7 seq# 47410 mem# 0: /redolog/XXXX/redo0701.log
 Current log# 7 seq# 47410 mem# 1: /newredolog/XXXX/redo0703.log
Tue Jan 13 03:39:14 2015
Timed out trying to start process PZ96.
Tue Jan 13 03:44:44 2015
Trace dumping is performing id=[cdmp_20150113034443]
Tue Jan 13 03:44:48 2015
Reconfiguration started (old inc 68, new inc 70)
List of nodes:
 1
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Tue Jan 13 03:44:48 2015
 LMS 0: 0 GCS shadows cancelled, 0 closed
......
 LMS 5: 0 GCS shadows cancelled, 0 closed
Tue Jan 13 03:44:48 2015
 LMS 9: 3 GCS shadows cancelled, 0 closed
 Set master node info
Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Tue Jan 13 03:44:49 2015
Instance recovery: looking for dead threads
Tue Jan 13 03:44:49 2015
Beginning instance recovery of 1 threads
Tue Jan 13 03:44:50 2015
 LMS 6: 282848 GCS shadows traversed, 0 replayed
Tue Jan 13 03:44:50 2015
 LMS 7: 284544 GCS shadows traversed, 0 replayed
.......
Tue Jan 13 03:44:51 2015
 LMS 10: 283658 GCS shadows traversed, 0 replayed
Tue Jan 13 03:44:51 2015
 LMS 11: 282777 GCS shadows traversed, 0 replayed
Tue Jan 13 03:44:51 2015
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Tue Jan 13 03:44:54 2015
 parallel recovery started with 16 processes
Tue Jan 13 03:44:55 2015
Started redo scan
Tue Jan 13 03:44:55 2015
Completed redo scan
 281591 redo blocks read, 4288 data blocks need recovery
Tue Jan 13 03:44:56 2015
Started redo application at
 Thread 1: logseq 47935, block 1974207
Tue Jan 13 03:44:56 2015
Recovery of Online Redo Log: Thread 1 Group 4 Seq 47935 Reading mem 0
 Mem# 0: /redolog/XXXX/redo0401.log
 Mem# 1: /newredolog/XXXX/redo0403.log
Tue Jan 13 03:44:56 2015
Recovery of Online Redo Log: Thread 1 Group 5 Seq 47936 Reading mem 0
 Mem# 0: /redolog/XXXX/redo0501.log
 Mem# 1: /newredolog/XXXX/redo0503.log
Tue Jan 13 03:44:57 2015
Completed redo application
Tue Jan 13 03:44:57 2015
Completed instance recovery at
 Thread 1: logseq 47936, block 270263, scn 6869096106270
 4253 data blocks read, 4901 data blocks written, 281591 redo blocks read
Tue Jan 13 03:44:57 2015
Thread 1 advanced to log sequence 47937 (thread recovery)
Tue Jan 13 03:44:57 2015
Redo thread 1 internally disabled at seq 47937 (SMON)
Tue Jan 13 03:44:58 2015
ARC1: Archiving disabled thread 1 sequence 47937
Tue Jan 13 03:44:59 2015
Thread 2 advanced to log sequence 47411 (LGWR switch)
 Current log# 8 seq# 47411 mem# 0: /redolog/XXXX/redo0801.log
 Current log# 8 seq# 47411 mem# 1: /newredolog/XXXX/redo0803.log
Tue Jan 13 03:45:09 2015
SMON: Parallel transaction recovery tried
Tue Jan 13 03:45:53 2015
Reconfiguration started (old inc 70, new inc 72)
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 0
Tue Jan 13 03:45:53 2015
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Tue Jan 13 03:45:53 2015
 LMS 0: 0 GCS shadows cancelled, 0 closed

从节点2的数据库告警日志来看,在3:44:48时间点,开始进行实例的reconfig操作,这与整个故障的时间点是符合的。告警日志中本身并无太多信息,我们接着分析节点2数据库实例的LMON进程trace信息:

*** 2015-01-13 03:18:53.006
Begin DRM(82933)
 sent syncr inc 68 lvl 84937 to 0 (68,0/31/0)
synca inc 68 lvl 84937 rcvd (68.0)
sent syncr inc 68 lvl 84938 to 0 (68,0/34/0)
......
sent syncr inc 68 lvl 84968 to 0 (68,0/38/0)
synca inc 68 lvl 84968 rcvd (68.0)
End DRM(82933)
*** 2015-01-13 03:23:55.896
Begin DRM(82934)
 sent syncr inc 68 lvl 84969 to 0 (68,0/31/0)
synca inc 68 lvl 84969 rcvd (68.0)
......
sent syncr inc 68 lvl 85000 to 0 (68,0/38/0)
synca inc 68 lvl 85000 rcvd (68.0)
End DRM(82934)
*** 2015-01-13 03:29:00.374
Begin DRM(82935)
 sent syncr inc 68 lvl 85001 to 0 (68,0/31/0)
synca inc 68 lvl 85001 rcvd (68.0)
......
 sent syncr inc 68 lvl 85011 to 0 (68,0/36/0)
synca inc 68 lvl 85011 rcvd (68.0)
*** 2015-01-13 03:29:10.511
 sent syncr inc 68 lvl 85012 to 0 (68,0/38/0)
synca inc 68 lvl 85012 rcvd (68.0)
......
sent syncr inc 68 lvl 85020 to 0 (68,0/38/0)
synca inc 68 lvl 85020 rcvd (68.0)
*** 2015-01-13 03:44:45.191
kjxgmpoll reconfig bitmap: 1
*** 2015-01-13 03:44:45.191
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 68 0.
*** 2015-01-13 03:44:45.222
 Name Service frozen
kjxgmcs: Setting state to 68 1.
kjxgfipccb: msg 0x110fffe78, mbo 0x110fffe70, type 22, ack 0, ref 0, stat 34
kjxgfipccb: Send cancelled, stat 34 inst 0, type 22, tkt (1416,80)
kjxgfipccb: msg 0x110ffa0b8, mbo 0x110ffa0b0, type 22, ack 0, ref 0, stat 34
kjxgfipccb: Send cancelled, stat 34 inst 0, type 22, tkt (944,80)
kjxgfipccb: msg 0x11113be68, mbo 0x11113be60, type 22, ack 0, ref 0, stat 34
kjxgfipccb: Send cancelled, stat 34 inst 0, type 22, tkt (472,80)
kjxgrssvote: reconfig bitmap chksum 0xd7682cca cnt 1 master 1 ret 0
kjxggpoll: change poll time to 50 ms
* kjfcln: DRM aborted due to CGS rcfg.
* ** 2015-01-13 03:44:45.281

从上述LMON进程的日志来看,在故障时间点之前,数据库一直存在大量的DRM操作。上述红色部分的信息十分关键,首先节点进行reconfig时,reason 代码值为1.  关于reason值,Oracle Metalink文档有如下描述:

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

从reason =1 来看,数据库实例被强行终止重启也不是通信故障的问题,如果是通信的问题,那么reason值通常应该等于3. reason=1表明这是数据库节点自身监控时触发的reconfig操作。

同时我们从* kjfcln: DRM aborted due to CGS rcfg. 这段关键信息也可以确认,CGS reconfig的原因也正是由于DRM操作失败导致。同时,我们也可以看到,在3:29分开始的Begin DRM(82935)操作,一直到3:44出现故障时,这个DRM操作都没有结束(如果结束,会出现End DRM(82935) 类似关键字)。

由此也不难看出,实际上,该集群数据库可能在3:29之后就已经出现问题了。这里简单补充Oracle DRM的原理:

在Oracle RAC环境中,当某个节点对某个资源访问频率较高时,而该资源的master节点是不是local节点时,那么可能会触发DRM操作,DRM即为:Dynamic Resource Management。在Oracle 10gR1引入该特性之前,如果数据库需要更改某个资源的master节点,那么必须将数据库实例重启来完成。很显然,这一特性的引入无疑改变了一切。同时,从Oracle 10gR2开始,又引入了基于object/undo 级别的affinity。这里所谓的affinity,本质上是引入操作系统的概念,即对某个对象的亲和力程度;在数据库来看,即为对某个对象的访问频率程度。

在Oracle 10gR2版本中,默认情况下,当某个对象的被访问频率超过50时,而同时该对象的master又是其他节点时,那么Oracle则会触发DRM操作。在进程DRM操作的过程中,Oracle会将该资源的相关信息进行临时frozen,然后将该资源在其他节点进行unfrozen,然后更改资源的master节点。注意,这里临时frozen的资源其实是GRD(Global Resource Directory)中的资源。在整个DRM的过程之中,访问该资源的进程都将被临时挂起。因此,当系统出现DRM时,是很可能导致系统或进程出现异常的。

根据Oracle 文档的描述,当DRM触发较为频繁时,是很可能导致出现SYNC Timeout的,如下:

Bug 6960699 - "latch: cache buffers chains" contention/ORA-481/kjfcdrmrfg: SYNC TIMEOUT/ OERI[kjbldrmrpst:!master] (ID 6960699.8)

Dynamic ReMastering (DRM) can be too aggressive at times causing any combination
of the following symptoms :

- "latch: cache buffers chains" contention.
- "latch: object queue header operation" contention
- a RAC node can crash with and ora-481 / kjfcdrmrfg: SYNC TIMEOUT ... step 31
- a RAC node can crash with OERI[kjbldrmrpst:!master]

因此,我们认为此次故障的原因本质上就是因为Oracle DRM的异常导致了相关RAC核心进程的异常,最终导致了数据库实例被强行终止(当然,这本质上是为了保证数据的一致性)。目前客户已经屏蔽DRM,已经运行一周,暂时没有发现任何问题,有待进一步观察!

Related posts:

  1. BUG 10008092 caused instance crash
  2. database crash with ora-00494
  3. 一次3 node Rac tunning
  4. 11gR2 rac add node(11.2.0.2 for aix 7.1)
  5. 10gR2 rac(asm) crash with ora-15064

ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh]

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh]

这是某个网友的数据库,11g ASM环境. 其中ASM元数据出现损坏,导致DiskGroup无法mount。不过比较万幸的存储有镜像。即使是这样,据说存储工程师恢复也花了1天多,对于我们的业务系统来讲,这是不可接受的。
我这里将该数据库case的信息贴出来,供大家参考!(备注:我们提供完善的数据库各种解决方案,详情请看:云和恩墨)

WARNING: cache read  a corrupt block: group=3(DATAVG) dsk=27 blk=1 disk=27 (DATAVG_0018) incarn=4042368416 au=0 blk=1 count=1
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_2711.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh] [2147483675] [1] [0 != 130]
NOTE: a corrupted block from group DATAVG was dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_2711.trc
WARNING: cache read (retry) a corrupt block: group=3(DATAVG) dsk=27 blk=1 disk=27 (DATAVG_0018) incarn=4042368416 au=0 blk=1 count=1
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_2711.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh] [2147483675] [1] [0 != 130]
ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh] [2147483675] [1] [0 != 130]
ERROR: cache failed to read group=3(DATAVG) dsk=27 blk=1 from disk(s): 27(DATAVG_0018)
ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh] [2147483675] [1] [0 != 130]
ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh] [2147483675] [1] [0 != 130]
NOTE: cache initiating offline of disk 27 group DATAVG
NOTE: process _user2711_+asm1 (2711) initiating offline of disk 27.4042368416 (DATAVG_0018) with mask 0x7e in group 3
WARNING: Disk 27 (DATAVG_0018) in group 3 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 3, dsk = 27/0xf0f1a5a0, mask = 0x6a, op = clear
Wed Jan 28 10:41:11 2015
GMON updating disk modes for group 3 at 13 for pid 36, osid 2711
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 3)
Wed Jan 28 10:41:11 2015
NOTE: cache dismounting (not clean) group 3/0xB80155A0 (DATAVG)
NOTE: messaging CKPT to quiesce pins Unix process pid: 3013, image: oracle@rsdb01 (B000)
Wed Jan 28 10:41:11 2015
NOTE: halting all I/Os to diskgroup 3 (DATAVG)
Wed Jan 28 10:41:11 2015
NOTE: LGWR doing non-clean dismount of group 3 (DATAVG)
NOTE: LGWR sync ABA=114.216 last written ABA 114.216
WARNING: Offline of disk 27 (DATAVG_0018) in group 3 and mode 0x7f failed on ASM inst 1
System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_2711.trc
Wed Jan 28 10:41:11 2015
kjbdomdet send to inst 2
detach from dom 3, sending detach message to inst 2
Wed Jan 28 10:41:11 2015
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 20)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 3 invalid = TRUE
1152 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Wed Jan 28 10:41:11 2015
WARNING: dirty detached from domain 3
NOTE: cache dismounted group 3/0xB80155A0 (DATAVG)
SQL> alter diskgroup DATAVG dismount force /* ASM SERVER */
Wed Jan 28 10:41:12 2015
ERROR: ORA-15130 in COD recovery for diskgroup 3/0xb80155a0 (DATAVG)
ERROR: ORA-15130 thrown in RBAL for group number 3
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_2389.trc:
ORA-15130: diskgroup "DATAVG" is being dismounted
ERROR: ORA-15130 in COD recovery for diskgroup 3/0xb80155a0 (DATAVG)
ERROR: ORA-15130 thrown in RBAL for group number 3
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_2389.trc:
ORA-15130: diskgroup "DATAVG" is being dismounted
ERROR: ORA-15130 in COD recovery for diskgroup 3/0xb80155a0 (DATAVG)
ERROR: ORA-15130 thrown in RBAL for group number 3
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_2389.trc:
ORA-15130: diskgroup "DATAVG" is being dismounted
NOTE: AMDU dump of disk group DATAVG created at /u01/app/grid/diag/asm/+asm/+ASM1/trace
NOTE: cache deleting context for group DATAVG 3/0xb80155a0
GMON dismounting group 3 at 14 for pid 37, osid 3013
NOTE: Disk  in mode 0x8 marked for de-assignment
.......
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DATAVG was dismounted
SUCCESS: alter diskgroup DATAVG dismount force /* ASM SERVER */
ERROR: PST-initiated MANDATORY DISMOUNT of group DATAVG
Wed Jan 28 10:41:20 2015
NOTE: diskgroup resource ora.DATAVG.dg is offline
Wed Jan 28 10:41:26 2015
NOTE: ASM client rsdb1:rsdb disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_2667.trc

从上述错误我们可以判断,ASM DISKGROUP无法mount。报错的原因是如下:

WARNING: cache read  a corrupt block: group=3(DATAVG) dsk=27 blk=1 disk=27 (DATAVG_0018) incarn=4042368416 au=0 blk=1 count=1
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_2711.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh] [2147483675] [1] [0 != 130]

从这几行信息来看,DATAVG磁盘组的第27号盘的第0个AU的第1号block损坏了。 实际上,这就是disk header损坏了。

下面针对ORA-15196错误进行简单解释:

ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh] [2147483675] [1] [0 != 130]

[kfc.c:26076]: 表示运行kfc.c代码的第26076行出现问题
[hard_kfbh]:    表示检查失败的类型
[2147483675]:  表示file number
[1]:           表示block number
[0 != 130]:     表示该处的值,当前是0,检测发现实际上应该是130才对.

实际上,对于这样的错误,一旦出现,ASM元数据损坏的不仅仅是磁盘头。经过我们判断,至少前面4M的ASM元数据都已经
损坏。对于这样的情况,可能使用AMDU是无法进行数据文件的抽取的。

一般来讲,对于是external的DiskGroup,前面42M的ASM元数据如果不是彻底损坏,那么DiskGroup中的数据都是比较容易弄出来的。

如果损坏非常严重,那么可能只能使用数据库抽取工具进行扫盘。目前DUL或ODU都可以完美的解决这样的情况。

如果你遇到类似的数据库故障,那么请第一时间联系我们!

Related posts:

  1. 不完全详解os block header
  2. 关于ora-1652的一点总结–续(详解rowid,index entry header)
  3. Where is the backup of ASM disk header block? –补充
  4. 最近迁移恢复中遇到的几个小问题
  5. oracle asm 剖析系列(1) –disk header

PMON crash the instance of RAC

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: PMON crash the instance of RAC

这是某个网友的问题,其一套rac的某个节点被重启了,通过分析日志我们可以看到其中一个rac节点crash了,如下:

Errors in file /opt/oracle/diag/rdbms/cspora/cspora2/trace/cspora2_pmon_57410148.trc  (incident=584434):
ORA-00600: internal error code, arguments: [kjucvl:!busy], [8], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/oracle/diag/rdbms/cspora/cspora2/incident/incdir_584434/cspora2_pmon_57410148_i584434.trc
Thu Feb 05 10:24:04 2015
Trace dumping is performing id=[cdmp_20150205102404]
Errors in file /opt/oracle/diag/rdbms/cspora/cspora2/trace/cspora2_pmon_57410148.trc  (incident=584435):
ORA-00600: internal error code, arguments: [kjuscv], [6], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kjucvl:!busy], [8], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/oracle/diag/rdbms/cspora/cspora2/incident/incdir_584435/cspora2_pmon_57410148_i584435.trc
Thu Feb 05 10:24:06 2015
Sweep Incident[584434]: completed
Errors in file /opt/oracle/diag/rdbms/cspora/cspora2/trace/cspora2_pmon_57410148.trc:
ORA-00600: internal error code, arguments: [kjuscv], [6], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kjucvl:!busy], [8], [], [], [], [], [], [], [], [], [], []
PMON (ospid: 57410148): terminating the instance due to error 472
Thu Feb 05 10:24:18 2015
Termination issued to instance processes. Waiting for the processes to exit
Instance termination failed to kill one or more processes
Instance terminated by PMON, pid = 57410148
Thu Feb 05 10:24:58 2015
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 11.1.0.7.0.
Using parameter settings in server-side spfile /dev/rlv_spfile
.......
.......
ORA-1154 signalled during: alter database open...
Thu Feb 05 10:31:11 2015
Reconfiguration started (old inc 102, new inc 104)
List of nodes:
1
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Thu Feb 05 10:31:12 2015
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Feb 05 10:31:12 2015

从上述alert log可以看出,数据库实例是被pmon进程强行终止的,之所以pmon进程会强行终止instance,
很明显之前出现了ORA-00600错误。很明显上述ORA-00600错误是pmon进程出现的。我们来分析pmon进程的trace内容:

*** ACTION NAME:() 2015-02-05 10:24:01.185

Dump continued from file: /opt/oracle/diag/rdbms/cspora/cspora2/trace/cspora2_pmon_57410148.trc
ORA-00600: internal error code, arguments: [kjucvl:!busy], [8], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 584434 (ORA 600 [kjucvl:!busy]) ========
----- Beginning of Customized Incident Dump(s) -----
----------enqueue 7000001cbbcdaf0------------------------
lock version     : 311947
Owner node       : 1
grant_level      : KJUSEREX
req_level        : KJUSEREX
bast_level       : KJUSERNL
notify_func      : 0
resp             : 70000018f124ac0
procp            : 7000001cb0e3960
pid              : 46924312
proc version     : 922
oprocp           : 0
opid             : 0
group lock owner : 0
xid              : 0000-0000-00000000
dd_time          : 0.0 secs
dd_count         : 0
timeout          : 0.0 secs
On_timer_q?      : N
On_dd_q?         : N
lock_state       : CONVERTING
Open Options     :  KJUSERPROCESS_OWNED
Convert options  : KJUSERGETVALUE KJUSERNOQUEUE KJUSEREXPRESS KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
History          : 0x955252b5
Msg_Seq          : 0x0
res_seq          : 41
valblk           : 0x514e5844970d5d33d156c8d80004ac1e QNXD]3V
----------resource 70000018f124ac0----------------------
resname       : [0x970d5d33][0x0],[DX]
Local node    : 1
dir_node      : 0
master_node   : 0
hv idx        : 18
hv last r.inc : 98
current inc   : 98
hv status     : 0
hv master     : 0
open options  :
Held mode     : KJUSEREX
Cvt mode      : KJUSERNL
Next Cvt mode : KJUSERNL
msg_seq       : 1
res_seq       : 41
grant_bits    : KJUSERNL KJUSEREX
grant mode    : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
count         : 1         0         0         0         0         1
val_state     : KJUSERVS_VALUE
valblk        : 0x514e5844970d5d33d156c8d80004ac1e QNXD]3V
access_node   : 1
vbreq_state   : 0
state         : x8
resp          : 70000018f124ac0
On Scan_q?    : N
Total accesses: 5475
Imm.  accesses: 5057
Granted_locks : 2
Cvting_locks  : 0
value_block:  51 4e 58 44 97 0d 5d 33 d1 56 c8 d8 00 04 ac 1e
GRANTED_Q :
lp 7000001cbbcdaf0 gl KJUSEREX rp 70000018f124ac0 [0x970d5d33][0x0],[DX]
 master 0 pid 46924312 bast 0 rseq 41 mseq 0 history 0x955252b5
 open opt  KJUSERPROCESS_OWNED
CONVERT_Q:
----- End of Customized Incident Dump(s) -----

*** 2015-02-05 10:24:01.286
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+002c      bl       105e6a628
ksedst1()+0064       bl       101f9c14c
ksedst()+0028        bl       ksedst1()            FFFFFFFFFFF4880 ? 000000000 ?
dbkedDefDump()+0874  bl       ksedst()             110EA2180 ?
ksedmp()+0048        bl       dbkedDefDump()       3FFFF51D0 ? 110000A90 ?
ksfdmp()+0058        bl       ksedmp()             000000020 ?
dbgexPhaseII()+0130  bl       _ptrgl()
dbgexExplicitEndInc  bl       dbgexPhaseII()       000000000 ? 000000000 ?
()+0210                                            000000000 ?
dbgeEndDDEInvocatio  bl       dbgexExplicitEndInc  110378410 ? 11037F370 ?
nImpl()+0224                  ()
dbgeEndDDEInvocatio  bl       dbgeEndDDEInvocatio  110378410 ? 11037F370 ?
n()+0038                      nImpl()
kjucvl()+0590        bl       101f9d824
kjuscv()+0c90        bl       kjucvl()             000000002 ? FFFFFFFFFFF9700 ?
 FFFFFFFFFFF9570 ?
 24244228CB0E3960 ?
 102196C80 ? FFFFFFFFFFF9220 ?
 000000000 ? 110323C28 ?
ksicon_v()+06cc      bl       kjuscv()             7000001CBBCDAF0 ?
 500000000000000 ?
 79FFFF9D00 ? 0105358E0 ?
 7000001BDF32FB4 ? 000000000 ?
 000000000 ? 000000000 ?
k2qsod()+0250        bl       ksicon_v()           1071DED20 ? 000000000 ?
 000000000 ? 084242240 ?
 70000017FEF01A0 ? 000000046 ?
 70000017FEF01C0 ?
kssxdl()+076c        bl       _ptrgl()
kssdel()+0048        bl       kssxdl()             7000001BDF32F38 ? 382242228 ?
kssdch()+0d30        bl       kssdel()             FFFFFFFFFFFA550 ?
 4224822800000003 ?
ksuxds()+08b8        bl       kssdch()             3CD1FEAE8 ? 7000001CD3ACCA0 ?
kssxdl()+076c        bl       _ptrgl()
kssdel()+0048        bl       kssxdl()             7000001CD3ACCA0 ? 3CE4C7B78 ?
kssdch()+0d30        bl       kssdel()             FFFFFFFFFFFF5D0 ?
 FFFFFFFFFFFF5E0 ?
ksudlp()+0180        bl       kssdch()             000000000 ? 000000000 ?
kssxdl()+076c        bl       _ptrgl()
kssdel()+0048        bl       kssxdl()             7000001CF9B5E48 ? 300000000 ?
ksuxdl()+029c        bl       kssdel()             033339530 ? 093339530 ?
ksuxda()+02ac        bl       ksuxdl()             7000001CF9B5E48 ? 0FFFFC1D0 ?
ksucln()+0938        bl       ksuxda()
ksbrdp()+075c        bl       _ptrgl()
opirip()+0444        bl       105e65300
opidrv()+0418        bl       opirip()             732F6373706F7261 ?
 410323C28 ? FFFFFFFFFFFF420 ?
sou2o()+0090         bl       opidrv()             3204C0756C ? 4A0071E60 ?
 FFFFFFFFFFFF420 ?
opimai_real()+0148   bl       101f9be30
main()+0090          bl       opimai_real()        000000000 ? 000000000 ?
__start()+0070       bl       main()               000000000 ? 000000000 ?

从上述信息来看,出现了DX锁,对于DX锁,这是一种分布式事务锁。通过分析call stack,我们可以定位到
该问题其实是一个Oracle bug:11868640。

通过对比Bug 11868640 – ORA-600 [kjucvl:!busy] possible in PMON in RAC (文档 ID 11868640.8) ,
我们可以发现pmon trace中的call stack与上述bug描述完全一致。实际上该问题,可以通过安装patch或psu来解决。

很简单的一个问题,供大家参考!

Related posts:

  1. ora-00600 [kkslgbv0]
  2. BUG 10008092 caused instance crash
  3. database crash with ora-00494
  4. PMON failed to acquire latch导致crash的案例联想
  5. Instance immediate crash after open

XTTS(Cross Platform Incremental Backup)的测试例子

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: XTTS(Cross Platform Incremental Backup)的测试例子

对于数据库的跨平台迁移,大家所熟悉的方法有很多,例如传统的传输表空间技术(TTS),如果是10gR2+版本,字节序相同的话,那么还能进行rman convert database。甚至使用其他的第三方数据同步软件,例如GoldenGate,DSG,DDS,shareplex等等。

对于上述的技术,各有相互的优势,对于数据的逻辑迁移,后面的数据校对工作是比较麻烦的。

因此,对于数据迁移,我个人还是更倾向去使用物理迁移。convert database功能限制太多,必须要去源端和目标端字节序一致,如果是字节序不同,例如从AIX迁移至Linux(x86),那么只能通过TTS来操作。

对于传统的TTS,如果数据量较大的情况下,很难满足要求,为此Oracle提供了增强版的XTTS功能,可以进行增量操作,这可以最大程度的降低停机时间。这一功能之前Oracle仅仅针对exadata开发,后面对于非exadata环境也可以进行使用了。

 

对于XTTS的增量操作,Oracle提供了2种方式来进行,分别如下:
1)dbms_file_transfer
2)RMAN 备份

对于第一种方法,要求目标端数据库版本必须是11.2.0.4以及更新的版本。如果数据库版本低于11.2.0.4,
那么只能使用第2种方式。即使使用第2种方法,如果数据库版本低于11.2.0.4,那么目标端环境,仍然需要
安装11.2.0.4以及更新版本的临时环境。因为XTTS增量的核心脚本功能必须是基于11.2.0.4(+)版本。

如下是我的一个简单测试,是基于RMAN备份的方式,供参考!

1. 目标端安装11.2.0.4软件环境(如果不用ASM,那么不需要安装grid)

该步骤略.

2. 目标端准备convert Instance(以及修改相关的环境变量)

[root@cszwbdb1 11204]# su - ora1124
[ora1124@cszwbdb1 ~]$ export ORACLE_HOME=/oracle/app/ora1124/product/11.2.0/dbhome_1
[ora1124@cszwbdb1 ~]$ export PATH=$PATH:$HOME/bin:$ORACLE_HOME/bin:$ORACLE_HOME/OPatch
[ora1124@cszwbdb1 ~]$ export ORACLE_SID=xtt
[ora1124@cszwbdb1 ~]$ cat << EOF > $ORACLE_HOME/dbs/init$ORACLE_SID.ora
> db_name=xtt
> compatible=11.2.0.4.0
> EOF
[ora1124@cszwbdb1 ~]$
[ora1124@cszwbdb1 ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 11.2.0.4.0 Production on Mon Feb 9 11:12:41 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup nomount
ORACLE instance started.

Total System Global Area 1177632768 bytes
Fixed Size                  2260848 bytes
Variable Size             935329936 bytes
Database Buffers          218103808 bytes
Redo Buffers               21938176 bytes

注意,只需要将辅助实例启动到nomount状态即可.

 

3. 源端解压rman convert脚本

$ unzip *
Archive:  rman_xttconvert_1.4.2.1.zip
 inflating: xttcnvrtbkupdest.sql
inflating: xttdbopen.sql
inflating: xttdriver.pl
inflating: xttprep.tmpl
inflating: xtt.properties
inflating: xttstartupnomount.sql
$ pwd
/telephone_cdr/oracle11203/xtts

4. 源端修改xtt.properties内容

$ cat xtt.properties
tablespaces=TEST_TAB
platformid=2
backupformat=/telephone_cdr/oracle11203/backup
backupondest=/telephone_cdr/oracle11203/backup
#srcdir=SOURCEDIR
#dstdir=DESTDIR
#srclink=ttslink
dfcopydir=/telephone_cdr/oracle11203/dfcopydir
stageondest=/ogg/11204/xtts
storageondest=/ogg/11204/xtts/test
cnvinst_home=/oracle/app/ora1124/product/11.2.0/dbhome_1
cnvinst_sid=xtts

说明:
tablespaces:表示你需要传输的表空间名称
platformid: 表示源端平台编号,该值可以从v$transportable_platform获取

5. 源端运行perl脚本,准备Prepare操作

$ $ORACLE_HOME/perl/bin/perl  xttdriver.pl  -p

--------------------------------------------------------------------
Parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Starting prepare phase
--------------------------------------------------------------------
Prepare source for Tablespaces:
 'TEST_TAB'  /ogg/11204/xtts
xttpreparesrc.sql for 'TEST_TAB' started at Tue Feb 10 09:32:16 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:32:18 2015
Prepare source for Tablespaces:
 ''  /ogg/11204/xtts
xttpreparesrc.sql for '' started at Tue Feb 10 09:34:55 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:35:05 2015
Prepare source for Tablespaces:
 ''  /ogg/11204/xtts
xttpreparesrc.sql for '' started at Tue Feb 10 09:35:14 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:35:14 2015
Prepare source for Tablespaces:
 ''  /ogg/11204/xtts
xttpreparesrc.sql for '' started at Tue Feb 10 09:35:20 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:35:21 2015
Prepare source for Tablespaces:
 ''  /ogg/11204/xtts
xttpreparesrc.sql for '' started at Tue Feb 10 09:35:27 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:35:27 2015
Prepare source for Tablespaces:
 ''  /ogg/11204/xtts
xttpreparesrc.sql for '' started at Tue Feb 10 09:35:33 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:35:33 2015
Prepare source for Tablespaces:
 ''  /ogg/11204/xtts
xttpreparesrc.sql for '' started at Tue Feb 10 09:35:39 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:35:40 2015
Prepare source for Tablespaces:
 ''  /ogg/11204/xtts
xttpreparesrc.sql for '' started at Tue Feb 10 09:35:45 2015
xttpreparesrc.sql for  ended at Tue Feb 10 09:35:46 2015

--------------------------------------------------------------------
Done with prepare phase
--------------------------------------------------------------------
$

该操作执行完毕之后,会在xtts目录下产生几个文件,其中xttplan.txt文件中的内容如下:

$ cat  xttplan.txt
TEST_TAB::::1264229
5

该文件中的数值,数据库的SCN。如果后面再次运行脚本进行增量操作时,该值会发现改变。

$ cat rmanconvert.cmd
host 'echo ts::TEST_TAB';
convert from platform 'AIX-Based Systems (64-bit)'
datafile
'/ogg/11204/xtts/TEST_TAB_5.tf'
format '/ogg/11204/xtts/test/%N_%f.xtf'
parallelism 8;
$

上述脚本是perl脚本产生的rman convert脚本,需要将该脚本传递到目标端主机。注意,上述脚本文件格式需要注意,同时并行度是默认的,可以进行调整。
6. 将数据文件传输到目标端

这里你可以直接使用如下的方式进行scp:
scp oracle11@133.37.253.3:/telephone_cdr/oracle11203/dfcopydir/TEST_TAB_5.tf /ogg/11204/xtts

我这里直接进行ftp 传递,因为scp有问题,操作如下:

ftp> get TEST_TAB_5.tf
local: TEST_TAB_5.tf remote: test_tab.dbf
227 Entering Passive Mode (133,37,253,3,131,207)
150 Opening data connection for test_tab.dbf (1073750016 bytes).
226 Transfer complete.
1073750016 bytes received in 155 secs (6948.62 Kbytes/sec)
ftp> bye
421 Timeout (900 seconds): closing connection.
[root@cszwbdb1 xtts]# pwd
/ogg/11204/xtts

7. 将源端的rman convert脚本传到目标端

这里在传递文件的时候,将源端的xtts目录下的所有文件都传递到目标端。如果直接在目标端解压
rmancovert程序,那么还需要修改相关的配置文件,以及将源端的xttplan.txt等传过来。

我这里省略了传递其他文件的步骤:

ftp> cd /telephone_cdr/oracle11203/xtts
250 CWD command successful.
ftp> get rmanconvert.cmd
local: rmanconvert.cmd remote: rmanconvert.cmd
227 Entering Passive Mode (133,37,253,3,137,129)
150 Opening data connection for rmanconvert.cmd (189 bytes).
226 Transfer complete.
189 bytes received in 0.00881 secs (21.46 Kbytes/sec)
ftp> bye
221 Goodbye.

8. 目标端进行数据文件的转换

[ora1124@cszwbdb1 xtts]$ perl xttdriver.pl -c

--------------------------------------------------------------------
Parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Performing convert
--------------------------------------------------------------------

--------------------------------------------------------------------
Converted datafiles listed in: /ogg/11204/xtts/xttnewdatafiles.txt
--------------------------------------------------------------------

转换之后,如下:
[ora1124@cszwbdb1 xtts]$ cd test/
[ora1124@cszwbdb1 test]$ ls -ltr
total 1048588
-rw-r—– 1 ora1124 dba 1073750016 Feb 10 10:19 TEST_TAB_5.xtf
[ora1124@cszwbdb1 test]$
9. 创建增量数据(源端数据库)

SQL> conn /as sysdba
Connected.
SQL> create user roger identified by roger default tablespace test_tab;

User created.

SQL> grant connect,resource to roger;

Grant succeeded.

SQL> conn roger/roger
Connected.
SQL> create table killdb(a number);

Table created.

SQL> insert into killdb values(100);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from killdb;

A
----------
 100

10. 源端数据库创建增量备份

$ pwd
/telephone_cdr/oracle11203/xtts
$ $ORACLE_HOME/perl/bin/perl xttdriver.pl -i

--------------------------------------------------------------------
Parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Backup incremental
--------------------------------------------------------------------
Prepare newscn for Tablespaces: 'TEST_TAB'
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
rman target /  cmdfile /telephone_cdr/oracle11203/xtts/rmanincr.cmd

Recovery Manager: Release 11.2.0.3.0 - Production on Tue Feb 10 10:28:00 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: TEST (DBID=2169100805)

RMAN> set nocfau;
2> host 'echo ts::TEST_TAB';
3> backup incremental from scn 1264229
4>   tag tts_incr_update tablespace 'TEST_TAB' format
5>  '/telephone_cdr/oracle11203/backup/%U';
6>
executing command: SET NOCFAU
using target database control file instead of recovery catalog

ts::TEST_TAB
host command complete

Starting backup at 10-FEB-15

allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=4 device type=DISK
backup will be obsolete on date 17-FEB-15
archived logs will not be kept or backed up
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00005 name=/telephone_cdr/oracle11203/oracle/oradata/test/test_tab.dbf
channel ORA_DISK_1: starting piece 1 at 10-FEB-15
channel ORA_DISK_1: finished piece 1 at 10-FEB-15
piece handle=/telephone_cdr/oracle11203/backup/0hputq9s_1_1 tag=TTS_INCR_UPDATE comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15

using channel ORA_DISK_1
backup will be obsolete on date 17-FEB-15
archived logs will not be kept or backed up
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current control file in backup set
channel ORA_DISK_1: starting piece 1 at 10-FEB-15
channel ORA_DISK_1: finished piece 1 at 10-FEB-15
piece handle=/telephone_cdr/oracle11203/backup/0iputqac_1_1 tag=TTS_INCR_UPDATE comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 10-FEB-15

Recovery Manager complete.

--------------------------------------------------------------------
Done backing up incrementals
--------------------------------------------------------------------

上述步骤的增量备份信息,会写入到如下txt文件中。

$ cat incrbackups.txt
/telephone_cdr/oracle11203/backup/0hputq9s_1_1

11. 将增量备份信息传到目标端

将$/telephone_cdr/oracle11203/backup/0hputq9s_1_1 传到目标端:

ftp> cd /telephone_cdr/oracle11203/backup
250 CWD command successful.
ftp> get 0hputq9s_1_1
local: 0hputq9s_1_1 remote: 0hputq9s_1_1
227 Entering Passive Mode (133,37,253,3,145,111)
150 Opening data connection for 0hputq9s_1_1 (122880 bytes).
226 Transfer complete.
122880 bytes received in 0.0147 secs (8334.24 Kbytes/sec)

ftp> cd /telephone_cdr/oracle11203/xtts
250 CWD command successful.
ftp> get tsbkupmap.txt
local: tsbkupmap.txt remote: tsbkupmap.txt
227 Entering Passive Mode (133,37,253,3,145,183)
150 Opening data connection for tsbkupmap.txt (29 bytes).
226 Transfer complete.
29 bytes received in 2.9e-05 secs (1000.00 Kbytes/sec)
ftp> get xttplan.txt
local: xttplan.txt remote: xttplan.txt
227 Entering Passive Mode (133,37,253,3,145,200)
150 Opening data connection for xttplan.txt (22 bytes).
226 Transfer complete.
22 bytes received in 0.000117 secs (188.03 Kbytes/sec)

注意:这里传递增量数据信息的时候,还需要将源端xtts目录下的xttplan.txt,以及tsbkupmap.txt

文件都传输到目标端。每当你进行一次增量的备份操作,这2个文件的内容都会发现变化。每一次增量操作之后,都需要将这2个文件传到目标端数据库的xtts目录中。

对于一个比较大量的系统来讲,上述的增量操作,我们可以进行多次。假设我们进行了多次操作之后,在停机时间的时候,再将源端数据库中需要传输的表空间设置为只读模式,如下:

12. 源端数据库最后一次增量操作

$ sqlplus "/as sysdba"

SQL*Plus: Release 11.2.0.3.0 Production on Tue Feb 10 12:05:17 2015

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> alter tablespace test_tab read only;

Tablespace altered.

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
$ pwd
/telephone_cdr/oracle11203/xtts
$ $ORACLE_HOME/perl/bin/perl xttdriver.pl -i

--------------------------------------------------------------------
Parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Done checking properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Backup incremental
--------------------------------------------------------------------
Prepare newscn for Tablespaces: 'TEST_TAB'
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
Prepare newscn for Tablespaces: ''
rman target /  cmdfile /telephone_cdr/oracle11203/xtts/rmanincr.cmd

Recovery Manager: Release 11.2.0.3.0 - Production on Tue Feb 10 12:05:48 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: TEST (DBID=2169100805)

RMAN> set nocfau;
2> host 'echo ts::TEST_TAB';
3> backup incremental from scn 1264229
4>   tag tts_incr_update tablespace 'TEST_TAB' format
5>  '/telephone_cdr/oracle11203/backup/%U';
6>
executing command: SET NOCFAU
using target database control file instead of recovery catalog

ts::TEST_TAB
host command complete

Starting backup at 10-FEB-15

allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=50 device type=DISK
backup will be obsolete on date 17-FEB-15
archived logs will not be kept or backed up
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00005 name=/telephone_cdr/oracle11203/oracle/oradata/test/test_tab.dbf
channel ORA_DISK_1: starting piece 1 at 10-FEB-15
channel ORA_DISK_1: finished piece 1 at 10-FEB-15
piece handle=/telephone_cdr/oracle11203/backup/0jpuu017_1_1 tag=TTS_INCR_UPDATE comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:07

using channel ORA_DISK_1
backup will be obsolete on date 17-FEB-15
archived logs will not be kept or backed up
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current control file in backup set
channel ORA_DISK_1: starting piece 1 at 10-FEB-15
channel ORA_DISK_1: finished piece 1 at 10-FEB-15
piece handle=/telephone_cdr/oracle11203/backup/0kpuu01e_1_1 tag=TTS_INCR_UPDATE comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 10-FEB-15

Recovery Manager complete.

--------------------------------------------------------------------
Done backing up incrementals
--------------------------------------------------------------------

13. 目标端进行增量转换和数据写入同步

在测试的过程中,发现了不少的问题,需要进行排除,最后发现该脚本本身提供了debug功能,如下:

[ora1124@cszwbdb1 xtts]$ export XTTDEBUG=1  (打开debug功能)
[ora1124@cszwbdb1 xtts]$ perl xttdriver.pl  -r

--------------------------------------------------------------------
Parsing properties
--------------------------------------------------------------------
Key: backupondest
Values: /ogg/11204/xtts
Key: platformid
Values: 2
Key: backupformat
Values: /ogg/11204/xtts
Key: storageondest
Values: /ogg/11204/xtts
Key: dfcopydir
Values: /telephone_cdr/oracle11203/dfcopydir
Key: cnvinst_home
Values: /oracle/app/ora1124/product/11.2.0/dbhome_1
Key: cnvinst_sid
Values: xtt
Key: stageondest
Values: /ogg/11204/xtts
Key: tablespaces
Values: TEST_TAB

--------------------------------------------------------------------
Done parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Checking properties
--------------------------------------------------------------------
ARGUMENT tablespaces
ARGUMENT platformid
ARGUMENT backupformat
ARGUMENT stageondest
ARGUMENT backupondest

--------------------------------------------------------------------
Done checking properties
--------------------------------------------------------------------
ORACLE_SID  : xtt
ORACLE_HOME : /oracle/app/ora1124/product/11.2.0/dbhome_1

--------------------------------------------------------------------
Start rollforward
--------------------------------------------------------------------
convert instance: /oracle/app/ora1124/product/11.2.0/dbhome_1

convert instance: xtt

ORACLE instance started.

Total System Global Area 1177632768 bytes
Fixed Size                  2260848 bytes
Variable Size             935329936 bytes
Database Buffers          218103808 bytes
Redo Buffers               21938176 bytes
rdfno 5

BEFORE ROLLPLAN

datafile number : 5

datafile name   : /ogg/11204/xtts/test/TEST_TAB_5.xtf

AFTER ROLLPLAN

CONVERTED BACKUP PIECE/ogg/11204/xtts/xib_0jpuu017_1_1_5

PL/SQL procedure successfully completed.
Entering RollForward
After applySetDataFile
Done: applyDataFileTo
Done: applyDataFileTo
Done: RestoreSetPiece
Done: RestoreBackupPiece

PL/SQL procedure successfully completed.
alter database mount
*
ERROR at line 1:
ORA-00205: error in identifying control file, check alert log for more info

alter database open
*
ERROR at line 1:
ORA-01507: database not mounted

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error:
------
Error in executing xttdbopen.sql
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

说明:我们可以看到关键性的操作已经关闭,之所以后面会报ORA-00205错误,是因为我们的用于
转换的临时辅助实例XTT是nomount状态,是没有控制文件的,因此这个错误直接忽略之.
14.  最后将表空间相关的元数据插入到目标端数据库

该perl脚本本身提供了产生脚本的功能,如下:

[ora1124@cszwbdb1 xtts]$  perl xttdriver.pl -e

--------------------------------------------------------------------
Parsing properties
--------------------------------------------------------------------
Key: backupondest
Values: /ogg/11204/xtts
Key: platformid
Values: 2
Key: backupformat
Values: /ogg/11204/xtts
Key: storageondest
Values: /ogg/11204/xtts
Key: dfcopydir
Values: /telephone_cdr/oracle11203/dfcopydir
Key: cnvinst_home
Values: /oracle/app/ora1124/product/11.2.0/dbhome_1
Key: cnvinst_sid
Values: xtt
Key: stageondest
Values: /ogg/11204/xtts
Key: tablespaces
Values: TEST_TAB

--------------------------------------------------------------------
Done parsing properties
--------------------------------------------------------------------

--------------------------------------------------------------------
Checking properties
--------------------------------------------------------------------
ARGUMENT tablespaces
ARGUMENT platformid
ARGUMENT backupformat
ARGUMENT stageondest
ARGUMENT backupondest

--------------------------------------------------------------------
Done checking properties
--------------------------------------------------------------------
ORACLE_SID  : xtt
ORACLE_HOME : /oracle/app/ora1124/product/11.2.0/dbhome_1

--------------------------------------------------------------------
Generating plugin
--------------------------------------------------------------------

--------------------------------------------------------------------
Done generating plugin file /ogg/11204/xtts/xttplugin.txt
--------------------------------------------------------------------
[ora1124@cszwbdb1 xtts]$ cat /ogg/11204/xtts/xttplugin.txt
impdp directory=<DATA_PUMP_DIR> logfile=<tts_imp.log> \
network_link=<ttslink> transport_full_check=no \
transport_tablespaces=TEST_TAB \
transport_datafiles='/ogg/11204/xtts/test/TEST_TAB_5.xtf'

产生的脚本内容在/ogg/11204/xtts/xttplugin.txt文件中,我们创建相关的directory和network_link即可。
不过我这里创建link后,impdp有问题,因此我直接通过exp/imp 元数据的方式来进行了,如下:

15.  源端数据库,导致元数据

$ exp \'/ as sysdba\' tablespaces=test_tab transport_tablespace=y file=/telephone_cdr/oracle11203/dfcopydir/test_xtts.dmp

Export: Release 11.2.0.3.0 - Production on Tue Feb 10 17:26:52 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses ZHS16GBK character set (possible charset conversion)
Note: table data (rows) will not be exported
About to export transportable tablespace metadata...
For tablespace TEST_TAB ...
. exporting cluster definitions
. exporting table definitions
. . exporting table                             T1
EXP-00091: Exporting questionable statistics.
EXP-00091: Exporting questionable statistics.
. . exporting table                         KILLDB
. exporting referential integrity constraints
. exporting triggers
. end transportable tablespace metadata export
Export terminated successfully with warnings.
$

16.  目标端数据库导入元数据

1)首先创建相关的用户信息(其中roger用户是我的增量操作中创建的测试用户)

[oracle@cszwbdb1 ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 11.2.0.3.0 Production on Tue Feb 10 17:36:48 2015

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

SQL> create user test identified by test ;

User created.

SQL> grant connect,resource to test;

Grant succeeded.

SQL> create user roger identified by roger;

User created.

SQL> grant connect,resource to roger;

Grant succeeded.

SQL> !

2) 导入元数据

[oracle@cszwbdb1 ~]$  imp \'/ as sysdba\' tablespaces=test_tab transport_tablespace=y file=/ogg/11204/xtts/test_xtts.dmp datafiles=/ogg/11204/xtts/test/TEST_TAB_5.xtf

Import: Release 11.2.0.3.0 - Production on Tue Feb 10 17:37:35 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

Export file created by EXPORT:V11.02.00 via conventional path
About to import transportable tablespace(s) metadata...
import done in ZHS16GBK character set and AL16UTF16 NCHAR character set
export client uses US7ASCII character set (possible charset conversion)
. importing SYS's objects into SYS
. importing SYS's objects into SYS
. importing TEST's objects into TEST
. . importing table                           "T1"
. importing ROGER's objects into ROGER
. . importing table                       "KILLDB"
. importing SYS's objects into SYS
Import terminated successfully without warnings.
[oracle@cszwbdb1 ~]$ exit
exit

17. 验证数据是否OK

SQL> select * from roger.killdb;

A
----------
 100

SQL>
SQL> select name,status,bytes from v$datafile where name like '/ogg%';

NAME                                                                   STATUS       BYTES
---------------------------------------------------------------------- ------- ----------
/ogg/11204/xtts/test/TEST_TAB_5.xtf                                    ONLINE  1073741824

我们可以看到,最后我们的增量操作的数据,已经可以查询到了.

备注:在最近的一个运营商项目中,客户的2套10TB的RAC,我计划使用该方法来进行迁移(AIX–>Linux)。

 

Related posts:

  1. oracle Database PSU-CPU Cross-Reference List
  2. oracle TDE学习系列 (3) — 如何备份?
  3. 手工构造逻辑坏块一例
  4. Where is the backup of ASM disk header block? –补充
  5. 2015年第一季度PSU更新(OJVM PSU更新)

某客户的5TB RAC 恢复小记

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 某客户的5TB RAC 恢复小记

某客户的核心数据库存储问题导致数据库重启后无法正常启动,根据客户反馈最开始在启动数据库时
报错控制文件IO错误,如下:

Sun Mar 15 11:59:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '/xxx/xxxx/control01.ctl'
ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
Sun Mar 15 11:59:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '/xxx/xxxx/control01.ctl'
ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
Sun Mar 15 11:59:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '/xxx/xxxx/control01.ctl'
ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
Sun Mar 15 11:59:37 2015
Master background archival failure: 204
Sun Mar 15 11:59:49 2015
Termination issued to instance processes. Waiting for the processes to exit
Sun Mar 15 15:40:09 2015
Starting ORACLE instance (normal)

上述的问题本质上都跟控制文件有关系,替换掉损坏的控制文件就行。当替换掉控制文件之后,在open数据库时发现报如下错误:

Sun Mar 15 16:10:48 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_626734.trc:
ORA-00600: internal error code, arguments: [kcrfr_update_nab_2], [0x70000038F8C94E0], [2], [], [], [], [], []
Abort recovery for domain 0
Sun Mar 15 16:10:49 2015
Aborting crash recovery due to error 600
Sun Mar 15 16:10:49 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_626734.trc:
ORA-00600: internal error code, arguments: [kcrfr_update_nab_2], [0x70000038F8C94E0], [2], [], [], [], [], []
ORA-600 signalled during: ALTER DATABASE OPEN...
Sun Mar 15 16:10:49 2015
Trace dumping is performing id=[cdmp_20150315161049]
Sun Mar 15 16:12:35 2015
Shutting down instance: further logons disabled
Sun Mar 15 16:12:35 2015

该错误本质上是因为redo的问题,即有redo log损坏。通过在RMAN进行recover,发现报如下类似错误:

Sun Mar 15 16:47:59 2015
Beginning crash recovery of 2 threads
 parallel recovery setup failed: using serial mode
Sun Mar 15 16:47:59 2015
Started redo scan
Sun Mar 15 16:47:59 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00313: open failed for members of log group 5 of thread 2
ORA-00312: online log 5 thread 2: '/xxx/xxxx/redo05a.log'
ORA-17503: ksfdopn:4 Failed to open file /xxx/xxxx/redo05a.log
ORA-17500: ODM err:File does not exist
Sun Mar 15 16:47:59 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00313: open failed for members of log group 4 of thread 1
ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04a.log'
ORA-17503: ksfdopn:4 Failed to open file /xxx/xxxx/redo04a.log
ORA-17500: ODM err:File does not exist
Sun Mar 15 17:03:03 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2009344 change 14160745159583 time 03/15/2015 11:56:29
ORA-00334: archived log: '/xxx/xxxx/redo04b.log'
Sun Mar 15 17:03:03 2015
Abort recovery for domain 0
Sun Mar 15 17:03:03 2015
Aborting crash recovery due to error 354
Sun Mar 15 17:03:03 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2009344 change 14160745159583 time 03/15/2015 11:56:29
ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04b.log'
ORA-354 signalled during: ALTER DATABASE OPEN...
Sun Mar 15 17:08:02 2015

上述过程大致是客户之前的处理过程。我在18点左右介入之后,进行了相关的操作。我最开始尝试在利用RMAN 进行恢复,发现报错:

RMAN> recover database;

Starting recover at 15-MAR-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=3268 instance=xxxx2 devtype=DISK

starting media recovery

media recovery failed
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 03/15/2015 18:39:26
ORA-00283: recovery session canceled due to errors
RMAN-11003: failure during parse/execution of SQL statement: alter database recover if needed
 start
ORA-00283: recovery session canceled due to errors
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 1788672 change 14160744248478 time 03/15/2015 11:54:46
ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04a.log'

从上面的错误来看,初步可以判断redo04a.log文件已经损坏,而且是block 1788672的问题。为了验证该block是否损坏,我通过类似如下的dump 命令进行dump,发现报错:

alter system dump logfile 'xxx' scn min xxxx scn max xxxx;

由此判断,该block损坏无疑。 由于客户的需求是尽可能快的将数据库拉起来,因此对应redo损坏的情况之下。
通常只能进程不完全恢复并强制打开,这里我使用了如下的参数:

*._allow_resetlogs_corruption=TRUE
*._allow_error_simulation=TRUE

在open resetlogs之前,我已经将redo备份,resetlogs打开时,发现数据库报错如下:

Sun Mar 15 19:43:36 2015
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Sun Mar 15 19:43:36 2015
SMON: enabling cache recovery
Sun Mar 15 19:43:37 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sun Mar 15 19:43:37 2015
ORA-01555 caused by SQL statement below (SQL ID: 5wc2915k44m38, Query Duration=0 sec, SCN: 0x0ce1.0e2d8971):
Sun Mar 15 19:43:37 2015
select user#,type# from user$ where name=:1
Sun Mar 15 19:43:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_667814.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 25 with name "_SYSSMU25$" too small
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Instance terminated by USER, pid = 667814
ORA-1092 signalled during: alter database open resetlogs...

从日志来看,大致判断可能是_SYSSMU25$ 回滚段的问题,因此尝试先通过如下隐含参数屏蔽回滚段:

_corrupted_rollback_segments=_SYSSMU25$
_offline_rollback_segments=_SYSSMU25$

屏蔽回滚段之后,尝试打开数据库,发现错误依旧,通过10046 trace跟踪,发现递归SQL在如下的block上执行失败:

PARSING IN CURSOR #3 len=43 dep=1 uid=0 oct=3 lid=0 tim=37951056727245 hv=1682066536 ad='8cb74a90'
select user#,type# from user$ where name=:1
END OF STMT
PARSE #3:c=0,e=372,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=37951056727243
BINDS #3:
kkscoacd
 Bind#0
 oacdty=01 mxl=32(03) mxlc=00 mal=00 scl=00 pre=00
 oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
 kxsbbbfp=1126d4b70  bln=32  avl=03  flg=05
 value="XDB"
EXEC #3:c=0,e=465,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=37951056727780
WAIT #3: nam='db file sequential read' ela= 1582 file#=1 block#=282 blocks=1 obj#=44 tim=37951056729421
WAIT #3: nam='db file sequential read' ela= 6642 file#=1 block#=91 blocks=1 obj#=22 tim=37951056736126

通过dump file 1 block 91,发现该block上第2个ITL确认存在一个活跃事务。原本计划直接bbed提交该事务,但是当我编译好bbed之后,查看发现该block为一个cluster block.

对于cluster block的事务修改,相对复杂一些,我的博客有文章描述,大家可以参考,这里不多说。考虑到生产库使用bbed有一定的风险,我并没有使用bbed。

接着使用undo_management参数启动数据库,然后强制open数据库,发现错误变成如下:

Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_778430.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
Sun Mar 15 20:50:52 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sun Mar 15 20:50:53 2015
Redo thread 1 internally disabled at seq 1 (CKPT)
Sun Mar 15 20:50:53 2015
ARC1: Archiving disabled thread 1 sequence 1
Sun Mar 15 20:50:54 2015
Trace dumping is performing id=[cdmp_20150315205054]
Sun Mar 15 20:50:54 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_778430.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
Sun Mar 15 20:50:54 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 778430
ORA-1092 signalled during: alter database open  resetlogs...

从错误来看,我们就可以知道,这应该是SCN的问题。如果要手工推进SCN,那么level应该待遇3297*4才行,由于这里的238091117/1024/1024/1024小于1,因此推进scn时,level=3297*4+2 就差不多了。 这里我再次进行了10046 trace,发现了如下信息:

WAIT #5: nam='db file sequential read' ela= 1021 file#=1 block#=400 blocks=1 obj#=0 tim=37953716489772
EXEC #5:c=0,e=2969,p=1,cr=2,cu=3,mis=1,r=1,dep=1,og=4,tim=37953716490098
STAT #5 id=1 cnt=1 pid=0 pos=1 obj=0 op='UPDATE  UNDO$ (cr=2 pr=1 pw=0 time=1542 us)'
STAT #5 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=2 pr=0 pw=0 time=11 us)'
WAIT #1: nam='row cache lock' ela= 71 cache id=3 mode=0 request=5 obj#=0 tim=37953716490369
WAIT #1: nam='db file sequential read' ela= 5783 file#=2 block#=25 blocks=1 obj#=0 tim=37953716496201
........
........
GLOBAL CACHE ELEMENT DUMP (address: 700000011f91498):
 id1: 0x19 id2: 0x20000 obj: US#2 block: (2/25)
 lock: SL rls: 0x0000 acq: 0x0000 latch: 3
 flags: 0x41 fair: 0 recovery: 0 fpin: 'ktuwh02: ktugus'
 bscn: 0x0.0 bctx: 0 write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
 lcp: 0 lnk: [NULL] lch: [70000023bf4bc20,70000023bf4bc20]
 seq: 3 hist: 143:0 208 352
 LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
 flg: 0x00080000 state: READING mode: EXCL
 pin: 'ktuwh02: ktugus'
 addr: 70000023bf4bb10 obj: INVALID cls: UNDO HEAD bscn: 0xce1.e379b05  ---这里的bscn即scn值
 GCS SHADOW 700000011f91508,1 sq[70000035fb339f8,70000035fb339f8] resp[70000035fb339d0,0x19.20000] pkey 4294950914
 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
 master 1 owner 1 sid 0 remote[0,0] hist 0x106
 history 0x6.0x4.0x0.0x0.0x0.0x0. cflag 0x0 sender 0 flags 0x0 replay# 0
 disk: 0x0000.00000000 write request: 0x0000.00000000
 pi scn: 0x0000.00000000
msgseq 0x0 updseq 0x0 reqids[1,0,0] infop 0x0
 GCS RESOURCE 70000035fb339d0 hashq [70000038cbc6658,70000038cbc6658] name[0x19.20000] pkey 4294950914
 grant 700000011f91508 cvt 0 send 0,0 write 0,0@65535
 flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
 disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
 xid 0x0000.000.00000000 sid 0 pkwait 59s
 pkey 4294950914
 hv 0 [stat 0x0, 1->1, wm 32767, RMno 0, remxxx 0, dom 0]
 kjga st 0x4, step 0.0.0, cxxx 2, rmno 0, flags 0x0
 lb 0, hb 0, myb 6147, drmb 6147, apifrz 0
 GCS SHADOW 700000011f91508,1 sq[70000035fb339f8,70000035fb339f8] resp[70000035fb339d0,0x19.20000] pkey 4294950914
 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
 master 1 owner 1 sid 0 remote[0,0] hist 0x106
 history 0x6.0x4.0x0.0x0.0x0.0x0. cflag 0x0 sender 0 flags 0x0 replay# 0
 disk: 0x0000.00000000 write request: 0x0000.00000000
 pi scn: 0x0000.00000000
msgseq 0x0 updseq 0x0 reqids[1,0,0] infop 0x0
kjbmbassert [0x19.20000]
*** 2015-03-15 20:54:54.385
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
Current SQL statement for this session:
alter database open resetlogs

bscn: 0xce1.e379b05 将该scn进行转换,我们可以发现:0xce1 为3297,e379b05为238525189. 与上述报错信息一致。同时我发现这里使用了第2号回滚段,如下:

id1: 0x19 id2: 0x20000 obj: US#2 block: (2/25)

因此,尝试继续使用隐含参数屏蔽这第2号回滚段,并尝试打开数据库,但是错误依旧。看来还是需要调整SCN才行,如下:

Sun Mar 15 21:23:20 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
Sun Mar 15 21:23:20 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
Sun Mar 15 21:23:21 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
Sun Mar 15 21:23:21 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 774222

首先我尝试了在会话级别设置:

alter session set events '10015 trace name adjust_scn level 13190';

发现alter database open失败,尝试使用*._minimum_giga_scn参数,但是在启动的时候,提示说该参数不支持。从此判断,该环境可能是安装了比较新的PSU,Oracle将该参数废弃掉了,这么说前面的10015 event根本就没起作用。 无奈只能通过oradebug手工修改SCN来启动数据库了,如下:

SQL> startup mount pfile='/tmp/gb.ora';
ORACLE instance started.

Total System Global Area 1.5032E+10 bytes
Fixed Size                  2110096 bytes
Variable Size            5704256880 bytes
Database Buffers         9311354880 bytes
Redo Buffers               14663680 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [7000000100122A8, 7000000100122D8) = 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 07000000 ...
SQL>
SQL> oradebug poke 0x7000000100122A8 4 3300
BEFORE: [7000000100122A8, 7000000100122AC) = 00000000
AFTER:  [7000000100122A8, 7000000100122AC) = 00000CE4
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [7000000100122A8, 7000000100122D8) = 00000CE4 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 07000000 ...
SQL>

修改SCN之后,顺利打开了数据库,但是数据库很快就crash掉,如下是日志信息:

Sun Mar 15 21:47:31 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
......
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:33 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
QMNC started with pid=32, OS id=520520
Sun Mar 15 21:47:35 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
Sun Mar 15 21:47:35 2015
ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (44, 26) on object 47098.
Sun Mar 15 21:47:35 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
Sun Mar 15 21:47:36 2015
LOGSTDBY: Validating controlfile with logical metadata
Sun Mar 15 21:47:36 2015
LOGSTDBY: Validation complete
Sun Mar 15 21:47:36 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:36 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:36 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:37 2015
ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (48, 25).
Sun Mar 15 21:47:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:39 2015
Completed: alter database open
Sun Mar 15 21:47:39 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:39 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:39 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:40 2015
ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (65, 7).
Sun Mar 15 21:47:40 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:40 2015
Trace dumping is performing id=[cdmp_20150315214740]
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:42 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j008_586068.trc:
ORA-12012: error on auto execute of job 526
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_SEG_xxx", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:42 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:42 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'XXXX_ADMIN'
ORA-06512: at "SYS.XXXX_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08102: index key not found, obj# 239, file 1, block 1674 (2)
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], []
ORA-06512: at "xxxx.PKG_XXXXX", line 126
ORA-06512: at line 3
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_pmon_565700.trc:
ORA-00474: SMON process terminated with error
Sun Mar 15 21:47:43 2015
PMON: terminating instance due to error 474
Sun Mar 15 21:47:47 2015
Dump system state for local instance only
System State dumped to trace file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_diag_377122.trc
Sun Mar 15 21:47:48 2015
Instance terminated by PMON, pid = 565700

从上述日志信息来看,主要出现了如下几个错误:

ORA-00600 [6006],ORA-00600 [4137],ORA-00600 [kdsgrp1]

对于前面2个错误,很明显是Oracle SMON进程在进行利用回滚段进行事务rollback时失败导致,如下:
ORACLE Instance xxxx2 (pid = 22) – Error 600 encountered while recovering transaction (44, 26) on object 47098.
ORACLE Instance xxxx2 (pid = 22) – Error 600 encountered while recovering transaction (48, 25).

因此,不难看出,数据库中还有部分的回滚段存在活跃事务。

对于ORA-00600 [kdsgrp1]错误,通常是出现在Index上,即Index数据和表的数据不一致,一般来说可以通过重建解决。

其次,对于后面的ORA-08102: index key not found, obj# 239, file 1, block 1674 (2) 错误,主要是job调用出现,因此
我们可以暂时屏蔽job的调度。

对于ORA-08102错误,我的博客几年前也写过相关的文章,本质上也是Index block中的相关键值不存在导致。

与其如此,最后我感觉将数据库的所有回滚段都屏蔽掉,并重建数据库undo 表空间,如下是获取回滚段的命令:

strings system01.dbf | grep _SYSSMU | cut -d $ -f 1 | sort -u

经过整理,发现该库存在大约2600个回滚段,我了个去,先不管这么多,重启实例后,重建undo表空间:

SQL> conn /as sysdba
Connected to an idle instance.
SQL> startup upgrade pfile='/tmp/gb2.ora';
ORACLE instance started.

Total System Global Area 1.5032E+10 bytes
Fixed Size                  2110096 bytes
Variable Size            5704256880 bytes
Database Buffers         9311354880 bytes
Redo Buffers               14663680 bytes
Database mounted.
Database opened.
SQL> create undo tablespace undotbs11 datafile '/xxx/xxxx/undotbs11_01.dbf' size 100m;

Tablespace created.

SQL> create undo tablespace undotbs22 datafile '/xxx/xxxx/undotbs22_01.dbf' size 100m;

Tablespace created.

SQL> drop tablespace undotbs1 xxxluding contents and datafiles;

Tablespace dropped.

SQL> drop tablespace undotbs2 xxxluding contents and datafiles;

Tablespace dropped.

最后重启数据库实例,让客户将关键核心的配置表导出,先进行业务恢复,如果需要数据,直接从库中抽取。
这里要补充一点,该库约为5TB多一点,虽然有备份,但是恢复时间太长,如果有个dataguard是多么的重要啊!

Related posts:

  1. BUG 10008092 caused instance crash
  2. 非归档恢复遭遇ORA-01190 和 ORA-600 [krhpfh_03-1202]–恢复小记
  3. 一次TB级ERP(ASM RAC)库的恢复
  4. 1.4TB ASM(RAC) 磁盘损坏恢复小记
  5. 朋友的一个问题:9i的库open之后大量ora-00600错误

利用XTTS增量进行跨平台迁移遭遇Bug

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 利用XTTS增量进行跨平台迁移遭遇Bug

在某客户这里进行XTTS的增量测试,大概10TB的样子,脚本挂后台运行,发现只完成了2T多一点,看日志有很多error。这让我非常不解,部分文件能够转换成功,部分转换不成功,提示数据库为非归档模式,为如下类型错误:

[ora1124@cszwadb1 xtts_l_2]$ rman target / debug trace=/tmp/xtts_debug.log

Recovery Manager: Release 11.2.0.4.0 - Production on Wed Mar 18 10:40:07 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

RMAN-06006: connected to target database: XTT (not mounted)

RMAN> convert from platform 'AIX-Based Systems (64-bit)' datafile  '/test/oracle/accta/oradata/vgacctdb02/lv_vg02_10g_011'
format '+DG_DATA01/accta/datafile/vgacctdb02/lv_vg02_10g_011.dbf' ;

RMAN-03090: Starting conversion at target at 18-MAR-15
RMAN-06009: using target database control file instead of recovery catalog
RMAN-08030: allocated channel: ORA_DISK_1
RMAN-08500: channel ORA_DISK_1: SID=5004 device type=DISK
RMAN-08589: channel ORA_DISK_1: starting datafile conversion
RMAN-08506: input file name=/test/oracle/accta/oradata/vgacctdb02/lv_vg02_10g_011
RMAN-06731: command backup:87.2% complete, time left 00:00:18
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of conversion at target command on ORA_DISK_1 channel at 03/18/2015 10:42:46
ORA-19602: cannot backup or copy active file in NOARCHIVELOG mode

通过打开rman的debug操作,发现确实会去判断该实例是否未归档模式,如下:

[ora1124@cszwadb1 xtts_l]$ tail -f /tmp/xtts_debug.log
DBGRPC:       ENTERED krmqgns
DBGRPC:        krmqgns: looking for work for channel default (krmqgns)
DBGRPC:        krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:        CMD type=backup cmdid=1 status=STARTED
DBGRPC:              1 STEPstepid=1 cmdid=1 status=STARTED devtype=DISK
DBGRPC:        krmqgns: no work found for channel default (krmqgns)
DBGRPC:         (krmqgns)
。。。。。。。

DBGSQL:         TARGET> select decode(open_mode, 'MOUNTED', 0, 'READ WRITE', 1, 'READ ONLY', 1, 'READ ONLY WITH APPLY', 1, 0) into :isdbopen from v$database
DBGSQL:            sqlcode = 1507
DBGSQL:         error: ORA-01507: database not mounted (krmkosqlerr)
DBGSQL:          (krmkosqlerr)
DBGSQL:        EXITED krmkosqlerr
RMAN-06731: command backup:87.2% complete, time left 00:00:18
DBGSQL:       EXITED krmkrpr with status 0
DBGRPC:       ENTERED krmqgns
DBGRPC:        krmqgns: looking for work for channel default (krmqgns)
DBGRPC:        krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:        CMD type=backup cmdid=1 status=STARTED
DBGRPC:              1 STEPstepid=1 cmdid=1 status=STARTED devtype=DISK
DBGRPC:        krmqgns: no work found for channel default (krmqgns)
DBGRPC:         (krmqgns)
DBGRPC:       EXITED krmqgns with status 1
DBGRPC:       krmxpoq - returning rpc_number: 17 with status: STARTED16 for channel ORA_DISK_1
DBGRPC:       ENTERED krmqgns
DBGRPC:        krmqgns: looking for work for channel default (krmqgns)
DBGRPC:        krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:        CMD type=backup cmdid=1 status=STARTED
DBGRPC:              1 STEPstepid=1 cmdid=1 status=STARTED devtype=DISK
DBGRPC:        krmqgns: no work found for channel default (krmqgns)
DBGRPC:         (krmqgns)
DBGRPC:       EXITED krmqgns with status 1
DBGRPC:       krmxpoq - returning rpc_number: 17 with status: STARTED16 for channel ORA_DISK_1
DBGRPC:       krmxr - sleeping for 10 seconds
DBGRPC:       ENTERED krmqgns
DBGRPC:        krmqgns: looking for work for channel default (krmqgns)
DBGRPC:        krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:        CMD type=backup cmdid=1 status=STARTED
DBGRPC:              1 STEPstepid=1 cmdid=1 status=STARTED devtype=DISK
DBGRPC:        krmqgns: no work found for channel default (krmqgns)
DBGRPC:         (krmqgns)
DBGRPC:       EXITED krmqgns with status 1
DBGRPC:       krmxpoq - returning rpc_number: 17 with status: STARTED16 for channel ORA_DISK_1
DBGRPC:       krmxr - sleeping for 10 seconds
DBGRPC:       ENTERED krmqgns
DBGRPC:        krmqgns: looking for work for channel default (krmqgns)
DBGRPC:        krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:        CMD type=backup cmdid=1 status=STARTED
DBGRPC:              1 STEPstepid=1 cmdid=1 status=STARTED devtype=DISK
DBGRPC:        krmqgns: no work found for channel default (krmqgns)
DBGRPC:         (krmqgns)
DBGRPC:       EXITED krmqgns with status 1
DBGRPC:       krmxpoq - returning rpc_number: 17 with status: FINISHED16 for channel ORA_DISK_1
DBGRPC:       krmxr - channel ORA_DISK_1 calling peicnt
DBGRPC:       krmxrpc - channel ORA_DISK_1 kpurpc2 err=19583 db=target proc=SYS.DBMS_BACKUP_RESTORE.BACKUPPIECECREATE excl: 167
DBGRPC:       krmxrpc - caloing krmxtrim: with message of length 167: @@@ORA-19583: conversation terminated due to error
DBGRPC:       ORA-19602: cannot backup or copy active file in NOARCHIVELOG mode
DBGRPC:       ORA-06512: at "SYS.X$DBMS_BACKUP_RESTORE", line 1384
DBGRPC:       @@@
DBGMISC:      ENTERED krmzejob [10:42:46.639]
DBGMISC:       Input Args(failed=1),(errnum=-19583) [10:42:46.639] (krmzejob)
DBGMISC:       duration(stepid=1),endtime=874636966,jobtime=145s [10:42:46.639] (krmzejob)
DBGMISC:       duration(stepid=1), remaining(chn sec,bytes)=(0,10736369664) [10:42:46.639] (krmzejob)
DBGMISC:      EXITED krmzejob with status 0 (FALSE) [10:42:46.639] elapsed time [00:00:00:00.000]
DBGRPC:       krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.GETLIMIT excl: 0
DBGRPC:       krmxr - channel ORA_DISK_1 returned from peicnt
DBGMISC:      ENTERED krmstrim [10:42:46.640]
DBGMISC:       Trimming message: ORA-19583: conversation terminated due to error [10:42:46.640] (krmstrim)
DBGMISC:       ORA-19602: cannot backup or copy active file in NOARCHIVELOG mode (krmstrim)
DBGMISC:       ORA-06512: at "SYS.X$DBMS_BACKUP_RESTORE", line 1384 (krmstrim)
DBGMISC:       ORA-06512: at line 554 (krmstrim)
DBGMISC:        (190) (krmstrim)
DBGMISC:      EXITED krmstrim with status 23 [10:42:46.640] elapsed time [00:00:00:00.000]
DBGRPC:       krmxr - channel ORA_DISK_1 got execution errors (step_60)
DBGRPC:       krmxr - exiting with 1
DBGMISC:      krmqexe: unhandled exception on channel ORA_DISK_1 [10:42:46.640]
DBGMISC:     EXITED krmiexe with status 1 [10:42:46.640] elapsed time [00:00:02:25.142]
DBGMISC:     ENTERED krmkmrsr [10:42:46.640]
DBGMISC:     EXITED krmkmrsr [10:42:46.640] elapsed time [00:00:00:00.000]
DBGMISC:     ENTERED krmkjcl [10:42:46.640]
DBGSQL:       ENTERED krmkosqlerr

DBGSQL:        TARGET> select decode(open_mode, 'MOUNTED', 0, 'READ WRITE', 1, 'READ ONLY', 1, 'READ ONLY WITH APPLY', 1, 0) into :isdbopen from v$database
DBGSQL:           sqlcode = 1507
DBGSQL:        error: ORA-01507: database not mounted (krmkosqlerr)
DBGSQL:         (krmkosqlerr)
DBGSQL:       EXITED krmkosqlerr

DBGSQL:       TARGET> select decode(status, 'OPEN', 1, 0), decode(archiver, 'FAILED', 1, 0), decode(database_status, 'SUSPENDED', 1, 0) into :status, :archstuck, :dbsuspended from v$instance
DBGSQL:          sqlcode = 0
DBGSQL:           D :status = 0
DBGSQL:           D :archstuck = 0
DBGSQL:           D :dbsuspended = 0

DBGSQL:       TARGET> select value into :vcomp_txt from  v$parameter where name = 'compatible'
DBGSQL:          sqlcode = 0
DBGSQL:           D :vcomp_txt = 11.2.0.4.0

DBGSQL:       TARGET> declare dot1st number; dot2nd number; dot3rd number; comptxt varchar2(255) := :vcomp_txt; begin comptxt := comptxt || '.0.0'; dot1st := instr(comptxt, '.', 1, 1); dot2nd := instr(comptxt, '.', 1, 2); dot3rd := instr(comptxt, '.', 1, 3); comptxt :=  lpad(substr(comptxt, 1, dot1st - 1), 2, '0') || lpad(substr(comptxt, dot1st + 1, dot2nd - dot1st - 1), 2, '0')  || lpad(substr(comptxt, dot2nd + 1, dot3rd - dot2nd - 1), 2, '0');:vcomp_ub4 := to_number(comptxt); end;
DBGSQL:          sqlcode = 0
DBGSQL:           B :vcomp_ub4 = 110200
DBGSQL:           B :vcomp_txt = 11.2.0.4.0
DBGMISC:      krmkpdbs(): vcomp_txt:11.2.0.4.0 vcomp_ub4:110200 flags:0 [10:42:46.646]
DBGSQL:       ENTERED krmkusl
DBGSQL:       EXITED krmkusl with status 0
DBGSQL:       ENTERED krmkusl
DBGSQL:       EXITED krmkusl with status 0
DBGMISC:     EXITED krmkjcl [10:42:46.647] elapsed time [00:00:00:00.006]
DBGMISC:     error recovery releasing channel resources [10:42:46.647]
DBGRPC:      krmxcr - channel ORA_DISK_1 resetted
DBGRPC:      krmxcr - channel default resetted
DBGMISC:     ENTERED krmice [10:42:46.647]
DBGMISC:      command to be compiled and executed is: cleanup  [10:42:46.647] (krmice)
DBGMISC:      command after this command is: NONE  [10:42:46.647] (krmice)
DBGMISC:      current incarnation does not matter for cleanup [10:42:46.647] (krmice)
DBGMISC:      ENTERED krmicomp [10:42:46.647]
DBGMISC:       ENTERED krmkomp [10:42:46.647]
DBGRCV:         ENTERED krmkucls
DBGRCV:         EXITED krmkucls with address 0
DBGMISC:        krmkcomp - Name translation defaults to catalog - 2 [10:42:46.647] (krmkomp)
DBGMISC:        ENTERED krmknmtr [10:42:46.647]
DBGMISC:        EXITED krmknmtr with status cleanup [10:42:46.647] elapsed time [00:00:00:00.000]
DBGMISC:        ENTERED krmkdps [10:42:46.648]
DBGMISC:        EXITED krmkdps [10:42:46.648] elapsed time [00:00:00:00.000]
DBGMISC:       EXITED krmkomp [10:42:46.648] elapsed time [00:00:00:00.000]
DBGPLSQL:      the compiled command tree is: [10:42:46.648] (krmicomp)
DBGPLSQL:        1 CMD type=cleanup cmdid=1 status=NOT STARTED
DBGPLSQL:            1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGPLSQL:                1 TEXTNOD = -- clean
DBGPLSQL:                2 TEXTNOD = declare
DBGPLSQL:                3 TEXTNOD =   /* device status variables */
DBGPLSQL:                4 TEXTNOD =   state       binary_integer;
DBGPLSQL:                5 TEXTNOD =   devtype     varchar2(512);
DBGPLSQL:                6 TEXTNOD =   name        varchar2(512);
DBGPLSQL:                7 TEXTNOD =   bufsz       binary_integer;
DBGPLSQL:                8 TEXTNOD =   bufcnt      binary_integer;
DBGPLSQL:                9 TEXTNOD =   kbytes      number;
DBGPLSQL:               10 TEXTNOD =   readrate    binary_integer;
DBGPLSQL:               11 TEXTNOD =   parallel    binary_integer;
DBGPLSQL:               12 TEXTNOD =   thread      number;
DBGPLSQL:               13 TEXTNOD =   kcrmx_recs  number;
DBGPLSQL:               14 TEXTNOD =   autochn     number := 0;
DBGPLSQL:               15 TEXTNOD =   mr_not_started exception;
DBGPLSQL:               16 TEXTNOD =   pragma exception_init(mr_not_started, -1112);
DBGPLSQL:               17 TEXTNOD =   db_not_mounted exception;
DBGPLSQL:               18 TEXTNOD =   pragma exception_init(db_not_mounted, -1507);
DBGPLSQL:               19 TEXTNOD = begin
DBGPLSQL:               20 TEXTNOD =
DBGPLSQL:               21 PRMVAL =  autochn := 1;
DBGPLSQL:               22 TEXTNOD =   begin
DBGPLSQL:               23 TEXTNOD =     krmicd.execSql('select count(*) from x$dual');
DBGPLSQL:               24 TEXTNOD =   exception
DBGPLSQL:               25 TEXTNOD =     when others then
DBGPLSQL:               26 TEXTNOD =       krmicd.clearErrors;
DBGPLSQL:               27 TEXTNOD =   end;
DBGPLSQL:               28 TEXTNOD =   sys.dbms_backup_restore.backupCancel;
DBGPLSQL:               29 TEXTNOD =   sys.dbms_backup_restore.restoreCancel(FALSE);
DBGPLSQL:               30 TEXTNOD =   begin
DBGPLSQL:               31 TEXTNOD =     sys.dbms_backup_restore.proxyCancel;
DBGPLSQL:               32 TEXTNOD =   exception
DBGPLSQL:               33 TEXTNOD =      when others then
DBGPLSQL:               34 TEXTNOD =         krmicd.clearErrors;
DBGPLSQL:               35 TEXTNOD =   end;
DBGPLSQL:               36 TEXTNOD =   sys.dbms_backup_restore.cfileUseCurrent;              -- release enqueue
DBGPLSQL:               37 TEXTNOD =   sys.dbms_backup_restore.deviceStatus(state, devtype, name, bufsz, bufcnt,
DBGPLSQL:               38 TEXTNOD =                                          kbytes, readrate, parallel);
DBGPLSQL:               39 TEXTNOD =   begin
DBGPLSQL:               40 TEXTNOD =      sys.dbms_backup_restore.bmrCancel;
DBGPLSQL:               41 TEXTNOD =   exception
DBGPLSQL:               42 TEXTNOD =      when others then
DBGPLSQL:               43 TEXTNOD =         krmicd.clearErrors;
DBGPLSQL:               44 TEXTNOD =   end;
DBGPLSQL:               45 TEXTNOD =   begin
DBGPLSQL:               46 TEXTNOD =      sys.dbms_backup_restore.flashbackCancel;
DBGPLSQL:               47 TEXTNOD =   exception
DBGPLSQL:               48 TEXTNOD =      when others then
DBGPLSQL:               49 TEXTNOD =         krmicd.clearErrors;
DBGPLSQL:               50 TEXTNOD =   end;
DBGPLSQL:               51 TEXTNOD =   begin
DBGPLSQL:               52 TEXTNOD =     if krmicd.mrCheck > 0 then
DBGPLSQL:               53 TEXTNOD =       krmicd.execSql('alter database recover cancel');
DBGPLSQL:               54 TEXTNOD =     end if;
DBGPLSQL:               55 TEXTNOD =   exception
DBGPLSQL:               56 TEXTNOD =     when others then
DBGPLSQL:               57 TEXTNOD =       krmicd.clearErrors;
DBGPLSQL:               58 TEXTNOD =   end;
DBGPLSQL:               59 TEXTNOD =   -- If autchn is set to 0, then it the channel is user allocated, hence can be
DBGPLSQL:               60 TEXTNOD =   -- deallocated. However, we will call dbms_backup_restore.deviceDeallocate
DBGPLSQL:               61 TEXTNOD =   -- only if server says that the device is actually allocated. On the
DBGPLSQL:               62 TEXTNOD =   -- other hand, we will call krmicd.clearChannelInfo even if server
DBGPLSQL:               63 TEXTNOD =   -- thinks that device is not allocated because it can be that
DBGPLSQL:               64 TEXTNOD =   -- deviceAllocate have failed.
DBGPLSQL:               65 TEXTNOD =   if (autochn = 0) then
DBGPLSQL:               66 TEXTNOD =     if (state > sys.dbms_backup_restore.NO_DEVICE) then
DBGPLSQL:               67 TEXTNOD =        sys.dbms_backup_restore.deviceDeallocate;
DBGPLSQL:               68 TEXTNOD =        krmicd.writeMsg(8031, krmicd.getChid);
DBGPLSQL:               69 TEXTNOD =        -- Clear the client_info field on channels which have no device
DBGPLSQL:               70 TEXTNOD =        -- allocated. This has the effect of leaving the client_info field
DBGPLSQL:               71 TEXTNOD =        -- present on the default channel.
DBGPLSQL:               72 TEXTNOD =        sys.dbms_backup_restore.set_client_info('');
DBGPLSQL:               73 TEXTNOD =     end if;
DBGPLSQL:               74 TEXTNOD =     krmicd.clearChannelInfo;                    -- tell krmq no device here now
DBGPLSQL:               75 TEXTNOD =   end if;
DBGPLSQL:               76 TEXTNOD =   sys.dbms_backup_restore.setRmanStatusRowId(rsid=>0, rsts=>0);
DBGPLSQL:               77 TEXTNOD = end;
DBGMISC:      EXITED krmicomp with address 36396808 [10:42:46.651] elapsed time [00:00:00:00.004]
DBGMISC:      ENTERED krmiexe [10:42:46.651]
DBGMISC:       Executing command cleanup [10:42:46.651] (krmiexe)
DBGRPC:        krmxr - entering
DBGRPC:        krmxpoq - returning rpc_number: 5 with status: FINISHED130 for channel default
DBGRPC:        krmxr - channel default has rpc_count: 5
DBGRPC:        krmxpoq - returning rpc_number: 18 with status: FINISHED7 for channel ORA_DISK_1
DBGRPC:        krmxr - channel ORA_DISK_1 has rpc_count: 18
DBGRPC:        ENTERED krmqgns
DBGRPC:         krmqgns: looking for work for channel default (krmqgns)
DBGRPC:         krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:         CMD type=cleanup cmdid=1 status=NOT STARTED
DBGRPC:               1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGRPC:         krmqgns: channel default running cleanup (krmqgns)
DBGRPC:        EXITED krmqgns with status 0
DBGRPC:        ENTERED krmqgns
DBGRPC:         krmqgns: looking for work for channel ORA_DISK_1 (krmqgns)
DBGRPC:         krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:         CMD type=cleanup cmdid=1 status=NOT STARTED
DBGRPC:               1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGRPC:         krmqgns: channel ORA_DISK_1 running cleanup (krmqgns)
DBGRPC:        EXITED krmqgns with status 0
DBGRPC:        krmxcis - channel default, calling pcicmp
DBGRPC:        krmxcis - channel ORA_DISK_1, calling pcicmp
DBGRPC:        krmxr - channel default calling peicnt
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.BACKUPCANCEL excl: 0
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.RESTORECANCEL excl: 0
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.PROXYCANCEL excl: 0
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.CFILEUSECURRENT excl: 0
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.DEVICESTATUS excl: 0
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.BMRCANCEL excl: 0
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.FLASHBACKCANCEL excl: 0
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.SETRMANSTATUSROWID excl: 0
DBGRPC:        krmxr - channel default returned from peicnt
DBGRPC:        krmxr - channel default finished step
DBGRPC:            ENTERED krmqgns
krmqgns: looking for work for channel default (krmqgns)
DBGRPC:         krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:         CMD type=cleanup cmdid=1 status=NOT STARTED
DBGRPC:               1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGRPC:         krmqgns: no work found for channel default (krmqgns)
DBGRPC:          (krmqgns)
DBGRPC:        EXITED krmqgns with status 1
DBGRPC:        krmxr - channel ORA_DISK_1 calling peicnt
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.BACKUPCANCEL excl: 0
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.RESTORECANCEL excl: 0
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.PROXYCANCEL excl: 0
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.CFILEUSECURRENT excl: 0
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.DEVICESTATUS excl: 0
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.BMRCANCEL excl: 0
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.FLASHBACKCANCEL excl: 0
DBGRPC:        krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.SETRMANSTATUSROWID excl: 0
DBGRPC:        krmxr - channel ORA_DISK_1 returned from peicnt
DBGRPC:        krmxr - channel ORA_DISK_1 finished step
DBGRPC:            ENTERED krmqgns
krmqgns: looking for work for channel default (krmqgns)
DBGRPC:         krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:         CMD type=cleanup cmdid=1 status=NOT STARTED
DBGRPC:               1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGRPC:         krmqgns: no work found for channel default (krmqgns)
DBGRPC:          (krmqgns)
DBGRPC:        EXITED krmqgns with status 1
DBGRPC:        ENTERED krmqgns
DBGRPC:         krmqgns: looking for work for channel ORA_DISK_1 (krmqgns)
DBGRPC:         krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:         CMD type=cleanup cmdid=1 status=NOT STARTED
DBGRPC:               1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGRPC:         krmqgns: no work found for channel ORA_DISK_1 (krmqgns)
DBGRPC:          (krmqgns)
DBGRPC:        EXITED krmqgns with status 1
DBGRPC:        ENTERED krmqgns
DBGRPC:         krmqgns: looking for work for channel default (krmqgns)
DBGRPC:         krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:         CMD type=cleanup cmdid=1 status=NOT STARTED
DBGRPC:               1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGRPC:         krmqgns: no work found for channel default (krmqgns)
DBGRPC:          (krmqgns)
DBGRPC:        EXITED krmqgns with status 1
DBGRPC:        ENTERED krmqgns
DBGRPC:         krmqgns: looking for work for channel ORA_DISK_1 (krmqgns)
DBGRPC:         krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC:         CMD type=cleanup cmdid=1 status=NOT STARTED
DBGRPC:               1 STEPstepid=1 cmdid=1 status=NOT STARTED
DBGRPC:         krmqgns: no work found for channel ORA_DISK_1 (krmqgns)
DBGRPC:          (krmqgns)
DBGRPC:        EXITED krmqgns with status 1
DBGRPC:        krmxr - all done
DBGRPC:        krmxr - exiting with 0
DBGMISC:      EXITED krmiexe with status 0 [10:42:46.673] elapsed time [00:00:00:00.021]
DBGMISC:      Finished cleanup at 18-MAR-15 [10:42:46.673]
DBGMISC:      ENTERED krmkjcl [10:42:46.673]
DBGMISC:      EXITED krmkjcl [10:42:46.673] elapsed time [00:00:00:00.000]
DBGMISC:     EXITED krmice [10:42:46.673] elapsed time [00:00:00:00.026]
Calling krmmpem from krmmexe
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of conversion at target command on ORA_DISK_1 channel at 03/18/2015 10:42:46
RMAN-10032: unhandled exception during execution of job step 1:
ORA-06512: at line 554
RMAN-10035: exception raised in RPC:
ORA-19583: conversation terminated due to error
ORA-19602: cannot backup or copy active file in NOARCHIVELOG mode
ORA-06512: at "SYS.X$DBMS_BACKUP_RESTORE", line 1384
RMAN-10031: RPC Error: ORA-19583  occurred during call to DBMS_BACKUP_RESTORE.BACKUPPIECECREATE
DBGMISC:     ENTERED krmkursr [10:42:46.674]
DBGMISC:     EXITED krmkursr [10:42:46.674] elapsed time [00:00:00:00.000]

查询发现这应该是Oracle bug 17565514 导致处理方式是将辅助实例xtt创建为一个测试库,调整为归档模式。同时启动到mount状态下进行操作。于是这里我干脆创建一个xtt的测试库得了,如下:

SQL> show parameter instance

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
active_instance_count                integer
cluster_database_instances           integer     1
instance_groups                      string
instance_name                        string      xtt
instance_number                      integer     0
instance_type                        string      RDBMS
open_links_per_instance              integer     4
parallel_instance_group              string
parallel_server_instances            integer     1
SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ WRITE

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.

Total System Global Area 1068937216 bytes
Fixed Size                  2260088 bytes
Variable Size             335545224 bytes
Database Buffers          708837376 bytes
Redo Buffers               22294528 bytes
Database mounted.
SQL> alter database archivelog;

Database altered.

SQL> alter database open;

Database altered.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.

Total System Global Area 1068937216 bytes
Fixed Size                  2260088 bytes
Variable Size             335545224 bytes
Database Buffers          708837376 bytes
Redo Buffers               22294528 bytes
Database mounted.
SQL>

创建后,将xtt启动到mount状态,然后再进行测试,发现一切OK。

[ora1124@cszwadb1 xtts_l_2]$ rman target /

Recovery Manager: Release 11.2.0.4.0 - Production on Wed Mar 18 11:07:20 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: XTT (DBID=4264586957, not open)

RMAN> convert from platform 'AIX-Based Systems (64-bit)' datafile  '/test/oracle/accta/oradata/vgacctdb02/lv_vg02_10g_011'
format '+DG_DATA01/accta/datafile/vgacctdb02/%N_%f.dbf';

Starting conversion at target at 18-MAR-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=211 device type=DISK
channel ORA_DISK_1: starting datafile conversion
input file name=/test/oracle/accta/oradata/vgacctdb02/lv_vg02_10g_011
converted datafile=+DG_DATA01/accta/datafile/vgacctdb02/invoice_idx_01_216.dbf
channel ORA_DISK_1: datafile conversion complete, elapsed time: 00:02:25
Finished conversion at target at 18-MAR-15

不仅仅是我的手工测试ok,而且所有的后台脚本运行也都OK,目前已经转换了3TB,慢慢跑吧!

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   9437166  7045877                0         7045877              0             N  DG_DATA01/
ASMCMD>
ASMCMD>
ASMCMD>
ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   9437166  6277828                0         6277828              0             N  DG_DATA01/
ASMCMD>
ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   9437166  6093498                0         6093498              0             N  DG_DATA01/
ASMCMD>

分享给大家,以免大家以后再踩这个坑!

Related posts:

  1. 列删除的恢复测试 – 不要模仿
  2. dataguard主库丢失archivelog,如何不重建备库?
  3. XTTS(Cross Platform Incremental Backup)的测试例子

2015年第2季度PSU更新(OJVM PSU更新)

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 2015年第2季度PSU更新(OJVM PSU更新)

从2015年1月的补丁发布开始,OJVM PSU 就集成了JDBC Patch(2014年10月份的补丁并不包括)。

2015年第2季度的PSU更新主要是4个版本:12.1.0.2,12.1.0.1、11.2.0.4、11.2.0.3以及11.1.0.7。

我们也可以看到,Oracle 10gR2 已经不再提供patch,都赶紧升级吧!

++++10.2.0.4
Oracle Database PSU   Unix             Comments                 Includes Cpu 

10.2.0.4.1           8576156          Bash 10.2.0.4.0          includes CPU Jul 2009
10.2.0.4.2           8833280          Bash 10.2.0.4.0          includes CPU Oct 2009
10.2.0.4.3           9119284          Bash 10.2.0.4.0          includes CPU Jan 2010
10.2.0.4.4           9352164          Bash 10.2.0.4.0          includes CPU Apr 2010
10.2.0.4.5           9654991          该psu必须基于10.2.0.4.4  includes CPU Jul 2010
10.2.0.4.6           9952234          该psu必须基于10.2.0.4.4  includes CPU Oct 2010
10.2.0.4.7           10248636         该psu必须基于10.2.0.4.4  includes CPU Jan 2011
10.2.0.4.8           11724977         该psu必须基于10.2.0.4.4  includes CPU Apr 2011
10.2.0.4.9           12419397         该psu必须基于10.2.0.4.4  includes CPU Jul 2011
10.2.0.4.10          12827778         该psu必须基于10.2.0.4.4  includes CPU Oct 2011
10.2.0.4.11          12879929         该psu必须基于10.2.0.4.4  includes CPU Jan 2012
10.2.0.4.12          12879933         该psu必须基于10.2.0.4.4  includes CPU Apr 2012
10.2.0.4.13          13923851         该psu必须基于10.2.0.4.4  includes CPU Jul 2012
10.2.0.4.14          14275630         该psu必须基于10.2.0.4.4  includes CPU Oct 2012
10.2.0.4.15          14736542         该psu必须基于10.2.0.4.4  includes CPU Jan 2013
10.2.0.4.16          16056269         该psu必须基于10.2.0.4.4  includes CPU Apr 2013
10.2.0.4.17          16619897         该psu必须基于10.2.0.4.4  includes CPU Jul 2013

+++++10.2.0.5
Oracle Database PSU   Unix             Comments                Includes Cpu 

10.2.0.5.1           9952230          Bash 10.2.0.5.0         includes CPU Oct 2010
10.2.0.5.2           10248542         Bash 10.2.0.5.0         includes CPU Jan 2011
10.2.0.5.3           11724962         Bash 10.2.0.5.0         includes CPU Apr 2011
10.2.0.5.4           12419392         Bash 10.2.0.5.0         includes CPU Jul 2011
10.2.0.5.5           12827745         Bash 10.2.0.5.0         Includes CPU Oct 2011
10.2.0.5.6           13343471         Bash 10.2.0.5.0         includes CPU Jan 2012
10.2.0.5.7           13632743         Bash 10.2.0.5.0         includes CPU Apr 2012
10.2.0.5.8           13923855         Bash 10.2.0.5.0         includes CPU Jul 2012
10.2.0.5.9           14275629         Bash 10.2.0.5.0         includes CPU Oct 2012
10.2.0.5.10          14727319         Bash 10.2.0.5.0         includes CPU Jan 2013
10.2.0.5.11          16056270         Bash 10.2.0.5.0         includes CPU Apr 2013
10.2.0.5.12          16619894         Bash 10.2.0.5.0         includes CPU Jul 2013

+++++11.1.0.7
Oracle Database PSU   Database        CRS                   Comments          Includes Cpu 

11.1.0.7.1            8833297         bug: 8287931          Bash 11.1.0.7.0   includes CPU Oct 2009
11.1.0.7.2            9209238         bug: 9207257          Bash 11.1.0.7.0   includes CPU Jan 2010
11.1.0.7.3            9352179                               Bash 11.1.0.7.0   includes CPU Apr 2010
11.1.0.7.4            9654987         bug: 9294495          Bash 11.1.0.7.0   includes CPU Jul 2010
11.1.0.7.5            9952228         bug: 9952240          Bash 11.1.0.7.0   includes CPU Oct 2010
11.1.0.7.6            10248531        bug: 10248535         Bash 11.1.0.7.0   includes CPU Jan 2011
11.1.0.7.7            11724936        11724953              Bash 11.1.0.7.0   includes CPU Apr 2011
11.1.0.7.8            12419384        11724953              Bash 11.1.0.7.0   includes CPU Jul 2011
11.1.0.7.9            12827740        11724953              Bash 11.1.0.7.0   includes CPU Oct 2011
11.1.0.7.10           13343461        11724953              Bash 11.1.0.7.0   includes CPU Jan 2012
11.1.0.7.11           13621679        11724953              Bash 11.1.0.7.0   includes CPU Apr 2012
11.1.0.7.12           13923474        11724953              Bash 11.1.0.7.0   includes CPU Jul 2012
11.1.0.7.13           14275623        11724953              Bash 11.1.0.7.0   includes CPU Oct 2012
11.1.0.7.14           14739378        11724953              Bash 11.1.0.7.0   includes CPU Jan 2013
11.1.0.7.15           16056268        11724953              Bash 11.1.0.7.0   includes CPU Apr 2013
11.1.0.7.16           16619896        11724953              Bash 11.1.0.7.0   includes CPU Jul 2013
11.1.0.7.17           17082366        11724953              Bash 11.1.0.7.0   includes CPU Oct 2013
11.1.0.7.18           17465583        11724953              Bash 11.1.0.7.0   includes CPU Jan 2014
11.1.0.7.19           18031726        11724953              Bash 11.1.0.7.0   includes CPU Apr 2014
11.1.0.7.20           18522513        11724953              Bash 11.1.0.7.0   includes CPU Jul 2014
11.1.0.7.21           19152553        11724953              Bash 11.1.0.7.0   includes CPU Oct 2014
11.1.0.7.22           19769499        11724953              Bash 11.1.0.7.0   includes CPU Jan 2015
11.1.0.7.23           20299012        11724953              Bash 11.1.0.7.0   includes CPU Apr 2015

OJVM PSU:             Database              CRS        Comments                  Includes JDBC Patch
11.1.0.7.1            19282002(Unix)                   Bash 11.1.0.7.0
                      19806118(Win)
                      19852363(JDBC Patch)
11.1.0.7.2            19877446                         Bash 11.1.0.7.20           Jan 2015
                                                     或SPU 11.1.0.7.0(CPUOct2014)
11.1.0.7.3            20834724
++++++11.2.0.1
Oracle Database PSU  Database   Grid Infrastructure   Comments          Includes Cpu 

11.2.0.1.1           9352237    9343627               Bash 11.2.0.1.0   includes CPU Apr 2010
11.2.0.1.2           9654983    9343627               Bash 11.2.0.1.0   includes CPU Jul 2010
11.2.0.1.3           9952216    9655006               Bash 11.2.0.1.0   includes CPU Oct 2010
11.2.0.1.4           10248516   9655006               Bash 11.2.0.1.0   includes CPU Jan 2011
11.2.0.1.5           11724930   9655006               Bash 11.2.0.1.0   includes CPU Apr 2011
11.2.0.1.6           12419378   9655006               Bash 11.2.0.1.0   includes CPU Apr 2011

+++++++11.2.0.2
OracleDatabase PSU   Database   Grid Infrastructure   Comments          Includes  Cpu
11.2.0.2.1           10248523   Bundle1 10157506      Bash 11.2.0.2.0   no CPU fixes
11.2.0.2.2           11724916   Bundle2 10425672      Bash 11.2.0.2.0   includes CPU Apr 2011
                                PSU2 12311357
11.2.0.2.3           12419331   12419353              Bash 11.2.0.2.0   includes CPU Jul 2011
11.2.0.2.4           12827726   12827731              Bash 11.2.0.2.0   includes CPU Oct 2011
11.2.0.2.5           13343424   13343447              Bash 11.2.0.2.0   includes CPU Jan 2012
11.2.0.2.6           13696224   1396242               Bash 11.2.0.2.0   includes CPU Apr 2012
11.2.0.2.7           13923804   14192201              Bash 11.2.0.2.0   includes CPU Jul 2012
11.2.0.2.8           14275621   14390437              Bash 11.2.0.2.0   includes CPU Oct 2012
11.2.0.2.9           14727315   14390437              Bash 11.2.0.2.0   includes CPU Jan 2013
11.2.0.2.10          16056267   16166868              Bash 11.2.0.2.0   includes CPU Apr 2013
11.2.0.2.11          16619893   16742320              Bash 11.2.0.2.0   includes CPU Jul 2013
11.2.0.2.12          17082367   17272753              Bash 11.2.0.2.0   includes CPU Oct 2013

+++++++11.2.0.3
OracleDatabase PSU   Database               Grid Infrastructure       Comments           Includes  Cpu
11.2.0.3.0           10404530               (包含在10404530中)
11.2.0.3.1           13343438               13348650                  Bash 11.2.0.3.0     includes CPU Jan 2012
11.2.0.3.2           13696216               13696251                  Bash 11.2.0.3.0     includes CPU Apr 2012
11.2.0.3.3           13923374               13919095                  Bash 11.2.0.3.0     includes CPU Jul 2012
11.2.0.3.4           14275605               14275572                  Bash 11.2.0.3.0     includes CPU Oct 2012
11.2.0.3.5           14727310               14727347                  Bash 11.2.0.3.0     includes CPU Jan 2013
11.2.0.3.6           16056266               16083653                  Bash 11.2.0.3.0     includes CPU Apr 2013
11.2.0.3.7           16619892               16742216                  Bash 11.2.0.3.0     includes CPU Jul 2013
11.2.0.3.8           16902043               17272731                  Bash 11.2.0.3.0     includes CPU Oct 2013
11.2.0.3.9           17540582               17735354                  Bash 11.2.0.3.0     includes CPU Jan 2014
11.2.0.3.10          18031683               18139678                  Bash 11.2.0.3.0     includes CPU Apr 2014
11.2.0.3.11          18522512               18706488                  Bash 11.2.0.3.0     includes CPU Jul 2014
11.2.0.3.12          19121548               19440385                  Bash 11.2.0.3.0     includes CPU Oct 2014
11.2.0.3.13          19769496               19971343                  Bash 11.2.0.3.0     includes CPU Jan 2015
11.2.0.3.14          20299017               20485830                  Bash 11.2.0.3.0     includes CPU Apr 2015

OJVM PSU:            Database               Grid Infrastructure       Comments            Includes JDBC Patch

11.2.0.3.1           19282015(Unix)                                   PSU 11.2.0.3.0
                     19806120(Win)                                 或 SPU 11.2.0.3(CPUOct2014)
                     19852361(JDBC Patch)
11.2.0.3.2           19877443               19852361                  Bash 11.2.0.3.0
11.2.0.3.3           20834670               20834686

+++++++11.2.0.4
OracleDatabase PSU   Database               Grid Infrastructure       Comments               Includes  Cpu
11.2.0.4.0           13390677               13390677
11.2.0.4.1           17478514                                          Bash 11.2.0.4.0         includes CPU Jan 2014
11.2.0.4.2           18031668               18139609                   Bash 11.2.0.4.0         includes CPU Apr 2014
11.2.0.4.3           18522509               18706472                   Bash 11.2.0.4.0         includes CPU Jul 2014
11.2.0.4.4           19121551               19380115                   Bash 11.2.0.4.0         includes CPU Oct 2014
11.2.0.4.5           19769489               19955028                   Bash 11.2.0.4.0         includes CPU Jan 2015
11.2.0.4.6           20299013               20485808                   Bash 11.2.0.4.0         includes CPU Apr 2015

OJVM PSU:            Database               Grid Infrastructure        Comments               Includes JDBC Patch

11.2.0.4.1           19282021(Unix)                                    Bash 11.2.0.4.0
                     19799291(WIN)                                   或SPU 11.2.0.4(CPUOct2014)
                     19852360(JDBC Patch)
11.2.0.4.2           19877440                19852360                  Bash 11.2.0.4.0        

11.2.0.4.3           20834611                20834621                  Bash 11.2.0.4.0        

+++++++12.1.0.1
OracleDatabase PSU   Database           Grid Infrastructure       Comments           Includes  Cpu

12.1.0.1.1           17027533           17272829                  Bash 12.1.0.1.0
12.1.0.1.2           17552800           17735306                  Bash 12.1.0.1.0    includes CPU Jan 2014
12.1.0.1.3           18031528           18139660(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Apr 2014
                                        18413105(Linux/Solaris)
12.1.0.1.4           18522516           18705972(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Jul 2014
                                        18705901(Linux/Solaris)

12.1.0.1.5           19121550           19392451(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Oct 2014
                                        19392372(Linux/Solaris)  

12.1.0.1.6           19769486           19971331(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Jan 2015
                                        19971324(Linux/Solaris)  

12.1.0.1.7           20299016           20485774(AIX/HP/zLinux)   Bash 12.1.0.1.0    includes CPU Apr 2015
                                        20485762(Linux/Solaris) 

OJVM PSU:            Database              Grid Infrastructure       Comments             Includes JDBC Patch

12.1.0.1.1           19282024(Unix)                                   Bash 12.1.0.1.0
                     19801531(WIN)
                     19852357(JDBC Patch)
12.1.0.1.2           19877342              19852357                   Bash 12.1.0.1.0     includes CPU Jan 2015

12.1.0.1.3           20834568              20834579                   Bash 12.1.0.1.0     includes CPU Apr 2015

++++++ 12.1.0.2
OracleDatabase PSU   Database         Grid Infrastructure       Comments           Includes  Cpu

12.1.0.2.1           19303936          19392646                  Bash 12.1.0.2.0    includes CPU Oct 2014
12.1.0.2.2           19769480          19954978                  Bash 12.1.0.2.0    includes CPU Jan 2015
12.1.0.2.3           20299023          20485724                  Bash 12.1.0.2.0    includes CPU Apr 2015

OJVM PSU:            Database         Grid Infrastructure       Comments           Includes JDBC Patch

12.1.2.0.1           19282028                                    Bash 12.1.0.2.0
12.1.2.0.2           19877336          20132450                  Bash 12.1.0.2.0    Jan 2015
12.1.2.0.3           20834354          20834538                  Bash 12.1.0.2.0    Apr 2015

说明:OJVM的DB PSU中包含了前面的DB PSU,即20299029 也包含在20834354 PATCH中。
备注:
1) 关于JDBC Patch和OJVM PSU的信息,请参考MOS doc:
Oracle Recommended Patches — “Oracle JavaVM Component Database PSU” (OJVM PSU) Patches (1929745.1)
2) 要安装OJVM PSU,那么数据库环境版本也是有要求的,不能低于2014年10月发布的补丁号,及:require the database home to be patched to at least October 2014 DB PSU
换句话讲,11.1.0.7如果要安装OJVM PSU,那么版本不能低于11.1.0.7.20。请参考上面的Comments说明。
3) 之前CPU(Critical Patch Update)安装补丁,现在改名被称为SPU Security Patch Update。

Related posts:

  1. 10.2.0.4+版本PSU以及相关bundle patch列表-(2012/7/18 update)
  2. 10.2.0.4+版本PSU以及相关bundle patch列表-(2012/10/19 update)
  3. 10.2.0.4+版本PSU以及相关bundle patch列表-(2013/1/20 update)
  4. 10.2.0.4+版本PSU以及相关bundle patch列表-(2013/4/18 update)
  5. 10.2.0.4+版本的最新psu信息,供大家参考!(新增加12c psu信息)

_optimizer_null_aware_antijoin引发的SQL性能问题

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: _optimizer_null_aware_antijoin引发的SQL性能问题

前几天某客户联系我说之前我们进行存储迁移的系统,有个SQL跑的极慢,根本跑不出来结果。通过VPN登录看了下,SQL确认跑的很慢。开始我很难理解,我们仅仅是进行了存储迁移,数据库基本上没动,为什么会有SQL性能问题呢?  我们先来看看有问题的SQL:

SYS@rptdb1> set autot traceonly exp
SYS@rptdb1> select a.*,b.rate TAX_RATE,round(a.charge*b.rate/(1+b.rate),0) tax,a.charge-round((a.charge*b.rate/(1+b.rate)),0) charge_flh,1 flag,b.tax_rule_id
  2               from statrpt.rpt_offer_rate b,statrpt.tmp_item_aggr_ex_691  a
  3               where  a.acct_item_type_id = b.acct_item_type_id
  4               and a.offer_cd =b.offer_ID
  5             union all
  6
SYS@rptdb1>              select a.*,b.rate TAX_RATE,round(a.charge*b.rate/(1+b.rate),0) tax,a.charge-ROUND((a.charge*b.rate/(1+b.rate)),0) charge_flh,2,b.tax_rule_id
  2               from  statrpt.rpt_product_rate b,statrpt.tmp_item_aggr_ex_691  a
  3               where   a.acct_item_type_id = b.acct_item_type_id
  4               and  a.product_id=b.product_id
  5               and (a.acct_item_type_id,a.offer_cd) not in(select acct_item_type_id,offer_id from statrpt.rpt_offer_rate)
  6              union all
  7
SYS@rptdb1>              select a.*,b.rate TAX_RATE,round(a.charge*b.rate/(1+b.rate),0) tax,a.charge-round((a.charge*b.rate/(1+b.rate)),0) charge_flh,3,b.tax_rule_id
  2               from  statrpt.rpt_zm_rate b,statrpt.tmp_item_aggr_ex_691  a
  3               where  a.acct_item_type_id = b.acct_item_type_id
  4               and (a.acct_item_type_id,a.offer_cd)not in(select acct_item_type_id,offer_id from statrpt.rpt_offer_rate )
  5               and (a.acct_item_type_id,a.product_id) not in(select acct_item_type_id,product_id from statrpt.rpt_product_rate )
  6               union all
  7               select a.*,b.rate TAX_RATE,round(a.charge*b.rate/(1+b.rate),0) tax,a.charge-round((a.charge*b.rate/(1+b.rate)),0) charge_flh,4,b.tax_rule_id
  8               from  statrpt.tmp_zm_only_rate b,statrpt.tmp_item_aggr_ex_691  a
  9               where  a.acct_item_type_id = b.acct_item_type_id
 10               and (a.acct_item_type_id,a.offer_cd)not in(select acct_item_type_id,offer_id from statrpt.rpt_offer_rate )
 11               and (a.acct_item_type_id,a.product_id) not in(select acct_item_type_id,product_id from statrpt.rpt_product_rate )
 12               and (a.acct_item_type_id) not in(select acct_item_type_id from statrpt.rpt_zm_rate )
 13  /
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 1624413711

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                      |  6983K|   765M|   563M (51)|999:59:59 |        |      |            |
|   1 |  UNION-ALL                |                      |       |       |            |          |        |      |            |
|*  2 |   FILTER                  |                      |       |       |            |          |        |      |            |
|   3 |    PX COORDINATOR         |                      |       |       |            |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)   | :TQ60001             |  3494K|   383M|  1050   (1)| 00:00:13 |  Q6,01 | P->S | QC (RAND)  |
|*  5 |      HASH JOIN            |                      |  3494K|   383M|  1050   (1)| 00:00:13 |  Q6,01 | PCWP |            |
|   6 |       PX RECEIVE          |                      |  1034 | 14476 |     3   (0)| 00:00:01 |  Q6,01 | PCWP |            |
|   7 |        PX SEND BROADCAST  | :TQ60000             |  1034 | 14476 |     3   (0)| 00:00:01 |  Q6,00 | P->P | BROADCAST  |
|   8 |         PX BLOCK ITERATOR |                      |  1034 | 14476 |     3   (0)| 00:00:01 |  Q6,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| RPT_ZM_RATE          |  1034 | 14476 |     3   (0)| 00:00:01 |  Q6,00 | PCWP |            |
|  10 |       PX BLOCK ITERATOR   |                      |  3494K|   336M|  1046   (1)| 00:00:13 |  Q6,01 | PCWC |            |
|  11 |        TABLE ACCESS FULL  | TMP_ITEM_AGGR_EX_691 |  3494K|   336M|  1046   (1)| 00:00:13 |  Q6,01 | PCWP |            |
|  12 |    PX COORDINATOR         |                      |       |       |            |          |        |      |            |
|  13 |     PX SEND QC (RANDOM)   | :TQ10000             |     1 |     9 |     8   (0)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|  14 |      PX BLOCK ITERATOR    |                      |     1 |     9 |     8   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|* 15 |       TABLE ACCESS FULL   | RPT_OFFER_RATE       |     1 |     9 |     8   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  16 |    PX COORDINATOR         |                      |       |       |            |          |        |      |            |
|  17 |     PX SEND QC (RANDOM)   | :TQ20000             |     1 |     9 |    96   (2)| 00:00:02 |  Q2,00 | P->S | QC (RAND)  |
|  18 |      PX BLOCK ITERATOR    |                      |     1 |     9 |    96   (2)| 00:00:02 |  Q2,00 | PCWC |            |
|* 19 |       TABLE ACCESS FULL   | RPT_PRODUCT_RATE     |     1 |     9 |    96   (2)| 00:00:02 |  Q2,00 | PCWP |            |
|* 20 |   FILTER                  |                      |       |       |            |          |        |      |            |
|  21 |    PX COORDINATOR         |                      |       |       |            |          |        |      |            |
|  22 |     PX SEND QC (RANDOM)   | :TQ70001             |  3494K|   383M|  1050   (1)| 00:00:13 |  Q7,01 | P->S | QC (RAND)  |
|* 23 |      HASH JOIN            |                      |  3494K|   383M|  1050   (1)| 00:00:13 |  Q7,01 | PCWP |            |
|  24 |       PX RECEIVE          |                      |  6053 | 84742 |     3   (0)| 00:00:01 |  Q7,01 | PCWP |            |
|  25 |        PX SEND BROADCAST  | :TQ70000             |  6053 | 84742 |     3   (0)| 00:00:01 |  Q7,00 | P->P | BROADCAST  |
|  26 |         PX BLOCK ITERATOR |                      |  6053 | 84742 |     3   (0)| 00:00:01 |  Q7,00 | PCWC |            |
|  27 |          TABLE ACCESS FULL| TMP_ZM_ONLY_RATE     |  6053 | 84742 |     3   (0)| 00:00:01 |  Q7,00 | PCWP |            |
|  28 |       PX BLOCK ITERATOR   |                      |  3494K|   336M|  1046   (1)| 00:00:13 |  Q7,01 | PCWC |            |
|  29 |        TABLE ACCESS FULL  | TMP_ITEM_AGGR_EX_691 |  3494K|   336M|  1046   (1)| 00:00:13 |  Q7,01 | PCWP |            |
|  30 |    PX COORDINATOR         |                      |       |       |            |          |        |      |            |
|  31 |     PX SEND QC (RANDOM)   | :TQ30000             |     1 |     9 |     8   (0)| 00:00:01 |  Q3,00 | P->S | QC (RAND)  |
|  32 |      PX BLOCK ITERATOR    |                      |     1 |     9 |     8   (0)| 00:00:01 |  Q3,00 | PCWC |            |
|* 33 |       TABLE ACCESS FULL   | RPT_OFFER_RATE       |     1 |     9 |     8   (0)| 00:00:01 |  Q3,00 | PCWP |            |
|  34 |    PX COORDINATOR         |                      |       |       |            |          |        |      |            |
|  35 |     PX SEND QC (RANDOM)   | :TQ40000             |     1 |     9 |    96   (2)| 00:00:02 |  Q4,00 | P->S | QC (RAND)  |
|  36 |      PX BLOCK ITERATOR    |                      |     1 |     9 |    96   (2)| 00:00:02 |  Q4,00 | PCWC |            |
|* 37 |       TABLE ACCESS FULL   | RPT_PRODUCT_RATE     |     1 |     9 |    96   (2)| 00:00:02 |  Q4,00 | PCWP |            |
|  38 |    PX COORDINATOR         |                      |       |       |            |          |        |      |            |
|  39 |     PX SEND QC (RANDOM)   | :TQ50000             |     1 |     5 |     3   (0)| 00:00:01 |  Q5,00 | P->S | QC (RAND)  |
|  40 |      PX BLOCK ITERATOR    |                      |     1 |     5 |     3   (0)| 00:00:01 |  Q5,00 | PCWC |            |
|* 41 |       TABLE ACCESS FULL   | RPT_ZM_RATE          |     1 |     5 |     3   (0)| 00:00:01 |  Q5,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter( NOT EXISTS (SELECT 0 FROM "STATRPT"."RPT_OFFER_RATE" "RPT_OFFER_RATE" WHERE
              LNNVL("ACCT_ITEM_TYPE_ID"<>:B1) AND LNNVL("OFFER_ID"<>:B2)) AND  NOT EXISTS (SELECT 0 FROM
              "STATRPT"."RPT_PRODUCT_RATE" "RPT_PRODUCT_RATE" WHERE LNNVL("ACCT_ITEM_TYPE_ID"<>:B3) AND LNNVL("PRODUCT_ID"<>:B4)))
   5 - access("A"."ACCT_ITEM_TYPE_ID"="B"."ACCT_ITEM_TYPE_ID")
  15 - filter(LNNVL("ACCT_ITEM_TYPE_ID"<>:B1) AND LNNVL("OFFER_ID"<>:B2))
  19 - filter(LNNVL("ACCT_ITEM_TYPE_ID"<>:B1) AND LNNVL("PRODUCT_ID"<>:B2))
  20 - filter( NOT EXISTS (SELECT 0 FROM "STATRPT"."RPT_OFFER_RATE" "RPT_OFFER_RATE" WHERE
              LNNVL("ACCT_ITEM_TYPE_ID"<>:B1) AND LNNVL("OFFER_ID"<>:B2)) AND  NOT EXISTS (SELECT 0 FROM
              "STATRPT"."RPT_PRODUCT_RATE" "RPT_PRODUCT_RATE" WHERE LNNVL("ACCT_ITEM_TYPE_ID"<>:B3) AND LNNVL("PRODUCT_ID"<>:B4))
              AND  NOT EXISTS (SELECT 0 FROM "STATRPT"."RPT_ZM_RATE" "RPT_ZM_RATE" WHERE LNNVL("ACCT_ITEM_TYPE_ID"<>:B5)))
  23 - access("A"."ACCT_ITEM_TYPE_ID"="B"."ACCT_ITEM_TYPE_ID")
  33 - filter(LNNVL("ACCT_ITEM_TYPE_ID"<>:B1) AND LNNVL("OFFER_ID"<>:B2))
  37 - filter(LNNVL("ACCT_ITEM_TYPE_ID"<>:B1) AND LNNVL("PRODUCT_ID"<>:B2))
  41 - filter(LNNVL("ACCT_ITEM_TYPE_ID"<>:B1))

 

 

大家看该SQL的执行计划就知道,COST巨大无比,很显然这个SQL基本上是跑不动的。本人SQL优化比较弱,因此直接从原库进行对比,因此在原库跑了下SQL:

Plan hash value: 2514835211

---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                 | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                      |       |       |       |  3493 (100)|          |        |      |            |
|   1 |  UNION-ALL                      |                      |       |       |       |            |          |        |      |            |
|   2 |   PX COORDINATOR                |                      |       |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)          | :TQ10004             |   557 | 71853 |       |  1745   (3)| 00:00:21 |  Q1,04 | P->S | QC (RAND)  |
|*  4 |     HASH JOIN BUFFERED          |                      |   557 | 71853 |       |  1745   (3)| 00:00:21 |  Q1,04 | PCWP |            |
|   5 |      PX RECEIVE                 |                      |   557 | 64055 |       |  1742   (3)| 00:00:21 |  Q1,04 | PCWP |            |
|   6 |       PX SEND HASH              | :TQ10002             |   557 | 64055 |       |  1742   (3)| 00:00:21 |  Q1,02 | P->P | HASH       |
|   7 |        MERGE JOIN ANTI NA       |                      |   557 | 64055 |       |  1742   (3)| 00:00:21 |  Q1,02 | PCWP |            |
|   8 |         SORT JOIN               |                      | 55738 |  5769K|    12M|  1733   (3)| 00:00:21 |  Q1,02 | PCWP |            |
|   9 |          MERGE JOIN ANTI NA     |                      | 55738 |  5769K|       |  1732   (3)| 00:00:21 |  Q1,02 | PCWP |            |
|  10 |           SORT JOIN             |                      |  5573K|   515M|  1643M|  1631   (2)| 00:00:20 |  Q1,02 | PCWP |            |
|  11 |            PX BLOCK ITERATOR    |                      |  5573K|   515M|       |  1614   (1)| 00:00:20 |  Q1,02 | PCWC |            |
|* 12 |             TABLE ACCESS FULL   | TMP_ITEM_AGGR_EX_691 |  5573K|   515M|       |  1614   (1)| 00:00:20 |  Q1,02 | PCWP |            |
|* 13 |           SORT UNIQUE           |                      |   421K|  3704K|    16M|   101   (5)| 00:00:02 |  Q1,02 | PCWP |            |
|  14 |            PX RECEIVE           |                      |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q1,02 | PCWP |            |
|  15 |             PX SEND BROADCAST   | :TQ10000             |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q1,00 | P->P | BROADCAST  |
|  16 |              PX BLOCK ITERATOR  |                      |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q1,00 | PCWC |            |
|* 17 |               TABLE ACCESS FULL | RPT_PRODUCT_RATE     |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q1,00 | PCWP |            |
|* 18 |         SORT UNIQUE             |                      | 22695 |   199K|       |     9  (12)| 00:00:01 |  Q1,02 | PCWP |            |
|  19 |          PX RECEIVE             |                      | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  20 |           PX SEND BROADCAST     | :TQ10001             | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q1,01 | P->P | BROADCAST  |
|  21 |            PX BLOCK ITERATOR    |                      | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|* 22 |             TABLE ACCESS FULL   | RPT_OFFER_RATE       | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  23 |      PX RECEIVE                 |                      |  1059 | 14826 |       |     3   (0)| 00:00:01 |  Q1,04 | PCWP |            |
|  24 |       PX SEND HASH              | :TQ10003             |  1059 | 14826 |       |     3   (0)| 00:00:01 |  Q1,03 | P->P | HASH       |
|  25 |        PX BLOCK ITERATOR        |                      |  1059 | 14826 |       |     3   (0)| 00:00:01 |  Q1,03 | PCWC |            |
|* 26 |         TABLE ACCESS FULL       | RPT_ZM_RATE          |  1059 | 14826 |       |     3   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  27 |   PX COORDINATOR                |                      |       |       |       |            |          |        |      |            |
|  28 |    PX SEND QC (RANDOM)          | :TQ20004             |     6 |   804 |       |  1748   (3)| 00:00:21 |  Q2,04 | P->S | QC (RAND)  |
|* 29 |     HASH JOIN                   |                      |     6 |   804 |       |  1748   (3)| 00:00:21 |  Q2,04 | PCWP |            |
|  30 |      PX RECEIVE                 |                      |     6 |   720 |       |  1745   (3)| 00:00:21 |  Q2,04 | PCWP |            |
|  31 |       PX SEND BROADCAST         | :TQ20003             |     6 |   720 |       |  1745   (3)| 00:00:21 |  Q2,03 | P->P | BROADCAST  |
|  32 |        MERGE JOIN ANTI NA       |                      |     6 |   720 |       |  1745   (3)| 00:00:21 |  Q2,03 | PCWP |            |
|  33 |         SORT JOIN               |                      |   557 | 61827 |       |  1736   (3)| 00:00:21 |  Q2,03 | PCWP |            |
|* 34 |          HASH JOIN RIGHT ANTI NA|                      |   557 | 61827 |       |  1735   (3)| 00:00:21 |  Q2,03 | PCWP |            |
|  35 |           PX RECEIVE            |                      |  1059 |  5295 |       |     3   (0)| 00:00:01 |  Q2,03 | PCWP |            |
|  36 |            PX SEND BROADCAST    | :TQ20000             |  1059 |  5295 |       |     3   (0)| 00:00:01 |  Q2,00 | P->P | BROADCAST  |
|  37 |             PX BLOCK ITERATOR   |                      |  1059 |  5295 |       |     3   (0)| 00:00:01 |  Q2,00 | PCWC |            |
|* 38 |              TABLE ACCESS FULL  | RPT_ZM_RATE          |  1059 |  5295 |       |     3   (0)| 00:00:01 |  Q2,00 | PCWP |            |
|  39 |           MERGE JOIN ANTI NA    |                      | 55738 |  5769K|       |  1732   (3)| 00:00:21 |  Q2,03 | PCWP |            |
|  40 |            SORT JOIN            |                      |  5573K|   515M|  1643M|  1631   (2)| 00:00:20 |  Q2,03 | PCWP |            |
|  41 |             PX BLOCK ITERATOR   |                      |  5573K|   515M|       |  1614   (1)| 00:00:20 |  Q2,03 | PCWC |            |
|* 42 |              TABLE ACCESS FULL  | TMP_ITEM_AGGR_EX_691 |  5573K|   515M|       |  1614   (1)| 00:00:20 |  Q2,03 | PCWP |            |
|* 43 |            SORT UNIQUE          |                      |   421K|  3704K|    16M|   101   (5)| 00:00:02 |  Q2,03 | PCWP |            |
|  44 |             PX RECEIVE          |                      |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q2,03 | PCWP |            |
|  45 |              PX SEND BROADCAST  | :TQ20001             |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q2,01 | P->P | BROADCAST  |
|  46 |               PX BLOCK ITERATOR |                      |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q2,01 | PCWC |            |
|* 47 |                TABLE ACCESS FULL| RPT_PRODUCT_RATE     |   421K|  3704K|       |    96   (0)| 00:00:02 |  Q2,01 | PCWP |            |
|* 48 |         SORT UNIQUE             |                      | 22695 |   199K|       |     9  (12)| 00:00:01 |  Q2,03 | PCWP |            |
|  49 |          PX RECEIVE             |                      | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q2,03 | PCWP |            |
|  50 |           PX SEND BROADCAST     | :TQ20002             | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q2,02 | P->P | BROADCAST  |
|  51 |            PX BLOCK ITERATOR    |                      | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q2,02 | PCWC |            |
|* 52 |             TABLE ACCESS FULL   | RPT_OFFER_RATE       | 22695 |   199K|       |     8   (0)| 00:00:01 |  Q2,02 | PCWP |            |
|  53 |      PX BLOCK ITERATOR          |                      |  6083 | 85162 |       |     3   (0)| 00:00:01 |  Q2,04 | PCWC |            |
|* 54 |       TABLE ACCESS FULL         | TMP_ZM_ONLY_RATE     |  6083 | 85162 |       |     3   (0)| 00:00:01 |  Q2,04 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("A"."ACCT_ITEM_TYPE_ID"="B"."ACCT_ITEM_TYPE_ID")
  12 - access(:Z>=:Z AND :Z<=:Z)
  13 - access(INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID") AND
              INTERNAL_FUNCTION("A"."PRODUCT_ID")=INTERNAL_FUNCTION("PRODUCT_ID"))
       filter((INTERNAL_FUNCTION("A"."PRODUCT_ID")=INTERNAL_FUNCTION("PRODUCT_ID") AND
              INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID")))
  17 - access(:Z>=:Z AND :Z<=:Z)
  18 - access(INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID") AND
              INTERNAL_FUNCTION("A"."OFFER_CD")=INTERNAL_FUNCTION("OFFER_ID"))
       filter((INTERNAL_FUNCTION("A"."OFFER_CD")=INTERNAL_FUNCTION("OFFER_ID") AND
              INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID")))
  22 - access(:Z>=:Z AND :Z<=:Z)
  26 - access(:Z>=:Z AND :Z<=:Z)
  29 - access("A"."ACCT_ITEM_TYPE_ID"="B"."ACCT_ITEM_TYPE_ID")
  34 - access("A"."ACCT_ITEM_TYPE_ID"="ACCT_ITEM_TYPE_ID")
  38 - access(:Z>=:Z AND :Z<=:Z)
  42 - access(:Z>=:Z AND :Z<=:Z)
  43 - access(INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID") AND
              INTERNAL_FUNCTION("A"."PRODUCT_ID")=INTERNAL_FUNCTION("PRODUCT_ID"))
       filter((INTERNAL_FUNCTION("A"."PRODUCT_ID")=INTERNAL_FUNCTION("PRODUCT_ID") AND
              INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID")))
  47 - access(:Z>=:Z AND :Z<=:Z)
  48 - access(INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID") AND
              INTERNAL_FUNCTION("A"."OFFER_CD")=INTERNAL_FUNCTION("OFFER_ID"))
       filter((INTERNAL_FUNCTION("A"."OFFER_CD")=INTERNAL_FUNCTION("OFFER_ID") AND
              INTERNAL_FUNCTION("A"."ACCT_ITEM_TYPE_ID")=INTERNAL_FUNCTION("ACCT_ITEM_TYPE_ID")))
  52 - access(:Z>=:Z AND :Z<=:Z)
  54 - access(:Z>=:Z AND :Z<=:Z)

 

 

很明显,原库的执行计划要好的,通过对比执行计划,我们发现:性能较差的SQL的执行计划中,not in 被改写成了not exits,进行了一些filter操作。而性能较高的SQL的执行计划,则是选择了ANTI Join。
问题是原来为什么ok ?存储迁移之后就有问题了呢 ?第一感觉可能是调整了优化器参数,检查发现果然是:

SYS@rptdb1> show parameter optimizer

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_optimizer_adaptive_cursor_sharing   boolean     FALSE
_optimizer_extended_cursor_sharing   string      NONE
_optimizer_extended_cursor_sharing_r string      NONE
el
_optimizer_null_aware_antijoin       boolean     FALSE
_optimizer_use_feedback              boolean     FALSE
optimizer_capture_sql_plan_baselines boolean     FALSE
optimizer_dynamic_sampling           integer     2
optimizer_features_enable            string      11.2.0.2
optimizer_index_caching              integer     0
optimizer_index_cost_adj             integer     100
optimizer_mode                       string      ALL_ROWS
optimizer_secure_view_merging        boolean     TRUE
optimizer_use_invisible_indexes      boolean     FALSE
optimizer_use_pending_statistics     boolean     FALSE
optimizer_use_sql_plan_baselines     boolean     TRUE
SYS@rptdb1>
SYS@rptdb1> alter session set "_optimizer_null_aware_antijoin"=true;

Session altered.

通过将该参数改回默认值,测试一切正常。 这里我主要是通过SQLT来解决该SQL的性能问题,首先创建一个SQL profile,然后修改SQL profile的查询块信息即可,如下:

q'[OPT_PARAM('_optimizer_null_aware_antijoin' 'true')]',
q'[OPT_PARAM('_optimizer_extended_cursor_sharing' 'none')]',
q'[OPT_PARAM('_optimizer_extended_cursor_sharing_rel' 'none')]',
q'[OPT_PARAM('_optimizer_adaptive_cursor_sharing' 'false')]',
q'[OPT_PARAM('_optimizer_use_feedback' 'false')]',

通过调整之后,SQL性能恢复正常。 虽然这是一个很常见的问题,然而我却是第一次在生产中碰见,下面进行一个简单的测试。
说明:测试脚本来自google。
—For 10.2.0.5

www.killdb.com> create table t1
  2  as select
  3  cast(rownum as int) a,
  4  cast(rownum+10 as int) b,
  5  cast(dbms_random.string('i',10) as varchar2(10)) c
  6  from dual connect by level<=10000;

Table created.

www.killdb.com> create table t2
  2  as select
  3  cast(rownum as int) a,
  4  cast(rownum+10 as int) b,
  5  cast(dbms_random.string('i',10) as varchar2(10)) c
  6  from dual connect by level<=9980;

Table created.

www.killdb.com>
www.killdb.com> set autot traceonly exp
www.killdb.com> analyze table t1 compute statistics;

Table analyzed.

www.killdb.com> analyze table t2 compute statistics;

Table analyzed.

www.killdb.com> select /*SQL_1*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 895956251

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  9999 |   126K| 60407   (1)| 00:12:05 |
|*  1 |  FILTER            |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 10000 |   126K|    12   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| T2   |     1 |     3 |    12   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "T2" "T2" WHERE
              LNNVL("A"<>:B1)))
   3 - filter(LNNVL("A"<>:B1))

www.killdb.com> alter table t2 modify a not null ;

Table altered.

www.killdb.com> select /*SQL_2*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 895956251

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  9999 |   126K| 60407   (1)| 00:12:05 |
|*  1 |  FILTER            |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 10000 |   126K|    12   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| T2   |     1 |     3 |    12   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "T2" "T2" WHERE
              LNNVL("A"<>:B1)))
   3 - filter(LNNVL("A"<>:B1))

www.killdb.com> create index idx_t2_a on t2(a);

Index created.

www.killdb.com> create index idx_t1_a on t1(a);

Index created.

www.killdb.com> select /*SQL_3*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 377637984

----------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  9999 |   126K| 35333   (1)| 00:07:04 |
|*  1 |  FILTER               |          |       |       |            |          |
|   2 |   TABLE ACCESS FULL   | T1       | 10000 |   126K|    12   (0)| 00:00:01 |
|*  3 |   INDEX FAST FULL SCAN| IDX_T2_A |     1 |     3 |     7   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "T2" "T2" WHERE LNNVL("A"<>:B1)))
   3 - filter(LNNVL("A"<>:B1))

www.killdb.com>

我们可以看到,仍然没有走办连接,还是走filter了,这里的类似nest loop,很明显效率很低,其原因是需要用T1表的每条记录去和T2 返回的结果集进行匹配。那么有没有办法让SQL走半连接呢 ? 肯定是可以的,如下:

www.killdb.com> alter table t1 modify a not null ;

Table altered.

www.killdb.com> select /*SQL_4*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 1490751970

----------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |    20 |   320 |    20   (5)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT ANTI |          |    20 |   320 |    20   (5)| 00:00:01 |
|   2 |   INDEX FAST FULL SCAN| IDX_T2_A |  9980 | 29940 |     7   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL   | T1       | 10000 |   126K|    12   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"="A")

我们可以看到,走半连接之后,效率明显要高的多。当然,这里不对t1表进行not null操作也可以进行优化。
—-for 11.2.0.2 test

[ora11g@localhost ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 11.2.0.2.0 Production on Sat Apr 18 22:59:32 2015

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

www.killdb.com> conn roger/roger
Connected.
www.killdb.com>

SQL> create table t1
  2  as select
  3  cast(rownum as int) a,
  4  cast(rownum+10 as int) b,
  5  cast(dbms_random.string('i',10) as varchar2(10)) c
  6  from dual connect by level<=10000;

Table created.

SQL> create table t2
  2  as select
  3  cast(rownum as int) a,
  4  cast(rownum+10 as int) b,
  5  cast(dbms_random.string('i',10) as varchar2(10)) c
  6  from dual connect by level<=9980;

Table created.

SQL> analyze table t1 compute statistics ;

Table analyzed.

SQL> analyze table t2 compute statistics;

Table analyzed.

SQL> set autot traceonly exp
SQL> select /*SQL_1*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 2739594415

--------------------------------------------------------------------------------
| Id  | Operation               | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |      |   100 |  1600 |    23   (5)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT ANTI NA|      |   100 |  1600 |    23   (5)| 00:00:01 |
|   2 |   TABLE ACCESS FULL     | T2   |  9980 | 29940 |    11   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL     | T1   | 10000 |   126K|    11   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"="A")

SQL> alter session set "_optimizer_null_aware_antijoin"=false;

Session altered.

SQL> select /*SQL_2*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 895956251

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  9999 |   126K| 55478   (1)| 00:11:06 |
|*  1 |  FILTER            |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 10000 |   126K|    11   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| T2   |     1 |     3 |    11   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "T2" "T2" WHERE
              LNNVL("A"<>:B1)))
   3 - filter(LNNVL("A"<>:B1))

SQL> alter table t2 modify a not null;

Table altered.

SQL> create index idx_t2_a on t2(a);

Index created.

SQL> create index idx_t1_a on t1(a);

Index created.

SQL>
SQL> alter session set "_optimizer_null_aware_antijoin"=true;

Session altered.

SQL>  select /*SQL_3*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 2568882110

-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |   100 |  1600 |    19   (6)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT ANTI SNA|          |   100 |  1600 |    19   (6)| 00:00:01 |
|   2 |   INDEX FAST FULL SCAN   | IDX_T2_A |  9980 | 29940 |     7   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL      | T1       | 10000 |   126K|    11   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"="A")

SQL> alter session set "_optimizer_null_aware_antijoin"=false;

Session altered.

SQL> select /*SQL_3*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 377637984

----------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  9999 |   126K| 35396   (2)| 00:07:05 |
|*  1 |  FILTER               |          |       |       |            |          |
|   2 |   TABLE ACCESS FULL   | T1       | 10000 |   126K|    11   (0)| 00:00:01 |
|*  3 |   INDEX FAST FULL SCAN| IDX_T2_A |     1 |     3 |     7   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "T2" "T2" WHERE LNNVL("A"<>:B1)))
   3 - filter(LNNVL("A"<>:B1))

SQL>  alter table t1 modify a not null ;

Table altered.

SQL> select /*SQL_3*/ c from t1 where a not in (select a from t2) ;

Execution Plan
----------------------------------------------------------
Plan hash value: 1490751970

----------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |   100 |  1600 |    19   (6)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT ANTI |          |   100 |  1600 |    19   (6)| 00:00:01 |
|   2 |   INDEX FAST FULL SCAN| IDX_T2_A |  9980 | 29940 |     7   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL   | T1       | 10000 |   126K|    11   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"="A")

实际上,通过我们测试可以发现,本质上应用SQL出问题,不是我们调整参数的问题,而是应用SQL写法不规范导致。或者说应用表结构设计存在缺陷导致。实际上该SQL,我们不需要调整隐含参数,通过对表的column 添加非空约束即可。

Related posts:

  1. 10g中distinct加强以及anti jion,semi jion
  2. about subquery unnest/push pred

library cache lock引发的一个案例

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: library cache lock引发的一个案例

美女同事说某个客户有个问题,系统出现了大量的library cache lock. 导致业务严重受阻,具体表现是所有访问某个表的SQL语句都会挂起. 首先我们来看hanganalyze 的结果:

PORADEBUG END ORIGINATING INST:1 SERIAL:0 PID:38076802
********************************************************************
Found 341 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <1/1513/11901/library cache lock>
Found 341 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <1/1148/42016/library cache lock>
Found 341 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <1/1395/45772/library cache lock>
Found 341 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <1/1574/16193/library cache lock>
Found 341 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <1/1488/64080/library cache lock>
Found 346 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <0/982/38928/0x435d270/46727232/library cache lock>

Cycle 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1488/64080/library cache lock>
 -- <0/700/3738/0x4335f60/20840620/library cache lock>
Cycle 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1012/44406/0xf35a3f8/17826228/library cache lock>
 -- <1/1513/11901/library cache lock>
Open chains found:
Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1516/42489/0x4338f00/34996438/single-task message>
 -- <0/982/38928/0x435d270/46727232/library cache lock>
 -- <0/627/7616/0x7319538/47710688/library cache lock>
 -- <1/1488/64080/library cache lock>
 -- <0/700/3738/0x4335f60/20840620/library cache lock>
 -- <1/1574/16193/library cache lock>
 -- <0/1162/22132/0x4360a00/31260934/library cache lock>
 -- <1/1395/45772/library cache lock>
 -- <0/1380/41831/0xf358c28/19202486/library cache lock>
 -- <1/1148/42016/library cache lock>
 -- <0/1012/44406/0xf35a3f8/17826228/library cache lock>
 -- <1/1513/11901/library cache lock>
 -- <0/609/6726/0x634af60/24903782/library cache lock>
Other chains found:
Chain 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/610/13426/0x23460c0/29163888/Streams AQ: qmn slave idle wait>
Chain 3 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/612/572/0xf359418/26214770/cursor: pin S wait on X>
Chain 4 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/615/402/0x73555a8/14221446/cursor: pin S wait on X>
........
........
    <0/1541/6064/0x4355370/52166804/cursor: pin S wait on X>
Chain 142 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1553/56644/0x334f228/15991100/cursor: pin S wait on X>
Chain 143 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1569/32611/0x23156e0/52756830/cursor: pin S wait on X>
Chain 144 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1587/1/0x432e060/8323160/Streams AQ: waiting for time man>
Chain 145 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1591/9/0x533f9e0/34406908/Streams AQ: waiting for messages>
Chain 146 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1592/5/0x7312618/9502790/Streams AQ: qmn coordinator idle>
Chain 147 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1654/1/0x230f7a0/21168480/No Wait>
Chain 148 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1301/10502/cursor: pin S wait on X>
Chain 149 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1468/31326/Streams AQ: qmn slave idle wait>
Chain 150 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1471/39769/jobq slave wait>
Chain 151 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1578/10/Streams AQ: waiting for time man>
Chain 152 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1594/11/Streams AQ: qmn coordinator idle>
Chain 153 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1654/1/No Wait>
Extra information that will be dumped at higher levels:
[level  3] : 346 node dumps -- [IN_HANG]
[level  4] :   1 node dumps -- [REMOTE_WT] [LEAF] [LEAF_NW]
[level  5] : 152 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP]
[level  6] :   1 node dumps -- [NLEAF]
[level 10] : 704 node dumps -- [IGN] 

State of nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor):
[606]/0/607/27406/0x23f01f0/32702484/IGN/1/2//none
[608]/0/609/6726/0x641eeb0/24903782/IN_HANG/3/698/[3167][2802][3049][3228][3142][981]/none
[609]/0/610/13426/0x541b9b0/29163888/SINGLE_NODE/699/700//none
[610]/0/611/31/0xf3f9378/25231442/IN_HANG/37/38/[3167][2802][3049][3228][3142][981]/3142
[611]/0/612/572/0x440abd0/26214770/SINGLE_NODE/701/702//none
[614]/0/615/402/0x73ee5f0/14221446/SINGLE_NODE/703/704//none
[615]/0/616/223/0x6420428/13959648/SINGLE_NODE/705/706//none
[616]/0/617/10658/0x541cf28/19792010/IN_HANG/19/20/[3167][2802][3049][3228][3142][981]/3142
[617]/0/618/714/0xf3fa8f0/40632696/IN_HANG/43/44/[3167][2802][3049][3228][3142][981]/3142
[619]/0/620/1483/0x33f82f8/18219062/SINGLE_NODE/707/708//none
[620]/0/621/12164/0x23f2ce0/37552426/IGN/709/710//none
[621]/0/622/2718/0x73efb68/39911520/SINGLE_NODE/711/712//none
[622]/0/623/6949/0x64219a0/28049662/IGN/713/714//none
[623]/0/624/3802/0x541e4a0/6160704/SINGLE_NODE/715/716//none
[624]/0/625/883/0xf3fbe68/36438112/IGN/717/718//none
[626]/0/627/7616/0x33f9870/47710688/IN_HANG/13/18/[3167][2802][3049][3228][3142][981]/3142
[627]/0/628/4681/0x23f4258/14811164/SINGLE_NODE/719/720//none
[628]/0/629/833/0x73f10e0/5767552/IN_HANG/27/28/[3167][2802][3049][3228][3142][981]/3142
.......
.......
[973]/0/974/4600/0x5461610/12845484/IN_HANG/111/112/[3167][2802][3049][3228][3142][981]/3142
[974]/0/975/25532/0xf43efd8/53936616/IN_HANG/603/604/[3167][2802][3049][3228][3142][981]/3142
[975]/0/976/16608/0x4450830/20775322/SINGLE_NODE/993/994//none
[976]/0/977/4511/0x343c9e0/14287136/IGN/995/996//none
[977]/0/978/26807/0x24373c8/17760418/IGN/997/998//none
[978]/0/979/14805/0x7434250/42860784/IN_HANG/181/182/[3167][2802][3049][3228][3142][981]/3142
[980]/0/981/22918/0x5462b88/32374886/IN_HANG/429/430/[3167][2802][3049][3228][3142][981]/3142
[981]/0/982/38928/0xf440550/46727232/NLEAF/14/17/[1515]/626
[982]/0/983/19754/0x4451da8/10355146/IN_HANG/605/606/[3167][2802][3049][3228][3142][981]/3142
[983]/0/984/34596/0x343df58/50463108/IN_HANG/441/442/[3167][2802][3049][3228][3142][981]/3142
[984]/0/985/8594/0x2438940/21889486/SINGLE_NODE/999/1000//none

。。。。。。
[1512]/0/1513/6509/0x54c8b28/21692662/IN_HANG/247/248/[3167][2802][3049][3228][3142][981]/3142
[1513]/0/1514/47903/0xf4a64f0/30933308/IN_HANG/351/352/[3167][2802][3049][3228][3142][981]/3142
[1514]/0/1515/52967/0x44b7d48/11207134/IGN/1795/1796//none
[1515]/0/1516/42489/0x34a3ef8/34996438/LEAF/15/16//981
[1516]/0/1517/28157/0x249e8e0/19464364/IGN/1797/1798//none
[1517]/0/1518/28318/0x749b768/25166076/IGN/1799/1800//none
。。。。。。

这里对对一部分是内容进行简单的解释:

Found 341 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <1/1513/11901/library cache lock>
......
Found 346 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <0/982/38928/0x435d270/46727232/library cache lock>

上述信息表上在进行dump时发现1513 会话阻塞了341个数据库会话(session)。 阻塞了300多个会话,这在任何系统中恐怕都会导致很大的影响。在一个稍微的时间点dump 又发现982会话阻塞了346个会话.
虽然这里有多条阻塞的记录,但是这不难理解,可以理解为是不同的时间点进行的(因为dump 可能花了几分钟才完成)。
第二部分内容是死锁信息,如下:

Cycle 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/1488/64080/library cache lock>
 -- <0/700/3738/0x4335f60/20840620/library cache lock>
Cycle 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1012/44406/0xf35a3f8/17826228/library cache lock>
 -- <1/1513/11901/library cache lock>

这是cycle 即循环,跟死锁类似。一般来讲,如果trace中出现了cycle,而cycle 涉及的进程又阻塞了大量的会话、那么系统估计都可能已经hang了。从上述信息来看,节点2的1488 会话和节点1的700 会话互为”死锁”. 比较怪异的是,这2个会话都在等待library cache lock.而下面的1012 和1513 会话也是类似,也都在等待library cache lock.
下面我们来继续解释第3部分内容:

State of nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor):
[606]/0/607/27406/0x23f01f0/32702484/IGN/1/2//none
[608]/0/609/6726/0x641eeb0/24903782/IN_HANG/3/698/[3167][2802][3049][3228][3142][981]/none
[609]/0/610/13426/0x541b9b0/29163888/SINGLE_NODE/699/700//none
[610]/0/611/31/0xf3f9378/25231442/IN_HANG/37/38/[3167][2802][3049][3228][3142][981]/3142
[611]/0/612/572/0x440abd0/26214770/SINGLE_NODE/701/702//none
.....

这部分内容显示了所有进程的状态以及阻塞情况。对于进程的状态,主要是分为如下几种:

IN_HANG :该状态是一个非常危险的状态,通常表现该会话陷入了死循环或挂起(hang)。
          一般来说出现这种情况,该节点的临近节点(adjlist)也是一样的状态.adjlist 其实就是表示session id.
LEAF    :通常是被认为blockers的重点对象。可以根据后面的predecesor来判断该session是不是blocker或者是waiter。
LEAF_NW :跟leaf类似 不过可能会占用cpu
NLEAF   :该状态的session通常被认为 “stuck” session。即其他session所需要的资源正被其holding。
IGN     :该状态的session通常是处理IDLE状态,除非其adjlist存在,如果是,那么该session正在等待其他session。
IGN_DMP :跟IGN类似。
SINGLE_NODE,SINGLE_NODE_NW 可以认为跟LEAF,LEAF_NW类似。

我们这里再回到主题上来、通过如下内容我们可以看到,此次故障的源头应该是会话1516,如下:

Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/1516/42489/0x4338f00/34996438/single-task message>
 -- <0/982/38928/0x435d270/46727232/library cache lock>
 -- <0/627/7616/0x7319538/47710688/library cache lock>
 -- <1/1488/64080/library cache lock>
 -- <0/700/3738/0x4335f60/20840620/library cache lock>
 -- <1/1574/16193/library cache lock>
 -- <0/1162/22132/0x4360a00/31260934/library cache lock>
 -- <1/1395/45772/library cache lock>
 -- <0/1380/41831/0xf358c28/19202486/library cache lock>
 -- <1/1148/42016/library cache lock>
 -- <0/1012/44406/0xf35a3f8/17826228/library cache lock>
 -- <1/1513/11901/library cache lock>
 -- <0/609/6726/0x634af60/24903782/library cache lock>

注意,这部分内容告诉我们的是,如下的会话982,627,1488,700,1012,1513等都12个进程都是被1516 锁阻塞。
但是这需要我们注意的是,虽然这部分会话是被1516 锁阻塞,但是并不是说这部分进程是1516 直接阻塞,因为很有可能是间接性的。从上述内容我们可以发现,cycle的4个会话其实都被1516 阻塞了。这也说明一点,这里的cycle其实并不是真正意义上的死锁.
当然,客户解决这个问题很简单,通过将会话1516 kill即可。 但是客户不明白的是,为什么会出现这个问题?
首先我们来看看客户的困惑是什么? 他们困惑的是为什么访问某个表的sql都不会挂起,哪怕是如下的sql也会挂起:
SQL> select count(1) from GEOSTAR.ATT_PT_LINE;
看到这里,可能很多人都会疑问,为什么select 也会挂起? 有什么锁会阻塞select 呢?
首先,我们来看下源头会话1516在干些什么 ?

*** 2015-05-09 06:16:26.707
  ----------------------------------------
  SO: 700000504338f00, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=204, calls cur/top: 70000043542b320/7000004b9323ec0, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 167
              last post received-location: kqrbtm
              last process to post me: 70000050533f1f0 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 70000050533f1f0 1 6
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 700000503379d38
    O/S info: user: oracle, term: UNKNOWN, ospid: 34996438
    OSD pid info: Unix process pid: 34996438, image: oracle@gisdata1
    ----------------------------------------
    SO: 7000005034a3ef8, type: 4, owner: 700000504338f00, flag: INIT/-/-/0x00
    (session) sid: 1516 trans: 0, creator: 700000504338f00, flag: (41) USR/- BSY/-/-/-/-/-
              DID: 0001-00CC-00000175, short-term DID: 0001-00CC-00000176
              txn branch: 0
              oct: 2, prv: 0, sql: 70000044eb24610, psql: 70000050d2dec80, user: 58/GEOSTAR
    service name: gissc
    O/S info: user: Administrator, term: XF-PC, ospid: 9396:8644, machine: WORKGROUP\XF-PC
              program: plsqldev.exe
    application name: PL/SQL Developer, hash value=1190136663
    action name: SQL Window - New, hash value=3399691616
    waiting for 'single-task message' wait_time=0, seconds since wait started=402550
                =0, =0, =0
                blocking sess=0x0 seq=77
    Dumping Session Wait History
     for 'row cache lock' count=1 wait_time=0.000413 sec
                cache id=f, mode=0, request=3
     。。。。。。
     for 'row cache lock' count=1 wait_time=0.000368 sec
                cache id=10, mode=0, request=3
  ......
  ......省略部分内容
        ----------------------------------------
      SO: 70000050f4db0e8, type: 5, owner: 7000004b9323ec0, flag: INIT/-/-/0x00
      (enqueue) CU-3F99D598-07000004	DID: 0001-00CC-00000175
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      mode: X, lock_flag: 0x0, lock: 0x70000050f4db108, res: 0x7000005045152f0
      own: 0x7000005034a3ef8, sess: 0x7000005034a3ef8, proc: 0x700000504338f00, prv: 0x700000504515300
      ----------------------------------------
      SO: 7000005086945a0, type: 59, owner: 7000004b9323ec0, flag: INIT/-/-/0x00
      cursor enqueue
      child: 7000004c38219e0, flag: 53, number: 0
      parent: 7000004aa597e30
    ----------------------------------------
    SO: 70000043a244e30, type: 16, owner: 700000504338f00, flag: INIT/-/-/0x00
    (osp req holder)
PSO child state object changes :

该进程的dump内容比较长,有几千行,因为这里直接跳到最后。我们可以看到该会话持有了一个Mode=x的CU锁,所谓的CU enqueue其实是指Bind Enqueue。我们看该cursor的夫游标地址是:7000004aa597e30
我们直接搜索:7000004aa597e30 可以发现如下内容:

      KGX Atomic Operation Log 7000005086941d0
       Mutex 7000004c38219e0(1516, 0) idn fad52024 oper EXCL
       Cursor Pin uid 1516 efd 0 whr 1 slp 0
       opr=3 pso=700000437e79160 flg=0
        pcs=7000004c38219e0 nxt=0 flg=35 cld=0 hd=70000043f99d598 par=7000004aa597e30
       ct=0 hsh=0 unp=0 unn=0 hvl=643b5990 nhv=1 ses=7000005034a3ef8
       hep=7000004c3821a60 flg=80 ld=1 ob=700000497c38fa8 ptr=700000445b2e260 fex=700000445b2d570
      ----------------------------------------
      SO: 700000437e5d490, type: 53, owner: 7000005034a3ef8, flag: INIT/-/-/0x00
      LIBRARY OBJECT LOCK: lock=700000437e5d490 handle=70000044eb24610 mode=N
      call pin=0 session pin=0 hpc=0000 hlc=0000
      htl=700000437e5d510[700000437e13b68,700000437e791e0] htb=7000004368e0fd0 ssga=7000004368e0c78
      user=7000005034a3ef8 session=7000005034a3ef8 count=1 flags=[0000] savepoint=0x55471143
      LIBRARY OBJECT HANDLE: handle=70000044eb24610 mtx=70000044eb24740(0) lct=3 pct=1 cdp=1
      name=insert into zw_gis@cspmslink   (OBJ_id, global_id,bz1,bz2,bz3, type,bz4)
select id, globeid, name,unit,gridtype,'输电','输电线路' from att_pt_line

我们可以看到,该会话在执行一个insert语句,访问的正是att_pt_line表,而且是通过dblink进行操作。 同时我们也可以看到该会话对这个表上进行了library cache lock和pin 操作,如下:

     SO: 70000043a2a2d98, type: 54, owner: 70000043542b320, flag: INIT/-/-/0x00
        LIBRARY OBJECT PIN: pin=70000043a2a2d98 handle=70000050c2183d0 mode=S lock=700000437ec2790
        user=7000005034a3ef8 session=7000005034a3ef8 count=1 mask=0501 savepoint=0x92b flags=[00]
        LIBRARY OBJECT HANDLE: handle=70000050c2183d0 mtx=70000050c218500(0) lct=28657 pct=28658 cdp=0
        name=GEOSTAR.ATT_PT_LINE
        hash=451da03ad5fcef0b5788626298294e0f timestamp=03-28-2015 10:45:18
        namespace=TABL flags=KGHP/TIM/XLR/[00000020]
        kkkk-dddd-llll=0000-0541-0749 lock=S pin=S latch#=40 hpc=77da hlc=77d2
        lwt=70000050c218478[700000435ef9cc0,700000448ecdae0] ltm=70000050c218488[70000050c218488,70000050c218488]
        pwt=70000050c218440[70000050c218440,70000050c218440] ptm=70000050c218450[70000050c218450,70000050c218450]
        ref=70000050c2184a8[70000050c2184a8,70000050c2184a8] lnd=70000050c2184c0[70000044abad018,70000050c49edc8]
          LOCK INSTANCE LOCK: id=LB451da03ad5fcef0b
          PIN INSTANCE LOCK: id=NB451da03ad5fcef0b mode=S release=F flags=[00]
          INVALIDATION INSTANCE LOCK: id=IV0000cfed1c0b2e13 mode=S
          LIBRARY OBJECT: object=7000004a6d5e618
          type=TABL flags=EXS/LOC[0005] pflags=[0000] status=VALD load=0
          DATA BLOCKS:
          data#     heap  pointer    status pins change whr alloc(K)  size(K)
          ----- -------- -------- --------- ---- ------ --- -------- --------
              0 70000050c604d60 7000004a6d5e730 I/P/A/-/-    0 NONE   00      0.71     1.09
              3 700000495a40b38        0        I/-/-/-/-    0 NONE   0c      0.00     0.00
              8 7000004a6d5e910 700000476a42498 I/P/A/-/-    1 NONE   00      5.61     6.52
              9 7000004a6d5e300        0        I/-/-/-/-    0 NONE   0c      0.00     0.00
             10 7000004a6d5e388 70000047680edd8 I/P/A/-/-    1 NONE   00      6.77     7.77
        ----------------------------------------
        SO: 700000437ec2790, type: 53, owner: 70000043542b320, flag: INIT/-/-/0x00
        LIBRARY OBJECT LOCK: lock=700000437ec2790 handle=70000050c2183d0 mode=S
        call pin=70000043a2a2d98 session pin=0 hpc=0000 hlc=0000
        htl=700000437ec2810[7000004368e0e80,7000004368e0e80] htb=7000004368e0e80 ssga=7000004368e0c78
        user=7000005034a3ef8 session=7000005034a3ef8 count=1 flags=PNC/[0400] savepoint=0x92b
        LIBRARY OBJECT HANDLE: handle=70000050c2183d0 mtx=70000050c218500(0) lct=28657 pct=28658 cdp=0
        name=GEOSTAR.ATT_PT_LINE
        hash=451da03ad5fcef0b5788626298294e0f timestamp=03-28-2015 10:45:18
        namespace=TABL flags=KGHP/TIM/XLR/[00000020]
        kkkk-dddd-llll=0000-0541-0749 lock=S pin=S latch#=40 hpc=77da hlc=77d2
        lwt=70000050c218478[700000435ef9cc0,700000448ecdae0] ltm=70000050c218488[70000050c218488,70000050c218488]
        pwt=70000050c218440[70000050c218440,70000050c218440] ptm=70000050c218450[70000050c218450,70000050c218450]
        ref=70000050c2184a8[70000050c2184a8,70000050c2184a8] lnd=70000050c2184c0[70000044abad018,70000050c49edc8]
          LOCK INSTANCE LOCK: id=LB451da03ad5fcef0b
          PIN INSTANCE LOCK: id=NB451da03ad5fcef0b mode=S release=F flags=[00]
          INVALIDATION INSTANCE LOCK: id=IV0000cfed1c0b2e13 mode=S
          LIBRARY OBJECT: object=7000004a6d5e618
          type=TABL flags=EXS/LOC[0005] pflags=[0000] status=VALD load=0
          DATA BLOCKS:
          data#     heap  pointer    status pins change whr
          ----- -------- -------- --------- ---- ------ ---
              0 70000050c604d60 7000004a6d5e730 I/P/A/-/-    0 NONE   00
              3 700000495a40b38        0        I/-/-/-/-    0 NONE   0c
              8 7000004a6d5e910 700000476a42498 I/P/A/-/-    1 NONE   00
              9 7000004a6d5e300        0        I/-/-/-/-    0 NONE   0c
             10 7000004a6d5e388 70000047680edd8 I/P/A/-/-    1 NONE   00
      ----------------------------------------

我们知道,对应library cache latch的的获取,实际上Oracle SQL语句的硬解析、软解析、甚至软软解析都是需要获取的。 对应软软解析、是否需要获得library cache latch,在11g中有所改变,但是客户这里是10205版本。
这里我们已经知道了1516 会话在干什么,那么仍然不知道为什么1516 会导致后面的982会话被阻塞呢 ?
我们接着来看下会话982是什么进程,在干些什么 ?

----------------------------------------
SO: 70000050f440550, type: 4, owner: 70000050435d270, flag: INIT/-/-/0x00
  (session) sid: 982 trans: 7000004d8e5f120, creator: 70000050435d270, flag: (48000041) USR/- BSY/-/-/-/-/-
            DID: 0001-02CB-00000005, short-term DID: 0001-02CB-00000004
            txn branch: 0
            oct: 3, prv: 0, sql: 700000509e93e50, psql: 700000509e93e50, user: 0/SYS
  O/S info: user: oracle, term: UNKNOWN, ospid: 46727232, machine: gisdata1
            program: oracle@gisdata1 (J001)
  application name: DBMS_SCHEDULER, hash value=2478762354
  action name: GATHER_STATS_JOB, hash value=930355498
  waiting for 'library cache lock' wait_time=0, seconds since wait started=148249
          handle address=70000050c2183d0, lock address=700000435eee5a8, 100*mode+namespace=1f5
          blocking sess=0x0 seq=7211
  Dumping Session Wait History
   for 'library cache lock' count=1 wait_time=0.488298 sec
          handle address=70000050c2183d0, lock address=700000435eee5a8, 100*mode+namespace=1f5
   for 'library cache lock' count=1 wait_time=0.488295 sec
......

我们可以看到982会话是Oracle的定时任务发起的,通过actiion name我们可以知道、这是调用GATHER_STATS_JOB。
很明显这是对于数据库全库的统计信息的收集。对应统计信息的收集,大家应该清楚,这其实类似DDL操作、默认情况之下会会导致cursor 失效,而且还会导致library cache lock的产生。
出问题的这天正好是周日,对应这个定时任务,大家应该知道,周末是全天运行,如果收集失败也不会被强行终止掉。我们可以可以看到1516会话进程的dump时间点是2015-05-09 06:16:26.707,这正好位于统计信息时候点之后不久。
于是我们可以大胆的猜测,统计信息后面其实是没有运行完成的,这也就是为什么982会话会导致大量的library cache lock等待的原因。
那么最后,为什么select 这个表会挂起呢?
其实很简单,这是因为这个表的统计信息收集其实未完成,正在在进行中。而我们也知道这是会导致cursor失效的,那么针对这个表的所有SQL都必须进行硬解析,这毫无疑问,解析是需要获得library cache pin和lock的。 所有客户会发现任何一个会话去访问这个表出现的等待事件都是library cache lock.

针对这一点,我们可以通过如下的实验来进行验证:

首先调整游标缓存的参数:

www.killdb.com>show parameter session_cached_cursors

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
session_cached_cursors               integer     0
www.killdb.com>select count(1) from t_library_lock;

  COUNT(1)
----------
     50042
www.killdb.com>select sql_id,hash_value,sql_Text from v$sqlarea where sql_text like '%t_library_lock%';

SQL_ID        HASH_VALUE SQL_TEXT
------------- ---------- --------------------------------------------------------------------------------
c7sdcrtz4t3k3 2118946371 select sql_id,sql_Text from v$sqlarea where sql_text like '%t_library_lock%'
f8m4jqsaypc7m  367702259 select count(1) from t_library_lock

www.killdb.com>select to_char(367702259,'xxxxxxxxxx') from dual;

TO_CHAR(367
-----------
   15eab0f3

那么如何选择trace 的level呢? 首先来看下文档说明:

......
#define KGLTRCLCK  0×0010                  /* trace lock operations */
#define KGLTRCPIN  0×0020                  /* trace pin operations  */
#define KGLTRCOBF  0×0040                  /* trace object freeing  */
#define KGLTRCINV  0×0080                  /* trace invalidations   */
#define KGLDMPSTK  0×0100                  /* DUMP CALL STACK WITH TRACE */
#define KGLDMPOBJ  0×0200                  /* DUMP KGL OBJECT WITH TRACE */
#define KGLDMPENQ  0×0400                  /* DUMP KGL ENQUEUE WITH TRACE */
#define KGLTRCHSH  0×2000                  /* DUMP BY HASH VALUE */
......

这里取后两位:b0f3 ,  至于如何计算,我们来看文档示意图:

                 +-------------------------------> Trace by hash turned on
                   |           +-------------------> Trace pin operations
                   |           |             +-----> Trace invalidation
                   |           |             |       operations
     0xb0f30000 | KGLTRCHSH | KGLTRCPIN | KGLTRCLCK

根据Oracle 文档描述,这里还需要针对pin、lock以及hash的操作,对应的level分别是:

0×00002000

0×00000020

0×00000010
那么最后的level应该是.
b0f30000+2000+20+10=b0f32030
最后再将其转换为10进制,则如下:

www.killdb.com>select to_number('b0f32030','xxxxxxxxxxxxxxx') from dual;

TO_NUMBER('B0F32030','XXXXXXXXXXXXXXX')
---------------------------------------
                             2968723504

那么最后针对该SQL的library cache pin/lock操作trace即:

oradebug setospid  xxx

oradebug event 10049 trace name context forever,level 2968723504

oradebug event 10049 trace name context off;
下面我们开始进行测试,首先我们来测试硬解析.

### 硬解析

--session 1
www.killdb.com>alter system flush shared_pool;

System altered.
www.killdb.com>select sid from v$mystat where rownum < 2;

       SID
----------
       159

www.killdb.com>select count(1) from t_library_lock;

  COUNT(1)
----------
     50042

--Session 2
www.killdb.com>select s.sid,s.serial#,s.username,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=&sid;
Enter value for sid: 159
old   1: select s.sid,s.serial#,s.username,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=&sid
new   1: select s.sid,s.serial#,s.username,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=159

       SID    SERIAL# USERNAME                  SPID
---------- ---------- ------------------------- ------------
       159          5 ROGER                     10200

www.killdb.com>oradebug setospid 10200
Oracle pid: 15, Unix process pid: 10200, image: oracle@killdb.com (TNS V1-V3)
www.killdb.com> oradebug event 10049 trace name context forever,level 2968723504
Statement processed.
www.killdb.com>oradebug event 10049 trace name context off;
Statement processed.
www.killdb.com>oradebug tracefile_name
/home/ora10g/admin/test/udump/test_ora_10200.trc
www.killdb.com>

我们来看下trace的内容:

*** SESSION ID:(159.5) 2015-05-20 21:15:24.673
Received ORADEBUG command 'event 10049 trace name context forever,level 2968723504' from process Unix process pid: 10668, image:
*** 2015-05-20 21:15:35.496
KGLTRCLCK kgllkal    hd = 0x0x2997abdc  KGL Lock addr = 0x0x27b751a8 mode = N
KGLTRCLCK kglget     hd = 0x0x2997abdc  KGL Lock addr = 0x0x27b751a8 mode = N
KGLTRCPIN kglpin     hd = 0x0x2997abdc  KGL Pin  addr = 0x0x27b76010 mode = X
KGLTRCPIN kglpndl    hd = 0x0x2997abdc  KGL Pin  addr = 0x0x27b76010 mode = X
KGLTRCLCK kgllkal    hd = 0x0x298a1d10  KGL Lock addr = 0x0x27b4e038 mode = N
KGLTRCLCK kglget     hd = 0x0x298a1d10  KGL Lock addr = 0x0x27b4e038 mode = N
KGLTRCPIN kglpin     hd = 0x0x298a1d10  KGL Pin  addr = 0x0x27b1935c mode = X
KGLTRCPIN kglpndl    hd = 0x0x298a1d10  KGL Pin  addr = 0x0x27b1935c mode = X
Received ORADEBUG command 'event 10049 trace name context off' from process Unix process pid: 10668, image:

从trace的内容我们可以清楚的看到,对应SQL的硬解析,是需要获得library cache lock和library cache pin操作的。
而且对应select 语句,library cache lock的mode是NULL,而library cache pin的mode是X.

### 测试软解析

--session 1
www.killdb.com>select count(1) from t_library_lock;

  COUNT(1)
----------
     50042

--session 2
www.killdb.com>oradebug event 10049 trace name context forever,level 2968723504
Statement processed.
www.killdb.com>oradebug tracefile_name
/home/ora10g/admin/test/udump/test_ora_10200.trc
www.killdb.com>

此时的trace 内容如下:

*** 2015-05-20 21:19:40.799
Received ORADEBUG command 'event 10049 trace name context forever,level 2968723504' from process Unix process pid: 10668, image:
KGLTRCLCK kgllkdl    hd = 0x0x298a1d10  KGL Lock addr = 0x0x27b4e038 mode = N
KGLTRCLCK kgllkdl2   hd = 0x0x298a1d10  KGL Lock addr = 0x0x27b4e038 mode = 0
KGLTRCLCK kgllkdl    hd = 0x0x2997abdc  KGL Lock addr = 0x0x27b751a8 mode = N
KGLTRCLCK kgllkdl2   hd = 0x0x2997abdc  KGL Lock addr = 0x0x27b751a8 mode = 0
KGLTRCLCK kgllkal    hd = 0x0x2997abdc  KGL Lock addr = 0x0x27b2b5c0 mode = N
KGLTRCLCK kglget     hd = 0x0x2997abdc  KGL Lock addr = 0x0x27b2b5c0 mode = N
KGLTRCLCK kgllkal    hd = 0x0x298a1d10  KGL Lock addr = 0x0x27b4e6b8 mode = N

我们可以看到,对应SQL语句的软解析是不需要获得library cache pin操作的,只需要获得library cache lock即可,而且mode为NULL。

### 软软解析

--session 1

www.killdb.com>show parameter session_cached_cursors

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
session_cached_cursors               integer     20
www.killdb.com>conn roger/roger
Connected.
www.killdb.com>select count(1) from t_library_lock;

  COUNT(1)
----------
     50042

www.killdb.com>select count(1) from t_library_lock;

  COUNT(1)
----------
     50042

www.killdb.com>select count(1) from t_library_lock;

  COUNT(1)
----------
     50042

www.killdb.com>select sid from v$mystat where rownum < 2;

       SID
----------
       159

www.killdb.com>select count(1) from t_library_lock;

  COUNT(1)
----------
     50042
---session 2
www.killdb.com>select s.sid,s.serial#,s.username,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=&sid;
Enter value for sid: 159
old   1: select s.sid,s.serial#,s.username,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=&sid
new   1: select s.sid,s.serial#,s.username,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=159

       SID    SERIAL# USERNAME                  SPID
---------- ---------- ------------------------- ------------
       159          5 ROGER                     11339

www.killdb.com>oradebug setospid 11339
Oracle pid: 15, Unix process pid: 11339, image: oracle@killdb.com (TNS V1-V3)
www.killdb.com>oradebug event 10049 trace name context forever,level 2968723504
Statement processed.
www.killdb.com>oradebug event 10049 trace name context off;
Statement processed.
www.killdb.com>oradebug tracefile_name
/home/ora10g/admin/test/udump/test_ora_11339.trc

我们来看下软软解析的trace 内容:

*** SESSION ID:(159.5) 2015-05-20 21:31:46.501
Received ORADEBUG command 'event 10049 trace name context forever,level 2968723504' from process Unix process pid: 11356, image:
*** 2015-05-20 21:32:04.535
Received ORADEBUG command 'event 10049 trace name context off' from process Unix process pid: 11356, image:
Received ORADEBUG command 'tracefile_name' from process Unix process pid: 11356, image:
Received ORADEBUG command 'tracefile_name' from process Unix process pid: 11356, image:

我们可以看到,SQL的软软解析这里其实没有获得library cache pin和library cache lock操作.
最后我们来总结一下:
1、10205版本中,SQL硬解析是需要获得library cache pin和lock的,且分别的mode是X和NULL。

2、10205版本中,SQL软解析是需要获得library cache lock,mode为NULL。

3、10205版本中,SQL软软解析是不需要获得library cache lock和pin的。

4、其他版本可能不同、尤其是mutex的引入,11g中可能有很大的变化,这一点稍后再进行验证。
详见:TRACING KGL CALLS IN A OCI PROGRAM USING THE EVENT 10049 NEW FEATURES (Doc ID 334636.1)

Related posts:

  1. GES: Potential blocker (pid=13839) on resource FU
  2. 11g 新特性之–query result cache(3)
  3. library cache pin&lock (1)
  4. soft parse 和 library cache lock
  5. library cache: mutex X引发的故障

 goldengate 学习系列7-ogg 12c support read standby redo

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址:  goldengate 学习系列7-ogg 12c support read standby redo

Goldengate 12c support redo standby redo

1. source

1) 配置好active dg(略)

1) active dataguard
[oracle@11g_adg ~]$ sqlplus roger/roger@roger

SQL*Plus: Release 11.2.0.4.0 Production on Tue Jun 9 16:25:07 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select database_role,open_mode from v$database;

DATABASE_ROLE    OPEN_MODE
---------------- --------------------
PRIMARY          READ WRITE

SQL> conn roger/roger@standby
Connected.
SQL> select  database_role,open_mode from v$database;

DATABASE_ROLE    OPEN_MODE
---------------- --------------------
PHYSICAL STANDBY READ ONLY WITH APPLY

2)add tranlog

GGSCI (11g_adg) 107> dblogin userid ggs@roger,password ggs
Successfully logged into database.

GGSCI (11g_adg as ggs@roger) 109> ADD SCHEMATRANDATA ROGER

2015-06-09 16:09:28  INFO    OGG-01788  SCHEMATRANDATA has been added on schema ROGER.

2015-06-09 16:09:28  INFO    OGG-01976  SCHEMATRANDATA for scheduling columns has been added on schema ROGER.

说明:这里进行add schematrandata 仍然需要连接到primary db.

3) 抽取进程/pump 进程配置

GGSCI (11g_adg as ggs@standby) 122> view param ext1

extract ext1
userid ggs@standby,password ggs
discardfile  ./dirrpt/ext1.dsc, append, megabytes 50
warnlongtrans 2h, checkinterval 3m
TRANLOGOPTIONS MINEFROMACTIVEDG
EXTTRAIL ./dirdat/ex ,FORMAT RELEASE 11.2
NUMFILES 3000
ALLOCFILES 200
table roger.*;

GGSCI (11g_adg as ggs@standby) 123> view param dp1

EXTRACT dp1
RMTHOST 192.168.1.110, MGRPORT 7809 TCPBUFSIZE 5000000
PASSTHRU
RMTTRAIL  ./dirdat/rm , format release 11.2
NUMFILES 3000
TABLE roger.*;

GGSCI (11g_adg as ggs@standby) 124> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
EXTRACT     RUNNING     DP1         00:00:00      00:00:03
EXTRACT     RUNNING     EXT1        00:00:00      00:00:04

说明:参数 TRANLOGOPTIONS MINEFROMACTIVEDG 是12.1引入的新参数。
2. target

1)复制进程配置

GGSCI (killdb.com) 24> view param rep1124

replicat rep1124
userid  ggs@roger,password ggs
reperror default, discard
DISCARDROLLOVER AT 20:30
discardfile ./dirrpt/rep1124.dsc, append, megabytes 50

assumetargetdefs
allownoopupdates    

numfiles 3000

map roger.*, target roger.*;

GGSCI (killdb.com) 25> info rep1124

REPLICAT   REP1124   Last Started 2015-06-09 16:14   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:00:06 ago)
Log Read Checkpoint  File ./dirdat/rm000006
                     2015-06-09 16:10:38.060936  RBA 1609

3. 测试dml

--11gR2 primary db
SQL> conn roger/roger
Connected.
SQL> insert into t_ogg values(1,'a');

1 row created.

SQL> insert into t_ogg values(1,'a');

1 row created.

SQL> insert into t_ogg values(2,'b');

1 row created.

SQL> insert into t_ogg values(3,'c');

1 row created.

SQL> commit;

Commit complete.

SQL>
SQL>
SQL> select  database_role,open_mode from v$database;

DATABASE_ROLE    OPEN_MODE
---------------- --------------------
PRIMARY          READ WRITE

---target db
www.killdb.com>conn roger/roger
Connected.
www.killdb.com>select * from t_ogg;

        ID NAME
---------- --------------------
         1 a
         1 a
         2 b
         3 c

www.killdb.com>select * from v$version where rownum < 2;    

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod

 

Related posts:

  1. logical standby ORA-1119
  2. goldengate 学习系列1–10gasm to 11gR2 asm 单向复制(DDL支持)
  3. goldengate 学习系列2–相关配置说明
  4. goldengate 学习系列3–一对多的复制配置
  5. Goldengate monitor v11.1 Install for LinuxX86

今年6月30号又将面临闰秒(leap seconds)的影响

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 今年6月30号又将面临闰秒(leap seconds)的影响

什么是闰秒?
闰秒是在协调世界时(UTC)中增加或减少一秒,使它与平太阳时贴近所做调整。UTC,是透过广播作为民用时的官方时间基础,它使用非常精确的原子钟来维护。要保持UTC与平太阳时的一致性,偶尔需要调整,也就是”跳个”1秒来做调整,就是所谓添加闰秒(请参阅ΔT)。闰秒时间现在是由国际地球自转和参考座标系统服务(IERS)来确认,而在1988年1月1日之前是由国际时间局(BIH)承担这项职责。
当要增加正闰秒时,这一秒是增加在第二天的00:00:00之前,效果是延缓UTC第二天的开始。当天23:59:59的下一秒被记为23:59:60,然后才是第二天的00:00:00。如果是负闰秒的话,23:59:58的下一秒就是第二天的00:00:00了,但目前还没有负闰秒调整的需求。需要时的日长度必须低于1750-1892年的平均日长度,才会累积足够调整1秒所需要的时间。除了每天4毫秒的波动外,日长度自1700年以来都保持一样[1]。然而,从历史上的日食观测则显示,自公元前700年以来,每个世纪的日长度大约增加1.7毫秒[2]。(来自维基百科)

闰秒的出现时间

 

 

 

 

 

 

 

 

闰秒的影响

从当前所了解的一些参考资料来看,主要是针对linux和solaris存在影响。 主要影响是因为使用ntp的缘故。
对于Linux而言,内核版本大于2.6.22的都受影响。(详见http://www-01.ibm.com/support/docview.wss?uid=swg21602521)主要是可能出现cpu 100%的情况,详见Leap Second Hang – CPU Can Be Seen at 100% (文档 ID 1472421.1),根据该文档描述,主要影响是linux 4.4 – linux 6.2 的中间版本.
对于solaris,影响solaris 8、9、10等几个版本,最新的solaris 11不受影响。

闰秒对Oracle的影响
1)对于Oracle RAC

如果是11.2以下版本的rac(10.2.0.4/5不受影响),那么可能出现节点重启的情况,详见 mos文档:

NTP leap second event causing Oracle Clusterware node reboot (文档 ID 759143.1)

备注:如果使用了第三方管理软件,那么不受影响,如果是仅用Oracle clusterware,那么则会有问题。
2)非RAC环境

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

SQL> create table t_630(a number,b timestamp(8));

Table created.

SQL> insert into t_630 values(1,to_timestamp('2015-06-10 12:10:10.1','yyyy-mm-dd hh24:mi:ss.ff')); 

1 row created.

         A B
---------- ---------------------------------------------------------------------------
         1 10-JUN-15 12.10.10.10000000 PM

SQL> insert into t_630 values(2,to_timestamp('2015-06-30 23:59:60','yyyy-mm-dd hh24:mi:ss'));
insert into t_630 values(2,to_timestamp('2015-06-30 23:59:60','yyyy-mm-dd hh24:mi:ss'))
                                        *
ERROR at line 1:
ORA-01852: seconds must be between 0 and 59

对于此问题,Oracle 建议使用varchar2来替代timestamp.

SQL> drop table t_630 purge;

Table dropped.

SQL> create table t_630(a number,b varchar2(30));

Table created.

SQL> insert into t_630 values(1,'2015-06-10 12:10:10');

1 row created.

SQL> insert into t_630 values(2,'2015-06-30 23:59:60');

1 row created.

SQL> commit;

Commit complete.

SQL> select * from t_630;

         A B
---------- ------------------------------
         1 2015-06-10 12:10:10
         2 2015-06-30 23:59:60

相关参考文档:
Leap seconds (extra second in a year) and impact on the Oracle database. (文档 ID 730795.1)

Leap Second on Oracle SuperCluster (文档 ID 1991954.1)

Leap Second on Oracle SuperCluster (文档 ID 1991954.1)

What Leap Second Affects Occur In Tuxedo? (文档 ID 1461363.1)

NTP leap second event causing Oracle Clusterware node reboot (文档 ID 759143.1)

http://www-01.ibm.com/support/docview.wss?uid=swg21602521

总结:
1、对于传统企业来讲如果使用linux,10204+的版本应该没有任何影响(排除timestamp的问题外)

2、建议检查应用,如果使用了timestamp,那么应用可能会报错ORA-01852

3、闰秒主要影响linux和solaris.

4、2015年6月30号将会出现如下的情形:
23:59:57 -> 23:59:58 -> 23:59:59 -> 23:59:60 -> 00:00:00 -> 00:00:01

Related posts:

  1.  goldengate 学习系列7-ogg 12c support read standby redo

goldengate 学习系列8–当主键遇上keycols

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: goldengate 学习系列8–当主键遇上keycols

—源端主库
说明:
源端数据库:  11.2.0.4   ogg版本12.1.2

目标端数据库:10.2.0.5   ogg版本11.2.1.0.1

SQL> create table s1 (a number primary key, b number, c char(32));

Table created.

SQL> create table s3 (a number, b number);

Table created.

SQL> insert into s1 values (1,1,1);

1 row created.

SQL> insert into s3 values(1,1);

1 row created.

SQL> commit;

Commit complete.
SQL> select a,b,c,rowid from s1;

         A          B C                                ROWID
---------- ---------- -------------------------------- ------------------
         1          1 1                                AAAVViAAEAAAAC1AAA

SQL> select a,b,rowid from s3;

         A          B ROWID
---------- ---------- ------------------
         1          1 AAAVVjAAEAAAADFAAA

—目标端数据库

www.killdb.com>create table s1 (a number primary key, b number, c char(32));

Table created.

www.killdb.com>create table s3 (a number, b number);

Table created.

www.killdb.com>select * from s1;

         A          B C
---------- ---------- --------------------------------
         1          1 1

www.killdb.com>select * from s3;

         A          B
---------- ----------
         1          1

www.killdb.com>insert into s1 values (2,1,1);

1 row created.

www.killdb.com>insert into s3 values(2,1);

1 row created.

www.killdb.com>commit;

Commit complete.

www.killdb.com>select a,b,c,rowid from s1;

         A          B C                                ROWID
---------- ---------- -------------------------------- ------------------
         1          1 1                                AAAObvAAEAAAADkAAA
         2          1 1                                AAAObvAAEAAAADlAAA

www.killdb.com>select a,b,rowid from s3;

         A          B ROWID
---------- ---------- ------------------
         1          1 AAAObwAAEAAAAD0AAA
         2          1 AAAObwAAEAAAAD1AAA

—-源端进行delete操作

SQL> delete from s1;

1 row deleted.

SQL> commit;

Commit complete.

SQL> delete from s3;

1 row deleted.

SQL> commit;

Commit complete.

—目标端查询数据

www.killdb.com> select * from s1;

         A          B C
---------- ---------- --------------------------------
         1          1 1
         2          1 1

www.killdb.com>select * from s3;

         A          B
---------- ----------
         2          1

我们可以看到,这里Oracle 默认情况下,并没有对s1表进行删除操作? 为什么?
通过logminer分析源端redo

SQL> execute dbms_logmnr.add_logfile(options =>dbms_logmnr.new,logfilename =>'/home/oracle/oradata/roger/redo02.log');

PL/SQL procedure successfully completed.

SQL> EXEC SYS.DBMS_LOGMNR.START_LOGMNR(OPTIONS => SYS.DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);

PL/SQL procedure successfully completed.

SQL> col sql_redo for a80
SQL> select TIMESTAMP ,sql_redo FROM v$logmnr_contents WHERE sql_redo like '%S1%';

TIMESTAMP    SQL_REDO
------------ --------------------------------------------------------------------------------
09-JUN-15    insert into "ROGER"."S1"("A","B","C") values ('1','1','1');
09-JUN-15    delete from "ROGER"."S1" where "A" = '1' and "B" = '1' and "C" = '1
                               ' and ROWID = 'AAAVViAAEAAAAC1AAA';

SQL> select TIMESTAMP ,sql_redo FROM v$logmnr_contents WHERE sql_redo like '%S3%';

TIMESTAMP    SQL_REDO
------------ --------------------------------------------------------------------------------
09-JUN-15    insert into "ROGER"."S3"("A","B") values ('1','1');
09-JUN-15    delete from "ROGER"."S3" where "A" = '1' and "B" = '1' and ROWID = 'AAAVVjAAEAAA
             ADFAAA';

通过logminer分析目标端redo

www.killdb.com>execute dbms_logmnr.add_logfile(options =>dbms_logmnr.new,logfilename =>'/home/ora10g/oradata/roger/redo03.log');

PL/SQL procedure successfully completed.

www.killdb.com>EXEC SYS.DBMS_LOGMNR.START_LOGMNR(OPTIONS => SYS.DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);

PL/SQL procedure successfully completed.

www.killdb.com>set lines 120
www.killdb.com>col sql_redo for a90
www.killdb.com>select TIMESTAMP ,sql_redo FROM v$logmnr_contents WHERE sql_redo like '%S1%';

TIMESTAMP  SQL_REDO
---------- -----------------------------------------------------------
10-JUN-15  insert into "ROGER"."S1"("A","B","C") values ('1','1','1');
10-JUN-15  insert into "ROGER"."S1"("A","B","C") values ('2','1','1');

www.killdb.com>select TIMESTAMP ,sql_redo FROM v$logmnr_contents WHERE sql_redo like '%S3%';    

TIMESTAMP  SQL_REDO
---------- ---------------------------------------------------------------------------------------
10-JUN-15  insert into "ROGER"."S3"("A","B") values ('1','1');
10-JUN-15  insert into "ROGER"."S3"("A","B") values ('2','1');
10-JUN-15  delete from "ROGER"."S3" where "A" = '1' and "B" = '1' and ROWID = 'AAAObwAAEAAAAD0AAA';

既然源端数据库redo已经记录了相关DML的操作,那么ogg是否抓取了呢?

通过logdump分析源端trail文件

Logdump 1 >open ./dirdat/ex000004
Current LogTrail is /opt/oracle/ggs/12.1.2.1/dirdat/ex000004
Logdump 2 >ghdr on
Logdump 3 >detail on
Logdump 4 >detail data
Logdump 5 >usertoken on
Logdump 6 >FILTER include filename ROGER.S1;
Logdump 7 >next
......
Logdump 21 >n
___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)
RecLength  :    56  (x0038)   IO Time    : 2015/06/09 22:22:22.000.000
IOType     :     5  (x05)     OrigNode   :   255  (xff)
TransInd   :     .  (x00)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :         32       AuditPos   : 22032
Continued  :     N  (x00)     RecCount   :     1  (x01) 

2015/06/09 22:22:22.000.000 Insert               Len    56 RBA 5095
Name: ROGER.S1
After  Image:                                             Partition 4   G  b
 0000 0005 0000 0001 3100 0100 0500 0000 0131 0002 | ........1........1..
 0022 0000 3120 2020 2020 2020 2020 2020 2020 2020 | ."..1
 2020 2020 2020 2020 2020 2020 2020 2020           |
Column     0 (x0000), Len     5 (x0005)
 0000 0001 31                                      | ....1
Column     1 (x0001), Len     5 (x0005)
 0000 0001 31                                      | ....1
Column     2 (x0002), Len    34 (x0022)
 0000 3120 2020 2020 2020 2020 2020 2020 2020 2020 | ..1
 2020 2020 2020 2020 2020 2020 2020                |                 

Filtering suppressed      1 records
Logdump 22 >n
___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)
UndoFlag   :     .  (x00)     BeforeAfter:     B  (x42)
RecLength  :     9  (x0009)   IO Time    : 2015/06/09 22:27:18.000.000
IOType     :     3  (x03)     OrigNode   :   255  (xff)
TransInd   :     .  (x03)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :         32       AuditPos   : 178704
Continued  :     N  (x00)     RecCount   :     1  (x01) 

2015/06/09 22:27:18.000.000 Delete               Len     9 RBA 5366
Name: ROGER.S1
Before Image:                                             Partition 4   G  s
 0000 0005 0000 0001 31                            | ........1
Column     0 (x0000), Len     5 (x0005)
 0000 0001 31                                      | ....1

从trial文件的dump信息来看,确实是抽取了delete操作. 其中IOType 3表示delete,IOType 5表示insert.表明我们对S1表进行的insert 和delete操作都是被抓取了的。到这里来看,貌似一切都是正常的,但是为什么会出现s1 表数据不同步的情况呢?

对应ogg如果存在异常,那么我们可以查看相关进程的discard文件,内容如下:

Operation failed at seqno 7 rba 1907
Discarding record on action DISCARD on error 0
Problem replicating ROGER.S1 to ROGER.S1
Mapping problem with delete record (target format)...
*
A = 1
*

很明显,goldengate这里在对s1表进行delete操作的时候,map失败了。因此实际上在目标端针对s1表的delete操作根本就没有执行.

GGSCI (killdb.com) 2> view param rep1124
replicat rep1124
userid  ggs@roger,password ggs
reperror default, discard
DISCARDROLLOVER AT 20:30
discardfile ./dirrpt/rep1124.dsc, append, megabytes 50
handlecollisions
assumetargetdefs
allownoopupdates
numfiles 3000
map roger.t_ogg, target roger.t_ogg;
map roger.s1, target roger.s1, keycols (b);
map roger.s3, target roger.s3, keycols (b);
GGSCI (killdb.com) 3> stop rep1124  

Sending STOP request to REPLICAT REP1124 ...
Request processed.
GGSCI (killdb.com) 4> edit param rep1124
GGSCI (killdb.com) 5> view param rep1124

replicat rep1124
userid  ggs@roger,password ggs
reperror default, discard
DISCARDROLLOVER AT 20:30
discardfile ./dirrpt/rep1124.dsc, append, megabytes 50
handlecollisions
assumetargetdefs
allownoopupdates
numfiles 3000
map roger.t_ogg, target roger.t_ogg;
map roger.s1, target roger.s1;
map roger.s3, target roger.s3, keycols (b);

----modify rba
GGSCI (killdb.com) 6> alter rep rep1124,extrba 1907
REPLICAT altered.

GGSCI (killdb.com) 7> start rep1124

Sending START request to MANAGER ...
REPLICAT REP1124 starting

—再次check

www.killdb.com>select a,b,c,rowid from s1;

         A          B C                                ROWID
---------- ---------- -------------------------------- ------------------
         2          1 1                                AAAObvAAEAAAADlAAA

www.killdb.com>select TIMESTAMP ,sql_redo FROM v$logmnr_contents WHERE sql_redo like '%S1%';

TIMESTAMP    SQL_REDO
------------ ----------------------------------------------------------------------
10-JUN-15    insert into "ROGER"."S1"("A","B","C") values ('1','1','1');
10-JUN-15    insert into "ROGER"."S1"("A","B","C") values ('2','1','1');
10-JUN-15    delete from "ROGER"."S1" where "A" = '1' and "B" = '1' and "C" = '1
                     ' and ROWID = 'AAAObvAAEAAAADkAAA';

这里严格上来讲是keycols参数配置不当导致。 该参数的含义是指制定一个可以表示数据唯一性的列,这样以便于goldengate可以完成同步,例如delete和update.
之前之所以不能同步,报错的原因是因为目标端的s1表 b=1的结果有2条,而原端删除的是一条,很明显是无法进行map的.
下面我们将replicat进程的keycols列修改为a,进行测试发现ok,测试过程如下:

—-原端

SQL> insert into s1 values (1,1,1);

1 row created.

SQL> insert into s3 values(1,1);

1 row created.

SQL> commit;

Commit complete.

SQL> delete from s1;     

1 row deleted.

SQL> commit;

Commit complete.

—修改目标端replicat配置

GGSCI (killdb.com) 7> view param rep1124

replicat rep1124
userid  ggs@roger,password ggs
reperror default, discard
DISCARDROLLOVER AT 20:30
discardfile ./dirrpt/rep1124.dsc, append, megabytes 50
handlecollisions
assumetargetdefs
allownoopupdates
numfiles 3000
map roger.t_ogg, target roger.t_ogg;
map roger.s1, target roger.s1, keycols (a);
map roger.s3, target roger.s3, keycols (b);

—目标端

www.killdb.com>truncate table s1;

Table truncated.

www.killdb.com>insert into s1 values(3,1,1); 

1 row created.

www.killdb.com>commit;

Commit complete.

www.killdb.com>select * from s1;

         A          B C
---------- ---------- --------------------------------
         1          1 1
         3          1 1

www.killdb.com>
www.killdb.com>
www.killdb.com>select * from s1;

         A          B C
---------- ---------- --------------------------------
         3          1 1

可以看到,当调整keycols的列之后,一切正常,这是因为目标端s1表的a列的数据本身就是唯一的,因为目前只有2条数据,数值为1,3.  对应不存在主键或unique index的情况之下,如果进行update会导致目标端可能产生重复数据吗?很多人都说ogg 11.2版本不存在这个问题。包括原厂的工程师。稍后将进行相关测试!

Related posts:

  1. goldengate 学习系列1–10gasm to 11gR2 asm 单向复制(DDL支持)
  2. goldengate 学习系列2–相关配置说明
  3. goldengate 学习系列3–一对多的复制配置
  4. Goldengate monitor v11.1 Install for LinuxX86
  5. Goldengate monitor v11.1 Configure for Linux X86

ocssd.log:clssgmpcBuildNodeList: nodename for node x is NULL

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: ocssd.log:clssgmpcBuildNodeList: nodename for node x is NULL

之前2个客户的环境中,我们按照了11.2.0.4.6这个最新PSU,都发现ocssd.log的信息比较异常。总的来讲,没有什么影响,但是由于日志输出非常快,每5秒一次,这导致ocssd.log的其他信息可能都被这些无关紧要的信息给淹没了,不便于日后的问题分析。我们发现,目前在Solaris Sparc,Linux都存在这个问题。大家可以看到ocssd.log的信息更新时间都很短,如下:

oracle@xxxxdb1:/u01/app/11.2.0/grid/log/xxxxdb1/cssd $ls -ltr
total 1100608
-rw-rw-r--   1 grid     oinstall  222656 Jun  7 00:38 cssdOUT.log
-rw-r--r--   1 grid     oinstall 52496701 Jun 11 06:15 ocssd.l10
-rw-r--r--   1 grid     oinstall 52501775 Jun 11 09:19 ocssd.l09
-rw-r--r--   1 grid     oinstall 52503902 Jun 11 12:20 ocssd.l08
-rw-r--r--   1 grid     oinstall 52505495 Jun 11 15:21 ocssd.l07
-rw-r--r--   1 grid     oinstall 52503384 Jun 11 18:22 ocssd.l06
-rw-r--r--   1 grid     oinstall 52501464 Jun 11 21:24 ocssd.l05
-rw-r--r--   1 grid     oinstall 52500024 Jun 12 00:28 ocssd.l04
-rw-r--r--   1 grid     oinstall 52497327 Jun 12 03:35 ocssd.l03
-rw-r--r--   1 grid     oinstall 52497206 Jun 12 06:42 ocssd.l02
-rw-r--r--   1 grid     oinstall 52501370 Jun 12 09:45 ocssd.l01
-rw-r--r--   1 grid     oinstall 37051612 Jun 12 11:53 ocssd.log

如下是ocssd.log的信息输出:

oracle@xxxxdb1:/u01/app/11.2.0/grid/log/xxxxdb1/cssd $more ocssd.log
Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved.
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 177 is NULL
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 178 is NULL
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 179 is NULL
......
......
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 250 is NULL
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 251 is NULL
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 252 is NULL
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 253 is NULL
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 254 is NULL
2015-06-12 09:45:13.898: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 255 is NULL
2015-06-12 09:45:15.391: [    CSSD][45]clssnmSendingThread: sending status msg to all nodes
2015-06-12 09:45:15.392: [    CSSD][45]clssnmSendingThread: sent 4 status msgs to all nodes
2015-06-12 09:45:18.900: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 0 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 3 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 4 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 5 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 6 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 7 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 8 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 9 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 10 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 11 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 12 is NULL
.....
.....
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 249 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 250 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 251 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 252 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 253 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 254 is NULL
2015-06-12 09:45:18.901: [    CSSD][5]clssgmpcBuildNodeList: nodename for node 255 is NULL

Oracle的解释是如果你的环境是从pre-11.2版本升级上来的,那么这里看到的node numer则是从0开始,否则不是。

不过据我检查一个客户的环境,完全是一个我们新按照的11.2.0.4.6的环境,ocssd.log信息也给上面几乎一致。
总的来讲,这个问题影响不大,甚至可以直接忽略之。Oracle的解释是说在解决bug 17046460的时候引入了这个问题。将该问题确认为一个新的bug:Bug# 21171934 – REPEATED MESSAGES “CLSSGMPCBUILDNODELIST: NODENAME FOR NODE XX IS NULL” IN OCSSD
比较遗憾的是,目前Oracle 还并没有出出相关的patch,我相信在7月份的psu中应该会fixed.

Related posts:

  1. 一次3 node Rac tunning
  2. 11gR2 rac add node(11.2.0.2 for aix 7.1)
  3. single instance to 3 node rac
  4. 10g rac如何通过votedisk来判断disk心跳?
  5. HAIP异常,导致RAC节点无法启动的解决方案

Another one recover database case!

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Another one recover database case!

朋友反馈其客户的一个库系统损坏,导致oracle 崩溃,最后通过安全模式将数据文件拷贝出来,发现无法启动,非归档环境,

而且只有dmp 备份,之前他们通过dmp 备份进行了恢复,但是发现部分dmp 可能存在问题,导致部分表无法恢复,又尝试使用ODU进行数据文件的抽取,也发现部分表无法抽取(可能是system损坏较为严重,dbv检测有1000多个坏块)。

如下是尝试open时的alert log信息:

Wed Jun 17 17:31:00 2015
Database Characterset is ZHS16GBK
Wed Jun 17 17:31:00 2015
Hex dump of (file 1, block 53319) in trace file /spacedb/oracle/app/admin/workflow/bdump/workflow_smon_10972.trc
Corrupt block relative dba: 0x0040d047 (file 1, block 53319)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x0040d047
 last change scn: 0x0000.0007b0ab seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x00000000
 check value in block header: 0x89cd
 computed block checksum: 0xd2d1
Reread of rdba: 0x0040d047 (file 1, block 53319) found same corrupted data
Wed Jun 17 17:31:00 2015
Errors in file /spacedb/oracle/app/admin/workflow/bdump/workflow_smon_10972.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08103: object no longer exists
Hex dump of (file 1, block 53319) in trace file /spacedb/oracle/app/admin/workflow/bdump/workflow_smon_10972.trc
Corrupt block relative dba: 0x0040d047 (file 1, block 53319)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x0040d047
 last change scn: 0x0000.0007b0ab seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x00000000
 check value in block header: 0x89cd
 computed block checksum: 0xd2d1
Reread of rdba: 0x0040d047 (file 1, block 53319) found same corrupted data
Wed Jun 17 17:31:00 2015
Errors in file /spacedb/oracle/app/admin/workflow/bdump/workflow_smon_10972.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08103: object no longer exists
Hex dump of (file 1, block 53319) in trace file /spacedb/oracle/app/admin/workflow/bdump/workflow_smon_10972.trc
Corrupt block relative dba: 0x0040d047 (file 1, block 53319)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x0040d047
 last change scn: 0x0000.0007b0ab seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x00000000
 check value in block header: 0x89cd
 computed block checksum: 0xd2d1
Reread of rdba: 0x0040d047 (file 1, block 53319) found same corrupted data
Wed Jun 17 17:31:00 2015
Errors in file /spacedb/oracle/app/admin/workflow/bdump/workflow_smon_10972.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08103: object no longer exists
Wed Jun 17 17:31:00 2015
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 8
Wed Jun 17 17:31:00 2015
Errors in file /spacedb/oracle/app/admin/workflow/udump/workflow_ora_11168.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08103: object no longer exists
Error 604 happened during db open, shutting down database
USER: terminating instance due to error 604
Instance terminated by USER, pid = 11168
ORA-1092 signalled during: ALTER DATABASE OPEN...

很明显,Oracle 在执行递归SQL的适合报错了,而且遇到了坏块。通过dbv检测,我发现存在大量的坏块,而且部分块还是连续损坏,极有可能是某个extent都损坏了。如下是dbv的检测结果:

---dbv system01.dbf
[oracle@zxzx workflow]$ dbv file=system01.dbf blocksize=8192 logfile=check_system.log

DBVERIFY: Release 10.2.0.4.0 - Production on Wed Jun 17 23:59:40 2015

Copyright (c) 1982, 2007, Oracle.  All rights reserved.
......
......

DBVERIFY - Verification complete

Total Pages Examined         : 72960
Total Pages Processed (Data) : 44288
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 9414
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 1837
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 15709
Total Pages Marked Corrupt   : 1712
Total Pages Influx           : 380
Highest block SCN            : 105111246 (0.105111246)
[oracle@zxzx workflow]$ 

[oracle@zxzx workflow]$ cat check_system.log |grep 533
Page 50533 is marked corrupt
Corrupt block relative dba: 0x0040c565 (file 1, block 50533)
Page 53319 is influx - most likely media corrupt
Corrupt block relative dba: 0x0040d047 (file 1, block 53319)
Page 53320 is marked corrupt
Corrupt block relative dba: 0x0040d048 (file 1, block 53320)
Page 53321 is marked corrupt
Corrupt block relative dba: 0x0040d049 (file 1, block 53321)
Page 53322 is marked corrupt
Corrupt block relative dba: 0x0040d04a (file 1, block 53322)
Page 53323 is marked corrupt
Corrupt block relative dba: 0x0040d04b (file 1, block 53323)
Page 53324 is marked corrupt
Corrupt block relative dba: 0x0040d04c (file 1, block 53324)
Page 53325 is marked corrupt
Corrupt block relative dba: 0x0040d04d (file 1, block 53325)
Page 53326 is marked corrupt
Corrupt block relative dba: 0x0040d04e (file 1, block 53326)
Page 53327 is influx - most likely media corrupt
Corrupt block relative dba: 0x0040d04f (file 1, block 53327)
Page 53383 is influx - most likely media corrupt
Corrupt block relative dba: 0x0040d087 (file 1, block 53383)
Page 53384 is marked corrupt
Corrupt block relative dba: 0x0040d088 (file 1, block 53384)
Page 53385 is marked corrupt
Corrupt block relative dba: 0x0040d089 (file 1, block 53385)
Page 53386 is marked corrupt
Corrupt block relative dba: 0x0040d08a (file 1, block 53386)
Page 53387 is marked corrupt
Corrupt block relative dba: 0x0040d08b (file 1, block 53387)
Page 53388 is marked corrupt
Corrupt block relative dba: 0x0040d08c (file 1, block 53388)
Page 53389 is marked corrupt
Corrupt block relative dba: 0x0040d08d (file 1, block 53389)
Page 53390 is marked corrupt
Corrupt block relative dba: 0x0040d08e (file 1, block 53390)
Page 53391 is influx - most likely media corrupt
Corrupt block relative dba: 0x0040d08f (file 1, block 53391)
Page 55330 is marked corrupt
Corrupt block relative dba: 0x0040d822 (file 1, block 55330)
Page 55331 is marked corrupt
Corrupt block relative dba: 0x0040d823 (file 1, block 55331)
Page 55332 is marked corrupt
Corrupt block relative dba: 0x0040d824 (file 1, block 55332)
Page 55333 is marked corrupt
Corrupt block relative dba: 0x0040d825 (file 1, block 55333)
Page 55334 is marked corrupt
Corrupt block relative dba: 0x0040d826 (file 1, block 55334)
Page 55335 is influx - most likely media corrupt
Corrupt block relative dba: 0x0040d827 (file 1, block 55335)
Page 55533 is marked corrupt
Corrupt block relative dba: 0x0040d8ed (file 1, block 55533)
Page 60533 is marked corrupt
Corrupt block relative dba: 0x0040ec75 (file 1, block 60533)

[oracle@zxzx workflow]$ cat check_system.log |grep 538
Page 53839 is influx - most likely media corrupt
Corrupt block relative dba: 0x0040d24f (file 1, block 53839)
Page 53840 is marked corrupt
Corrupt block relative dba: 0x0040d250 (file 1, block 53840)
Page 53841 is marked corrupt
Corrupt block relative dba: 0x0040d251 (file 1, block 53841)
Page 53842 is marked corrupt
Corrupt block relative dba: 0x0040d252 (file 1, block 53842)
Page 53843 is marked corrupt
Corrupt block relative dba: 0x0040d253 (file 1, block 53843)
Page 53844 is marked corrupt
Corrupt block relative dba: 0x0040d254 (file 1, block 53844)
Page 53845 is marked corrupt
Corrupt block relative dba: 0x0040d255 (file 1, block 53845)
Page 53846 is marked corrupt
Corrupt block relative dba: 0x0040d256 (file 1, block 53846)
Page 53847 is influx - most likely media corrupt
Corrupt block relative dba: 0x0040d257 (file 1, block 53847)
Page 57538 is marked corrupt
Corrupt block relative dba: 0x0040e0c2 (file 1, block 57538)
[oracle@zxzx workflow]$

我们看到,部分坏块是连续的,这种情况处理就比较麻烦了。开始我尝试通过一些其他手段想把库先拉起来,发现不行,如下是10046 trace的内容:

=====================
PARSING IN CURSOR #3 len=169 dep=1 uid=0 oct=3 lid=0 tim=1400932967838961 hv=1173719687 ad='ddb76e68'
select col#, grantee#, privilege#,max(mod(nvl(option$,0),2)) from objauth$ where obj#=:1 and col# is not null group by privilege#, col#, grantee# order by col#, grantee#
END OF STMT
EXEC #3:c=0,e=67,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967838959
FETCH #3:c=0,e=18,p=0,cr=2,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967839016
STAT #3 id=1 cnt=0 pid=0 pos=1 obj=0 op='SORT GROUP BY (cr=6 pr=0 pw=0 time=81 us)'
STAT #3 id=2 cnt=0 pid=1 pos=1 obj=57 op='TABLE ACCESS BY INDEX ROWID OBJAUTH$ (cr=6 pr=0 pw=0 time=52 us)'
STAT #3 id=3 cnt=0 pid=2 pos=1 obj=103 op='INDEX RANGE SCAN I_OBJAUTH1 (cr=6 pr=0 pw=0 time=49 us)'
BINDS #7:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5b4afb40c0  bln=22  avl=02  flg=05
  value=16
=====================
PARSING IN CURSOR #7 len=151 dep=1 uid=0 oct=3 lid=0 tim=1400932967839161 hv=4139184264 ad='deedc608'
select grantee#,privilege#,nvl(col#,0),max(mod(nvl(option$,0),2))from objauth$ where obj#=:1 group by grantee#,privilege#,nvl(col#,0) order by grantee#
END OF STMT
EXEC #7:c=0,e=61,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967839158
FETCH #7:c=0,e=29,p=0,cr=3,cu=0,mis=0,r=1,dep=1,og=4,tim=1400932967839225
FETCH #7:c=0,e=3,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967839253
STAT #7 id=1 cnt=2 pid=0 pos=1 obj=0 op='SORT GROUP BY (cr=8 pr=1 pw=0 time=160 us)'
STAT #7 id=2 cnt=2 pid=1 pos=1 obj=57 op='TABLE ACCESS BY INDEX ROWID OBJAUTH$ (cr=8 pr=1 pw=0 time=113 us)'
STAT #7 id=3 cnt=2 pid=2 pos=1 obj=103 op='INDEX RANGE SCAN I_OBJAUTH1 (cr=6 pr=0 pw=0 time=43 us)'
BINDS #4:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5b4afb6a80  bln=22  avl=02  flg=05
  value=18
=====================
PARSING IN CURSOR #4 len=169 dep=1 uid=0 oct=3 lid=0 tim=1400932967839460 hv=1173719687 ad='ddb76e68'
select col#, grantee#, privilege#,max(mod(nvl(option$,0),2)) from objauth$ where obj#=:1 and col# is not null group by privilege#, col#, grantee# order by col#, grantee#
END OF STMT
EXEC #4:c=0,e=68,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967839458
FETCH #4:c=0,e=21,p=0,cr=2,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967839518
STAT #4 id=1 cnt=0 pid=0 pos=1 obj=0 op='SORT GROUP BY (cr=8 pr=0 pw=0 time=110 us)'
STAT #4 id=2 cnt=0 pid=1 pos=1 obj=57 op='TABLE ACCESS BY INDEX ROWID OBJAUTH$ (cr=8 pr=0 pw=0 time=72 us)'
STAT #4 id=3 cnt=0 pid=2 pos=1 obj=103 op='INDEX RANGE SCAN I_OBJAUTH1 (cr=8 pr=0 pw=0 time=67 us)'
BINDS #3:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5b4afb40c0  bln=22  avl=02  flg=05
  value=18
=====================
PARSING IN CURSOR #3 len=151 dep=1 uid=0 oct=3 lid=0 tim=1400932967839663 hv=4139184264 ad='deedc608'
select grantee#,privilege#,nvl(col#,0),max(mod(nvl(option$,0),2))from objauth$ where obj#=:1 group by grantee#,privilege#,nvl(col#,0) order by grantee#
END OF STMT
EXEC #3:c=0,e=62,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967839660
WAIT #3: nam='db file sequential read' ela= 20 file#=1 block#=53841 blocks=1 obj#=-1 tim=1400932967839771
FETCH #3:c=0,e=93,p=1,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=1400932967839800
ORA-00604: error occurred at recursive SQL level 1
ORA-08103: object no longer exists
EXEC #1:c=471928,e=1720329,p=273,cr=3927,cu=157,mis=0,r=0,dep=0,og=1,tim=1400932968820893
ERROR #1:err=1092 tim=438031590

我们可以看到,在访问某个block的适合出问题了,而且该block后面连续几个都是损坏的。我尝试bbed copy修改了几个block都不行。最后发现其实这个递归SQL,可以想办法绕过去的,及通过修改Oracle 二进制文件的方法,可惜的是客户

的机器上已经跑了一个库了,无法停,因此这种方法也就作罢;当然或许还能通过gdb来实现。总的来讲比较麻烦。

考虑到他们本身具有dmp 备份,因此直接dul 抽取dmp 即可,把需要的表弄出来就完了。

如下是dul 抽取dump的步骤:

1. scan dump file:
   scan dump file spacedb/orabak/wxzx20150604.dmp;

2. cat dul.log|grep YZ_CAR_APPLY

3. 抽取需要的表数据(获得表的偏移量位置)

   unexp TABLE "YZ_CAR_APPLY" ("ID" NUMBER(20, 0) NOT NULL ENABLE, "APPLY_USERID" VARCHAR2(50), "APPLY_USERNAME" VARCHAR2(50), "DEPT_ID" VARCHAR2(50), "DEPT_NAME" VARCHAR2(50), "CAR_TYPE" VARCHAR2(50), "CAR_CODE" VARCHAR2(50), "RENSHU" NUMBER(3, 0), "SHIYOU" VARCHAR2(1000), "LEAVE_TIME" VARCHAR2(50), "LEAVE_ADDRESS" VARCHAR2(400), "BACK_TIME" VARCHAR2(50), "BACK_ADDRESS" VARCHAR2(400), "LINKMAN" VARCHAR2(50), "LINK_PHONE" VARCHAR2(40), "SUBFLAG" VARCHAR2(50), "USER_LIST" VARCHAR2(2000), "CAR_ID" VARCHAR2(50), "CRE_USERID" VARCHAR2(50), "CRE_DATE" VARCHAR2(50), "ORG_ID" VARCHAR2(50), "FLOW_TYPE" VARCHAR2(50), "FILE_TYPE" VARCHAR2(50), "TITLE" VARCHAR2(200), "FLOWCOURSE" VARCHAR2(4000), "SECRETARIAL_SIGN" VARCHAR2(4000), "SECRETARIAL_IDEA" VARCHAR2(4000), "OFFICELEADER_SIGN" VARCHAR2(4000), "OFFICELEADER_IDEA" VARCHAR2(4000), "SLEADER_SIGN" VARCHAR2(4000), "SLEADER_IDEA" VARCHAR2(4000), "OTHER_NOTION" VARCHAR2(4000), "DRIVER" VARCHAR2(50)) dump file /spacedb/orabak/wxzx20150604.dmp from 23934970472;

 

Related posts:

  1. 如何修复未格式化的坏块?
  2. One recover case!

ORA-03137: TTC protocol internal error : [3113] in 11.2.0.4

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: ORA-03137: TTC protocol internal error : [3113] in 11.2.0.4

端午节刚过,下午本来是想vpn登录客户刚迁移的系统,看看最近是否正常的,突然发现之前迁移的一套系统,在我们打了几个patch之后,这次发现alert log又出现了新的错误,不过这个错误并不致命。如下是alert log的信息:

Fri Jun 19 16:19:01 2015
Errors in file /u01/app/oracle/diag/rdbms/acct/xxxx/trace/xxxx_ora_18010.trc  (incident=1239675):
ORA-03137: TTC protocol internal error : [3113] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/acct/xxxx/incident/incdir_1239675/xxxx_ora_18010_i1239675.trc
Fri Jun 19 16:19:04 2015
Dumping diagnostic data in directory=[cdmp_20150619161904], requested by (instance=1, osid=18010), summary=[incident=1239675].
Fri Jun 19 16:19:06 2015
Sweep [inc][1239675]: completed
Sweep [inc2][1239675]: completed
.....
.....
Sat Jun 20 04:32:59 2015
Archived Log entry 15345 added for thread 1 sequence 5089 ID 0xffffffffe3634d75 dest 1:
Sat Jun 20 04:34:50 2015
Errors in file /u01/app/oracle/diag/rdbms/acct/xxxx/trace/xxxx_ora_40578.trc  (incident=1233619):
ORA-03137: TTC protocol internal error : [3113] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/acct/xxxx/incident/incdir_1233619/xxxx_ora_40578_i1233619.trc
Sat Jun 20 04:34:53 2015
Dumping diagnostic data in directory=[cdmp_20150620043453], requested by (instance=1, osid=40578), summary=[incident=1233619].
Sat Jun 20 04:34:55 2015
Sweep [inc][1233619]: completed
Sweep [inc2][1233619]: completed
.....
.....
Sat Jun 20 22:58:56 2015
Errors in file /u01/app/oracle/diag/rdbms/acct/xxxx/trace/xxxx_ora_16040.trc  (incident=1238339):
ORA-03137: TTC protocol internal error : [3113] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/acct/xxxx/incident/incdir_1238339/xxxx_ora_16040_i1238339.trc
Sat Jun 20 22:59:00 2015
Dumping diagnostic data in directory=[cdmp_20150620225900], requested by (instance=1, osid=16040), summary=[incident=1238339].
Sat Jun 20 22:59:04 2015
Sweep [inc][1238339]: completed
Sweep [inc2][1238339]: completed
Sat Jun 20 22:59:41 2015
.....
.....
Sun Jun 21 06:02:35 2015
Archived Log entry 15548 added for thread 1 sequence 5213 ID 0xffffffffe3634d75 dest 1:
Sun Jun 21 06:05:11 2015
Errors in file /u01/app/oracle/diag/rdbms/acct/xxxx/trace/xxxx_ora_10611.trc  (incident=1249547):
ORA-03137: TTC protocol internal error : [3113] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/acct/xxxx/incident/incdir_1249547/xxxx_ora_10611_i1249547.trc
Sun Jun 21 06:05:14 2015
Dumping diagnostic data in directory=[cdmp_20150621060514], requested by (instance=1, osid=10611), summary=[incident=1249547].
Sun Jun 21 06:05:17 2015
Sweep [inc][1249547]: completed
Sweep [inc2][1249547]: completed
Sun Jun 21 06:19:26 2015
.....
.....
Archived Log entry 15700 added for thread 1 sequence 5316 ID 0xffffffffe3634d75 dest 1:
Mon Jun 22 02:35:23 2015
Errors in file /u01/app/oracle/diag/rdbms/acct/xxxx/trace/xxxx_ora_25443.trc  (incident=1249843):
ORA-03137: TTC protocol internal error : [3113] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/acct/xxxx/incident/incdir_1249843/xxxx_ora_25443_i1249843.trc
Mon Jun 22 02:35:26 2015
Dumping diagnostic data in directory=[cdmp_20150622023526], requested by (instance=1, osid=25443), summary=[incident=1249843].
Mon Jun 22 02:35:28 2015
Sweep [inc][1249843]: completed
Sweep [inc2][1249843]: completed
Mon Jun 22 02:42:18 2015

开始看日志的时间戳,感觉似乎是差了8个小时,后面仔细核对发现又不完全符合,如下是trace的call stack信息:

*** ACTION NAME:() 2015-06-22 02:35:23.697

Dump continued from file: /u01/app/oracle/diag/rdbms/acct/xxxx/trace/xxxx_ora_25443.trc
ORA-03137: TTC protocol internal error : [3113] [] [] [] [] [] [] []

========= Dump for incident 1249843 (ORA 3137 [3113]) ========

*** 2015-06-22 02:35:23.700
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst1()+124        CALL     skdstdst()           FFFFFFFF7FFF1AD0 ?
                                                   000000002 ? 10D670528 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFC8E40 ?
                                                   000000000 ?
ksedst()+52          CALL     ksedst1()            00010D800 ? 00010D800 ?
                                                   10DB0F000 ? 00010DB17 ?
                                                   10DB0F0D0 ? 10DB1704C ?
dbkedDefDump()+1984  CALL     ksedst()             000000000 ? 10DB39000 ?
                                                   00010DB39 ? 10DB17000 ?
                                                   00010D800 ? 00010DB17 ?
dbgexPhaseII()+1340  PTR_CALL dbkedDefDump()       000000004 ? 00010DB39 ?
                                                   000000003 ? 000000000 ?
                                                   000000001 ? 00010D800 ?
dbgexProcessError()  CALL     dbgexPhaseII()       10DD07E40 ?
+2072                                              FFFFFFFF7CE6A4B0 ?
                                                   10DDDF9D0 ? 100A1B2E0 ?
                                                   10DB05168 ? 000000000 ?
dbgePostErrorKGE()+  CALL     dbgeExecuteForError  000000024 ? 10DB0F5E0 ?
2188                          ()                   000000000 ? 10DD07E40 ?
                                                   FFFFFFFF7CE6A4B0 ?
                                                   10DB05168 ?
dbkePostKGE_kgsf()+  CALL     dbgePostErrorKGE()   000000000 ? 10DD07E40 ?
48                                                 FFFFFFFF7CE664B0 ?
                                                   000001D48 ? 000000C41 ?
                                                   FFFFFFFF7CE79AA8 ?
kgeade()+548         PTR_CALL dbkePostKGE_kgsf()   10DB0F420 ?
                                                   FFFFFFFF7CE78E20 ?
                                                   000000C41 ? 000002878 ?
                                                   10BE7A000 ? 00010BE7A ?
kgerelv()+240        CALL     kgeade()             000000000 ?
                                                   FFFFFFFF7CE78E20 ?
                                                   000000000 ? 000000C41 ?
                                                   000000000 ? 000000000 ?
kgerev()+64          CALL     kgerelv()            10DB0F420 ?
                                                   FFFFFFFF7CE78E20 ?
                                                   000000C41 ? 10D664020 ?
                                                   FFFFFFFF7FFFAC48 ?
                                                   000000001 ?
opiierr()+584        CALL     kgerev()             10DB0F420 ?
                                                   FFFFFFFFFFB5EEB8 ?
                                                   000000C41 ? 000000001 ?
                                                   FFFFFFFF7FFFAC48 ?
                                                   0004A1000 ?
opiodr()+9528        CALL     opiierr()            00010DB0F ? 000000001 ?
                                                   000000001 ? 10DB0F000 ?
                                                   0000001B0 ? 10CFF6E00 ?
ttcpip()+932         PTR_CALL opiodr()             00010D800 ? 10DB0F5E0 ?
                                                   000000000 ? 000000074 ?
                                                   000000000 ? 00010DB35 ?
opitsk()+1728        CALL     ttcpip()             FFFFFFFF7FFFC340 ?
                                                   000000040 ? 000000001 ?
                                                   10DB0F420 ?
                                                   FFFFFFFF7FFFD928 ?
                                                   000000000 ?
opiino()+924         CALL     opitsk()             000000000 ? 10BE5D7E4 ?
                                                   000000000 ? 000000001 ?
                                                   00000000A ? 000001768 ?
opiodr()+1176        PTR_CALL opiino()             10DB31878 ?
                                                   FFFFFFFF7FFFECE0 ?
                                                   000000001 ? 000000000 ?
                                                   0000000D8 ? 10DD0B778 ?
opidrv()+1032        CALL     opiodr()             000010000 ? 10DB0F5E0 ?
                                                   000000000 ? 00000003C ?
                                                   000000000 ? 10C0F3120 ?
sou2o()+88           CALL     opidrv()             10DB13000 ? 000000000 ?
                                                   10DB31878 ? 00000003C ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFECE0 ?
opimai_real()+316    CALL     sou2o()              FFFFFFFF7FFFECB8 ?
                                                   00000003C ? 000000004 ?
                                                   FFFFFFFF7FFFECE0 ?
                                                   10DB363E0 ? 00010D800 ?
ssthrdmain()+324     PTR_CALL opimai_real()        000000002 ?
                                                   FFFFFFFF7FFFEF68 ?
                                                   FFFFFFFF7F201340 ?
                                                   FFFFFFFF7F201340 ?
                                                   00537C944 ? 000000001 ?
main()+316           CALL     ssthrdmain()         00010D800 ? 00010DB41 ?
                                                   10DB41000 ? 000000002 ?
                                                   00010DB41 ? 10DCFBD60 ?
_start()+380         CALL     main()               000000002 ? 000000000 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFEF78 ?
                                                   FFFFFFFF7FFFF088 ?
                                                   000002800 ?

查询Oracle metalink,确认是Oracle bug  20309829导致。比较遗憾的是该bug 目前尚未出相关的patch。

说明:经过查询该bug call stack与上述trace 完全一致,虽然该bug的描述是针对Linux 平台,而且使用内部账户查询相关的SR,也提到了另外2个相关的bug,因此我确认是这个bug 无疑

可惜都没有相关的patch。还好,这个bug不致命,基本上可以忽略之。请参考:

Bug 20309829 : ORA-3137: TTC PROTOCOL INTERNAL ERROR: [3113] [] [] [] [] [] [] []

Related posts:

  1. ora-00600 [kgeade_is_0]
  2. 最近迁移恢复中遇到的几个小问题
  3. Instance immediate crash after open
  4. Oracle 11gR2 for Windows遭遇ora-600[4194]的恢复
  5. windows Oracle数据文件大小为0的恢复case

记一次12TB 测试库的恢复过程

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 记一次12TB 测试库的恢复过程

本来是一件很简单的事情,restore文件,然后recover归档,恢复到某个点,然后open resetlogs 打开数据库,但是居然报错,ora-600 [4097],很常见的一个错误,不过比较怪异的是,这里并没有直接提示是哪个回滚段有问题,如下是trace内容:

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Current SQL statement for this session:
update undo$ set name=:2,file#=:3,block#=:4,status$=:5,user#=:6,undosqn=:7,xactsqn=:8,scnbas=:9,scnwrp=:10,inst#=:11,ts#=:12,spare1=:13 where us#=:1
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              000000000 ? 000000000 ?
ksedmp+0290          bl       ksedst               104A56548 ?
ksfdmp+0018          bl       03F563A4
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               000000018 ? FFFFFFFFFFDB5C0 ?
                                                   000000000 ? 10564B600 ?
                                                   7000004F76CA698 ?
ksesic0+0060         bl       kgesiv               000000000 ? 700000350CD7B3C ?
                                                   FFFFFFFFFFDB050 ?
                                                   FFFFFFFFFFDB598 ? 000000000 ?
ktugti+07cc          bl       ksesic0              100100001001 ? 0000010E4 ?
                                                   000000000 ? 000000000 ?
                                                   1104B74D0 ? 000000080 ?
                                                   1100DFC10 ? 000000007 ?
ktcwit1+0684         bl       ktugti               700000506D0CC10 ? 410000FC8 ?
                                                   0056815A0 ? 147AE1410000FC8 ?
                                                   3B00000015 ? 4400000005 ?
                                                   4EB10018078 ? 21FE7BFCF8 ?
ktbgfi+1390          bl       ktcwit1              FFFFFFFFFFDB5C0 ?
                                                   FFFFFFFFFFDB598 ? 20010D18C ?
                                                   41022B190 ? 000000000 ?
                                                   147AE1411D203D0 ?
                                                   FFFFFFFFFFDB5B0 ? 111D20390 ?
kdddgb+08b0          bl       ktbgfi               011D16648 ? 111D44B58 ?
                                                   000000000 ? 111D16770 ?
                                                   FFFFFFFFFFDB8C0 ?
                                                   4844484304CD2968 ?
                                                   1020E0968 ?
kdusru+15d8          bl       kdddgb               000000000 ? 000000000 ?
                                                   000000000 ?
kauupd+0230          bl       kdusru               000000000 ? 000000000 ?
                                                   000000000 ? 000000000 ?
updrow+10fc          bl       kauupd               111D643F0 ? 7000004ECF8A480 ?
                                                   1104D87D0 ?
                                                   4004824000000000 ?
                                                   7000004ECF8C978 ?
                                                   E0004F60D27C8 ? FFFFE6D50 ?
                                                   1104D6E60 ?
qerupRowProcedure+0  bl       updrow               1100C82A8 ? 7FFF04E5836C ?
050
qerupFetch+053c      bl       03F52E00
updaul+0e0c          bl       01FC3DDC
updThreePhaseExe+0e  bl       updaul               7000004ECF7FF48 ?
ec                                                 FFFFFFFFFFE8328 ? 000000000 ?
updexe+02f8          bl       updThreePhaseExe     FFFFFFFFFFE8580 ? 100000000 ?
                                                   000000000 ? 1104DCEA0 ?
opiexe+2868          bl       updexe               111D7C110 ? 300000418 ?
opiodr+0ae0          bl       _ptrgl
rpidrus+01bc         bl       opiodr               400000000 ? 4104DCEA0 ?
                                                   FFFFFFFFFFEC030 ? 204E92D50 ?
skgmstack+00c8       bl       _ptrgl
rpidru+0088          bl       skgmstack            700000505CE041C ? 000000000 ?
                                                   000000002 ? 000000000 ?
                                                   FFFFFFFFFFEBBC8 ?
rpiswu2+034c         bl       _ptrgl
rpidrv+095c          bl       rpiswu2              700000505CE03E0 ?
                                                   FFFFFFFFFFEBB30 ?
                                                   FFFFFFFFFFEC210 ?
                                                   882244200030B558 ?
                                                   10107B9CC ? 000000000 ?
                                                   FFFFFFFFFFEBF30 ? 000000000 ?
rpiexe+005c          bl       rpidrv               200000000 ? 400000000 ?
                                                   FFFFFFFFFFEC030 ? 000000000 ?
ktuscu+0284          bl       01FC42B8
kqrcmt+0404          bl       _ptrgl
ktcrcm+052c          bl       kqrcmt               7000004F76CA698 ? 100000000 ?
                                                   000000000 ?
ktuiup+056c          bl       ktcrcm               7000004F76CA698 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 100000001 ?
                                                   000000000 ? 000000000 ?
ktuini+0064          bl       ktuiup               000000000 ?
adbdrv+1984          bl       ktuini               010441180 ?
opiexe+2c98          bl       adbdrv
opiosq0+19f0         bl       opiexe               000000000 ? 000000000 ?
                                                   FFFFFFFFFFF8F20 ?
kpooprx+0168         bl       opiosq0              3F8FECB10 ? 700000010003520 ?
                                                   7000004F8FECA90 ?
                                                   A40001101960A8 ?
kpoal8+0400          bl       kpooprx              FFFFFFFFFFFB774 ?
                                                   FFFFFFFFFFFB518 ?
                                                   1D0000001D ? 100000001 ?
                                                   000000000 ? A40000000000A4 ?
                                                   000000000 ? 1103A5678 ?
opiodr+0ae0          bl       _ptrgl
ttcpip+1020          bl       _ptrgl
opitsk+1124          bl       ttcpip               1100CB4B0 ? 9001000A0080860 ?
                                                   FFFFFFFFFFFB750 ? 11044D010 ?
                                                   FFFFFFFFFFFB750 ? 11044D090 ?
                                                   FFFFFFFFFFFB750 ?
                                                   9001000A0080860 ?
opiino+0990          bl       opitsk               1E00000000 ? 000000000 ?
opiodr+0ae0          bl       _ptrgl
opidrv+0484          bl       01FC4CDC
sou2o+0090           bl       opidrv               3C02877CFC ? 4A006F398 ?
                                                   FFFFFFFFFFFF6B0 ?
opimai_real+01bc     bl       01FC306C
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0070         bl       main                 000000000 ? 000000000 ?

其实我们可以尝试reset incarnation,然后再去restore归档,然后recover,想想麻烦,反正是测试,所以继续搞下去。

首先利用10046 event 来跟踪一下,发现如下sql报错:

PARSING IN CURSOR #2 len=148 dep=1 uid=0 oct=6 lid=0 tim=11525224803938 hv=3540833987 ad='9f8d140'
update undo$ set name=:2,file#=:3,block#=:4,status$=:5,user#=:6,undosqn=:7,xactsqn=:8,scnbas=:9,scnwrp=:10,inst#=:11,ts#=:12,spare1=:13 where us#=:1
END OF STMT
PARSE #2:c=0,e=10,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=11525224803936
BINDS #2:
kkscoacd
 Bind#0
  oacdty=01 mxl=32(10) mxlc=00 mal=00 scl=00 pre=00
  oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
  kxsbbbfp=70000050bfe3922  bln=32  avl=10  flg=09
  value="_SYSSMU29$"
 Bind#1

实际上因为oracle 在open的时候会去判断回滚端上是否存在事物,如果存在,那么就会进行update,如果进行update那么也就说明正在open的时候需要更新回滚端的信息。这里尝试使用参数将上述几个回滚端屏蔽掉,发现仍然无法open,再次寻找10046 trace,发现原来是另外一个回滚段可能有问题,如下:

Cursor#2(1104c1ad8) state=BOUND curiob=11136c0d0
 curflg=5 fl2=0 par=1104c1a70 ses=700000505ce03e0
 sqltxt(70000050ebb91d8)=update undo$ set name=:2,file#=:3,block#=:4,status$=:5,user#=:6,undosqn=:7,xactsqn=:8,scnbas=:9,scnwrp=:10,inst#=:11,ts#=:12,spare1=:13 where us#=:1
  hash=9caba1288112094d5553173dd30cd6c3
  parent=7000004edfe6078 maxchild=01 plk=7000004f179cc38 ppn=n
cursor instantiation=11136c0d0 used=1435474913
 child#0(70000050ebb8fb0) pcs=7000004edfe5c88
  clk=7000004f1788cd0 ci=7000004edfe5370 pn=700000509fa4f00 ctx=7000004ecf7ff48
 kgsccflg=1 llk[11136c0d8,11136c0d8] idx=c4
 xscflg=e0100666 fl2=d100400 fl3=4022218c fl4=100
 Bind bytecodes
  Opcode = 5   Bind Rpi Scalar Sql In (not out) Nocopy
  Offsi = 48, Offsi = 0
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 32
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 64
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 96
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 128
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 160
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 192
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 224
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 256
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 288
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 320
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 352
  Opcode = 1   Unoptimized
  Offsi = 48, Offsi = 384
kkscoacd
 Bind#0
  oacdty=01 mxl=32(10) mxlc=00 mal=00 scl=00 pre=00
  oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
  kxsbbbfp=70000050afdf68a  bln=32  avl=10  flg=09
  value="_SYSSMU61$"

果断再次屏蔽,然后尝试open resetlogs,发热仍然报错,原来这个回滚端用无法直接offline,隐含参数不好用,因此直接bbed 修改状态吧,如下:

BBED> p *kdbr[7]
rowdata[7302]
-------------
ub1 rowdata[7302]                           @7662     0x0c

BBED> x /1rncnnnnnnnnnnn
rowdata[7302]                               @7662
-------------
flag@7662: 0x0c (KDRHFL, KDRHFF)
lock@7663: 0x00
cols@7664:   17
hrid@7665:0x0040006a.3d

col    0[2] @7671: 61
col   1[10] @7674: _SYSSMU61$
col    2[2] @7685: 1
col    3[2] @7688: 200
col    4[4] @7691: 34489
col    5[6] @7696: 4196918701
col    6[3] @7703: 3364
col    7[5] @7707: 8202997
col    8[4] @7713: 23884
col    9[1] @7718: 0
col   10[2] @7720: 3
col   11[2] @7723: 1
col   12[0] @7726: *NULL*
col   13[0] @7727: *NULL*
col   14[0] @7728: *NULL*
col   15[0] @7729: *NULL*
col   16[2] @7730: 1
BBED> modify /x c103 offset 7721
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 110              Offsets: 7721 to 7726           Dba:0x0040006e
------------------------------------------------------------------------
 c10302c1 02ff 

 <32 bytes per line>

BBED> sum apply
Check value for File 1, Block 110:
current = 0x704c, required = 0x704c

修改之后成功open 数据库。

not connected> alter database open resetlogs;

Database altered.

虽然打开了,但是奇葩的还在后面,当我shutdown 再次启动,居然无法启动了。 报错ora-01555,比较经典的错误。

 ARC0: Becoming the 'no SRL' ARCH
Sun Jun 28 16:08:22 2015
ARC1: Becoming the heartbeat ARCH
Sun Jun 28 16:08:22 2015
SMON: enabling cache recovery
Sun Jun 28 16:08:22 2015
ORA-01555 caused by SQL statement below (SQL ID: 7bd391hat42zk, Query Duration=0 sec, SCN: 0x0d27.0a1ce29d):
Sun Jun 28 16:08:22 2015
select /*+ rule */ name,file#,block#,status$,user#,undosqn,xactsqn,scnbas,scnwrp,DECODE(inst#,0,NULL,inst#),ts#,spare1 from undo$ where us#=:1
Sun Jun 28 16:08:22 2015
Errors in file /oracle/app/oracle/admin/ibsscrm/udump/xxxx_ora_30212428.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 0 with name "SYSTEM" too small
Error 604 happened during db open, shutting down database
USER: terminating instance due to error 604
Instance terminated by USER, pid = 30212428
ORA-1092 signalled during: alter database open...

比较郁闷的是system 回滚段。很明显这也跟scn有关系,aix平台,尝试oradebug 修改scn,发现比较费劲。

最后果断bbed 再次修改block(仍然通过10046 trace 寻找相关的block).

BBED> p ktbbh
struct ktbbh, 72 bytes                      @20
   ub1 ktbbhtyp                             @20       0x01 (KDDBTDATA)
   union ktbbhsid, 4 bytes                  @24
      ub4 ktbbhsg1                          @24       0x0000000f
      ub4 ktbbhod1                          @24       0x0000000f
   struct ktbbhcsc, 8 bytes                 @28
      ub4 kscnbas                           @28       0x0a1ba8da
      ub2 kscnwrp                           @32       0x0d27
   b2 ktbbhict                              @36       2
   ub1 ktbbhflg                             @38       0x02 (NONE)
   ub1 ktbbhfsl                             @39       0x00
   ub4 ktbbhfnx                             @40       0x00000000
   struct ktbbhitl[0], 24 bytes             @44
      struct ktbitxid, 8 bytes              @44
         ub2 kxidusn                        @44       0x0000
         ub2 kxidslt                        @46       0x002a
         ub4 kxidsqn                        @48       0x000004eb
      struct ktbituba, 8 bytes              @52
         ub4 kubadba                        @52       0x00400195
         ub2 kubaseq                        @56       0x0238
         ub1 kubarec                        @58       0x0b
      ub2 ktbitflg                          @60       0x0001 (NONE)
      union _ktbitun, 2 bytes               @62
         b2 _ktbitfsc                       @62       0
         ub2 _ktbitwrp                      @62       0x0000
      ub4 ktbitbas                          @64       0x00000000
   struct ktbbhitl[1], 24 bytes             @68
      struct ktbitxid, 8 bytes              @68
         ub2 kxidusn                        @68       0x0000
         ub2 kxidslt                        @70       0x0007
         ub4 kxidsqn                        @72       0x000004e5
      struct ktbituba, 8 bytes              @76
         ub4 kubadba                        @76       0x00400017
         ub2 kubaseq                        @80       0x0235
         ub1 kubarec                        @82       0x11
      ub2 ktbitflg                          @84       0x8000 (KTBFCOM)
      union _ktbitun, 2 bytes               @86
         b2 _ktbitfsc                       @86       3367
         ub2 _ktbitwrp                      @86       0x0d27
      ub4 ktbitbas                          @88       0x0a1ba8d9

BED> d /v offset 60 count 2
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 106     Offsets:   60 to   61  Dba:0x0040006a
-------------------------------------------------------
 0001                                l ..

 <16 bytes per line>

BBED> modify /x 00 offset 61
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 106              Offsets:   61 to   62           Dba:0x0040006a
------------------------------------------------------------------------
 0000 

 <32 bytes per line>

BBED> sum apply
Check value for File 1, Block 106:
current = 0x3972, required = 0x3972

BBED> verify
DBVERIFY - Verification starting
FILE = /crm/oradata02/rngc_system.dbf
BLOCK = 106

Block Checking: DBA = 4194410, Block Type = KTB-managed data block
data header at 0x1101fb05c
kdbchk: row locked by non-existent transaction
        table=0   slot=124
        lockid=1   ktbbhitc=2
Block 106 failed with check code 6101

DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

BBED> p *kdbr[124]
rowdata[65]
-----------
ub1 rowdata[65]                             @431      0x2c

BBED> x /1rncnnnnnnnnnnn
rowdata[65]                                 @431
-----------
flag@431:  0x2c (KDRHFL, KDRHFF, KDRHFH)
lock@432:  0x01
cols@433:    17

col    0[3] @434: 124
col   1[11] @438: _SYSSMU124$
col    2[2] @450: 1
col    3[3] @453: 208
col    4[3] @457: 1881
col    5[6] @461: 4246102093
col    6[3] @468: 3364
col    7[5] @472: 2167495
col    8[4] @478: 60563
col    9[1] @483: 0
col   10[2] @485: 3
col   11[2] @488: 1
col   12[0] @491: *NULL*
col   13[0] @492: *NULL*
col   14[0] @493: *NULL*
col   15[0] @494: *NULL*
col   16[2] @495: 1 

BBED> d /v offset 432 count 2
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 106     Offsets:  432 to  433  Dba:0x0040006a
-------------------------------------------------------
 0111                                l ..

 <16 bytes per line>

BBED> modify /x 00 offset 432
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 106              Offsets:  432 to  433           Dba:0x0040006a
------------------------------------------------------------------------
 0011 

 <32 bytes per line>

BBED> sum apply
Check value for File 1, Block 106:
current = 0x3872, required = 0x3872

BBED> verify
DBVERIFY - Verification starting
FILE = /crm/oradata02/rngc_system.dbf
BLOCK = 106

DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

BBED> set file 1 block 110
        FILE#           1
        BLOCK#          110

BBED> p ktbbh
struct ktbbh, 48 bytes                      @20
   ub1 ktbbhtyp                             @20       0x01 (KDDBTDATA)
   union ktbbhsid, 4 bytes                  @24
      ub4 ktbbhsg1                          @24       0x0000000f
      ub4 ktbbhod1                          @24       0x0000000f
   struct ktbbhcsc, 8 bytes                 @28
      ub4 kscnbas                           @28       0x0a1ba9bf
      ub2 kscnwrp                           @32       0x0d27
   b2 ktbbhict                              @36       1
   ub1 ktbbhflg                             @38       0x02 (NONE)
   ub1 ktbbhfsl                             @39       0x00
   ub4 ktbbhfnx                             @40       0x00000000
   struct ktbbhitl[0], 24 bytes             @44
      struct ktbitxid, 8 bytes              @44
         ub2 kxidusn                        @44       0x0000
         ub2 kxidslt                        @46       0x0044
         ub4 kxidsqn                        @48       0x000004eb
      struct ktbituba, 8 bytes              @52
         ub4 kubadba                        @52       0x00400195
         ub2 kubaseq                        @56       0x0238
         ub1 kubarec                        @58       0x1d
      ub2 ktbitflg                          @60       0x0001 (NONE)
      union _ktbitun, 2 bytes               @62
         b2 _ktbitfsc                       @62       0
         ub2 _ktbitwrp                      @62       0x0000
      ub4 ktbitbas                          @64       0x00000000

BBED> d /v offset 60 count 2
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 110     Offsets:   60 to   61  Dba:0x0040006e
-------------------------------------------------------
 0001                                l ..

 <16 bytes per line>

BBED> modify /x 8000
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 110              Offsets:   60 to   61           Dba:0x0040006e
------------------------------------------------------------------------
 8000 

 <32 bytes per line>

BBED> sum apply
Check value for File 1, Block 110:
current = 0xefb1, required = 0xefb1

BBED> verify
DBVERIFY - Verification starting
FILE = /crm/oradata02/rngc_system.dbf
BLOCK = 110

Block Checking: DBA = 4194414, Block Type = KTB-managed data block
data header at 0x11021d044
kdbchk: row locked by non-existent transaction
        table=0   slot=8
        lockid=1   ktbbhitc=1
Block 110 failed with check code 6101

DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

BBED> p *kdbr[8]
rowdata[7231]
-------------
ub1 rowdata[7231]                           @7591     0x0c

BBED> x /1rncnnnnnnnnnnn
rowdata[7231]                               @7591
-------------
flag@7591: 0x0c (KDRHFL, KDRHFF)
lock@7592: 0x01
cols@7593:   17
hrid@7594:0x0040006b.7

col    0[3] @7600: 132
col   1[11] @7604: _SYSSMU132$
col    2[2] @7616: 1
col    3[2] @7619: 9
col    4[2] @7622: 89
col    5[6] @7625: 4246102099
col    6[3] @7632: 3364
col    7[5] @7636: 2064336
col    8[4] @7642: 55781
col    9[1] @7647: 0
col   10[2] @7649: 3
col   11[2] @7652: 1
col   12[0] @7655: *NULL*
col   13[0] @7656: *NULL*
col   14[0] @7657: *NULL*
col   15[0] @7658: *NULL*
col   16[2] @7659: 1 

BBED> d /v offset 7592 count 2
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 110     Offsets: 7592 to 7593  Dba:0x0040006e
-------------------------------------------------------
 0111                                l ..

 <16 bytes per line>

BBED> modify /x 00 offset 7592
 File: /crm/oradata02/rngc_system.dbf (1)
 Block: 110              Offsets: 7592 to 7593           Dba:0x0040006e
------------------------------------------------------------------------
 0011 

 <32 bytes per line>

BBED> sum apply
Check value for File 1, Block 110:
current = 0xeeb1, required = 0xeeb1

BBED> verify
DBVERIFY - Verification starting
FILE = /crm/oradata02/rngc_system.dbf
BLOCK = 110

DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

BBED>

最后再次open,发现一切顺利。

not connected> startup mount
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122472 bytes
Variable Size            6425677080 bytes
Database Buffers         1.5032E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
not connected> alter database open;

Database altered.

由于是测试环境,因此可以随便折腾,生产库,建议不要这样玩,可不好哦~~

Related posts:

  1. 手工提交Cluster Table的事务
  2. 非归档恢复的一个模拟例子
  3. Deep in ora-00600 [4193]

Oracle Recover Case: 50TB ASM crash case

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Oracle Recover Case: 50TB ASM crash case

某客户50 TB的ASM发生故障,经过合力拯救,恢复正常,在此简单记录一下!实际上最后发现比我想象中的简单的多。如下是关于该故障的详细描述情况。

–db alert log信息

Mon Jul  6 15:17:43 2015
Errors in file /oraclehome/admin/xxxx/udump/xxxx_ora_262606.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: -1
Additional information: 1048576
ORA-19502: write error on file "+DG_xxxx/xxxx/datafile/xxxx_load_1.679.847025081", blockno 3136352 (blocksize=32768)
ORA-27063: number of bytes read/written is incorrect
......
......
Errors in file /oraclehome/admin/xxxx/bdump/xxxx_lgwr_185246.trc:
ORA-00340: IO error processing online log 9 of thread 1
ORA-00345: redo log write error block 394389 count 2
ORA-00312: online log 9 thread 1: '+DG_xxxx/xxxx/onlinelog/group_9.433.789664647'
ORA-15078: ASM diskgroup was forcibly dismounted
Mon Jul  6 15:18:46 2015
LGWR: terminating instance due to error 340
Instance terminated by LGWR, pid = 185246

从db的alert log来看,是出现了IO异常,导致lgwr进程写日志,最后lgwr进程强行终止数据库实例.很明显,这里我们需要分析为什么lgwr进程无法写日志呢 ? 接着查看asm日志如下:

Thu Jul  3 08:24:26 2014
NOTE: ASMB process exiting due to lack of ASM file activity
Mon Jul  6 15:18:44 2015
Errors in file /oraclehome/product/10.2.0/admin/+ASM/udump/+asm_ora_353008.trc:
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 7
Additional information: -939091968
Additional information: -1
Mon Jul  6 15:18:46 2015
NOTE: cache initiating offline of disk 32  group 1
WARNING: offlining disk 32.1115675731 (DG_xxxx_0032) with mask 0x3
NOTE: PST update: grp = 1, dsk = 32, mode = 0x6
Mon Jul  6 15:18:46 2015
ERROR: too many offline disks in PST (grp 1)
Mon Jul  6 15:18:46 2015
ERROR: PST-initiated MANDATORY DISMOUNT of group DG_xxxx

从上述信息来看,很明显是因为asm 磁盘组中的32号盘出现IO问题,导致磁盘组被强制offline,最后数据库实例也crash。后面客户尝试手工mount diskgroup 发现报如下类似错误:

SQL> alter diskgroup datadg mount
Mon Jul  6 15:33:50 2015
Errors in file /oraclehome/product/10.2.0/admin/+ASM/bdump/+asm_dbw0_275092.trc:
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
......
......
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
OR
Mon Jul  6 15:33:51 2015
Errors in file /oraclehome/product/10.2.0/admin/+ASM/bdump/+asm_b000_360654.trc:
ORA-00600: internal error code, arguments: [kfcDismount02], [], [], [], [], [], [], []
Mon Jul  6 15:33:52 2015
NOTE: cache dismounting group 1/0xDDDF2CC7 (DG_xxxx)
NOTE: dbwr not being msg'd to dismount
Mon Jul  6 15:33:52 2015

这个错误极有可能是某个bug,在安装该patch 之后,最后再次尝试mount,发现仍然报错。不过错误已经发生改变:

SQL> alter diskgroup dg_xxxx mount
Tue Jul  7 05:49:29 2015
NOTE: cache registered group DG_xxxx number=1 incarn=0x72661a1f
Tue Jul  7 05:49:31 2015
NOTE: Hbeat: instance first (grp 1)
Tue Jul  7 05:49:36 2015
NOTE: start heartbeating (grp 1)
NOTE: cache opening disk 0 of grp 1: DG_xxxx_0000 path:/dev/rhdiskpower41
NOTE: cache opening disk 1 of grp 1: DG_xxxx_0001 path:/dev/rhdiskpower42
NOTE: cache opening disk 2 of grp 1: DG_xxxx_0002 path:/dev/rhdiskpower43
NOTE: cache opening disk 3 of grp 1: DG_xxxx_0003 path:/dev/rhdiskpower44
NOTE: cache opening disk 4 of grp 1: DG_xxxx_0004 path:/dev/rhdiskpower45
......
......
NOTE: cache opening disk 33 of grp 1: DG_xxxx_0033 path:/dev/rhdiskpower15
NOTE: cache opening disk 34 of grp 1: DG_xxxx_0034 path:/dev/rhdiskpower14
NOTE: cache opening disk 35 of grp 1: DG_xxxx_0035 path:/dev/rhdiskpower13
NOTE: cache mounting (first) group 1/0x72661A1F (DG_xxxx)
NOTE: starting recovery of thread=1 ckpt=6295.7329 group=1
NOTE: crash recovery signalled OER-15131
ERROR: ORA-15131 signalled during mount of diskgroup DG_xxxx
NOTE: cache dismounting group 1/0x72661A1F (DG_xxxx)
ERROR: diskgroup DG_xxxx was not mounted
Tue Jul  7 05:50:10 2015
Shutting down instance: further logons disabled

可以看出,Oracle ASM在mount的时候,需要进行crash recovery,其中的检查点位置就是6295.7329;检查trace发现检查点所读取的位置如下:

NOTE: starting recovery of thread=1 ckpt=6295.7329 group=1
CE: (0x7000000107c9640)  group=1 (DG_xxxx) obj=625  blk=256 (indirect)
    hashFlags=0x0100  lid=0x0002  lruFlags=0x0000  bastCount=1
    redundancy=0x11  fileExtent=0  AUindex=0 blockIndex=0
    copy #0:  disk=32  au=1638611
BH: (0x70000001079c360)  bnum=143 type=rcv reading state=rcvRead chgSt=not modifying
    flags=0x00000000  pinmode=excl  lockmode=null  bf=0x70000001048e000
    kfbh_kfcbh.fcn_kfbh = 0.0  lowAba=0.0  highAba=0.0
    last kfcbInitSlot return code=null cpkt lnk is null
*** 2015-07-07 05:26:12.382

可以看到,oracle需要读取32号磁盘的第1638611号AU,10g AU默认是1M,那么这个位置大致是1.6T的样子,实际上这个checkpoint的位置,我们很容易找到,这里通过kfed可以直接读取,如下:

[xxxx:/oraclehome]$ kfed read /xxx/rhdiskpower13 aun=3 blkn=0|more
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            7 ; 0x002: KFBTYP_ACDC
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:                       3 ; 0x008: TYPE=0x0 NUMB=0x3
kfbh.check:                  1350450563 ; 0x00c: 0x507e3d83
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfracdc.eyec[0]:                     65 ; 0x000: 0x41
kfracdc.eyec[1]:                     67 ; 0x001: 0x43
kfracdc.eyec[2]:                     68 ; 0x002: 0x44
kfracdc.eyec[3]:                     67 ; 0x003: 0x43
kfracdc.thread:                       1 ; 0x004: 0x00000001
kfracdc.lastAba.seq:         4294967295 ; 0x008: 0xffffffff
kfracdc.lastAba.blk:         4294967295 ; 0x00c: 0xffffffff
kfracdc.blk0:                         1 ; 0x010: 0x00000001
kfracdc.blks:                     10751 ; 0x014: 0x000029ff
kfracdc.ckpt.seq:                  6295 ; 0x018: 0x00001897    ---ckpt的值
kfracdc.ckpt.blk:                  7329 ; 0x01c: 0x00001ca1
kfracdc.fcn.base:             297751371 ; 0x020: 0x11bf534b
kfracdc.fcn.wrap:                     0 ; 0x024: 0x00000000
kfracdc.bufBlks:                     64 ; 0x028: 0x00000040

最后客户经过各种尝试之后,仍然在mount 磁盘组的时候报如下的错误:

Tue Jul  7 18:03:03 2015
Errors in file /oraclehome/product/10.2.0/admin/+ASM/udump/+asm_ora_438636.trc:
ORA-00600: internal error code, arguments: [kfcChkAio01], [], [], [], [], [], [], []
ORA-15196: invalid ASM block header [kfc.c:5553] [blk_kfbl] [625] [2147483904] [2147483905 != 2147483904]
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DG_xxxx
NOTE: cache dismounting group 1/0xE70AB6D0 (DG_xxxx)
ERROR: diskgroup DG_xxxx was not mounted
Tue Jul  7 18:05:38 2015

关于这一点跟Oracle MOS文档understanding and fixing errors ORA-600 [kfcChkAio01] and ORA-15196(Doc ID 757529.1)完全一致,因此最后我们建议客户根据该文档的描述,处理即可,实际上处理的方式很简单,该文档提供了提供shell脚本,只需要修改其中的块号即可。处理完毕之后,成功mount 磁盘组如下:

Tue Jul  7 18:05:38 2015
SQL> alter diskgroup dg_xxxx mount
Tue Jul  7 18:05:38 2015
NOTE: cache registered group DG_xxxx number=1 incarn=0xce0ab6d3
Tue Jul  7 18:05:38 2015
NOTE: Hbeat: instance first (grp 1)
Tue Jul  7 18:05:43 2015
NOTE: start heartbeating (grp 1)
......
......
NOTE: cache mounting (first) group 1/0xCE0AB6D3 (DG_xxxx)
NOTE: starting recovery of thread=1 ckpt=6295.7329 group=1
NOTE: advancing ckpt for thread=1 ckpt=6295.8649
NOTE: cache recovered group 1 to fcn 0.297779775
Tue Jul  7 18:05:43 2015
NOTE: opening chunk 1 at fcn 0.297779775 ABA
NOTE: seq=6296 blk=8650
Tue Jul  7 18:05:43 2015
NOTE: cache mounting group 1/0xCE0AB6D3 (DG_xxxx) succeeded
SUCCESS: diskgroup DG_xxxx was mounted
Tue Jul  7 18:05:45 2015
NOTE: recovering COD for group 1/0xce0ab6d3 (DG_xxxx)
SUCCESS: completed COD recovery for group 1/0xce0ab6d3 (DG_xxxx)

最后我们回过头来解释一下,为什么会出现这样的情况呢? 实际上,根本原因在于,客户在之前添加磁盘的时候操作不规范,如下:

Tue May 20 15:43:26 2014
SQL> alter diskgroup dg_xxxx add disk '/xxx/rhdiskpower24' size 1677721M,......
.'/xxx/rhdiskpower16' size 1677721M, '/xxx/rhdiskpower15' size 167772
1M, '/xxx/rhdiskpower14' size 1677721M, '/xxx/rhdiskpower13' size 1677721M  rebalance power 8
Wed May 21 08:45:13 2014
Starting background process ASMB
ASMB started with pid=13, OS id=267028
Wed May 21 08:45:14 2014
NOTE: ASMB process exiting due to lack of ASM file activity
Wed May 21 12:24:34 2014
NOTE: stopping process ARB5
NOTE: stopping process ARB2
NOTE: stopping process ARB6
NOTE: stopping process ARB1
NOTE: stopping process ARB3
NOTE: stopping process ARB7
NOTE: stopping process ARB4
NOTE: stopping process ARB0
Wed May 21 12:24:38 2014
SUCCESS: rebalance completed for group 1/0x595ad46e (DG_xxxx)
Wed May 21 12:24:38 2014
SUCCESS: rebalance completed for group 1/0x595ad46e (DG_xxxx)

前面出问题的disk 就是第32号盘,其大小是1677721M,实际上我们检查发现该磁盘的物理大小是1638400M。
换句话将,在添加磁盘的时候,写了一个比较大的数值,让Oracle以为是这么大,然而实际上并没有这么大。当然,这也只能说明是Oracle 10g 版本中对于asm 的校验不够严格。
所以,问题很明确,报错的AU 编号1638611是大于 1638400的,所以这是一个不存在的位置,因此asm crash了。

 

备注:客户这里asm diskgroup 一共用了36个盘,每个盘1.6TB,大约53TB,基本上全部用光了,还好能够简单修复之一,否则恢复难度和工作量就太大了。无可否认,云和恩墨 依然是国内恢复实力最强的公司,没有之一!

 

Related posts:

  1. Where is the backup of ASM disk header block? –补充
  2. oracle asm剖析系列(7)–Active Change Directory
  3. One recover case!
  4. One case:Latch free of oracle 9208 ?
  5. 一次TB级ERP(ASM RAC)库的恢复

关于enq: TX – row lock contention的测试和案例分析

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 关于enq: TX – row lock contention的测试和案例分析

1、主键或唯一index

---session 1
SQL> select sid from v$mystat where rownum=1;

       SID
----------
       130
SQL> create table t1_tx(id number primary key,name varchar2(20));

Table created.

SQL> insert into t1_tx values(1,'roger');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into t1_tx values(2,'xxoo');

1 row created.

---session 2
SQL> conn roger/roger
Connected.
SQL> select sid from v$mystat where rownum=1;

       SID
----------
        69

SQL> insert into t1_tx values(2,'xxoo'); ---一直处于等待状态

---session 3

SQL> select sid,
  2         chr(bitand(p1, -16777216) / 16777215) ||
  3         chr(bitand(p1, 16711680) / 65535) "Name",
  4         (bitand(p1, 65535)) "Mode",event,sql_id,FINAL_BLOCKING_SESSION
  5           from v$session
  6   where event like 'enq%';

       SID Name  Mode EVENT                          SQL_ID        FINAL_BLOCKING_SESSION
---------- ---- ----- ------------------------------ ------------- ----------------------
        69 TX       4 enq: TX - row lock contention  b775wqk86zc6k                    130

SQL> select sid,serial#,username,sql_id from v$session where sid=130;

       SID    SERIAL# USERNAME                       SQL_ID
---------- ---------- ------------------------------ -------------
       130        185 ROGER

SQL> select * from v$Lock where block=1;

ADDR     KADDR           SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
2A4F2AAC 2A4F2AEC        130 TX     196612        895          6          0        736          1

可以看出,对于表存在主键或者 unique index 时,一个会话操作主键不提交时,其他会话如果也操作相同的主键时,那么必须进行等待,而其持有的mode=4;而阻塞blocker的持有mode=6.

2、Bitmap INDEX

--session 1

SQL> select * from t1_tx;

        ID NAME
---------- --------------------
         1 roger
         2 roger
         3 aa
         4 aa

SQL> create bitmap index idx_bitmap_name on t1_tx(name);

Index created.

SQL> select sid from v$mystat where rownum=1;

       SID
----------
       130                                   

SQL> update t1_tx set name='tx' where id=3;  

1 row updated.                               

SQL>

---session 2
SQL> select sid from v$mystat where rownum=1;                                           

       SID
----------
        69                                     

SQL> update t1_tx set name='bitmap' where id=4;  ---一直处于等待状态

---session 3

SQL>  select sid,
  2         chr(bitand(p1, -16777216) / 16777215) ||
  3         chr(bitand(p1, 16711680) / 65535) "Name",
  4         (bitand(p1, 65535)) "Mode",event,sql_id,FINAL_BLOCKING_SESSION
  5           from v$session
  6   where event like 'enq%';                                                                                                   

       SID Name       Mode EVENT                           SQL_ID        FINAL_BLOCKING_SESSION
---------- ---- ---------- ------------------------------- ------------- ----------------------
        69 TX            4 enq: TX - row lock contention   7wanaturqndn1                    130

SQL>
SQL> set lines 200 pagesize 200
SQL> select * from table(dbms_xplan.display_cursor('&sql_id', NULL, 'ALL'));
Enter value for sql_id: 7wanaturqndn1
old   1: select * from table(dbms_xplan.display_cursor('&sql_id', NULL, 'ALL'))
new   1: select * from table(dbms_xplan.display_cursor('7wanaturqndn1', NULL, 'ALL'))

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  7wanaturqndn1, child number 0
-------------------------------------
update t1_tx set name='bitmap' where id=4

Plan hash value: 1842098942

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT   |              |       |       |     1 (100)|          |
|   1 |  UPDATE            | T1_TX        |       |       |            |          |
|*  2 |   INDEX UNIQUE SCAN| SYS_C0010951 |     1 |    25 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

SQL> select * from v$Lock where block=1;

ADDR     KADDR           SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
2A4F3100 2A4F3140        130 TX     262144        563          6          0        209          1

SQL> l
  1* select * from v$Lock where block=1
SQL> /

ADDR     KADDR           SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
2A4F3100 2A4F3140        130 TX     262144        563          6          0        215          1

SQL> select sid,serial#,username,sql_id,event from v$session where sid=130;

       SID    SERIAL# USERNAME                       SQL_ID        EVENT
---------- ---------- ------------------------------ ------------- -----------------------------------------------------------------
       130        185 ROGER                                        SQL*Net message from client

SQL> select owner,index_name,index_type from dba_indexes where table_name='T1_TX';

OWNER                          INDEX_NAME                     INDEX_TYPE
------------------------------ ------------------------------ ---------------------------
ROGER                          IDX_BITMAP_NAME                BITMAP
ROGER                          SYS_C0010951                   NORMAL

我们可以看到,如果表上存在位图index,那么在update时,多个会话同时进行更新,必然出现tx 等待。
此时waiter申请持有的tx 锁mode=4,而blocker持有的mode=6,而且通过v$session试图还无法查询到blocker会话到sql_id.

3、数据位于同一block

---session 1
SQL>  select dbms_rowid.rowid_object(rowid) obj#,
  2   dbms_rowid.rowid_relative_fno(rowid) rfile#,
  3   dbms_rowid.rowid_block_number(rowid) block#,
  4   dbms_rowid.rowid_row_number(rowid) row#
  5   from t1_tx order by 4;

      OBJ#     RFILE#     BLOCK#       ROW#
---------- ---------- ---------- ----------
     74762          4      30141          0
     74762          4      30141          1
     74762          4      30141          2
     74762          4      30141          3

SQL> update t1_tx set name='enmotech' where id=2; 

1 row updated.

SQL> commit;

Commit complete.

---session 2

SQL>  update t1_tx set name='zhenxu' where id=4;

1 row updated.

SQL> commit;

Commit complete.

SQL> 

即使我分别开2个会话执行100w次,也不会出现tx锁

--session 1
SQL> declare
  2     c number;
  3   begin
  4     for i in 1 .. 1000000 loop
  5       update  t1_tx set name='shit1' where id=2;
  6     end loop;
  7   end;
  8   /

PL/SQL procedure successfully completed.

Elapsed: 00:00:26.58
SQL> 

---session 2
SQL>  declare
  2     c number;
  3   begin
  4     for i in 1 .. 1000000 loop
  5       update  t1_tx set name='t-shit' where id=3;
  6     end loop;
  7   end;
  8   /

PL/SQL procedure successfully completed.

--session 3
SQL> select inst_id,event,count(1) from gv$session where wait_class#<>6 group by inst_id,event order by 1,3; 

   INST_ID EVENT                                                               COUNT(1)
---------- ----------------------------------------------------------------- ----------
         1 asynch descriptor resize                                                   1
         1 Log archive I/O                                                            1
         1 buffer busy waits                                                          2

我们可以看到,不同会话更新同一block中到不同行,不会存在等待,假设更新同一行,那么不提交到情况执行,必然存在等待,这里不再累述。

4、外键

SQL> create table t1 (id number ,name varchar2(20),product_id number);

Table created.

SQL> create table t2 (id number primary key,name varchar2(20));

Table created.

SQL> alter table t1  add constraint FK_PRODUCTID foreign key (PRODUCT_id)  references t2 (ID);

Table altered.

SQL>
SQL> select index_name,table_name from user_indexes where table_name='T1';

no rows selected

SQL>
SQL> insert into t2 values(1,'aa');

1 row created.

SQL> insert into t2 values(2,'dd');

1 row created.

SQL> insert into t2 values(3,'cc');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into t2 values(5,'cc');

1 row created.

SQL> 

---session 2
SQL> insert into t1 values(1,'xx',5);   --子表操作会一直挂起

 

实际上我们可以发现,无论子表有没有主键约束,都会存在这种情况,只有主表操作不提交.

实际上还有一种更特殊到情况,也会出现,当然原理上来讲,也上主外键的问题,如下测试:

---session 1
SQL> conn roger/roger
Connected.
SQL> create table t3_ref (id number primary key,name varchar2(20),obj_id NUMBER);

Table created.

SQL> alter table  t3_ref  add constraint fk_id foreign key (obj_id)  references t3_ref (id);

Table altered.

SQL> insert into t3_ref values(1,'roger',1);

1 row created.

SQL> insert into t3_ref values(2,'roger',1);

1 row created.

---session 2

SQL> conn roger/roger
Connected.
SQL> insert into t3_ref values(3,'roger',2); ---一直处于等待

---session 3
SQL> l
  1   select sid,
  2         chr(bitand(p1, -16777216) / 16777215) ||
  3         chr(bitand(p1, 16711680) / 65535) "Name",
  4         (bitand(p1, 65535)) "Mode",event,sql_id,blocking_session,FINAL_BLOCKING_SESSION
  5           from v$session
  6*  where event like 'enq%'
SQL> /

       SID Name       Mode EVENT                            SQL_ID        BLOCKING_SESSION FINAL_BLOCKING_SESSION
---------- ---- ---------- -------------------------------- ------------- ---------------- ----------------------
       199 TX            4 enq: TX - row lock contention    8cj5awv9djrby              139                    139

 

所以,对于enq: TX – row lock contention 我们可以进行如下简单总结:


1. 其原因一般有如下几种:
1) 表上存在主键或唯一性约束,多个会话操作同一条记录

2) 表存在主外键读情况,主表不提交,子表那么必须进行等待.

3) 表上存在位图Index,这跟uniqeue index中存在重复值是一样的道理,其中一个会话操作,其他会话必须等待.

4) 表进行自我外键关联,前面的事务不提交,那么会导致后面的会话一直等待.


2. 对于网上说的enq: TX – row lock contention也有可能是在等待index block分裂的情况,我没有进行测试,   从理论上来讲,如果是在等待index block分裂,那么应该还伴有enq: TX – index contention等待事件产生.


3. 对于enq: TX – row lock contention,通过v$session视图查询时,等待会话带lock mode通常为4,而blocker   会话带lock mode通常为6,并且一般查询blocker会话的sql_id都为空。这是正常现象,v$session显示是当前状态,   而非历史数据.

如下是某客户的真实例子的分析过程,如下:

SQL> select inst_id,event,count(1) from gv$session where wait_class#<>6 group by inst_id,event order by 1,3; 

   INST_ID EVENT                                                               COUNT(1)
---------- ----------------------------------------------------------------- ----------
         1 SQL*Net message to client                                                  1
         1 SQL*Net message from dblink                                                2
         1 db file sequential read                                                    4
         1 library cache: mutex X                                                     4
         1 enq: TX - row lock contention                                             18
         2 library cache: mutex X                                                     1
         2 db file sequential read                                                    1                      

7 rows selected.                 

SQL>  select sid,
       chr(bitand(p1, -16777216) / 16777215) ||
  2    3         chr(bitand(p1, 16711680) / 65535) "Name",
  4         (bitand(p1, 65535)) "Mode",event,sql_id,FINAL_BLOCKING_SESSION
  5           from v$session
  6   where event like 'enq%';                                                                                                                                                                          

       SID Name       Mode EVENT                               SQL_ID        FINAL_BLOCKING_SESSION
---------- ---- ---------- ----------------------------------- ------------- ----------------------
       207 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      1008 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      1168 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      1451 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   5286
      1652 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   5286
      2129 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      2207 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   5286
      2723 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   5286
      3095 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      4807 TX            6 enq: TX - row lock contention       djbvcr351s0mh                   1690
      5015 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      5047 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      5213 TX            6 enq: TX - row lock contention       djbvcr351s0mh                   1690
      5372 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   5286
      5374 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      5732 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      6721 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      7608 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810
      7609 TX            4 enq: TX - row lock contention       4fpb7rfm3fb3b                   2810                                                                       

19 rows selected.

这里通过dump 这几个process,然后过滤insert into 并没有发现针对party表的insert 操作。
于是尝试换一种思路,通过logminer 来分析进程的操作进程,如下:

SQL> select member from v$logfile where group#=1;

MEMBER
--------------------------------------------------
/crm/oradata01/redo01a.log
/crm/oradata02/redo01b.log

SQL>  select sid, username,
  2         chr(bitand(p1, -16777216) / 16777215) ||
  3         chr(bitand(p1, 16711680) / 65535) "Name",
  4         (bitand(p1, 65535)) "Mode",event,sql_id,blocking_session,FINAL_BLOCKING_SESSION
  5           from v$session
  6   where event like 'enq%';

       SID USERNAME   Name       Mode EVENT                            SQL_ID        BLOCKING_SESSION FINAL_BLOCKING_SESSION
---------- ---------- ---- ---------- -------------------------------- ------------- ---------------- ----------------------
       123 CRM_APP    TX            4 enq: TX - row lock contention    4fpb7rfm3fb3b             2802                   2802
       689 CRM_APP    TX            4 enq: TX - row lock contention    4fpb7rfm3fb3b             2802                   2802
       934 CRM_APP    TX            4 enq: TX - row lock contention    4fpb7rfm3fb3b             3128                   3128
      4128 CRM_APP    TX            4 enq: TX - row lock contention    4fpb7rfm3fb3b             2802                   2802
      4449 CRM_APP    TX            4 enq: TX - row lock contention    4fpb7rfm3fb3b             2802                   2802
      6324 CRM_APP    TX            4 enq: TX - row lock contention    4fpb7rfm3fb3b             2802                   2802

6 rows selected.

当前6个waiter 会话的事务信息如下:

SQL> SELECT t.xidusn, t.xidslot, t.xidsqn, t.start_time, t.start_scn
  2  FROM v$transaction t JOIN v$session s ON t.addr = s.taddr
  3  WHERE s.sid  in (123,689,934,4128,4449,6324)
  4  /

    XIDUSN    XIDSLOT     XIDSQN START_TIME            START_SCN
---------- ---------- ---------- -------------------- ----------
       743         28      61461 07/11/15 22:09:12    1.4484E+13
       828         26      61559 07/11/15 22:07:22    1.4484E+13
       918          5      57068 07/11/15 22:08:12    1.4484E+13
       820          1      64176 07/11/15 22:07:11    1.4484E+13
      1060         16      54417 07/11/15 22:08:12    1.4484E+13
       816          3      62830 07/11/15 22:07:11    1.4484E+13

6 rows selected.

当前2个blocker会话的事务信息如下:

SQL> SELECT t.xidusn, t.xidslot, t.xidsqn, t.start_time, t.start_scn
  2  FROM v$transaction t JOIN v$session s ON t.addr = s.taddr
  3  WHERE s.sid  in (2802,3128)
  4  /

    XIDUSN    XIDSLOT     XIDSQN START_TIME            START_SCN
---------- ---------- ---------- -------------------- ----------
       949         31      79938 07/11/15 22:06:22    1.4484E+13
      1061         20      57965 07/11/15 22:06:11    1.4484E+13

利用logminer 来分析waiter和blocker 会话的操作信息:

SQL> EXECUTE dbms_logmnr.add_logfile(logfilename=>'/crm/oradata01/redo01a.log');

PL/SQL procedure successfully completed.

SQL> EXECUTE dbms_logmnr.start_logmnr(options=>dbms_logmnr.dict_from_online_catalog);

PL/SQL procedure successfully completed.

SQL> create table tmp_logmnr_contents as select * from v$logmnr_contents;

Table created.

SQL> EXECUTE dbms_logmnr.end_logmnr ;

PL/SQL procedure successfully completed.

最后查询发现blocker和waiter执行的SQL都类似,因此这就很容易说明问题了. 由于logminer抓取的SQL涉及到客户信息,因此这里不便贴出来。这里只是给大家提供一种思路,对于TX锁的分析,也是可以利用logminer来做的。

最后分析发现,本质上来讲,就是因为前后会话操作相同的数据导致,而表上有存在主键,这必然导致出现TX锁等待。

 

PS:或许有人会说,为什么不直接查v$试图抓取sql的绑定变量,实际上我这里已经查过,没有查到,而且通过dump processstate也没有发现,因此才想到利用logminer来分析问题,找到根本原因。



Related posts:

  1. 关于index的监控
  2. library cache pin&lock (1)
  3. soft parse 和 library cache lock
  4. About enq: TX – row lock contention deadlock?

11.2.0.4 ASM RAC 恢复一个例子

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 11.2.0.4 ASM RAC 恢复一个例子

这是一个朋友的客户的数据库,数据库出故障之后,无法顺利打开,如下是数据库在open的时候所报的错误:

ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
ALTER DATABASE RECOVER    CONTINUE DEFAULT
Media Recovery Log /space/sys_software/oracle/app/product/11.2.0/db_1/dbs/arch1_1_885005686.dbf
Errors with log /space/sys_software/oracle/app/product/11.2.0/db_1/dbs/arch1_1_885005686.dbf
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_pr00_43377.trc:
ORA-00308: cannot open archived log '/space/sys_software/oracle/app/product/11.2.0/db_1/dbs/arch1_1_885005686.dbf'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
ALTER DATABASE RECOVER CANCEL
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_pr00_43377.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+RDBDATADG/bexasmdb/datafile/system.dbf'
Slave exiting with ORA-1547 exception
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_pr00_43377.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+RDBDATADG/bexasmdb/datafile/system.dbf'
ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...

我们可以看到,通过不完全恢复之后,通过加入隐含参数强制拉库,发现仍然报如下的错误:

Thu Jul 16 07:21:58 2015
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x002c.880f33fc):
select ctime, mtime, stime from obj$ where obj# = :1
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_40577.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 503 with name "_SYSSMU503_2368473065$" too small
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_40577.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 503 with name "_SYSSMU503_2368473065$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 40577): terminating the instance due to error 704
Thu Jul 16 07:21:59 2015
opiodr aborting process unknown ospid (45403) as a result of ORA-1092
Instance terminated by USER, pid = 40577
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (40577) as a result of ORA-1092
Thu Jul 16 07:22:10 2015
ORA-1092 : opitsk aborting process

据朋友讲,多次尝试之后仍然报上述错误,我建议通过10046 trace发现如下的几个block有问题:

WAIT #139668522497552: nam='db file sequential read' ela= 234 file#=1 block#=122911 blocks=1 obj#=36 tim=1436836317152403
WAIT #139668522497552: nam='db file sequential read' ela= 245 file#=1 block#=338 blocks=1 obj#=36 tim=1436836317152765
WAIT #139668522497552: nam='db file sequential read' ela= 160 file#=1 block#=241 blocks=1 obj#=18 tim=1436836317153036

通过bbed 检查发生上述几个block,发现确实存在活动事务。 通过bbed手工提交事务之后,尝试open发现报如下错误:

Thu Jul 16 07:39:52 2015
Media Recovery failed with error 16433
Slave exiting with ORA-283 exception
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_pr00_48904.trc:
ORA-00283: recovery session canceled due to errors
ORA-16433: The database must be opened in read/write mode.
Recovery Slave PR00 previously exited with exception 283
ORA-283 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Thu Jul 16 07:40:08 2015
Shutting down instance (abort)

这个错误其实很简单,是因为需要重建一下控制文件,然后再次尝试open数据库即可。不幸的是,再次open发现报ORA-00600 [2662]错误:

hu Jul 16 07:51:11 2015
SMON: enabling cache recovery
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_50314.trc  (incident=1796477):
ORA-00600: internal error code, arguments: [2662], [44], [2282697729], [44], [2503605680], [4194545], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_1796477/bexasmdb1_ora_50314_i1796477.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_50314.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [2662], [44], [2282697729], [44], [2503605680], [4194545], [], [], [], [], [], []
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_50314.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [2662], [44], [2282697729], [44], [2503605680], [4194545], [], [], [], [], [], []
Error 704 happened during db open, shutting down database
USER (ospid: 50314): terminating the instance due to error 704
Instance terminated by USER, pid = 50314

由于他这里的环境是11.2.0.4版本,因此老的推进scn的方式已经不行了,后面我建议通过oradebug 直接修改scn来拉库,如下:

oradebug poke 0x060019598 8 0x37881E7641

通过上述命令修改之后,再次进行open,发现顺利打开数据库:

Thu Jul 16 08:35:26 2015
Setting recovery target incarnation to 2
Thu Jul 16 08:35:26 2015
Assigning activation ID 1153859453 (0x44c67f7d)
Thread 1 opened at log sequence 1
  Current log# 1 seq# 1 mem# 0: +RDBDATADG/bexasmdb/onlinelog/redo_g01t01.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Jul 16 08:35:26 2015
SMON: enabling cache recovery
[59018] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:2407522882 end:2407523992 diff:1110 (11 seconds)
Dictionary check beginning
Tablespace 'TEMP' #3 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Dictionary check complete
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
*********************************************************************
WARNING: The following temporary tablespaces contain no files.
         This condition can occur when a backup controlfile has
         been restored.  It may be necessary to add files to these
         tablespaces.  That can be done using the SQL statement:

         ALTER TABLESPACE <tablespace_name> ADD TEMPFILE

         Alternatively, if these temporary tablespaces are no longer
         needed, then they can be dropped.
           Empty temporary tablespace: TEMP
*********************************************************************
Database Characterset is AL32UTF8
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_smon_58973.trc  (incident=2076389):
ORA-00600: internal error code, arguments: [4137], [474.1.368214], [0], [0], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2076389/bexasmdb1_smon_58973_i2076389.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Not initializing the resource manager because _resource_manager_always_on=FALSE
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Thu Jul 16 08:35:29 2015
QMNC started with pid=36, OS id=65502
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
ORACLE Instance bexasmdb1 (pid = 22) - Error 600 encountered while recovering transaction (474, 1).
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_smon_58973.trc:
ORA-00600: internal error code, arguments: [4137], [474.1.368214], [0], [0], [], [], [], [], [], [], [], []
Thu Jul 16 08:35:29 2015
Dumping diagnostic data in directory=[cdmp_20150716083529], requested by (instance=1, osid=58973 (SMON)), summary=[incident=2076389].
Thu Jul 16 08:35:29 2015
Sweep [inc][2076389]: completed
Thu Jul 16 08:35:29 2015
Sweep [inc2][2076389]: completed
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x42E37DA4] [PC:0x932F97E, kgegpa()+40] [flags: 0x0, count: 1]
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x42E37DA4] [PC:0x932DF87, kgebse()+771] [flags: 0x2, count: 2]
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x42E37DA4] [PC:0x932DF87, kgebse()+771] [flags: 0x2, count: 2]
Thu Jul 16 08:35:30 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_mmon_58981.trc  (incident=2076421):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2076421/bexasmdb1_mmon_58981_i2076421.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jul 16 08:35:30 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_65525.trc  (incident=2076549):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Thu Jul 16 08:35:30 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_65527.trc  (incident=2076557):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2076549/bexasmdb1_ora_65525_i2076549.trc
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2076557/bexasmdb1_ora_65527_i2076557.trc
Thu Jul 16 08:35:30 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_65529.trc  (incident=2076533):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Dumping diagnostic data in directory=[cdmp_20150716083530], requested by (instance=1, osid=58973 (SMON)), summary=[abnormal process termination].Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2076533/bexasmdb1_ora_65529_i2076533.trc

Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jul 16 08:35:31 2015
Block recovery from logseq 1, block 380 to scn 238506899377
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: +RDBDATADG/bexasmdb/onlinelog/redo_g01t01.log
Block recovery completed at rba 1.416.16, scn 55.2283698098
Block recovery from logseq 1, block 380 to scn 238506899350
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: +RDBDATADG/bexasmdb/onlinelog/redo_g01t01.log
Block recovery completed at rba 1.382.16, scn 55.2283698072
Thu Jul 16 08:35:31 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_65583.trc  (incident=2076677):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2076677/bexasmdb1_ora_65583_i2076677.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Dumping diagnostic data in directory=[cdmp_20150716083532], requested by (instance=1, osid=65529), summary=[incident=2076533].
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_65529.trc  (incident=2076534):
ORA-00600: internal error code, arguments: [504], [0x06000F0F0], [1], [0], [ksv instance latch], [0], [0], [0x2FC57D92F8], [], [], [], []
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
......
......
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_65525.trc:
ORA-00600: internal error code, arguments: [504], [0x06000F0F0], [1], [0], [ksv instance latch], [0], [0], [0x2FC57D9438], [], [], [], []
ORA-00600: internal error code, arguments: [504], [0x06000F0F0], [1], [0], [ksv instance latch], [0], [0], [0x2FC57D9438], [], [], [], []
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x3BAF488E] [PC:0x932F97E, kgegpa()+40] [flags: 0x0, count: 1]
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x3BAF488E] [PC:0x932DF87, kgebse()+771] [flags: 0x2, count: 2]
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_mmon_58981.trc  (incident=2076428):
ORA-00603: ORACLE server session terminated by fatal error
ORA-24557: error 600 encountered while handling error 600; exiting server process
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []

虽然数据库能够打开,据朋友反应,很快数据库就会挂掉。从上述日志来看,open之后报错undo 相关错误。这就更容易处理了。通过undo_management参数改成manual即可,然后open数据库,重建undo表空间,如下:

hu Jul 16 08:40:43 2015
QMNC started with pid=72, OS id=66934
Completed: ALTER DATABASE OPEN
Thu Jul 16 08:40:44 2015
minact-scn: got error during useg scan e:1555 usn:405
minact-scn: useg scan erroring out with error e:1555
ORACLE Instance bexasmdb1 (pid = 22) - Error 600 encountered while recovering transaction (405, 33).
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_smon_66678.trc:
ORA-00600: internal error code, arguments: [4137], [405.33.408826], [0], [0], [], [], [], [], [], [], [], []
Thu Jul 16 08:40:44 2015
Dumping diagnostic data in directory=[cdmp_20150716084044], requested by (instance=1, osid=66678 (SMON)), summary=[incident=2196402].
Thu Jul 16 08:40:44 2015
Starting background process CJQ0
Thu Jul 16 08:40:44 2015
CJQ0 started with pid=88, OS id=66995
Dumping diagnostic data in directory=[cdmp_20150716084045], requested by (instance=1, osid=66678 (SMON)), summary=[abnormal process termination].
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_smon_66678.trc  (incident=2196403):
ORA-00600: internal error code, arguments: [4137], [408.23.372933], [0], [0], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2196403/bexasmdb1_smon_66678_i2196403.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jul 16 08:40:45 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_67001.trc  (incident=2196954):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2196954/bexasmdb1_ora_67001_i2196954.trc
Thu Jul 16 08:40:45 2015
Sweep [inc][2196402]: completed
Thu Jul 16 08:40:45 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_m000_66993.trc:
ORA-25153: Temporary Tablespace is Empty
ORACLE Instance bexasmdb1 (pid = 22) - Error 600 encountered while recovering transaction (408, 23).
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_smon_66678.trc:
ORA-00600: internal error code, arguments: [4137], [408.23.372933], [0], [0], [], [], [], [], [], [], [], []
Thu Jul 16 08:40:46 2015
Sweep [inc][2196403]: completed
Dumping diagnostic data in directory=[cdmp_20150716084046], requested by (instance=1, osid=66678 (SMON)), summary=[incident=2196403].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_smon_66678.trc  (incident=2196404):
ORA-00600: internal error code, arguments: [4137], [411.5.413464], [0], [0], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2196404/bexasmdb1_smon_66678_i2196404.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jul 16 08:40:47 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_67019.trc  (incident=2196962):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2196962/bexasmdb1_ora_67019_i2196962.trc
Dumping diagnostic data in directory=[cdmp_20150716084047], requested by (instance=1, osid=66678 (SMON)), summary=[abnormal process termination].
ORACLE Instance bexasmdb1 (pid = 22) - Error 600 encountered while recovering transaction (411, 5).
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_smon_66678.trc:
ORA-00600: internal error code, arguments: [4137], [411.5.413464], [0], [0], [], [], [], [], [], [], [], []
Thu Jul 16 08:40:48 2015
Sweep [inc][2196404]: completed
Thu Jul 16 08:40:48 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j002_67033.trc:
ORA-12012: error on auto execute of job 25
ORA-01552: cannot use system rollback segment for non-system tablespace 'SPC_SDB_SOD_DATA'
ORA-06512: at "USR_SOD.OPERDEL", line 3
ORA-06512: at line 1
Dumping diagnostic data in directory=[cdmp_20150716084048], requested by (instance=1, osid=66678 (SMON)), summary=[incident=2196404].
Thu Jul 16 08:40:48 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j004_67043.trc:
ORA-12012: error on auto execute of job 103
ORA-01552: cannot use system rollback segment for non-system tablespace 'SPC_SDB_DPC_DATA'
ORA-06512: at "USR_DPC.DPC_PARTITION_DEL", line 95
ORA-01552: cannot use system rollback segment for non-system tablespace 'SPC_SDB_DPC_DATA'
ORA-06512: at line 1
Thu Jul 16 08:40:48 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j005_67045.trc:
ORA-12012: error on auto execute of job 85
ORA-01552: cannot use system rollback segment for non-system tablespace 'SPC_SDB_DPC_DATA'
ORA-06512: at "USR_DPC.DPC_PARTITION_ADD", line 154
ORA-01552: cannot use system rollback segment for non-system tablespace 'SPC_SDB_DPC_DATA'
ORA-06512: at line 1
Thu Jul 16 08:40:48 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j007_67049.trc:
ORA-12012: error on auto execute of job 24
ORA-01552: cannot use system rollback segment for non-system tablespace 'SPC_SDB_SOD_DATA'
ORA-06512: at "USR_SOD.OPERDEL", line 3
ORA-06512: at line 1
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j002_67033.trc:
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jul 16 08:40:49 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_67055.trc  (incident=2197066):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_2197066/bexasmdb1_ora_67055_i2197066.trc
Thu Jul 16 08:40:49 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j001_67030.trc:
ORA-12012: error on auto execute of job 36
ORA-12008: error in materialized view refresh path
ORA-01552: cannot use system rollback segment for non-system tablespace 'SPC_SDB_MCP_DATA'
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2563
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2776
ORA-06512: at "SYS.DBMS_IREFRESH", line 685
ORA-06512: at "SYS.DBMS_REFRESH", line 195
ORA-06512: at line 1
......
......
Thu Jul 16 08:41:34 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_67263.trc  (incident=2196986):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jul 16 08:41:35 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_67265.trc  (incident=2196995):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
......
......
Thu Jul 16 08:42:36 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_67419.trc  (incident=2197091):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j002_67383.trc:
ORA-12012: error on auto execute of job 4002
ORA-08102: index key not found, obj# 290, file 1, block 2033 (2)
Thu Jul 16 08:42:36 2015
Sweep [inc][2197186]: completed
Sweep [inc][2197179]: completed
Sweep [inc][2197146]: completed
Sweep [inc][2197138]: completed
Sweep [inc][2197130]: completed
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_j000_67379.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08102: index key not found, obj# 290, file 1, block 2033 (2)
ORA-12012: error on auto execute of job 3
ORA-08102: index key not found, obj# 290, file 1, block 2033 (2)

最后打开之后,仍然发现有一些问题,重建index发现都报错错误。如下:

SQL> CREATE INDEX "USR_MCP"."IDX_QRTZ_T_NEXT_FIRE_TIME" ON "USR_MCP"."QRTZ_TRIGGERS" ("NEXT_FIRE_TIME");
CREATE INDEX "USR_MCP"."IDX_QRTZ_T_NEXT_FIRE_TIME" ON "USR_MCP"."QRTZ_TRIGGERS" ("NEXT_FIRE_TIME")
                                                                *
ERROR at line 1:
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []

这实际上是存在坏块,通过检查相关对象,发现数据字典表其实存在问题,此时检查发现alert log也存在相关错误,如下:

Dumping diagnostic data in directory=[cdmp_20150717024616], requested by (instance=1, osid=76873), summary=[abnormal process termination].
Fri Jul 17 02:46:20 2015
alter tablespace temp add tempfile '+RDBDATADG' size 10G autoextend on
Completed: alter tablespace temp add tempfile '+RDBDATADG' size 10G autoextend on
Fri Jul 17 02:46:30 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_77293.trc  (incident=3236693):
ORA-00600: internal error code, arguments: [ktsplbfmb-dblfree], [0], [96608622], [96608439], [183], [0], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_3236693/bexasmdb1_ora_77293_i3236693.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri Jul 17 02:46:33 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_76873.trc:
Fri Jul 17 02:46:34 2015
Sweep [inc][3236693]: completed
Sweep [inc2][3236693]: completed
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_77293.trc  (incident=3236694):
ORA-00600: internal error code, arguments: [ktsplbfmb-dblfree], [0], [96608622], [96608439], [183], [0], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_3236694/bexasmdb1_ora_77293_i3236694.trc
Fri Jul 17 02:46:37 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_76814.trc  (incident=3236638):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_3236638/bexasmdb1_ora_76814_i3236638.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_77293.trc  (incident=3236695):
ORA-00600: internal error code, arguments: [6002], [32], [32], [2], [0], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_3236695/bexasmdb1_ora_77293_i3236695.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri Jul 17 02:46:40 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_77293.trc  (incident=3236696):
ORA-00600: internal error code, arguments: [6002], [32], [32], [2], [0], [], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_3236696/bexasmdb1_ora_77293_i3236696.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_77293.trc  (incident=3236697):
ORA-00600: internal error code, arguments: [ktsplbfmb-dblfree], [0], [96608622], [96608439], [183], [0], [], [], [], [], [], []
Incident details in: /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/incident/incdir_3236697/bexasmdb1_ora_77293_i3236697.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri Jul 17 02:46:52 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_76873.trc:
Fri Jul 17 02:46:53 2015
Errors in file /space/sys_software/oracle/app/diag/rdbms/bexasmdb/bexasmdb1/trace/bexasmdb1_ora_77293.trc  (incident=3236698):

据我分析,其实完全可以通过bbed修复obj$的index,来完成这个工作。然而朋友不熟,考虑到index结构的复杂性,因此后面直接建议他exp导出重建数据库算了。
我博客中也有相关针对ora-08102错误,修复Index的情况,请参考!类似这样专业的数据恢复,请联系我们云和恩墨!

Related posts:

  1. 最近迁移恢复中遇到的几个小问题
  2. One recover case!
  3. 数据库open报错ORA-01555: snapshot too old
  4. Instance immediate crash after open
  5. sysaux大面积坏块的例子
Viewing all 49 articles
Browse latest View live