Guenadi N Jilevski's Oracle BLOG

Oracle RAC, DG, EBS, DR and HA DBA BLOG

RMAN TSPITR – Examining the driver script

RMAN TSPITR – Examining the driver script

In this article, we’ll take a look at what Oracle does when you perform a tablespace point-in-time recovery (TSPITR) – and also what it doesn’t do, depending on the version being used. One of the errors can be vexing, but once you know where and how to fix it, the point-in-time recovery process works much better. Working “much better” doesn’t necessarily mean perfectly, but more than likely, you’ll be at a place where finishing the recovery is fairly straightforward.

During an RMAN session, you’ll see output where RMAN is “printing” or using a stored memory script. The print part is correct as the output is printed to the terminal/session window, but from memory is not entirely accurate. One example of the output is shown below, and you can see how Oracle has taken your input (via script and init.ora parameters) and fed it into the RMAN session. In my example, I am recovering tablespace USERS until logseq 499.

RMAN> run {

2> allocate auxiliary channel dev1 type ‘sbt_tape’;

3> set newname for datafile ‘/u065/oradata/train/system01.dbf’ to

‘/u064/oradata/clone/system01.dbf’;

4> set newname for datafile ‘/u064/oradata/train/users01.dbf’ to

‘/u064/oradata/clone/users01.dbf’;

5> set newname for datafile ‘/u065/oradata/train/rbs01.dbf’ to

‘/u064/oradata/clone/rbs01.dbf’;

6> recover tablespace users until logseq 499 thread 1;

7> }

The RMAN output (partial) of what I provided is reflected below.

RMAN-03027: printing stored script: Memory Script

{

# set the until clause

set until logseq 499 thread 1;

# restore the controlfile

restore clone controlfile to clone_cf;

# replicate the controlfile

replicate clone controlfile from clone_cf;

# mount the controlfile

sql clone ‘alter database mount clone database’;

# archive current online log for tspitr to a resent until time

sql ‘alter system archive log current’;

# resync catalog after controlfile restore

resync catalog;

}

Note: The “resent” time is really a “recent” time. The same typo has existed for close to decade now.

The code to generate this output, and other RMAN output for that matter, resides in a file named recover.bsq. The file is located in ORACLE_HOME/rdbms/admin and is directly editable, although a note at the top says otherwise. Within the file, around line 8509 in the 10g version and line 4523 in 8i, you can see the start of what looks to be boilerplate text using variable resolution. The block of code looks as such:

#

# tspitr_0: pre_tspitr script. This member is used once.

#

define tspitr_0

<<<

# set the until clause

set until &1&;

# restore the controlfile

restore clone controlfile to clone_cf;

# replicate the controlfile

replicate clone controlfile from clone_cf;

# mount the controlfile

sql clone ‘alter database mount clone database’;

# archive current online log for tspitr to a resent until time

sql ‘alter system archive log current’;

>>>

#

# tspitr_sy_ct: resync catalog

#

define tspitr_sy_ct

<<<

# resync catalog after controlfile restore

resync catalog;

>>>

Note that the set until clause is required to resolve at least three different inputs (time, SCN, and log sequence number). Continuing on through the tspitr_X “defines” in the C-like code, the procedural steps to rename files, flip them, online/offline files, and alter database are easy to correlate to an RMAN session’s output.

You would think that the TSPITR code from Oracle would be robust, where user input or editing of the stored script wouldn’t be necessary. Interestingly enough, between sections 7 and 8 we see the following “oh by the way” instructions for a user.

# PLUG HERE the creation of a temporary tablespace if export fails due to lack

# of temporary space.

# For example in Unix these two lines would do that:

#sql clone “create temporary tablespace temp

# tempfile ”/tmp/aux_tspitr_tmp.dbf” size 500K”;

Depending upon your environment/situation, you may have to edit the recover.txt (although you can edit recover.bsq file and not have the changes overwritten) and uncomment the two lines associated with creating a temporary tablespace. Okay, that is one fix related to the export failing. The explanation for why a temporary tablespace is needed is that the sort area (in memory) is not large enough, and just like any other database (i.e., operating normally, not needing any recovery), if sorts cannot be done in memory, then Oracle goes to the temp tablespace.

Your mileage may vary on this, but uncomment the “create temporary tablespace” statement and use what is suggested. If you encounter another error, more than likely it is due to the fact the tempfile size is not large enough (the second fix). Instead of trying to tweak the size parameter so you just miss encountering the error, go big early and avoid the problem altogether. How big is big enough? That may be trial and error, but it is quite likely to be on the order of megabytes (e.g., 5MB). The point is this: don’t waste time setting it to 600K, then 700K, and so on.

Two points are in order here. First, this problem is supposedly fixed in 10gR2 and above, so you may never have to edit this file – at least for this reason. The second is that Oracle Corp. provides conflicting instructions on which file to edit: recover.bsq or recover.txt. As mentioned, the actual file says to edit the text version, but in a MetaLink note (“ORA-25153: Temporary Tablespace is Empty encountered on RMAN TSPITR,” Doc ID 263483.1), the instructions are:

Edit recover.bsq (RMAN library file) in $ORACLE_HOME/rdbms/admin (location is platform specific) and make the following changes. NOTE: It is recommended to take a backup of this file.

For sure, you can edit the recover.bsq file, but do make a copy of the original beforehand. If you want to experiment with the file, try adding your own comment lines here and there. Commented lines (those beginning with #) within the >>> to the <<< blocks are output to the RMAN session. For example, if seeing “PLUG HERE” grates on your sense of grammar, feel free to replace it with your own comment.

The instructional value of looking through all the tspitr_X sections is that you can follow the process at a higher level (as opposed to filtering out the RMAN output). If something goes wrong, you’ll have a better idea of where in the process the failure took place.

While on the subject of the recover.bsq file, a search on MetaLink (yes, I’m avoiding writing the clunky sounding “My Oracle Support”) for this file returns 29 entries (as of the publish date of this article). An interesting theme among the hits is that of the wrong version of the file being used. And like many other notes, check them for accuracy before making changes or running scripts.

One note related to upgrading the recovery catalog has you fix one error but only to introduce another via the changed script. The suggested change in Doc ID 833046.1 (“RMAN-600 [3040] TRYING TO UPGRADE THE RECOVERY CATALOG”) has an error in the SELECT clause (extra comma at the end of the last column selected). Not that you would be stressed for time during some hugely critical upgrade or recovery process or anything like that as there is always enough time to fix technical notes from Oracle.

Conclusion

The recover.bsq file is interesting in that it provides you with the nitty-gritty of what RMAN does during a recovery. All of the steps aren’t necessarily listed in the documentation, or in third-party books for that matter, and if you can read code at even a basic level, you can understand what is taking place when RMAN does its magic. Knowing about this file ahead of time can save you hours of searching for resolution of RMAN errors, and probably during a time when you really can’t afford those extra hours.

July 18, 2008 Posted by | oracle | Leave a comment

How To Recover From Corrupted OCR Disk

How To Recover From Corrupted OCR Disk

It is very common where a DBA is left with corrupted OCR disk without having any good backup.
The same situation was experienced by me few days back. One node of RAC database shows the following:

NODE1:

$ORA_CRS_HOME/bin/crs_stat -t

Name           Type           Target    State     Host

————————————————————

ora.orcl.db    application    ONLINE    ONLINE    raclinux1

ora….11.inst application    ONLINE    ONLINE    raclinux1

ora….12.inst application    ONLINE    OFFLINE

ora….vice.cs application    OFFLINE   OFFLINE

ora….l11.srv application    ONLINE    OFFLINE

ora….l12.srv application    ONLINE    OFFLINE

ora….SM1.asm application    ONLINE    ONLINE    raclinux1

ora….DC.lsnr application    ONLINE    ONLINE    raclinux1

ora….abc.gsd application    ONLINE    ONLINE    raclinux1

ora….abc.ons application    ONLINE    ONLINE    raclinux1

ora….abc.vip application    ONLINE    ONLINE    raclinux1

ora….SM2.asm application    ONLINE    ONLINE    raclinux2

ora….C2.lsnr application    ONLINE    ONLINE    raclinux2

ora….bc2.gsd application    ONLINE    ONLINE    raclinux2

ora….bc2.ons application    ONLINE    ONLINE    raclinux2

ora….bc2.vip application    ONLINE    ONLINE    raclinux2

The other node shows the following:
NODE2:

/crs_stat -t

HA Resource                                   Target     State

———–                                   ——     —–

ora.orcl.db                                   OFFLINE    OFFLINE

ora.orcl.racdb1.inst                          OFFLINE    OFFLINE

ora.orcl.racdb2.inst                          OFFLINE    OFFLINE

ora.orcl.test_service.cs                      ONLINE     OFFLINE

ora.orcl.test_service.racdb1.srv              OFFLINE    OFFLINE

ora.orcl.test_service.racdb2.srv              OFFLINE    OFFLINE

ora.raclinux1 .ASM1.asm                         OFFLINE    OFFLINE

ora.raclinux1 .LISTENER_RAC1 .lsnr           OFFLINE    OFFLINE

ora.raclinux1 .gsd                              OFFLINE    OFFLINE

ora.raclinux1 .ons                              OFFLINE    OFFLINE

ora.raclinux1 .vip                              OFFLINE    OFFLINE

ora.raclinux2.ASM2.asm                        OFFLINE    OFFLINE

ora.raclinux2.LISTENER_RAC2 2.lsnr         ONLINE     OFFLINE

ora.raclinux2.gsd                             ONLINE     OFFLINE

ora.raclinux2.ons                             ONLINE     OFFLINE

ora.raclinux2.vip                             ONLINE     OFFLINE

We can see the inconsistent data across two node RAC. Every command for srvctl, crsctl was hanging on NODE 2.
Now the option is to restore the OCR backup, but if there is no backup available for OCR then we can use the following procedure to recover from corrupted OCR disk
(There will be complete downtime needed to perform these operations)
1. Check the status of CRS from node 1:

# ps -eaf |grep d.bin
root 12873 1 0 Aug11 ? 00:11:07 /u01/app/crs/bin/crsd.bin reboot
oracle 13105 12846 0 Aug11 ? 00:00:45 /u01/app/crs/bin/evmd.bin
oracle 13226 13200 0 Aug11 ? 00:13:13 /u01/app/crs/bin/ocssd.bin
root 21458 19986 0 20:34 pts/4 00:00:00 grep d.bin

2. Shutdown Oracle ClusterWare on all nodes:

[root@raclinux1  bin]# ./crsctl stop crs

Stopping resources.

Successfully stopped CRS resources

Stopping CSSD.

Shutting down CSS daemon.

Shutdown request successfully issued.

Check the status again:

[root@raclinux1 bin]# ps -eaf |grep d.bin
root 21927 19986 0 20:34 pts/4 00:00:00 grep d.bin

It shows that the cluster is stopped.

3. Execute rootdelete.sh from all nodes.

It is under directory $ORA_CRS_HOME/install/rootdelete.sh

NODE1:

[root@raclinux1  install]# ./rootdelete.sh

Shutting down Oracle Cluster Ready Services (CRS):

Stopping resources.

Error while stopping resources. Possible cause: CRSD is down.

Stopping CSSD.

Unable to communicate with the CSS daemon.

Shutdown has begun. The daemons should exit soon.

Checking to see if Oracle CRS stack is down…

Oracle CRS stack is not running.

Oracle CRS stack is down now.

Removing script for Oracle Cluster Ready services

Updating ocr file for downgrade

Cleaning up SCR settings in ‘/etc/oracle/scls_scr’

NODE 2:

./rootdelete.sh

Shutting down Oracle Cluster Ready Services (CRS):

OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]

Shutdown has begun. The daemons should exit soon.

Checking to see if Oracle CRS stack is down…

Oracle CRS stack is not running.

Oracle CRS stack is down now.

Removing script for Oracle Cluster Ready services

Updating ocr file for downgrade

Cleaning up SCR settings in ‘/etc/oracle/scls_scr’

“OCR initialization failed accessing OCR device”, this error can occur due to folloing reasons:
1. ocrconfig_loc is not pointing to the correct ocr.
2. Problem of rights and owners on the ocr devices
3. Configuration problem on Oracle Cluster Synchronization Services

As the SCR entries are cleaned up so there is no need to worry about PROC-26 error.

If you have more than 2 nodes in a rac you need to run rootdelete.sh on all the other nodes also.

4. Run rootdeinstall.sh from the node where the RAC installation was done (usually it is the node1).
It will clear up the OCR disk contents.

./rootdeinstall.sh

Removing contents from OCR device

2560+0 records in

2560+0 records out

5. Run root.sh from the same node:

./root.sh

WARNING: directory ‘/u01’ is not owned by root

Checking to see if Oracle CRS stack is already configured

Setting the permissions on OCR backup directory

Setting up NS directories

Oracle Cluster Registry configuration upgraded successfully

WARNING: directory ‘/u01’ is not owned by root

assigning default hostname raclinux1  for node 1.

assigning default hostname raclinux2 2 for node 2.

Successfully accumulated necessary OCR keys.

Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.

node :

node 1: raclinux1  raclinux1-priv raclinux1

node 2: raclinux2  raclinux2-priv raclinux2

Creating OCR keys for user ‘root’, privgrp ‘root’..

Operation successful.

Now formatting voting device: /dev/raw/raw1

Format of 1 voting devices complete.

Startup will be queued to init within 90 seconds.

Adding daemons to inittab

Expecting the CRS daemons to be up within 600 seconds.

CSS is active on these nodes.

raclinux1

CSS is inactive on these nodes.

raclinux2 2

Local node checking complete.

Run root.sh on remaining nodes to start CRS daemons.

After its completion run root.sh on all remaining nodes.

./root.sh

Checking to see if Oracle CRS stack is already configured

Setting the permissions on OCR backup directory

Setting up NS directories

Oracle Cluster Registry configuration upgraded successfully

clscfg: EXISTING configuration version 3 detected.

clscfg: version 3 is 10G Release 2.

assigning default hostname raclinux1  for node 1.

assigning default hostname raclinux2  for node 2.

Successfully accumulated necessary OCR keys.

Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.

node :

node 1: raclinux1  raclinux1-priv raclinux1

node 2: raclinux2  raclinux2-priv raclinux2

clscfg: Arguments check out successfully.

NO KEYS WERE WRITTEN. Supply -force parameter to override.

-force is destructive and will destroy any previous cluster

configuration.

Oracle Cluster Registry for cluster has already been initialized

Startup will be queued to init within 90 seconds.

Adding daemons to inittab

Expecting the CRS daemons to be up within 600 seconds.

CSS is active on these nodes.

raclinux1

raclinux2

CSS is active on all nodes.

Oracle CRS stack installed and running under init(1M)

Running vipca(silent) for configuring nodeapps

The given interface(s), “eth0” is not public. Public interfaces should be used to configure virtual IPs.

The silent mode VIPCA configuration will fail because of BUG 4437727 in 10.2.0.1. To solve this run the
VIPCA manually from root user from last node where this error has occured and follow the instructions.
# $ORA_CRS_HOME/bin/vipca

6. Now final step is to add the resources back to OCR with srvctl command.

Adding DATABASE to OCR:

$srvctl add database -d db_unique_name -o oracle_home

[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add database -d orcl -o /u01/app/oracle/product/10.2.0/db_1

Adding INSTANCE to OCR:

srvctl add instance -d db_unique_name -i inst_name -n node_name

[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i racdb1 -n raclinux1

[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i racdb2 -n raclinux2 2

Adding SERVICES to OCR:

$srvctl add service -d db_unique_name -s service_name -r preferred_list

[oracle@raclinux1  ~]$ $ORA_CRS_HOME/bin/srvctl add service -d orcl -s test_service -r racdb1,racdb2

Adding NODEAPPS to OCR:

srvctl add nodeapps -n node_name -o oracle_home -A addr_str
Where addr_str= The node level VIP address
This command needs to be run from ROOT user otherwise you will get following error:

[oracle@raclinux1  ~]$  $ORA_CRS_HOME/bin/srvctl add nodeapps -n raclinux1  -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0

PRKO-2117 : This command should be executed as the system privilege user.

[oracle@raclinux1  ~]$

[oracle@raclinux1  ~]$ su –

Password:

[root@raclinux1  ~]# cd /u01/app/crs/bin

[root@raclinux1  bin]# ./srvctl add nodeapps -n raclinux1  -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.87/255.255.255.0

[root@raclinux1  bin]#./srvctl add nodeapps -n raclinux2 2  -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0

This will complete the OCR recreation, now you can test the status with cluvfy.

July 7, 2008 Posted by | oracle | Leave a comment