RMAN TSPITR – Examining the driver script
RMAN TSPITR – Examining the driver script
In this article, we’ll take a look at what Oracle does when you perform a tablespace point-in-time recovery (TSPITR) – and also what it doesn’t do, depending on the version being used. One of the errors can be vexing, but once you know where and how to fix it, the point-in-time recovery process works much better. Working “much better” doesn’t necessarily mean perfectly, but more than likely, you’ll be at a place where finishing the recovery is fairly straightforward.
During an RMAN session, you’ll see output where RMAN is “printing” or using a stored memory script. The print part is correct as the output is printed to the terminal/session window, but from memory is not entirely accurate. One example of the output is shown below, and you can see how Oracle has taken your input (via script and init.ora parameters) and fed it into the RMAN session. In my example, I am recovering tablespace USERS until logseq 499.
RMAN> run {
2> allocate auxiliary channel dev1 type ‘sbt_tape’;
3> set newname for datafile ‘/u065/oradata/train/system01.dbf’ to
‘/u064/oradata/clone/system01.dbf’;
4> set newname for datafile ‘/u064/oradata/train/users01.dbf’ to
‘/u064/oradata/clone/users01.dbf’;
5> set newname for datafile ‘/u065/oradata/train/rbs01.dbf’ to
‘/u064/oradata/clone/rbs01.dbf’;
6> recover tablespace users until logseq 499 thread 1;
7> }
The RMAN output (partial) of what I provided is reflected below.
RMAN-03027: printing stored script: Memory Script
{
# set the until clause
set until logseq 499 thread 1;
# restore the controlfile
restore clone controlfile to clone_cf;
# replicate the controlfile
replicate clone controlfile from clone_cf;
# mount the controlfile
sql clone ‘alter database mount clone database’;
# archive current online log for tspitr to a resent until time
sql ‘alter system archive log current’;
# resync catalog after controlfile restore
resync catalog;
}
Note: The “resent” time is really a “recent” time. The same typo has existed for close to decade now.
The code to generate this output, and other RMAN output for that matter, resides in a file named recover.bsq. The file is located in ORACLE_HOME/rdbms/admin and is directly editable, although a note at the top says otherwise. Within the file, around line 8509 in the 10g version and line 4523 in 8i, you can see the start of what looks to be boilerplate text using variable resolution. The block of code looks as such:
#
# tspitr_0: pre_tspitr script. This member is used once.
#
define tspitr_0
<<<
# set the until clause
set until &1&;
# restore the controlfile
restore clone controlfile to clone_cf;
# replicate the controlfile
replicate clone controlfile from clone_cf;
# mount the controlfile
sql clone ‘alter database mount clone database’;
# archive current online log for tspitr to a resent until time
sql ‘alter system archive log current’;
>>>
#
# tspitr_sy_ct: resync catalog
#
define tspitr_sy_ct
<<<
# resync catalog after controlfile restore
resync catalog;
>>>
Note that the set until clause is required to resolve at least three different inputs (time, SCN, and log sequence number). Continuing on through the tspitr_X “defines” in the C-like code, the procedural steps to rename files, flip them, online/offline files, and alter database are easy to correlate to an RMAN session’s output.
You would think that the TSPITR code from Oracle would be robust, where user input or editing of the stored script wouldn’t be necessary. Interestingly enough, between sections 7 and 8 we see the following “oh by the way” instructions for a user.
# PLUG HERE the creation of a temporary tablespace if export fails due to lack
# of temporary space.
# For example in Unix these two lines would do that:
#sql clone “create temporary tablespace temp
# tempfile ”/tmp/aux_tspitr_tmp.dbf” size 500K”;
Depending upon your environment/situation, you may have to edit the recover.txt (although you can edit recover.bsq file and not have the changes overwritten) and uncomment the two lines associated with creating a temporary tablespace. Okay, that is one fix related to the export failing. The explanation for why a temporary tablespace is needed is that the sort area (in memory) is not large enough, and just like any other database (i.e., operating normally, not needing any recovery), if sorts cannot be done in memory, then Oracle goes to the temp tablespace.
Your mileage may vary on this, but uncomment the “create temporary tablespace” statement and use what is suggested. If you encounter another error, more than likely it is due to the fact the tempfile size is not large enough (the second fix). Instead of trying to tweak the size parameter so you just miss encountering the error, go big early and avoid the problem altogether. How big is big enough? That may be trial and error, but it is quite likely to be on the order of megabytes (e.g., 5MB). The point is this: don’t waste time setting it to 600K, then 700K, and so on.
Two points are in order here. First, this problem is supposedly fixed in 10gR2 and above, so you may never have to edit this file – at least for this reason. The second is that Oracle Corp. provides conflicting instructions on which file to edit: recover.bsq or recover.txt. As mentioned, the actual file says to edit the text version, but in a MetaLink note (“ORA-25153: Temporary Tablespace is Empty encountered on RMAN TSPITR,” Doc ID 263483.1), the instructions are:
Edit recover.bsq (RMAN library file) in $ORACLE_HOME/rdbms/admin (location is platform specific) and make the following changes. NOTE: It is recommended to take a backup of this file.
For sure, you can edit the recover.bsq file, but do make a copy of the original beforehand. If you want to experiment with the file, try adding your own comment lines here and there. Commented lines (those beginning with #) within the >>> to the <<< blocks are output to the RMAN session. For example, if seeing “PLUG HERE” grates on your sense of grammar, feel free to replace it with your own comment.
The instructional value of looking through all the tspitr_X sections is that you can follow the process at a higher level (as opposed to filtering out the RMAN output). If something goes wrong, you’ll have a better idea of where in the process the failure took place.
While on the subject of the recover.bsq file, a search on MetaLink (yes, I’m avoiding writing the clunky sounding “My Oracle Support”) for this file returns 29 entries (as of the publish date of this article). An interesting theme among the hits is that of the wrong version of the file being used. And like many other notes, check them for accuracy before making changes or running scripts.
One note related to upgrading the recovery catalog has you fix one error but only to introduce another via the changed script. The suggested change in Doc ID 833046.1 (“RMAN-600 [3040] TRYING TO UPGRADE THE RECOVERY CATALOG”) has an error in the SELECT clause (extra comma at the end of the last column selected). Not that you would be stressed for time during some hugely critical upgrade or recovery process or anything like that as there is always enough time to fix technical notes from Oracle.
Conclusion
The recover.bsq file is interesting in that it provides you with the nitty-gritty of what RMAN does during a recovery. All of the steps aren’t necessarily listed in the documentation, or in third-party books for that matter, and if you can read code at even a basic level, you can understand what is taking place when RMAN does its magic. Knowing about this file ahead of time can save you hours of searching for resolution of RMAN errors, and probably during a time when you really can’t afford those extra hours.
How To Recover From Corrupted OCR Disk
How To Recover From Corrupted OCR Disk
It is very common where a DBA is left with corrupted OCR disk without having any good backup.
The same situation was experienced by me few days back. One node of RAC database shows the following:
NODE1:
$ORA_CRS_HOME/bin/crs_stat -t
Name Type Target State Host
————————————————————
ora.orcl.db application ONLINE ONLINE raclinux1
ora….11.inst application ONLINE ONLINE raclinux1
ora….12.inst application ONLINE OFFLINE
ora….vice.cs application OFFLINE OFFLINE
ora….l11.srv application ONLINE OFFLINE
ora….l12.srv application ONLINE OFFLINE
ora….SM1.asm application ONLINE ONLINE raclinux1
ora….DC.lsnr application ONLINE ONLINE raclinux1
ora….abc.gsd application ONLINE ONLINE raclinux1
ora….abc.ons application ONLINE ONLINE raclinux1
ora….abc.vip application ONLINE ONLINE raclinux1
ora….SM2.asm application ONLINE ONLINE raclinux2
ora….C2.lsnr application ONLINE ONLINE raclinux2
ora….bc2.gsd application ONLINE ONLINE raclinux2
ora….bc2.ons application ONLINE ONLINE raclinux2
ora….bc2.vip application ONLINE ONLINE raclinux2
The other node shows the following:
NODE2:
/crs_stat -t
HA Resource Target State
———– —— —–
ora.orcl.db OFFLINE OFFLINE
ora.orcl.racdb1.inst OFFLINE OFFLINE
ora.orcl.racdb2.inst OFFLINE OFFLINE
ora.orcl.test_service.cs ONLINE OFFLINE
ora.orcl.test_service.racdb1.srv OFFLINE OFFLINE
ora.orcl.test_service.racdb2.srv OFFLINE OFFLINE
ora.raclinux1 .ASM1.asm OFFLINE OFFLINE
ora.raclinux1 .LISTENER_RAC1 .lsnr OFFLINE OFFLINE
ora.raclinux1 .gsd OFFLINE OFFLINE
ora.raclinux1 .ons OFFLINE OFFLINE
ora.raclinux1 .vip OFFLINE OFFLINE
ora.raclinux2.ASM2.asm OFFLINE OFFLINE
ora.raclinux2.LISTENER_RAC2 2.lsnr ONLINE OFFLINE
ora.raclinux2.gsd ONLINE OFFLINE
ora.raclinux2.ons ONLINE OFFLINE
ora.raclinux2.vip ONLINE OFFLINE
We can see the inconsistent data across two node RAC. Every command for srvctl, crsctl was hanging on NODE 2.
Now the option is to restore the OCR backup, but if there is no backup available for OCR then we can use the following procedure to recover from corrupted OCR disk
(There will be complete downtime needed to perform these operations)
1. Check the status of CRS from node 1:
# ps -eaf |grep d.bin
root 12873 1 0 Aug11 ? 00:11:07 /u01/app/crs/bin/crsd.bin reboot
oracle 13105 12846 0 Aug11 ? 00:00:45 /u01/app/crs/bin/evmd.bin
oracle 13226 13200 0 Aug11 ? 00:13:13 /u01/app/crs/bin/ocssd.bin
root 21458 19986 0 20:34 pts/4 00:00:00 grep d.bin
2. Shutdown Oracle ClusterWare on all nodes:
[root@raclinux1 bin]# ./crsctl stop crs
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
Check the status again:
[root@raclinux1 bin]# ps -eaf |grep d.bin
root 21927 19986 0 20:34 pts/4 00:00:00 grep d.bin
It shows that the cluster is stopped.
3. Execute rootdelete.sh from all nodes.
It is under directory $ORA_CRS_HOME/install/rootdelete.sh
NODE1:
[root@raclinux1 install]# ./rootdelete.sh
Shutting down Oracle Cluster Ready Services (CRS):
Stopping resources.
Error while stopping resources. Possible cause: CRSD is down.
Stopping CSSD.
Unable to communicate with the CSS daemon.
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down…
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in ‘/etc/oracle/scls_scr’
NODE 2:
./rootdelete.sh
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down…
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in ‘/etc/oracle/scls_scr’
“OCR initialization failed accessing OCR device”, this error can occur due to folloing reasons:
1. ocrconfig_loc is not pointing to the correct ocr.
2. Problem of rights and owners on the ocr devices
3. Configuration problem on Oracle Cluster Synchronization Services
As the SCR entries are cleaned up so there is no need to worry about PROC-26 error.
If you have more than 2 nodes in a rac you need to run rootdelete.sh on all the other nodes also.
4. Run rootdeinstall.sh from the node where the RAC installation was done (usually it is the node1).
It will clear up the OCR disk contents.
./rootdeinstall.sh
Removing contents from OCR device
2560+0 records in
2560+0 records out
5. Run root.sh from the same node:
./root.sh
WARNING: directory ‘/u01’ is not owned by root
Checking to see if Oracle CRS stack is already configured
Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
WARNING: directory ‘/u01’ is not owned by root
assigning default hostname raclinux1 for node 1.
assigning default hostname raclinux2 2 for node 2.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node :
node 1: raclinux1 raclinux1-priv raclinux1
node 2: raclinux2 raclinux2-priv raclinux2
Creating OCR keys for user ‘root’, privgrp ‘root’..
Operation successful.
Now formatting voting device: /dev/raw/raw1
Format of 1 voting devices complete.
Startup will be queued to init within 90 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
CSS is active on these nodes.
raclinux1
CSS is inactive on these nodes.
raclinux2 2
Local node checking complete.
Run root.sh on remaining nodes to start CRS daemons.
After its completion run root.sh on all remaining nodes.
./root.sh
Checking to see if Oracle CRS stack is already configured
Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
clscfg: EXISTING configuration version 3 detected.
clscfg: version 3 is 10G Release 2.
assigning default hostname raclinux1 for node 1.
assigning default hostname raclinux2 for node 2.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node :
node 1: raclinux1 raclinux1-priv raclinux1
node 2: raclinux2 raclinux2-priv raclinux2
clscfg: Arguments check out successfully.
NO KEYS WERE WRITTEN. Supply -force parameter to override.
-force is destructive and will destroy any previous cluster
configuration.
Oracle Cluster Registry for cluster has already been initialized
Startup will be queued to init within 90 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
CSS is active on these nodes.
raclinux1
raclinux2
CSS is active on all nodes.
Oracle CRS stack installed and running under init(1M)
Running vipca(silent) for configuring nodeapps
The given interface(s), “eth0” is not public. Public interfaces should be used to configure virtual IPs.
The silent mode VIPCA configuration will fail because of BUG 4437727 in 10.2.0.1. To solve this run the
VIPCA manually from root user from last node where this error has occured and follow the instructions.
# $ORA_CRS_HOME/bin/vipca
6. Now final step is to add the resources back to OCR with srvctl command.
Adding DATABASE to OCR:
$srvctl add database -d db_unique_name -o oracle_home
[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add database -d orcl -o /u01/app/oracle/product/10.2.0/db_1
Adding INSTANCE to OCR:
srvctl add instance -d db_unique_name -i inst_name -n node_name
[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i racdb1 -n raclinux1
[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i racdb2 -n raclinux2 2
Adding SERVICES to OCR:
$srvctl add service -d db_unique_name -s service_name -r preferred_list
[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add service -d orcl -s test_service -r racdb1,racdb2
Adding NODEAPPS to OCR:
srvctl add nodeapps -n node_name -o oracle_home -A addr_str
Where addr_str= The node level VIP address
This command needs to be run from ROOT user otherwise you will get following error:
[oracle@raclinux1 ~]$ $ORA_CRS_HOME/bin/srvctl add nodeapps -n raclinux1 -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0
PRKO-2117 : This command should be executed as the system privilege user.
[oracle@raclinux1 ~]$
[oracle@raclinux1 ~]$ su –
Password:
[root@raclinux1 ~]# cd /u01/app/crs/bin
[root@raclinux1 bin]# ./srvctl add nodeapps -n raclinux1 -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.87/255.255.255.0
[root@raclinux1 bin]#./srvctl add nodeapps -n raclinux2 2 -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0
This will complete the OCR recreation, now you can test the status with cluvfy.
-
Archives
- February 2017 (1)
- November 2016 (1)
- October 2016 (1)
- May 2016 (2)
- March 2016 (3)
- December 2014 (2)
- July 2014 (1)
- June 2014 (6)
- May 2014 (5)
- February 2014 (1)
- December 2012 (2)
- November 2012 (8)
-
Categories
-
RSS
Entries RSS
Comments RSS