DSDS Backup and Recovery Procedures

Author: Sean McManus
Creation Date: 17 March 2003

This document contains an overview of DSDS backup and recovery procedures. This information is relevant to the most common failure scenarios for the DSDS.


Backup Procedures
1) Operating System & Other Files
2) Oracle Database Files
Common Failure Modes

1) Power Outages
2) Disk failure
3) Oracle runs out of disk space.
4) Oracle hangs/stalls
5) Other failure modes
Backup Log  Updated Sep 24, 2002

Backup Procedures
On a weekly basis, the DSDS runs four distinct backup jobs, on the two available DLT drives. With current hardware, a complete system backup is not possible on a daily basis. The scheme below prioritizes nightly backups according to value, where GONG data and software is considered to be of higher value than operating system files. UFSDUMP is the preferred utility for the weekly operating system backup, whereas TAR is preferred for data files.

1) Operating System Dump (1X per week)

   Directory Locations: '/', '/usr/', '/d1', 'database'
   Backup Script: /d1/local/bin/cron_dump_backup.csh
   Backup Device: /dev/rmt/3n
   Backup Log: /d1/opsdata/backup_log0 & /etc/dumpdates
   Error Log: All output is directed to backup log

   Restore Command: ufsrestore if /dev/rmt/3n

2)Oracle Application, User files, and Oracle data files (nightly backup)

   Directory Locations: '/usr', '/d1', '/database'
   Backup Script: /d1/local/bin/cron_tar_backup.csh
   Backup Device: /dev/rmt/3n
   Backup Log: /d1/opsdata/backup_log1
   Error Log: /d1/opsdata/backup_log1.err

   Restore Command: tar xvbf 2048 /dev/rmt/3n

3) Oracle Instrument Database (2x per week backup)

   Directory Locations: '/instrument', '/instrument2'
   Backup Script: /d1/local/bin/cron_tar_backup2.csh
   Backup Device: /dev/rmt/4n
   Backup Log: /d1/opsdata/backup_log2
   Error Log: /d1/opsdata/backup_log2.err

   Restore Command: tar xvbf 2048 /dev/rmt/4n

4) DSDS archived online data (3x per week backup)

   Directory Locations: '/online'
   Backup Script: /d1/local/bin/cron_tar_backup3.csh
   Backup Device: /dev/rmt/4n
   Backup Log: /d1/opsdata/backup_log3
   Error Log: /d1/opsdata/backup_log3.err

   Restore Command: tar xvbf 2048 /dev/rmt/4n

All backup operations are performed as root to ensure complete capture of files.

DSDS operators are notified via email of successful or failed backup jobs.

One week's worth of backup tapes are stored adjacent to the tape drives of the DSDSOM1 workstation.

Every six months, a set of backup tapes will be archived permanently to the DSDS vault, with copies store at Kitt Peak.

Common Failure Modes

1) Power Outages (also applicable to intentional power cycling)

In the event of a power cycle or reboot, DSDSOM1 automatically starts two critical applications for essential DMAC functions:

1) Oracle Database Application
startup script: /etc/rc2.d/S99dbora
2) Apache Web Server Applications
startup script: /etc/rc2.d/S95wwwstartup

The startup scripts run automatically at boot time. If for some reason the DSDS does not come online automatically, the applications may be started manually using the following commands:

	Start the Apache Server:

	Start the Oracle Database:
	log in operating system user as 'oracle'

	dsdsom1% cd /usr/users/oracle/app/oracle/product/7.3.3/dbs/
	dsdsom1% svrmgrl

	SVRMGR> connect internal

	Shutdown any existing occurrences of Oracle:

	SVRMGR> shutdown

	Restart Oracle:

	SVRMGR> startup open pfile = 'initDSDS.ora'
	SVRMGR> exit

2) Disk failure

When a disk failure occurs, shutdown Oracle to prevent any further database transactions while the problem is being analyzed. Make every reasonable effort to restore the existing disk media.

If the disk drive is unsalvageble:

Shutdown Oracle
Restore the appropriate partition(s) in the operating system from the most recent backup tape.
Restart Oracle

Oracle will recognize the datafiles from the last successful backup.

3) Oracle runs out of disk space.*

Occassionally, an Oracle database table or tablespace will outgrow one of it's assigned datafiles, causing an application to crash. In this situation, the database administrator should be able to isolate the problem by thoroughly reviewing the error log, then analyzing the database tables to identify the table that has filled.

*This section will be expanded with appropriate SQL scripts.

4) Oracle hangs/stalls

A disk may have filled up. Use 'df -k' to check partitions, and move non-critical files off the filled partition. Oracle will resume automatically.

Note that it is possible for disk partitions containing Oracle data files to run at 100% capacity. Do not move or delete Oracle data files (files with the .dat extension in /database, /instrument, and /instrument2.

Oracle creates temporary files located at '/d1/opsdata/ckp'. All but the 10 most recent files can be safely deleted to free up disk space.

5) Other non-fatal errors*

The DSDS may encounter a variety of non-fatal errors on a daily basis. If an error is encountered in one of the DSDS applications:

1) Contact the database administrator via phone or email.
2) Review the error log located at: /d1/opsdata/database_msgs
3) Compare error codes with documented codes located at www.oracle.com
4) Proceed with resolution according to Oracle documentation.
5) Backup system before making any changes.


* This section will be expanded to include several examples of DSDS failure and recovery scenarios.

BACKUP LOG
     VSN       COPY     CONTENTS		      DATE PRODUCED

10480 810095 GONG WWW & NSO Dig. Lib. 13-SEP-02 10481 810096 GONG /Online Backup 13-SEP-02 10482 810097 GONG IDB Backup 13-SEP-02 10483 810098 DSDSOM1 SYSTEM BACKUP 19-SEP-02



Last updated by Sean McManus on Tuesday, Feb 18, 2003, 11:40