Recovery Procedures When HSP is Present at Time of Failure

^{IBM-AUSTRIA - PC-HW-Support 30 Aug 1999}

Recovery Procedures When HSP is Present at Time of Failure

Recovery Procedures When HSP is Present at Time of Failure

The following instructions apply to thc IBM SCSI-2 Fast/Wide PCI-Bus RAID Adapter and IBM Fast/Wide Streaming Adapter/A.

One DDD Drive, No OFL

Follow the steps below to bring the DDD drive back to ONL if the following items arc true:

Only one drive is marked DDD and the rest are ONL.
The RAID logical drive status is OKY because an HSP is present in the system. Fither the HSP drive is the hard drive that went DDD or the HSP has already automatically taken over for the DDD drive and has been rebuilt successfully.
There are no drives with an OFL status.

Once you verify the conditions above through either the RAID administration log or the RAID administration utility, perform the following steps to bring the DDD drive back to HSP status.

Physically replace the hard drive in the DDD bay with a new one of the same capacity or greater.
With a RAID-1 or RAID-5 array, the operating system is still functional at this point. Use either NetFinity or the RAID administration utility to bring the drive hack to HSP status. With the RATD administration utility, open the options menu and select Replace Drive.
When you see the prompt to select the DDD drive, highlight the drive you just replaced and press Enter.
The RAID adapter issues a start unit command to the drive. Once the drive successfully spins up, the RAID adapter changes the drive's status to HSP and saves the new configuration.
If you see an 'Error in starting drive' message, reinsert cables, the hard drive, etc., to verify these are connected properly, then go to step 2. If the error persists, go to step 1.
If the error still occurs with a known good hard drive, then troubleshoot to determine the defective part, which may be a cable, back plane, RAID adapter, etc. Once you have replaced the defective part so that there is a good connection between RAID adapter and hard drive, go to step 2.

Two DDD Drives, No OFL .

If the system has two DDD drives, and a defined hot spare existed prior to the drive fijilures, then the system should still be up and running as long as the logical drives are configured as RAID-5 or RAID-1. If the system is still running, then one of the DDD drives becomes HSP when you replace it. Perform the following steps to bring the logical drive back to ONL status. (Because the operating system is functional, this procedure assumes you are using the RAID administration utility within the operating system to recover.):

Physically replace both drives that are marked DDD.
Once you replace both drives, select the options menu of the RAID administration utility. Choose Replace Drive, highlight the first DDD drive, and press Enter. You receive a message confirming that the drive is starting. After that, one of two things happens:
- The drive starts the rebuild process, when complete, the drive changes to ONL.
  -OR-
- The drive becomes HSP. This happens if the actual hot-spare drive that was previously defined is defective, or a different drive was marked DDD and the hot spare successfully rebuilt the data before the second drive went down.
You can check which one occurs by viewing the RAID log.
Repeat step 2 for the second DDD drive.

More than 2 DDD Drives, No OFL

In this scenario, the operating system is no longer functional. Therefore, you must boot to the RAID Option Diskette to recover the array. It is extremely important to confirm that either the RAID administration utility or NetFinity Manager has been running prior to the drives being marked defunct. If so, the utility or NetFinity Manager has logged the sequence of DDD events to a log file either on a diskette or on a local or network drive. With this file, you can view the log file on another machine to determine the 'inconsistent' drive. When you know which drive is 'inconsistent', you can attempt to recover data.

Note: The previous paragraph states 'attempt to recover' because once you lose more than one drive in a set of RAID-5 or RAID-1 logical drives, loss of data is definitely a possibility. The steps below guide you through a recovery, if at all possible.

View the RAID log on another machine and write down the order in which the drives went defunct.
Boot to the RAID configuration diskette, and select View Configuration. Make sure that the template contains the correct information for the status of all drives, not just those listed in the RAID log.
Using the RAID configuration utility, select Replace Drive and choose a DDD drive that is not listed in the RAID log. Repeat this step until the only DDD drives remaining arc those indicated in the RAID log file.
NOTE: The drives marked DDD that are not listed in the RAID log are the last ones to go defunct. You must recover these drives first so that the infornaation from them can be used to rebuild the original drive that failed (the 'inconsistent' drive). If you do not replace the 'inconsistent' drive last, then the system uses it to rebuild the last drive that went defunct, resulting in corrupted data. Therefore, it is extremely important to perform step 3 carefully.
Select Replace Drive and then select the last drive to go defunct according to the log file. Repeat this step until you have replaced all drives in the correct order. One of the drives should appear as OFL and one should appear as HSP, the rest appear as ONL.
Select Rebuild and highlight the DDD drive.
If the rebuild completes successfully, reboot to the operating system. If it does not complete successfully, go to step 7.
At this point, run non-destructive RAID diagnostics individually on each drive. Run these diagnostics individually to ensure that you do not get more than one drive that goes defunct at a time. If a drive does go DDD, physically replace that drive and run a replace/rebuild procedure. This verifies that you remove all defective drives from the system, if any exist.
If the rebuild process fails, then perform these steps:
1. Exit to the RAID Main Menu.
2. Select Drive Information and view the error counts for each of the hard drives to determine which drive has errors.
3. If the errors occurred on the drive being rebuilt, then physically replace this drive. Select Replace. The status of the drive changes from DDD to OFL. Attempt the rebuild process again. If it completes successfully, go to Step 6.
If the drive still fails the rebuild process, then verify that the drives being rebuilt from do not have any errors. If they have no errors, then you should be able to rebuild the data. Check cable connections to the drive being rebuilt it is possible that you replaced a defective drive with another defective drive.
- When errors occur on the drives that you are rebuilding from, the adapter continues to rebuild all information except that contained in the unrecoverable defective sector. If the unrecoverable sector was in the data area of the disk, then naturally some data has been lost. There is no method at this time for determining whether the errors are in a data or non-data area of the disk, Users must inspect their personal files to determine this.
  To recover the portion of the data that was rebuilt, perform the following steps after the 'Rebuild Failure' message:
If a backup configuration is available, restore the backup configuration.
If a backup configuration is not available, write down the information you can retrieve by selecting the View Configuration option. Delete the array and manually create it to match this configuration information. Perform this step carefully, for if you deviate in any way from the original configuration, then you will lose all data.
NOTE: Do not Initialize this logical drive.
Have all users verify their personal files to ensure their data is good. Keep in mind that some files may be corrupt due to rebuild errors.

One or More DDD Drives, and One OFL Drive

Follow the same basic steps as those listed in the above section to recover your data. When a drive is marked OFL, that means that it is spinning but 'inconsistent' with the rest of the array. Usually when a drive is marked OFL, the data on it is being rebuilt from the remaining drives in the array. If the server loses power, or if another drive goes DDD during a rebuild, then the drive being rebuilt remains OFL. In this case, you have to boot the machine to the RAID Configuration Diskette and then follow the procedure in the previous section. Make sure that the OFL drive is the last drive to be software replaced. The offline drive is the 'inconsistent' drive, and it requires a rebuilding process.

NOTE: Data corruption occurs if the OFL drive is used to rebuild another drive.

Back to

More INFORMATION / HELP is available at the IBM-HelpCenter

Please see the LEGAL - Trademark notice.
Feel free - send a for any BUG on this page found - Thank you.