Reliability and Fault Tolerance of Windows NT Server


Reliability is a key component in today's network. One of the most powerful characteristics of the Microsoft® Windows NT™ Server operating system is its reliability. Built into every component of Windows NT Server, reliability provides maximum availability of information and services to users. The system ensures this high availability in three ways: by uniformly handling hardware and software system faults, protecting user programs from each other as well as the system, and providing data and system recovery mechanisms. By building reliability technology into Windows NT Server from the start, Microsoft has ensured the system's ability to tolerate faults while still maintaining the availability of the system, applications, network resources and data.

Windows NT Server includes the following reliability and fault tolerance capabilities:

Additionally, fault tolerance functionality in Windows NT Server can be enhanced through the use of third party products from vendors such as Arcada, Cheyenne and Octopus.

Error Handling and Protected Subsystems

Software applications do not always operate as expected – they can fault. Windows NT Server is designed to tolerate these faults by ensuring that they do not affect other components of the operating system. For Windows NT Server, the first line of defense against software errors is it's structured method of exception handling. When an abnormal event occurs, the event is captured and either the processor or operating system issues an exception. This design ensures that no undetected error is allowed to influence the system or other user programs.

Windows NT Server also employs protected subsystems in its design. Protected subsystems are separate, unique memory locations that are assigned to different processes and applications. By isolating programs in this way, Windows NT Server ensures that a program fault will not affect the system's kernel and, as a result, crash the operating system. Similarly, programs are isolated from each other so that when a program faults, it does not adversely affect other programs running on the system. This architecture makes it safe to deploy new Windows NT Server-based applications. New applications can be run and tested on a Windows NT Server-based machine without concern that they will adversely affect the system or other production applications. As a result, deploying powerful, new server-based applications on Windows NT Server is less risky than it is with some other server operating systems.

Automatic Restart

The combination of structured exception handling and protected subsystems make Windows NT Server system failures extremely rare. However, the operating system does include an automatic restart feature. In the event of a failure, the system can be configured to automatically restart itself. This feature of Windows NT Server provides maximum system up-time. To assist the administrator in determining the cause of the failure, Windows NT Server can be set to transfer its memory contents to a disk file before restarting for later analysis.

Recoverable File System

While Windows NT Server is highly tolerant of software faults, it also excels in handling hardware faults such as disk and disk related failures. Much of this disk fault tolerance is related to NTFS, the Windows NT file system. NTFS, is a comprehensive, recoverable file system that provides virtually instant recovery from a disk failure. The file system logs each disk I/O operation as a unique transaction. When a user updates a file, the Log File Service logs redo and undo information for that transaction. Redo is the information that tells NTFS how to repeat the transaction; undo tells NTFS how to roll back the transaction. If a transaction completes successfully, the file update is committed. If the transaction is incomplete, NTFS ends or rolls back the transaction by following instructions in the undo information. If NTFS detects an error in the transaction, the transaction is also rolled back.

File system recovery is straightforward with NTFS. If the disk fails, NTFS performs three passes – an analysis pass, a redo pass, and an undo pass. During the analysis pass, NTFS appraises the damage and determines exactly which clusters must now be updated per the information in the log file. The redo pass performs all transaction steps logged from the last checkpoint. The undo pass backs out any incomplete (uncommitted) transactions.

In addition to virtually instant recovery, NTFS supports hot-fixing. If an error occurs due to a bad sector, NTFS moves the information to a different sector and marks the original sector as bad. This process is completely transparent to an application performing disk I/O. Hot-fixing eliminates error messages such as the "Abort, Retry, or Fail?" error message that occurs when a file system such as FAT encounters a bad sector.

Tape Backup Support

Regular tape backup is an important part of guaranteeing data availability. Windows NT Server includes a graphical tool called Backup that makes it easy to backup your Windows NT Server-based data to tape. Backup allows you to:

In addition to the built-in backup utility in Windows NT Server, third party products (such as Arcada's Backup Exec for Windows NT) provide additional functionality such as the ability to create and script jobs, automated scheduling of jobs, remote administration of backup and client-server backup to remote tape devices.

Uninterruptible Power Supply (UPS)

An uninterruptible power supply (UPS) is a battery-operated power supply connected to a computer to keep the system running during a power failure. The UPS service for Windows NT Server detects and warns users of power failures and manages a safe system shutdown when the backup power supply is about to fail. The Windows NT Server UPS service allows the user to set various options:

Understanding RAID

Fault tolerant disk systems are standardized and categorized in six levels known as Redundant Arrays of Inexpensive Disks (RAID) level 0 through level 5. Each level offers various mixes of performance, reliability, and cost. The major difference between RAID and earlier, more expensive large-disk technologies is that RAID combines multiple disks with lower individual reliability ratings to reduce the total cost of storage. The lower reliability of each disk is offset by the redundancy. Windows NT Server supports disk striping (RAID level 0), disk mirroring (RAID levels 1) and disk striping with parity (RAID level 5).

Disk Mirroring (RAID Level 1)
Disk mirroring is the creation and maintenance of an identical twin for a selected disk. Any file system, including FAT, HPFS, and NTFS, can take advantage of disk mirroring. Disk mirroring uses two partitions on different drives connected to the same disk controller. All data on the first (primary) partition is mirrored automatically onto the secondary partition. Thus, if the primary disk fails, no data is lost. Instead, the partition on the secondary disk is used.

Mirroring is not restricted to a partition identical to the primary partition in size, number of tracks and cylinders, and so on. This eliminates the problem of acquiring an identical model drive to replace a failed drive when an entire drive is being mirrored. For practical purposes though, the mirrored partitions will usually be created to be the same size as the primary partition.

Disk mirroring has better overall read and write performance than stripe sets with parity. Another advantage of mirroring over stripe sets with parity is that there is no loss in performance when a member of a mirror set fails. Disk mirroring, however, is more expensive in terms of dollars per megabyte because disk utilization is lower than with striping with parity. Disk mirroring is best suited for peer-to-peer and modest server-based LANs.

Disk Duplexing
Disk duplexing is simply a mirrored pair with an additional adapter on the secondary drive. Duplexing provides fault tolerance for both disk and controller failure. In addition to providing fault tolerance, it can also improve performance. Like mirroring, duplexing is performed at the partition level. To the Windows NT Server operating system, there is no difference between mirroring and duplexing. It is simply a matter of where the other partition can be found.

Disk Striping with Parity (RAID Level 5)
Disk striping is another popular method of protecting data against disk failure. With disk striping, data is divided into large blocks and spread in a fixed order among multiple disks in an array. In a stripe set with parity, parity information for the data is also written across the array with the condition that the parity information and data reside on different disks. If a member of the disk array fails, data can be recovered from the parity information since it is stored on a different disk. One advantage of stripe sets with parity is that they have better read performance (although slower write performance) than mirror sets. Another advantage is that the cost per stored megabyte is typically lower with stripe sets with parity than with mirrored sets because disk utilization is much higher.

To understand fully Windows NT Server-based disk striping, it is valuable to compare it to hardware-based striping systems. Hardware implementation of the RAID level can offer performance advantages over software implementations. With some systems, it may even be possible to replace a failed drive without shutting down the system. However, hardware RAID implementations tend to be very expensive and may require an organization to lock-in to a single vendor solution. The RAID technology of Windows NT Server is powerful and cost-effective and provides a consistent implementation across numerous hardware platforms. Windows NT Server's RAID technology provides greater flexibility in mixing systems that provide optimum price and performance for customer needs.

Other Types of Fault Tolerance

System Fault Tolerance III
Another popular fault tolerance technology is Novell®'s System Fault Tolerance III (SFT III). SFT III uses a dedicated link to connect a NetWare®-based server to a local mirror site. It then creates a replica of the entire server on the mirror, including RAM contents. If the primary server experiences a hardware failure such as a processor fault, the mirror automatically assumes responsibility for network services.

Octopus 1.4
Octopus Technologies, Inc. offers a product for Windows NT Server that provides some distinct advantages over SFT III. Octopus 1.4 provides real-time fault-tolerance for Windows NT Server by updating copies of selected files on a mirror server as the file system commits file updates on the primary server. As a result, data is mirrored in real-time. Octopus also uses a standard LAN or WAN connection between the primary and mirror servers. No dedicated link is required and mirrors do not have to be local. Instead, they can be anywhere on the network including remote locations. This is particularly valuable in highly active, mission-critical transaction processing environments. If a transaction processing application server came down due to extreme circumstances in one location (such as an earthquake), a server in a remote location would be ready to begin running that application with all its associated data.

Octopus is also a very cost effective solution. First, it is a fraction of the cost of SFT III. In addition, it minimizes network bandwidth consumption by sending only file changes to the mirror, not the whole file. Octopus allows users to selectively mirror files rather than forcing mirroring of the entire server as SFT III does. As a result, Octopus provides more flexibility because it does not have to dedicate a machine strictly to mirroring a server. Also, multiple primary servers can be mirrored to a single mirror server or multiple mirror servers. The result is maximum disk utilization while still offering a high level of data redundancy. User data and services are thereby fully protected in the most cost effective way.



© 1995 Microsoft Corporation 
Microsoft is a registered trademark and Windows NT is a trademark of Microsoft Corporation. 
NetWare and Novell are registered trademarks of Novell, Inc.