Which of the following statements best explain the ability of the network to provide fault tolerance?

Fault-tolerant technology is a capability of a computer system, electronic system or network to deliver uninterrupted service, despite one or more of its components failing. Fault tolerance also resolves potential service interruptions related to software or logic errors. The purpose is to prevent catastrophic failure that could result from a single point of failure.

VMware vSphere 6 Fault Tolerance is a branded, continuous data availability architecture that exactly replicates a VMware virtual machine on an alternate physical host if the main host server fails.

Fault-tolerant systems are designed to compensate for multiple failures. Such systems automatically detect a failure of the computer processor unit, I/O subsystem, memory cards, motherboard, power supply or network components. The failure point is identified, and a backup component or procedure immediately takes its place with no loss of service.

To ensure fault tolerance, enterprises need to purchase an inventory of formatted computer equipment and a secondary uninterruptible power supply device. The goal is to prevent the crash of key systems and networks, focusing on issues related to uptime and downtime.

Fault tolerance can be provided with software embedded in hardware, or by some combination of the two.

In a software implementation, the operating system (OS) provides an interface that allows a programmer to checkpoint critical data at predetermined points within a transaction. In a hardware implementation (for example, with Stratus and its Virtual Operating System), the programmer does not need to be aware of the fault-tolerant capabilities of the machine.

At a hardware level, fault tolerance is achieved by duplexing each hardware component. Disks are mirrored. Multiple processors are lockstepped together and their outputs are compared for correctness. When an anomaly occurs, the faulty component is determined and taken out of service, but the machine continues to function as usual.

Fault tolerance vs. high availability

Fault tolerance is closely associated with maintaining business continuity via highly available computer systems and networks. Fault-tolerant environments are defined as those that restore service instantaneously following a service outage, whereas a high-availability environment strives for five nines of operational service.

In a high-availability cluster, sets of independent servers are loosely coupled together to guarantee system-wide sharing of critical data and resources. The clusters monitor each other's health and provide fault recovery to ensure applications remain available. Conversely, a fault-tolerant cluster consists of multiple physical systems that share a single copy of a computer's OS. Software commands issued by one system are also executed on the other system.

The trade-off between fault tolerance and high availability is cost. Systems with integrated fault tolerance incur a higher cost due to the inclusion of additional hardware.

What is graceful degradation?

Fault tolerance is often used synonymously with graceful degradation, although the latter is more aligned with the more holistic discipline of fault management, which aims to detect, isolate and resolve problems pre-emptively. A fault-tolerant system swaps in backup componentry to maintain high levels of system availability and performance. Graceful degradation allows a system to continue operations, albeit in a reduced state of performance.

Matching data protection and fault tolerance

Fault tolerance hinges on redundancy. Namely, information is redundantly protected via data Replication or synchronous mirroring of volumes to an off-site data center. For physical redundancy, extra hardware equipment remains on standby for failover of operational systems.

Data backup is frequently combined with redundancy. Both strategies are intended as a safeguard against data loss, although backup tends to focus on point-in-time recovery, including granular recovery of a discrete data object. Redundant systems are engineered specifically for application workloads that tolerate very little downtime.

When implementing fault tolerance, enterprises should match data availability requirements to the appropriate level of data protection with redundant array of independent disks (RAID). The RAID technique ensures data is written to multiple hard disks, both to balance I/O operations and boost overall system performance.

Organizations that prioritize fault tolerance above speed and performance would be best served by RAID 1 disk mirroring or RAID 10, which combines disk mirroring and disk striping. If fault tolerance and system performance are equally important, an enterprise may find it worthwhile to spend a little extra money combining RAID 10 with RAID 10 with RAID 6, or double-parity RAID, which tolerates the loss of two disk failures before data is lost. Aside from higher cost, the other drawback is data writes occur more slowly to the RAID set.

Aside from hardware, a fault-tolerant architecture should be coordinated with regularly scheduled backups of critical data, perhaps including a mirrored copy at a secondary or alternate location. Security needs to be part of the planning to prevent unauthorized access, and to apply antivirus tools and the most recent version of the computing system OS.

Which industries depend on system fault tolerance?

Fault tolerance refers not only to the consequence of having redundant equipment, but also to the ground-up methodology computer makers use to engineer and design their systems for reliability. Fault tolerance is a required design specification for computer equipment used in online transaction processing systems, such as airline flight control and reservations systems. Fault-tolerant systems are also widely used in sectors such as distribution and logistics, electric power plants, heavy manufacturing, industrial control systems and retailing.

Fault tolerance refers to the ability of a system (computer, network, cloud cluster, etc.) to continue operating without interruption when one or more of its components fail.

The objective of creating a fault-tolerant system is to prevent disruptions arising from a single point of failure, ensuring the high availability and business continuity of mission-critical applications or systems.

Fault-tolerant systems use backup components that automatically take the place of failed components, ensuring no loss of service. These include:

  • Hardware systems that are backed up by identical or equivalent systems. For example, a server can be made fault tolerant by using an identical server running in parallel, with all operations mirrored to the backup server.
  • Software systems that are backed up by other software instances. For example, a database with customer information can be continuously replicated to another machine. If the primary database goes down, operations can be automatically redirected to the second database.
  • Power sources that are made fault tolerant using alternative sources. For example, many organizations have power generators that can take over in case main line electricity fails.

In similar fashion, any system or component which is a single point of failure can be made fault tolerant using redundancy.

Fault tolerance can play a role in a disaster recovery strategy. For example, fault-tolerant systems with backup components in the cloud can restore mission-critical systems quickly, even if a natural or human-induced disaster destroys on-premise IT infrastructure.

Fault tolerance vs. high availability

High availability refers to a system’s ability to avoid loss of service by minimizing downtime. It’s expressed in terms of a system’s uptime, as a percentage of total running time. Five nines, or 99.999% uptime, is considered the “holy grail” of availability.

In most cases, a business continuity strategy will include both high availability and fault tolerance to ensure your organization maintains essential functions during minor failures, and in the event of a disaster.

While both fault tolerance and high availability refer to a system’s functionality over time, there are differences that highlight their individual importance in your business continuity planning.

Consider the following analogy to better understand the difference between fault tolerance and high availability. A twin-engine airplane is a fault tolerant system – if one engine fails, the other one kicks in, allowing the plane to continue flying. Conversely, a car with a spare tire is highly available. A flat tire will cause the car to stop, but downtime is minimal because the tire can be easily replaced.

Some important considerations when creating fault tolerant and high availability systems in an organizational setting include:

  • Downtime – A highly available system has a minimal allowed level of service interruption. For example, a system with “five nines” availability is down for approximately 5 minutes per year. A fault-tolerant system is expected to work continuously with no acceptable service interruption.
  • Scope – High availability builds on a shared set of resources that are used jointly to manage failures and minimize downtime. Fault tolerance relies on power supply backups, as well as hardware or software that can detect failures and instantly switch to redundant components.
  • Cost – A fault tolerant system can be costly, as it requires the continuous operation and maintenance of additional, redundant components. High availability typically comes as part of an overall package through a service provider (e.g., load balancer provider).

Some of your systems may require a fault-tolerant design, while high availability might suffice for others. You should weigh each system’s tolerance to service interruptions, the cost of such interruptions, existing SLA agreements with service providers and customers, as well as the cost and complexity of implementing full fault tolerance.

Load balancing and failover: fault tolerance for web applications

In the context of web application delivery, fault tolerance relates to the use of load balancing and failover solutions to ensure availability via redundancy and rapid disaster recovery.

Which of the following statements best explain the ability of the network to provide fault tolerance?

Load balancing and failover are both integral aspects of fault tolerance.

Load balancing solutions allow an application to run on multiple network nodes, removing the concern about a single point of failure. Most load balancers also optimize workload distribution across multiple computing resources, making them individually more resilient to activity spikes that would otherwise cause slowdowns and other disruptions.

In addition, load balancing helps cope with partial network failures. For example, a system containing two production servers can use a load balancer to automatically shift workloads in the event of an individual server failure.

Failover solutions, on the other hand, are used during the most extreme scenarios that result in a complete network failure. When these occur, a failover system is charged with auto-activating a secondary (standby) platform to keep a web application running while the IT team brings the primary network back online.

For true fault tolerance with zero downtime, you need to implement “hot” failover, which transfers workloads instantly to a working backup system. If maintaining a constantly active standby system is not an option, you can use “warm” or “cold” failover, in which a backup system takes time to load and start running workloads.

Imperva load balancing and failover solutions

Imperva offers a complete suite of web application fault tolerance solutions. The first among these is our cloud-based application layer load balancer that can be used for both in-datacenter (local) and cross-datacenter (global) traffic distribution.

The solution is provided via a load balancing as a service (LBaaS) model and is delivered from a globally-distributed network of data centers for rapid response and added redundancy.

Intelligent data-driven algorithms (e.g., least pending requests) are used to track server loads in real-time for optimized traffic distribution.

The other side of the coin is our failover solution that uses automated health checks from multiple geolocations to monitor the responsiveness of your servers.

In the event of a server failure, site traffic is instantly rerouted to a backup site within seconds, ensuring uninterrupted availability. The service is delivered from the cloud. As a result, even the execution of a remote failover doesn’t suffer from any TTL-related delays commonly found in other DNS-based solutions.

For peace of mind, all Imperva Incapsula enterprise customer are also offered a 99.999% uptime SLA that reflects our confidence in the resiliency of our solution and the quality of our services.