High Availability - System Design For High Availability

System Design For High Availability

Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability. That is because complex systems inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardware redundancy); however, this architecture suffers from the requirement that the entire system must be brought down for patching and Operating System upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and failover).

High availability implies no human intervention to restore operation in complex systems. For example, availability limit of 99.999% allows about one second of down time per day, which is impractical using human labor. The need for human intervention for maintenance actions in a large system will exceed this limit. Availability limit of 99% would allow an average of 15 minutes per day, which is realistic for human intervention.

Redundancy (engineering) is used to create systems with high levels of Availability (e.g. aircraft flight computers). In this case it is required to have high levels of failure detectability and avoidance of common cause failures. Two kinds of redundancy are passive redundancy and active redundancy.

Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving electric power transmission. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system.

Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet routing is derived from early work by Birman and Joseph in this area. Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic.

Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds the period of time between planned maintenance, upgrade events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of communications satellite. Global Positioning System is an example of a zero downtime system.

Fault instrumentation can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of down-time only after a fault indicator activates. Failure is only significant if this occurs during a mission critical period.

Modeling and simulation is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.

Read more about this topic:  High Availability

Famous quotes containing the words system, design, high and/or availability:

    The genius of any slave system is found in the dynamics which isolate slaves from each other, obscure the reality of a common condition, and make united rebellion against the oppressor inconceivable.
    Andrea Dworkin (b. 1946)

    Delay always breeds danger; and to protract a great design is often to ruin it.
    Miguel De Cervantes (1547–1616)

    For thou, O Spring! canst renovate
    All that high God did first create.
    Be still his arm and architect,
    Rebuild the ruin, mend defect.
    Ralph Waldo Emerson (1803–1882)

    Since ... six weeks ago, there has been no day in which I have not had letters and visits on the subject of my nomination for the Presidency.... I say very little. I have in no instance encouraged any one to work to that end.... I have said the whole talk about me is on the score of availability. Let availability do the work then.
    Rutherford Birchard Hayes (1822–1893)