Design for Safety

Design for safety is a method of design in which safety analysis is integral to the design process, rather than “added on” at the end. This approach has been made necessary by the ethical, economic, and legal demands of modern airplane system design. The evolution of aircraft systems and standards requires an evolution in the way we approach design. In the past decade, we have seen systems become more complex and more interdependent. As the demand for more complex systems has increased, so has the demand for safer systems.

Safety Through Integrity

As with most inventions, the first aerospace concept for providing safety through design was integrity. If a strong part is good, a stronger one is better. The idea is that if your design is good enough, the failure rate will be low enough that catastrophes will be virtually nonexistent.

To some extent, this attitude is still prevalent in many design circles. Unfortunately, as we will later see, huge improvements in the failure rates of system components will often make a negligible improvement in probability of system failure. At the time that the Wright brothers were taking to the air, integrity was the only design safety tool. The approach was to design components that were too good to fail.

In 1927 the Spirit of St. Louis made long flights that shocked the world. Its pilot, Charles Lindbergh, had a simple approach to design safety:

Give me one good engine and one good pilot.

Aircraft Expectations

Lindberg’s “right stuff” attitude has endured in the field of aviation. Unfortunately, the 1920s and 1930s brought such a high rate of aircraft fatalities that the general public refused to fly.

There were a great number of single-failure incidents with deadly results. While changes in the way we approached design safety resulted from the need to reduce accident rates, our expectations of aircraft changed as well. Through the 1940s and 1950s, complexity of systems increased to meet increasing performance demands. But systems of this era were highly self-contained. The analog flight-stability computer was introduced in the late 1940s.

The 1960s and 1970s saw the introduction of systems controllers and, consequentially, highly interconnected systems. Out of this arose a need for a more disciplined way to analyze effects of failures during design. For example, the advent of autoland caused the birth of the probabilistic safety requirements that we see in FAR 25.1309, which we will be reviewing shortly.

The 1980s introduced extensive reliance on automatic controls and inertial navigation systems. Category III B landings and two-person crews became more common, both increasing the importance of fault detection and the need for reliable monitoring. Now new planes are all using fly-by-wire transport aircraft. This required all system-design engineers to have a complete command of effective monitors and fault diagnosis and accommodation principles. The additional challenge will be to design for human factors, to avoid using the pilot as a monitor, and ultimately to cut airline inspection and maintenance costs. (Both of these subjects are covered in great detail in Sections 7.8 and 7.9.)

A CAT III operation is a precision approach at lower than CAT II minima. Subcategories are as follows:

• A category III A approach is a precision instrument approach and landing with no decision height or a decision height lower than 100 ft (30 m) and a runway visual range not less than 700 ft (200 m).
• A category III B approach is a precision approach and landing with no decision height or a decision height lower than 50 ft (15 m) and a runway visual range less than 700 ft (200 m), but not less than 150 ft (50 m).
• A category III C approach is a precision approach and landing with no decision height and no runway visual range limitation.

Design for Safety vs Cost-Effectiveness

The increased importance of cost-effectiveness might initially appear to be in conflict with a structured procedure for design safety. But the procedure for achieving safe designs does not alter the level of safety required by FAA regulations. Once the need to meet the requirements is accepted, the task becomes how to meet those requirements in an efficient and cost-effective manner. Correct design for safety provides lowest-cost systems to meet performance and safety goals. We can determine how much redundancy is really necessary to achieve safety goals and thereby avoid overdesign. The fact that unsafe aircraft are bad for business is of course understood by system designers. But an additional consideration, for a litigation-conscious society, is the issue of foreseeability for safe design. It is the designer’s responsibility to avoid injuries due to reasonably conceivable conditions. The matter of what is reasonably conceivable is a minor by-product of the failure mode analyses that are part of the procedures addressed in this course.

Unfortunately, most of the sophisticated safety analysis techniques developed in the past few decades have been used primarily as certification tools only, thus creating a panic at certification time and, sometimes, the need to redesign a system in order to meet safety requirements.

Safety—An Integral Part of the Design Approach

Perhaps the most important single realization forced upon us by the demands of airplane design is the fact that safety must be an integral part of the design approach—not as a factor to be optimized, but as an absolute requirement, to be viewed in the same light as meeting the performance specifications of the system design. Traditionally, you have had a number of factors to consider when designing a component, including:

• strength
• weight
• size
• shape
• wear
• corrosion
• safety
• reliability
• maintainability
• human factors

Generally, these are factors to be optimized, and so in many ways, safety does not belong on this list.

Assessing Potential Failure

The optimum safety of a modem aircraft can only be achieved by assessing potential failures and errors—separately and in combination—and the degree of hazard resulting from those failures. The more complex the system, the more this must be done in a planned and organized fashion. Quantitative safety assessments are required by the FARs and JARs. If done early on, these safety assessments provide a method of designing a balanced set of systems with less waste and resulting in a minimum-cost, certifiable aircraft.

Resistance to Design for Safety

For many reasons, seat-of-the-pants design, retrofitted for safety, no longer works. There are too many failure conditions to juggle in your head. It is important to realize that the number of failure conditions is large compared to the number of components. Design must be done with safety analyses in hand. But despite the fact that designing without safety has been shown to be unworkable, we’ve all heard reasons why this is still the best way to go. At one time or another, you may have even “bought” some of these considerations yourself because, on the face of it, they sound so “rational.”

Famous Justifiers for Not Using Safety Analysis in Up-Front Design

The system is known to be safe.

This is an approach used by an OEM in discussions with the FAA regarding worn-brake certification. Contrary to intuition, achievement of 10 million flight hours does not support a claim of one-in-a-billion probability. Suppose the probability of flipping a coin and having it land on edge is known to be one in one thousand. You test this by flipping the coin a thousand times, but it does not land on edge once. This is not really surprising; there is a good chance that it will not land on edge once during the first thousand or even the second thousand flips, but will subsequently land on edge three times during the third thousand.

Just because it did not land on edge during the first thousand flips does not mean that the probability of landing on edge is less than one-in-a-thousand.

• “Schedule: not enough time.”
If you think the schedule is tight at the beginning of a program, just wait until the precertification panic sets in!
• “I do not believe in probability.”
But the FAA surely does!
• “We never needed it before.”

Despite the feeling that little has changed in aircraft design, a look at Lamm’s schematics shows a quantum leap in interaction of systems. Also, old designs would be uncertifiable under current FARs. Safety cannot be certified into a system; it must be designed in. As systems become more and more complex, you need to know more and more about what is expected of you for certification, so that the system you are designing is certifiable in the first place.

You cannot wait for the reliability engineer to do some analysis and ensure that your system is good. We have a lot of evidence that no matter how good a system designer you are, you cannot anticipate the results that a structured analysis would provide. Systems are too complex for you to intuitively know that a system is certifiable. Additionally, if you cannot intuitively know that it is certifiable, you cannot intuitively know that it is safe.

Probabilistic Aspects of Design Safety

Achievements of the objectives of design for safety mandate a probabilistic approach. Bad events must be rare, and terrible events even more rare. Probability is the only reasonable basis for system safety requirements.

In an article on probabilistic risk analysis, Dr. George Apostolakis of UCLA stated:

There are still controversies and misunderstandings regarding its use. Engineers . . . are asked to deal with methods that require considerable subjective judgment, and, because they are unaccustomed to such mixing of ‘objective’ facts with ‘subjective’ judgments, they are left with the feeling that the whole exercise lacks scientific rigor.

The few engineers who have taken courses on probability and statistics in their college days find that their notion of probability . . .is challenged by the requirements of a PSA for a real system and by the fact the major accidents are rare.

In spite of these limitations, probability is still the only rational way that is available to us for handling uncertainty.

The question that immediately arises is why probabilistic safety analysis is not universally accepted. A major reason must be the lack of a strong statistical background of most engineers.

One of the difficulties engineers and others have with probability theory is that it is often counterintuitive. A great example of this appeared in newspapers across the country.

The Reality of Probability Clashes With Intuition

Marilyn vos Savant, columnist for Parade, at one point discussed a simple probability problem that shows just how serious the general lack of understanding of probability can be. She proposed a game show situation often used in college classrooms:

Suppose that in a game show you are given a choice between three doors. Behind one door is the big prize, a million dollars, and behind the other two there is nothing. The game show host, who always knows where the money is, tells you to pick a door. Say you pick Door Number I.

The host then opens another door—number two for instance—showing that it has nothing behind. The host then gives you the famous choice, ‘Well, now would you like to keep your Door Number 1, or trade me for what’s behind Door Number 3?’ Which would you do?

If you answered that you would keep Door Number 1, like most people, you would be wrong. How can this be, you ask. It is obviously a 50/50 chance between Number 1 and Number 3 so there is no particular reason to trade. Of course, in reality, the host’s Door Number 3 now has twice the probability (66%) that your Door Number 1 has (33%), and you just chose to cut your chances of winning a million bucks in half.

After giving readers a week to ponder this problem, Marilyn vos Savant then announced the correct answer—that you should trade doors—the following Sunday. She gave a simple explanation for the rationale.

The first door has a 1/3 chance of winning and the remaining closed door, after the choice is given, has a 2/3 chance. After your initial selection you had a one in three chance of getting the money. The host, of course, then had a two in three chance of having it behind either of his two doors. Remember that either you have it or you don’t (total probability equals one). Since you have a one in three chance of guessing right, you have a two in three chance of guessing wrong. So your two in three chance of guessing wrong is exactly equal to the chance that the host has the money behind one of his two remaining doors. Since he knows where the money is, he can always show you a door with nothing behind it, because he always has at least one. Opening such a door has no effect on the probability that your door is a winner.

Marilyn’s readers were outraged, the vast majority responding by telling her that she was in fact wrong and needed to print a retraction. She then published a second explanation with more detail, and additional examples.

Suppose that the game show host gave you a choice to pick one of a thousand doors and only one had a prize behind it. You chose one, leaving a long row of the host’s doors still closed. One by one, the host then opened nine hundred ninety eight of those remaining, leaving only one door closed, other than the one you selected. Does a trade seem more reasonable at this point?

Most readers still said no. Marilyn got letters from PhD after PhD. One, from the University of Florida, told her that there was enough mathematical illiteracy in the United States, and that we don’t need the world’s highest IQ propagating more. The deputy director of the Center for Defense Information and a research statistician from the National Institute of Health demanded that she confess. One reader suggested that women just can’t do math. Thousands of others agreed, signing their names along with impressive credentials. A Georgetown professor asked how many irate mathematicians it would take to convince her she was wrong. And finally a PhD from the US Army Research Institute noted that if all those other PhD’s were wrong, the world would be in very serious trouble. Very serious indeed. In the following months, after computer simulations and actual tests with a host who always knew where the money was, and always then opened a door with no money, embarrassed readers began submitting their retractions. The social aspects of the “Ask Marilyn” incident are as interesting as the problem itself, as irate readers were driven by their strong intuitions, not by analysis. And, unfortunately, our intuition seems to be poor where probability is concerned. The aircraft-design analogies to the “Ask Marilyn” problem are striking. Design engineers, whose probability skills are often rusty, are confronted with probabilistic safety calculations that just do not seem to make sense. Like Marilyn’s PhD readers, some noted engineers attempt to resolve the conflict by pointing to their long design experience, or by claiming technical superiority of one form or another. Like Marilyn, the safety analyst may be confronted by an army of consensus. As Marilyn stated in her column, math answers are not determined by authority or consensus. Answers come from understanding basic principles, and a structured analytical method.

Safety vs Reliability

Many companies have a department called Safety and Reliability. Safety and reliability are frequently thought to mean the same thing. They don’t.

For the purposes of this section we will use the following definitions:

• Reliability:
The inherent, designed-in attribute of any component, subsystem, or system that allows it to perform its intended function without failure.
• Design safety:
Another attribute of a component, subsystem, or system, deliberately designed in to minimize the probability of injury or death. The primary concern of reliability is economic consequences.

The primary concern of safety is loss of life.

Reliability is to be optimized. If some is good, more is better, but more reliability means more cost. Safety is not really optimized; an absolute minimum level must be met. If safety is achieved by a redundant, complex system capable of achieving high, functional reliability, this may come at the expense of degraded maintenance reliability. Since at least one goal of reliability is less maintenance, and since design for safety goals includes more system components needed for redundancy (requiring more maintenance), safety goals and reliability goals can sometimes conflict.

Hazards and Risk

The degree of risk for a given hazard is composed of (1) severity of effect, and (2) hazard probability.

High-risk occurrences are both highly probable and severe in effect. A severe hazard may be tolerable if it is rare enough; and a probable hazard may be tolerable if it is mild enough. This leads to a simple, perhaps intuitive concept that the probability of a hazard must be inversely proportional to its severity.

Safety integration in design should follow the following steps:

1. Identification of aircraft functions
2. Determination of failure consequences
3. Allocation of functions to functional systems
4. Allocation of safety requirements to system and people
5. Design of system architecture refinement of baseline concept
6. Allocation of safety requirements to hardware and software
7. Hardware and software design
8. Hardware and software integration
9. System integration

FAR 25.1309(b) sets forth the probabilistic criteria that must be met to certify systems and components. Advisory Circular 25.1309-1B defines extremely improbable as one-in-a-billion, and improbable as one-in-a-million. The FAR requirement of one-in-a-billion means that in a 20-year period the probability of a fatal crash from system failure due to certified systems for a line of 2000 planes is about 0.7. Following is a sample calculation based on a one-in-ten-million probability of a fatal accident.

Say we have 1000 aircraft, each with 100 critical systems. If they meet the probabilistic criteria of the FAR, each (with 100 critical systems on board) would have a catastrophic event probability per hour of one in ten million, per aircraft.

So, one aircraft flying an 8-hour flight would have a catastrophe probability of about 8 in 10 million. That aircraft’s probability of catastrophe in a year would be about 365 × 8E − 7 or about 3E − 4 (3 in 10,000).

Then, if all 2000 aircraft flew for 1 year, the probability of a catastrophe in the fleet would be about 2000 × 3E − 4 or 6 in 10 (0.6).

Does this then mean that the probability of catastrophe for the fleet in 10 years is 6 (10 years × 0.6)? Obviously, this is not the case, because a probability cannot exceed 1. It is actually about 0.7.

The defect in our math will be addressed later. But this example shows that the seemingly negligibly low probabilities required by the FAR would not seem ridiculously low to society.

Organizational Support

The following factors are necessary for effective safety efforts in an Integrated Product Team (IPT):

• Program management: Provides money and schedule to allow integrated safety approach.
• Design office: Recognizes top-down design with integral safety efforts and requires designers to employ this method.
• Human factors: Ensures accurate representation of human interface.
• Reliability: Provides historical data to support analysis (tells us how often a certain component may be expected to fail so we can design appropriately).
• Software management: Provides verification and validation.
• Maintainability: Ensures customer awareness of inherent hazards.
• Product support: Feedback to design community of unexpected service problems.
• Configuration management: Ensures that reality is equal to modeled systems.
• Designers: Employ design for safety method, and challenge own designs.

Safer Systems and Products

Engineers make safe systems and products by:

1. their choice of failure modes of components,
2. their application of safety preferences, to include the following
“System Safety Engineering Procedure,” which states company policy regarding design safety engineering;
“System Safety Procedure,” which states top-level requirements and responsibilities for design safety engineering;
“Design Safety Program Procedure,” which states the specific requirements that each design and development of a new airplane or major modification to an existing airplane shall have a design safety program and develop a DSPP (Design Safety Program Plan) to document the tasks that implement that program.
A Design Safety Manual (DSM) is recommended to be developed and the design team should follow the document as guidance to develop a Design Safety Program Plan in accordance with the company safety regulations. The DSM describes various safety assessment tools and sets forth guidance for their use. It also describes how design safety interfaces with other design and development activities. Managers and engineers have discretion in selecting which tools are appropriate for a particular design activity. The specific tools and tasks selected will depend on many factors, including the applicable contract (if military) or civil regulations (if commercial).
3. redundancy and its associated architecture

Design Order of Precedence

• Design system to eliminate (preclude) hazard. Reduce associated hazard through design selection.
• Control hazard through safety devices and features (reduce probability).
• Provide warning devices.
• Provide procedures and training; corrective action.

Examples of component selection in consideration of failure modes include:

• solenoid vs motor-operated valve
• preloaded springs
• watchdog timers

Fail-Safe Design Concept

The following are 11 design principles or techniques that can be used in various combinations to provide a fail-safe design to ensure that major failure conditions are improbable and catastrophic failure conditions are extremely improbable:

1. Design integrity and quality, including life limits, to ensure intended function and prevent failures.
2. Redundant or backup systems to enable continued function after any single failure.
3. Isolation of systems, components, and elements so that the failure of one does not cause the failure of another. Isolation is also termed independence.
4. Proven reliability so that multiple, independent failures are unlikely to occur during same flight.
5. Failure warning or indication to provide detection.
6. Flight crew procedures for use after failure detection, to enable continued safe flight and landing by specifying crew corrective action.
7. Checkability, the capability to check a component’s condition.
8. Designed failure effect limits, including the capability to sustain damage, to limit the safety impact or effects of a failure.
9. Designed failure path to control and direct the effects of a failure in a way that limits its safety impact.
10. Margins or factors of safety to allow for any undefined or unforeseeable adverse conditions.
11. Error tolerance that considers adverse effects of foreseeable error during the airplane’s design, test, manufacture, operation, and maintenance.

As a systems engineer, one should never lose sight of the fundamental importance of the fail-safe system design concept. We must resist the tendency to narrow our focus into an improper “forest for the trees” problem, where detailed focus on safety analyses supplants broad thinking and engineering judgment. The results of the safety analyses should always be reviewed for good sense and validated against the fail-safe system design concept.

Series or Parallel Architecture

One of the first choices a designer has, when considering functional redundancy, is the issue of series vs parallel architecture. Decisions regarding series and parallel architecture must take into account the criteria that have been set.

Suppose that a fire alarm system is installed on an aircraft that has a very low rate of fires, and that a false alarm in flight could result in an engine shutdown, emergency descent, emergency landing, and evacuation (which usually results in passenger injury).

AND logic (both first and second must agree) is best where unnecessary activation is a major concern. But this system has a significantly higher probability of not working when needed; if either one goes to sleep, then the good one can’t warn even if a fire is present.

Suppose you are designing an AIDS testing program to protect the public. Your first concern may be the number of real cases that are overlooked.

And what about false positive probabilities?

If the event you are looking for is very rare, and the probability of false positive is not made very low, you will find the majority of “positive” results are in fact false.

Example: Suppose the event probability is 10^− 5, and the probability of a false positive is 10^− 3. That means that out of 100,000 trials, you will get 100 positive results, of which only one is correctly positive.

If that happens, has the testing program (or fire-alarm system) actually benefited society? This is a complex subject and requires careful thinking. Both AND logic and OR logic each have their own strong attributes. But where one is best for being there when you need it, it’s also there more often when you don’t need it, and vice versa.

There is no free lunch. You must decide on the criteria and the design that will take into account your most critical concerns. It could go either way, depending upon the effects of real and false warnings as well as other factors. Actually, there are other alternatives besides these AND and OR systems.

More sophisticated schemes can be devised, which may meet system safety needs. Such schemes would include triple redundancy, where each has one “vote,” and the “majority” wins, or “2 out of 3,” or others. Most aircraft systems involve complex combinations of redundancy, with a number of monitors. While there is no free lunch, it is our job to find the minimum-cost lunch.

Monitors

Redundancy, along with correct architecture, is the best way to achieve highly safe systems. If one part fails, the other part takes over. But how does one know when the first has failed? If the other has taken over, everything may appear normal. But in fact, we have now lost our redundancy and disaster may lurk around the corner.

Monitoring is essential, then, to announce when one level of redundancy has been lost. Otherwise the system will just keep operating until the next one fails, which might result in catastrophe—exactly what you put redundancy into the system to prevent. Because the probability of failure of something depends on how long it is operating, every moment one operates without redundancy, the probability increases that the second failure will occur and result in catastrophic consequences. So we need a mechanism for finding latent losses of redundancy (monitor) so that they can be repaired, and for specifying minimum safe redundancy.

Design for Safety Tools

1. Functional Hazard Analysis (FHA)
The FHA looks at what major failures of function can occur, the effects of those failures, the risk associated with them, and the safety criteria we must meet to make that risk acceptable.
2. Failure Mode and Effects Analysis (FMEA)
The FMEA looks at what happens when each component of the system fails in various ways.
3. Fault Tree Analysis (FTA)
The FTA looks at the effects of combinations of failures.
4. Zonal Analysis and Events Reviews
Zonal Analysis and Events Reviews look at physical placement of systems so that components that are supposed to be independent, actually are (same event won’t cause failure in both parts of a redundant system). Design for safety is an iterative process using each of these tools. As architecture evolves, these tools become more refined.

Design for Safety Goals

Currently, our industry’s fatal accident rate is about one per million flights. The goal is no more than one fatal accident per 10 million flights. If we are to achieve that goal with aircraft having about 100 systems, each system can only have no more than a one-in-a-billion probability of causing an accident. While we may not be able to eliminate accidents altogether, we do have a responsibility to work toward reducing the accident rate. Good system design—with strong consideration of probability, human factors, and history—can reduce exposure to incorrect operator action, and help us eliminate such occurrences as Bhopal, Chernobyl, and Sioux City.