This seminar session provides an overview of major aspects of reliability engineering, including general introduction of reliability engineering (definition of reliability, function of reliability engineering, a brief history of reliability, etc.), reliability basics (metrics used in reliability, commonly-used probability distributions in reliability, bathtub curve, reliability demonstration test planning, confidence intervals, Bayesian statistics application in reliability, strength-stress interference theory, etc.), accelerated life testing (ALT) (types of ALT, Arrhenius model, inverse power law model, Eyring model, temperature-humidity model, etc.), reliability growth (reliability-based growth models, MTBF-based growth model, etc.), systems reliability & availability (reliability block diagram, non-repairable or repairable systems, reliability modeling of series systems, parallel systems, standby systems, and complex systems, load sharing reliability, reliability allocation, system availability, Monte Carlo simulation, etc.), and degradation-based reliability (introduction of degradation-based reliability, difference between traditional reliability and degradation-based reliability, etc.).
2. ASQ Reliability Division
ASQ Reliability Division
Chinese Webinar Series
Chinese Webinar Series
One of the monthly webinars
One of the monthly webinars
on topics of interest to
reliability engineers.
To view recorded webinar (available to ASQ Reliability
Division members only) visit asq.org/reliability
) /
To sign up for the free and available to anyone live
webinars visit reliabilitycalendar.org and select English
Webinars to find links to register for upcoming events
http://reliabilitycalendar.org/The_Re
liability_Calendar/Webinars_
liability Calendar/Webinars ‐
_Chinese/Webinars_‐_Chinese.html
5. What Is Reliability?
From The Oxford Essential Dictionary of the U.S. Military, Oxford University
Press, Inc, 2002.
Reliability – The ability of an item to perform a required function under
stated conditions for a specified period of time.
From McGraw-Hill Dictionary of Scientific and Technical Terms, McGraw-Hill
Companies, Inc, 2003.
Reliability – The probability that a component part, equipment, or system
will satisfactorily perform its intended function under given circumstances,
such as environmental conditions, limitations as to operating time, and
frequency and thoroughness of maintenance for a specified period of time.
– Most commonly used in reliability engineering textbooks.
Page 3 of 50
6. Function of Reliability Engineering
Ensure that designs meet product reliability requirements.
Verify that a product will function reliably over its mission
lifetime.
Identify design discrepancies and resolve.
Evaluate potential failure modes and their effects on mission.
Then, provide guidance on corrective actions.
Recommend design configurations for redundancy.
Establish cost effective test plan based on reliability goal to
determine sample size and test duration.
Assess product failure probability at mission lifetime.
Predict systems reliability and availability.
Page 4 of 50
7. Costs due to Unreliability
In April 1986, due to the failure of a safety control system, the
Chernobyl nuclear power plant at Ukraine released a huge amount of
radiation into environment, causing the worst nuclear accident in history,
including killing more than 10,000 people instantly.
In November 2001, due to a tail fin separation from plane body,
American Airlines flight 587 crashed into a New York city neighborhood
and killed 265 people, including all passengers and crew members on
board and several people on the ground.
In August 2003, the Northeastern and Midwestern United States and
Ontario, Canada experienced a widespread power outage, due to lack
of good reliability design in the power transmission grid, affecting an
estimated 45 million people in eight U.S. states and 10 million people in
Ontario without power.
Page 5 of 50
8. A Brief History of Reliability Engineering*
In 1941, Robert Lusser, who led German V-1 missile test program, first recognized the
need for a separate discipline as Reliability Engineering.
In 1950, the US Department of Defense (DoD) established the Ad Hoc Group on
Reliability. In 1951, the secretary of defense, General George C. Marshall, ordered all
DoD agencies to increase their emphasis on reliability of military electronic equipment.
In 1955, Institute of Electrical and Electronics Engineers (IEEE) initiated the world 1st
Reliability & Quality Control Society.
In 1960, the US Naval Post-Graduate School became the 1st institution to teach
reliability engineering courses in the US.
In 1962, the 1st Annual Reliability And Maintainability (RAM) Conference was held in
the US.
In 1963, the University of Arizona, with support from National Science Foundation,
became the 1st national research university to establish a Reliability Engineering
program in the U.S.
(* Source: Dimitri Kececioglu, Reliability Engineering Handbook, Vol. 1, PTR Prentice Hall, 1991.)
Page 6 of 50
9. Difference between Reliability & Quality
Reliability deals with behavior of failure rate over a long period of operation,
while quality control deals with percent of defectives based on performance
specifications at a certain point of time.
Reliability deals with all periods of existence of a product, with prime
emphasis at the design stage, while quality control deals with primarily on the
manufacturing stage.
Reliability and quality control use different statistical tools to evaluate.
LSL Target USL
100%
Defective %
Reliability
0%
Time Performance Measurement
Page 7 of 50
11. Metrics in Reliability Engineering
Reliability (R) or probability of success (Ps)
Failure probability (Pf = 1-R), equal to the cumulative density
t
function (cdf) of a lifetime distribution. cdf f ( x ) dx (here, f is the pdf )
f 0
Failure (or hazard) rate ().
R
Mean time to failure (MTTF). MTTF x f ( x) dx R( x) dx
0 0
Mean time between failures (MTBF)
System availability (A)
Page 9 of 50
12. Commonly Used Probability Distributions
Distribution Variable Application
Exponential Continuous variable. Commonly used for electronic
Time-to-failure. parts/assemblies with constant failure rates.
Weibull Continuous variable. Versatile to any application.
Time-to-failure.
Lognormal Continuous variable. Mostly used for products subject to wear-out.
Time-to-failure.
Chi-square (2) Continuous variable. Calculating confidence bounds of a constant
failure rate estimate. Also used for two samples
comparison, goodness-of-fit test, etc.
Binomial Discrete variable with Estimating probability of success from
binary outcomes. repeated tests. Also used for sampling plan.
F Continuous variable. Calculating confidence bounds of a probability
of success. Also used for two samples
comparison.
Page 10 of 50
13. Bathtub Curve
The bathtub curve describes a particular form of a failure (hazard) rate
function which comprises three parts: early failure, random failure and wear out
failure.
Military Specification requires that for life critical or system critical
applications, the infant mortality section be burned out or removed, as it greatly
reduces the possibility of the system failing early in its life.
Page 11 of 50
14. Exponential Distribution
Most commonly used for electronic parts or assemblies with
burning-in.
Failure rate is constant, only applicable for the random failure.
MIL-HDBK-217 provides failure rate data for electronic parts as
a function of electrical stresses and temperature.
The probability density function (pdf):
Failure Rate
f (t ) e t
MTBF 1 /
Time
Page 12 of 50
15. Weibull Distribution
Named after Swedish scientist Waloddi Weibull.
The most-commonly used probability distribution for life data
analysis.
Failure rate covers the whole scope of the bathtub curve.
The probability density function (pdf):
Failure Rate
1 t
Beta < 1.0
Beta = 1.0 t
Beta > 1.0 f (t )
e
Time
Page 13 of 50
16. Lognormal Distribution
Initially introduced for mechanical fatigue data analysis. Also
used for long-term return rate on a stock investment.
Failure rate covers both early failure and wear out failure, but
not random failure.
The probability density function (pdf):
Failure Rate
2
Sigma < 1.0
1 ln t x
Sigma > 1.0
1 2 x
f (t ) e
t x 2
Time x ln t
Page 14 of 50
17. Other Distributions
In addition to the three distributions described above,
there are other distributions occasionally used for life
data analysis:
Mixed Weibull – Competing failure modes
Normal
Extreme Value
Logistic
Gamma
Gumbel
Page 15 of 50
18. How To Determine A Lifetime Distribution?
From industry standards or common practices
For example, the exponential distribution is usually used for
electronic parts due to wide acceptance in electronic
industries.
From experience or historic data
For example, a typical computer hard disc drive lifetime follows
a Weibull distribution with < 1 based on long time field data.
From reliability life testing
Common situations in reliability engineering. Test data could
be in many different types (e.g., complete, left censored, right
censored, interval, and group data).
Page 16 of 50
19. Confidence Interval
A confidence interval (CI) is an interval estimate of a parameter,
used in statistics to indicate how reliable an estimate could be.
Since reliability models are often established on reliability life
test data, any estimated number needs a CI.
Page 17 of 50
21. What is Accelerating Life Testing (ALT)?
The concept of ALT was introduced in 1960s. Dr. Wayne Nelson
played a key role to lay the foundation when he worked at GE
Corporate Research & Development.
Driving force to promote the accelerated life testing is from
electronic industries where products’ lifetime is quite long such
that it would be difficult, if not impossible, to observe any failure in
an affordable period of life testing.
ALT is aimed to force the test units to fail more quickly then they
would under normal use conditions. In other words, the ALT is to
accelerate test units’ failure.
Page 19 of 50
22. Qualitative ALT
Goal of qualitative ALT is to obtain failure information, such as
failure mode, failure effect, environmental stress limit, etc. Not
designed to yield life data.
A typical example of qualitative ALT is the so-called HALT
(highly accelerated life testing).
Sample size usually small.
Test units subjected to a single level or multiple levels of a
stress. Quite often, time-varying stresses (e.g., temperature
cycling from cold to hot to observe thermal fatigue).
Primarily used to reveal potential design flaws in product
reliability.
Page 20 of 50
23. Quantitative ALT
Goal of quantitative ALT is to obtain life data.
Acceleration is achieved by overstress acceleration or usage
rate acceleration. In most cases, the term “Accelerating Life
Testing (ALT)” means quantitative ALT by overstress acceleration.
Sample size can’t be small, which is usually decided by
sampling plan.
Each batch of test samples subjected to a single level of stress
or combined stresses.
Typical stresses include temperature, humidity, voltage, current,
pressure, vibration, etc.
Page 21 of 50
24. ALT Data Analysis
The characteristic of a lifetime distribution (e.g., mean, median,
Weibull scale parameter, etc) depends on the level of stress.
But researchers revealed that the shape parameter (e.g., Weibull
shape parameter , lognormal standard deviation x, etc) does not vary
from a stress level to another, unless the failure mode is changed.
Typical life characteristics for the three most common lifetime
distributions (exponential, Weibull, and lognormal) are listed below.
Distribution Distribution Parameter(s) Life Characteristic
Exponential l MTBF (= 1/ l)
Weibull b, h h
Lognormal sx , mx Median
Page 22 of 50
25. Example of ALT Data Analysis
Following example demonstrates the time-to-failure data of an
insulation on electric motors. Test were conducted at four elevated
temperature levels: 110, 130, 150, and 170 °C to speed up the
insulation deterioration. The use condition for the motors is 80 °C.
Page 23 of 50
26. Arrhenius Model
The Arrhenius model is the most well-known life-stress relationship in
ALT for thermal stress (i.e., temperature). It is derived from the
Arrhenius reaction rate proposed by Swedish scientist Svante August
Arrhenius in 1887.
Ea 1 1
T T
CLu k u a
AF (Acceleration Factor) e
CLa
where Ea is the activation energy, k the Boltzman’s constant, Tu the
temperature at use condition, and Ta the temperature at accelerated test
condition.
(Note: The activation energy is the energy that a molecule must have in order to
participate in chemical reaction. So, in other words, the activation energy is a measure
of the effect that temperature has on the reaction.)
Page 24 of 50
27. Inverse Power Law Model
Developed from the Coffin-Manson equation for low-cycle thermal
fatigue lifetime analysis. It describes that the cycles-to-failure is
proportional to the inverse power of the temperature range of the
cycling.
Also used for other non-thermal stresses (current, voltage, vibration,
etc).
n
CLu S a
AF
CLa S u
where n is the model exponent, to be determined, Su the stress at use
condition, and Sa the stress at accelerated test condition.
Page 25 of 50
28. Eyring Model
The Eyring model was originally developed for thermal stresses from
quantum mechanics. In general, for thermal stresses, both the Eyring
model and Arrhenius model yield very close results. But the Eyring
model could also be used for humidity stress.
1 1
b
S S
CLu S a
AF e u a
CLa S u
where b is the model parameter, to be determined, Su the stress at use
condition, and Sa the stress at accelerated test condition.
Page 26 of 50
29. Temperature-Humidity Model
The temperature-humidity (T-H) model is a variation of the Eyring
model when both temperature and humidity stresses are involved.
1 1 1 1
b
a RH RH
CLu Tu Ta u a
AF e
CLa
where both a and b are the model parameters, to be determined, Tu and
RHu the temperature and relative humidity at use condition, and Ta and
RHa the temperature and relative humidity at accelerated test condition.
Page 27 of 50
31. What is Reliability Growth?
Reliability Growth is a tool to predict reliability of a system or
equipment under development to some future development time
from information available now, or monitor the reliability of the
system or equipment to establish a trend in increase of reliability
with research and engineering efforts to make sure it achieves its
reliability goal.
Reliability growth studies are necessary to ensure that, from
information available at the beginning of a project, the reliability
goal is achievable by delivery time. In general, a growth model is
projected to the project completion date.
Page 29 of 50
33. Reliability–Based Growth Models
Gompertz Model
ct
R (t ) a b
where t is the development time, 0 < a, b & c <1.
Logistic Model
1
R(t )
1 a e b t
where t is the development time, a & b >0.
Lloyd-Lipow Model
Rk R
k
where Rk is the reliability at the kth stage of development/testing, and R
the ultimate reliability.
Page 31 of 50
34. MTBF–Based Growth Models
Duane Model
MTBF (t ) a t b
where t is the development time, a the MTBF at the beginning of
development (defined as t0 = 1), and 0 b 1.
AMSAA (U.S. Army Material Systems Analysis Activity) Model
1
t
MTBF (t )
where t is the development time, & > 0.
Page 32 of 50
36. Objective of System Reliability & Availability
To evaluate system reliability; i.e., probability that a system is
operating properly without a failure.
To evaluate system availability; i.e., probability that a system is
operating properly when it is requested for use.
To provide recommendation for any design change for
redundancy to achieve a specified system reliability or availability
goal.
Page 34 of 50
37. Reliability Block Diagram (RBD)
A graphical representation of subsystems or components of a
system and reliability-wise connection among them.
A RBD should be created prior to doing system reliability
modeling.
A RBD might be different from its functional block diagram
Fan
Power Micro- Hard Peripheral
SDRM
Supply Processor Drive Electronics
Fan
A simplified RBD of a computer system
Page 35 of 50
38. Non-Repairable & Repairable Systems
A non-repairable system does not get repaired when it fails.
For a non-repairable system, system reliability is a sufficient
measure of the system performance.
A repairable system gets repaired when it fails.
In a repairable system, two types of distributions are
considered: life distribution and repair time distribution.
For a repairable system, system reliability itself is not a
sufficient measure of the system performance since it does not
account for repair. System availability also needs to be evaluated,
and in most cases, even more important than system reliability.
Page 36 of 50
39. Methods of RBD Analysis
RBD analysis can be performed with both analytical and simulation techniques.
Analytical approach is to develop a mathematical model to describe the reliability of a
system, based on reliability data of subsystems or components.
Advantage: A math model is developed. Using it, more analysis can be
performed, such as conditional reliability, warranty, etc.
Disadvantage: In general, it is difficult to get the model for a complex system or a
repairable system.
Simulation approach is based on random number generation, to get the time-to-failure
of each subsystem or component. The failure time is then analyzed to determine the
behavior of the system.
Advantage: It can be used for a highly complex system where no analytical
solution is expected.
Disadvantage: (1) It can be time-consuming.
(2) Result depends on the number of simulation runs.
(3) Lack of repeatability in result due to random nature of data
generation.
Page 37 of 50
40. Reliability of Series Systems
Success of a series system requires every single subsystem or
unit to succeed.
S1 S2 S3 Sn
Reliability block diagram of a series system
The system reliability equals to the product of the reliability of
each individual subsystem or unit.
n
Rsys (t ) Ri (t )
i 1
Page 38 of 50
41. Reliability of Parallel Systems – Active Redundancy
Failure of a parallel system means all subsystems or units fail.
S1
S2
S3
Sn
Reliability block diagram of a parallel system
The system reliability is expressed as:
n
Rsys (t ) 1 Ri (t )
1
i 1
Page 39 of 50
42. Difference between Function & Reliability
A functional parallel system does not have to be reliability-wise
parallel.
+ +
X
- -
For the failure mode of open circuit, For the failure mode of short circuit,
the functional parallel capacitors are the functional parallel capacitors are
reliability-wise parallel. reliability-wise series.
Page 40 of 50
43. System Reliability in Standby – Inactive Redundancy
Standby subsystem remains inactive until the active one fails.
SA
SS
Reliability block diagram of a 2-for-1 standby system
For the above 2-for-1 standby system, the system reliability is
expressed as:
t R A (t e t x )
R (t ) R A (t ) f A ( x) RS ( x) dx
0 R A (t e )
where te is an equivalent time such that RS(x) = RA(te).
Page 41 of 50
44. Example of Complex Systems
Unit B Unit E
Unit A Unit C Unit G
Unit D Unit F
In this RBD, assume all units are in active redundancy.
It would be difficult to recognized which units are in series and which
ones are in parallel, due to the fact that Unit C has two paths leading
away from it, while Unit B & D have only one.
Page 42 of 50
45. System Availability
Availability is a probability that a system is operating properly
when it is requested for use.
It is a performance characteristic for repairable systems that
accounts for both reliability and maintainability properties of a
subsystem or unit.
For example, a lamp with a 99.90% availability means that, in
average, there would be once out of one thousand times when
someone needs to use the lamp but finds out the lamp is not
operational either because the lamp is burned out or the lamp is in
the process of being replaced.
Page 43 of 50
46. Repairable Systems vs. Renewal Process
For a repairable system, the operation time is not continuous. The life cycle
contains a sequence of up & down states. Once the system fails, it is repaired
and restored to its original operating state. The repeated process of failure and
repair is classified as a alternating renewal process. And the associated random
variables are the times-to-failure and the times-to-repair.
Page 44 of 50
47. Definition of Availability
Instantaneous (or Point) Availability – A(t)
t
A(t ) R (t ) R(t x) m( x) dx
0
where m(x) is the renewal density function of the system.
Average Uptime (or Mean) Availability – A(t)
t
1
A(t ) A( x) dx
t0
Steady State Availability – A()
A( ) lim A(t )
t
Inherent Availability (Steady State Availability for Exponential) – AI
MTBF
AI
MTBF MTTR
Page 45 of 50
49. What & Why Degradation-Based Reliability?
Degradation-Based Reliability is a new technique to evaluate product
reliability based on its performance degradation measurements, rather
than its time-to-failure data.
Many failure mechanisms are directly linked to degradation of some critical
performance characteristics, such as brake failure due to pad wear, solder joint
failure due to fatigue crack propagation, etc.
Reliability of today’s products has been greatly improved, such that fewer
failures could be observed from reliability testing.
Reliability evaluation based on degradation provides a bridge between
reliability and physics-of-failure.
Degradation testing could be much shorter because it does not need to
witness any “hard failure”.
It makes it possible to predict products’ residual life from critical
performance measurements.
Page 47 of 50
50. Graphic Showing Degradation-Based Reliability
Following plot illustrates three units be tested for performance
degradation. The failure criterion is determined based on the
performance design specification.
y(t)
y1(t) y2(t) y3(t)
Failure Criterion
t
TTF1 TTF2 TTF3
Page 48 of 50
51. Approaches for Degradation-Based Reliability
Determine failure criterion of a performance characteristic,
which defines the maximum allowable degradation level and
would constitute a failure once being reached.
Measure performance degradation from multiple test units over
time, either continuously or at predetermined intervals.
Analyze the performance degradation data to establish
statistical models for the performance degradation.
Evaluate the product reliability based on its failure criterion.
Page 49 of 50
52. Difference in Reliability Modeling
In Traditional Failure-Based Reliability Modeling,
(1) The goal is to establish a distribution function for the variable of time-to-
failure.
(2) Distribution parameters are usually time independent.
(3) Reliability evaluation is performed directly based on the established
time-to-failure distribution function. That is, R(t) = Pr{T > t }.
In Degradation-Based Reliability Modeling,
(1) The goal is to establish a distribution function for the variable of
performance characteristic.
(2) Distribution parameters are usually time dependent.
(3) Reliability evaluation is performed indirectly based on determined failure
criterion and the established performance degradation distribution function.
That is, R(t) = Pr{Y(t) < Ycr}.
Page 50 of 50