Define "Reliable" – Let's talk about failure – Serial reliability



Define "Reliable" – Let's talk about failure – Serial reliability

4 24


the-math-of-reliability


On Github avishai-ish-shalom / the-math-of-reliability

t-test, introduced 1908 by william gosset working for guinness. published under the pen name "A. Student" because Claude Guinness treaeted his use of math in the brewry as trade secret

The Math of Reliability

Avishai Ish-Shalom (@nukemberg)

repo: github.com/avishai-ish-shalom/the-math-of-reliability Companion IPython notebook: reliability.ipynb

  • Reliability is initimately linked to culture, but that's a different talk
  • You can't "bolt on" reliability
  • Purpose of this talk: get people to think about reliability analytically
  • Who's using math daily?

Math!?

  • People are scared of math, but they shouldn't be
  • It's not about the formulas or numbers, it's about models, formalizing and proving
  • Story of the professor - "it's trivial that..."

Example: Nagios-like alerts

Nagios service with max_check_attempts=4, check_interval=15sec

Service experiencing 40% error rate

Chance of hard CRITICAL: 2.6%

Chance of NOT GETTING ANY ALERT:

0.5 hour 45.9% 1 hour 21.1% 1.5 hours 9.9%
  • Nagios was not designed for statistical failures (false negatives)

Define "Reliable"

  • "4 nines"
  • MTBF
  • Failures per Year
  • QoS
  • SLA
  • Lots of terms, all insufficient
  • "uptime" means "no failure"
  • Need to define what is failure

Define "Failure"

System operating outside specified parameters

In reality: users are complaining!

"Failure" is subjective!

We have to understand the business to define failure

Possible states

  • Working OK
  • Failure
  • Unknown
  • Fuzzy
  • Failure - operating outside parameters
  • Failure isn't allways obvious
  • which clock is "correct"? which value?
  • We don't always know what "correct" is
  • We don't always know what the system state is. e.g. our telemetry can be wrong

The absence of evidence is not the evidence of absence

The absence of alertsis not the evidence of proper operation

Let's talk about failure

Reliability measures

  • MTBF = mean time between failures (years per failure)
  • λ = failures per year
  • F = failure rate or probability of failure in one year
  • R = reliability rate (probability of working in one year)

$$\lambda = T / MTFB$$

$$F = \lambda / T = 1 / MTBF$$

$$R = 1 - F$$

typical hdd MTBF - 0.3-1M hours (about 35-120 years); MTBF computed in a lab and extrapolated over time

Statistical independence

  • Dominant mode in hardware
  • Also applies to some software failures

The Hot Hand fallacy

The Gambler's fallacy

Past performance does not predict future performance*

  • This is what statistical independence means. future events are independent of past events
  • In general, we can only make statistical prediction over a large number of similar systems

Serial reliability

$$R_{total} = \prod_{i=0}^{n} R_{i}$$

Serial reliability

R1 R2 R3 R system Improvement (MTBF) 0.995 0.99 0.95 0.936 - 0.9995 0.99 0.95 0.94 X 1.07 0.995 0.999 0.95 0.944 X 1.15 0.995 0.99 0.995 0.98 X 3.21

$$R_{total} \lt min(R_{i})$$

Best ROI - improve the worst component

Improvement is expensive

Total reliability always lower than worse component. No point using disproportionally reliable components

Cheaper way: clusters!

Parallel reliability (redundancy)

Reliability of redundant system, up to $k$ failures

$$R_{total}(n, k) = \sum_{i=0}^{k} {n \choose i} F^{i} R^{n-i}$$

Is an argument for using enterprise-grade hardware???

Redundant system, R=0.95

n - cluster size; k - failures tolerated

n k Overhead R total 10 1 10% 0.914 10 2 20% 0.989 100 5 5% 0.616 100 9 9% 0.972 100 11 11% 0.996
  • Not enough redundancy will REDUCE your reliability
  • N+1 rule only true for small clusters
  • Large clusters more cost effective

Use many small/cheap identical components

Statistically dependent / Correlated failures

  • Shared workload
  • Shared code
  • Shared infrastracture
  • Dominant failure mode in software
  • Still gain added reliability from redudancy, but not as much
  • How do you deal with it?
  • Segmentations, failure domains
  • Intentional variation

Backup and operational sub-systems should avoid coupling with primaries

  • Seems obvious but people get it wrong
  • Especially when it comes to monitoring and ops tools
  • 1 out of 1000 drivers is drunk
  • Breathalyzer detects all drunks but has 5% false positives
  • Drivers stopped at random

A driver was stopped and breathalyzer shows he's drunkWhat's the probability he's really drunk?

If you answered 0.95, you have fallen for the

The Base rate fallacy

Correct answer: ~ 0.02

Explanation

In a 1000 drivers sample, 1 would be drunk and 49.95 (999 x 0.95) would falsely test as drunk

Base rate of being detected as drunk (P(D)=50.95/1000) >> rate of drunk drivers (P(drunk)=1/1000)

Bayes theorem: $P(drunk|D) = P(D|drunk) P(drunk)/P(D)$

$P(D|drunk) = 1, P(drunk)=1/1000, P(D) = 50.95/1000$

Active/Standby failover

  • Failed master always detected
  • 2% probability of false positive (working master detected as failed)
  • ~ 95% of failovers are erroneous
  • Erroneous failovers can cause severe issues

Disable auto failover, greatly reduce false positives or use active/active

  • Database failover dillema, github 2012 outage
  • You may be tempted to say quorum decision can solve this, but..
  • Either reduce false positives drastically or reduce failover issues

Multiple dependencies

Circuit breakers!!

microservices FTW

Queuing delay

$delay \propto \frac {\rho} {1 - \rho}$

ρ - system utilization

Throttle your system!

  • if you go over ~ 80% utilization latency will start rising fast

Backpressure

  • backend server utilization too high
  • load will queue inside your system
  • limit internal queues and apply backpressure

Little's Law

$L = \lambda W$

L - clients in the system, λ - arrival rate, W - wait time (latency)

$L_i = L_j \rightarrow \frac {\lambda_i} {\lambda_j} = \frac {W_j} {W_i}$

  • What happens when 1 process failes and returs errors with 1/100 latency?
  • How do you deal with this?
  • throttling according to "normal" throughput

Feedback loops

$\frac {df} {dt} = \alpha f \rightarrow f(t) = A e^{\alpha t}$

Backoffs, cooldowns

Reliability is everyone's responsibility

Thank you

Complex Adaptive Systems

Phase changes

Chain Reaction

System memory

Transient -> permanent