t-test, introduced 1908 by william gosset working for guinness. published under the pen name "A. Student" because Claude Guinness treaeted his use of math in the brewry as trade secret

The Math of Reliability

Avishai Ish-Shalom (@nukemberg)

repo: github.com/avishai-ish-shalom/the-math-of-reliability Companion IPython notebook: reliability.ipynb

Reliability is initimately linked to culture, but that's a different talk
You can't "bolt on" reliability
Purpose of this talk: get people to think about reliability analytically
Who's using math daily?

Math!?

People are scared of math, but they shouldn't be
It's not about the formulas or numbers, it's about models, formalizing and proving
Story of the professor - "it's trivial that..."

Example: Nagios-like alerts

Nagios service with max_check_attempts=4, check_interval=15sec

Service experiencing 40% error rate

Chance of hard CRITICAL: 2.6%

Chance of NOT GETTING ANY ALERT:

0.5 hour 45.9% 1 hour 21.1% 1.5 hours 9.9%

Nagios was not designed for statistical failures (false negatives)

Define "Reliable"

"4 nines"
MTBF
Failures per Year
QoS
SLA

Lots of terms, all insufficient
"uptime" means "no failure"
Need to define what is failure

Define "Failure"

System operating outside specified parameters

In reality: users are complaining!

"Failure" is subjective!

We have to understand the business to define failure

Possible states

Working OK
Failure
Unknown
Fuzzy

Failure - operating outside parameters
Failure isn't allways obvious
which clock is "correct"? which value?
We don't always know what "correct" is
We don't always know what the system state is. e.g. our telemetry can be wrong

The absence of evidence is not the evidence of absence

The absence of alertsis not the evidence of proper operation

Let's talk about failure

Reliability measures

MTBF = mean time between failures (years per failure)
λ = failures per year
F = failure rate or probability of failure in one year
R = reliability rate (probability of working in one year)

$$\lambda = T / MTFB$$

$$F = \lambda / T = 1 / MTBF$$

$$R = 1 - F$$

typical hdd MTBF - 0.3-1M hours (about 35-120 years); MTBF computed in a lab and extrapolated over time

Statistical independence

Dominant mode in hardware
Also applies to some software failures

The Hot Hand fallacy

The Gambler's fallacy

Past performance does not predict future performance*

This is what statistical independence means. future events are independent of past events
In general, we can only make statistical prediction over a large number of similar systems

Serial reliability

$$R_{total} = \prod_{i=0}^{n} R_{i}$$

Serial reliability

R1 R2 R3 R system Improvement (MTBF) 0.995 0.99 0.95 0.936 - 0.9995 0.99 0.95 0.94 X 1.07 0.995 0.999 0.95 0.944 X 1.15 0.995 0.99 0.995 0.98 X 3.21

$$R_{total} \lt min(R_{i})$$

Best ROI - improve the worst component

Improvement is expensive

Total reliability always lower than worse component. No point using disproportionally reliable components

Cheaper way: clusters!

Parallel reliability (redundancy)

Reliability of redundant system, up to $k$ failures

$$R_{total}(n, k) = \sum_{i=0}^{k} {n \choose i} F^{i} R^{n-i}$$

Is an argument for using enterprise-grade hardware???

Redundant system, R=0.95

n - cluster size; k - failures tolerated

n k Overhead R total 10 1 10% 0.914 10 2 20% 0.989 100 5 5% 0.616 100 9 9% 0.972 100 11 11% 0.996

Not enough redundancy will REDUCE your reliability
N+1 rule only true for small clusters
Large clusters more cost effective

Use many small/cheap identical components

Statistically dependent / Correlated failures

Shared workload
Shared code
Shared infrastracture

Dominant failure mode in software
Still gain added reliability from redudancy, but not as much
How do you deal with it?
Segmentations, failure domains
Intentional variation

Backup and operational sub-systems should avoid coupling with primaries

Seems obvious but people get it wrong
Especially when it comes to monitoring and ops tools

1 out of 1000 drivers is drunk
Breathalyzer detects all drunks but has 5% false positives
Drivers stopped at random

A driver was stopped and breathalyzer shows he's drunkWhat's the probability he's really drunk?

If you answered 0.95, you have fallen for the

The Base rate fallacy

Correct answer: ~ 0.02

Explanation

In a 1000 drivers sample, 1 would be drunk and 49.95 (999 x 0.95) would falsely test as drunk

Base rate of being detected as drunk (P(D)=50.95/1000) >> rate of drunk drivers (P(drunk)=1/1000)

Bayes theorem: $P(drunk|D) = P(D|drunk) P(drunk)/P(D)$

$P(D|drunk) = 1, P(drunk)=1/1000, P(D) = 50.95/1000$

Active/Standby failover

Failed master always detected
2% probability of false positive (working master detected as failed)
~ 95% of failovers are erroneous
Erroneous failovers can cause severe issues

Disable auto failover, greatly reduce false positives or use active/active

Database failover dillema, github 2012 outage
You may be tempted to say quorum decision can solve this, but..
Either reduce false positives drastically or reduce failover issues

Multiple dependencies

Circuit breakers!!

microservices FTW

Queuing delay

$delay \propto \frac {\rho} {1 - \rho}$

ρ - system utilization

Throttle your system!

if you go over ~ 80% utilization latency will start rising fast

Backpressure

backend server utilization too high
load will queue inside your system
limit internal queues and apply backpressure

Little's Law

$L = \lambda W$

L - clients in the system, λ - arrival rate, W - wait time (latency)

$L_i = L_j \rightarrow \frac {\lambda_i} {\lambda_j} = \frac {W_j} {W_i}$

What happens when 1 process failes and returs errors with 1/100 latency?
How do you deal with this?
throttling according to "normal" throughput

Feedback loops

$\frac {df} {dt} = \alpha f \rightarrow f(t) = A e^{\alpha t}$

Define "Reliable" – Let's talk about failure – Serial reliability

avishai-ish-shalom

Define "Reliable" – Let's talk about failure – Serial reliability

4 24 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();