t-test, introduced 1908 by william gosset working for guinness. published under the pen name "A. Student" because Claude Guinness treaeted his use of math in the brewry as trade secret
Avishai Ish-Shalom (@nukemberg)
repo: github.com/avishai-ish-shalom/the-math-of-reliability Companion IPython notebook: reliability.ipynb
Nagios service with max_check_attempts=4, check_interval=15sec
Service experiencing 40% error rate
Chance of hard CRITICAL: 2.6%
Chance of NOT GETTING ANY ALERT:
0.5 hour 45.9% 1 hour 21.1% 1.5 hours 9.9%We have to understand the business to define failure
$$\lambda = T / MTFB$$
$$F = \lambda / T = 1 / MTBF$$
$$R = 1 - F$$
typical hdd MTBF - 0.3-1M hours (about 35-120 years); MTBF computed in a lab and extrapolated over time
$$R_{total} \lt min(R_{i})$$
Best ROI - improve the worst component
Improvement is expensive
Total reliability always lower than worse component. No point using disproportionally reliable components
Cheaper way: clusters!
Reliability of redundant system, up to $k$ failures
$$R_{total}(n, k) = \sum_{i=0}^{k} {n \choose i} F^{i} R^{n-i}$$
Is an argument for using enterprise-grade hardware???
Redundant system, R=0.95
n - cluster size; k - failures tolerated
n k Overhead R total 10 1 10% 0.914 10 2 20% 0.989 100 5 5% 0.616 100 9 9% 0.972 100 11 11% 0.996Use many small/cheap identical components
If you answered 0.95, you have fallen for the
Correct answer: ~ 0.02
In a 1000 drivers sample, 1 would be drunk and 49.95 (999 x 0.95) would falsely test as drunk
Base rate of being detected as drunk (P(D)=50.95/1000) >> rate of drunk drivers (P(drunk)=1/1000)
Bayes theorem: $P(drunk|D) = P(D|drunk) P(drunk)/P(D)$
$P(D|drunk) = 1, P(drunk)=1/1000, P(D) = 50.95/1000$
microservices FTW
$delay \propto \frac {\rho} {1 - \rho}$
ρ - system utilization
$L = \lambda W$
L - clients in the system, λ - arrival rate, W - wait time (latency)
$L_i = L_j \rightarrow \frac {\lambda_i} {\lambda_j} = \frac {W_j} {W_i}$
$\frac {df} {dt} = \alpha f \rightarrow f(t) = A e^{\alpha t}$