Building Reliable Websites – determine expected site load – validate site handles expected load



Building Reliable Websites – determine expected site load – validate site handles expected load

0 0


building-reliable-websites

Contains 'Building Reliable Websites' presentation

On Github skuenzli / building-reliable-websites

Building Reliable Websites

Load and Performance Edition

Stephen Kuenzli

@author skuenzli

breaking systems for fun and profit since 2000

https://github.com/skuenzli

The Process

determine expected site load validate site handles expected load stay operational when load exceeds expectations profit!

determine expected site load

Key Metrics

  • Throughput: requests per second
  • Performance: response time

What percentage of your customers do you care about?

50%

95%

99%

?

In reality, a page requires multiple requests, so we have a conditional probability to account for. Good designs will attempt to make services as independent and parallelized as possible.

Define a Service Level Agreement

  • Throughput: 42 requests per second
  • Performance: 99% of response times <= 100ms

Don't Forget!

  • network latency and bandwidth
  • client processing power

HOWTO: measure historical throughput

# total number of GETs to /myservice for a given day
grep -c 'GET /myservice' logs/app*/access.log.2012-11-16

# estimate peak hour for service from sample
grep 'GET /myservice' logs/app??5/access.log.2012-11-16 | \
  perl -nle 'print m|/201\d:(\d\d):|' | sort -n | uniq -c

# total number of GETs to /myservice at peak hour
grep -c '2012-11-16 17:.*GET /myservice' \
  logs/app*/access.log.2012-11-16
						

HOWTO: measure response time

# processing times recorded by server in access log
grep "GET /myservice" logs/app*/access.log.2012-11-16 | \
  cut -d\" -f7 | sort -n > service.access_times.2012-11-16
						

what about network latency and bandwidth?

does request fit in the client's resource budget? 50/95/99%?

all models are wrong; some models are useful

model +/- 20%

count compute judge

adjust for

growth seasonal loading margin of error
count actual requests in server logs compute expected request patterns from firebug/chrome waterfall charts defer to judgement of 'experts' - use with extreme caution

validate site handles expected load

validation process

select a tool build a simulation run simulation multiple times / periodically analyze trends

Gatling is an Open Source Stress Tool with:

  • A DSL to describe scenarios
  • High performance
  • HTTP support
  • Meaningful reports
  • Executable from command-line or maven
  • A scenario recorder

gatling-tool.org

build simulation

run simulation multiple times / periodically

  • gather statistically significant results
  • establish a baseline
  • verify site does [not meet] SLAs
  • detect changes over time

analyze trends

detect changes in trend with control charts

is process changing?

is process changing?

stay operational when load exceeds expectations

the needs of the many outweigh the needs of the few or the one

a dead site is no good to anyone

know the site's limits and stay within them

implement a series of circuit breakers that can be tripped to reduce load in a managed way

  • manual breakers, tripped by operations staff
  • automatic breakers, tripped by software
This is classic triage. It is much better to shut down or limit less critical, nice-to-have features rather than letting the whole site become unavailable. Especially when starting out, there is no shame in tripping breakers manually. Having an experienced engineer make a decision is rarely a bad thing when managing a non-trivial distributed system. In particular, people are good at avoiding 'flapping' situations where a service oscillates between available and unavailable.

Example services ranked by criticality

send marketing email off-line / off-hours update customer's dashboard breaker #1 upload images breaker #2 render images breaker #3 sign-in checkout save customer's work

there's usually a trade-off available

Of course, the definition of critical and non-critical totally depends on the business. However, if you try hard enough, you can always rank services in order of criticality.

Resources

  • This presentation: https://github.com/skuenzli/building-reliable-websites
  • Concurrency Limiting Filter: https://github.com/skuenzli/simplyreliable
  • Web Operations: http://shop.oreilly.com/product/0636920000136.do
  • Circuit Breaker Pattern: http://doc.akka.io/docs/akka/2.1.0-RC1/common/circuitbreaker.html
  • Gatling: http://gatling-tool.org
  • USL

fin

https://github.com/skuenzli