Scaling Websites

Some Lessons Learned at

Who are you guys?

Brandon Burton <@solarce>
Chris Turra <@cturra>

we both work in web operations @ mozilla

Not just firefox?

mozilla.org (Firefox downloads)
input.mozilla.org (happy/sad face)
crash-stats.mozilla.org (crash reporting)
support.mozilla.org (user support community)
... and hundreds more

summary

architecture load balancers databases async jobs caching self service paas cloud

architecture

clusters admin node web nodes databases nodes some clusters shared

Load Balancers

Zeus (now Stingray) software solution

Platform

RHEL6
HP DL360
Myricom Myri-10G

Details

in front of nearly everything

apache, mysql, elasticsearch, hadoop
200k packets per second for SSL

Databases

/dev/null is web scale.

MySQL Multi-Master

Only ONE is active for writes in the Load Balancer.

Read Slaves

Write to master, but only read from slaves.

 DATABASES = {
    ...

    'slave': {
       'ENGINE': 'django.db.backends.mysql',
       'NAME': 'mozillians_org',
       'USER': 'mozillians',
       'PASSWORD': 'YoUt#!nk+h1$izR3@l?',
       'HOST': 'generic-ro-zeus',
       'PORT': '3306',
    },
 }

 SLAVE_DATABASES = ['slave']

Hardware

No virtualization in production.

HP blades
Fusion-IO
HP and Kingston SSDs

DBA's

AWESOME DBA's are AWESOME! +query optimation like code reviews.

A'SYNC Jobs

webscale boy band

Celery

don't block the web app
written in python & we use django
supervisord for celeryd

Rabbit MQ

message queue between web app & celery service
cluster per datacenter
puppet module to horizontally scale

Cache

rules everything around me

Memcache

we use the vanilla memached

memcache::data

ephemeral data (sessions/rss feeds/etc)
short lived and can be lost without impact

memcache::databases

django-cachemachine
object manager, looks in cache first for data

Local HTTP caching

We use Zeus You can also use: Varnish, Squid

Global HTTP Caching: CDN

~450 million Firefox users (6 wk updates)
vendors: Akamai/EdgeCast (65%/35%)
balance traffic with DynECT base on response

Akamai::FF18 HPS

Jan 10, 2013 -> Jan 13, 2013 inclusive.

Total hits: 5.5 billion
Peak HPS: 58,379.7 hits/sec

Akamai::FF18 Bandwidth

Jan 10, 2013 -> Jan 13, 2013 inclusive.

Total volume: 2.1PB
Peak traffic: 163.177 GBit/sec

Scale Out

or you fail out

Config Management

We chose Puppet, but there are other great options like: Chef & CFEngine

Disposable Web Heads

nothing is shared
Seamicro Xeon
common files (uploads/css/js) in NetApp NFS
S3 to replace NFS for upload storage (amo/marketplace)

AMD Seamicro

deployed for increase compute efficiency
saves up to 75% in space/power
enables 192 vs. 64 hosts per 45U rack

The Future

where we're going, we don't need roads

DevOps culture

blameless postmortems
all invested in the same mission
continuous improvement (always try to make the process better)
hire the best f$*!ing people

Self Service::Goal

to become platform engineers!

Self Service::Continuous Deployment

django-waffle
dark launching / feature flags
sumo, amo, input, mdn
if flag_is_active, checks, cookies, superuser, group, "dice roll"

Self Service::Chief

90% of site pushes to prod by end of 2013Q1

Self Service::Jenkins

socorro - tarballs
stage autodeploys

Self Service::Graphite

everyone has access to the graphs, real time.

Self Service::Logstash

With Kibana, everyone has access to the logs. yup, real time.

Self Service::Sentry

everyone has access to exception tracking.you guessed it, real time!

PaaS

we chose Stackato by ActiveState (built on CloudFoundry)
evaluated CloudFoundry, OpenShift & various hosted
chose most product focused

Cloud

dynamically scale in cloud, base footprint in datacenter
PaaS -> add DEA instances for scaling extra capacity.

keep on rockin'the free web

summary – architecture – Load Balancers

cturra