How to be good at Operations – in 40 minutes – Done well, Operations enhances the safety, contentment, knowledge and freedom of both the authors and users of the system.



How to be good at Operations – in 40 minutes – Done well, Operations enhances the safety, contentment, knowledge and freedom of both the authors and users of the system.

10 29


good-at-ops

How to be good at Operations

On Github adamhjk / good-at-ops

How to be good at Operations

in 40 minutes

Created by Adam Jacob / @adamhjk

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Fork This Presentation on Github

sur·vey

noun, sərˈvā/

1. a general view, examination, or description of someone or something. "the author provides a survey of the relevant literature"
  • Each topic worthy of 40 minutes itself
  • Goal is to teach the reason for things
  • So you make better implementation choices

What is Operations?

By which we mean technical operations

The work of building and maintaining computer systems, networks, and applications. Original Image
  • The definition covers everyone
  • This is why "devops" is obvious and the new normal

How to be good at Operations

Design to improve the safety, contentment, knowledge and freedom of your colleagues and users.

Focus on improving availability through reducing MTTD and MTTR.

Improve the organizations efficiency through improvements in People, Process, and Technology.

Done well, Operations enhances the safety, contentment, knowledge and freedom of both the authors and users of the system.

  • Design is fundamental
  • Each choice you make needs to make life better for the humans involved
  • That also leads to better business outcomes, as we'll learn later
  • Ultimately, the most scalable, fastest systems are also the ones that are best for the humans invovled, most of the time

Safety

  • Human safety
  • Information safety
  • Availability of the system as a possible link to both
  • The ability for individuals to act without fear of unintended consequences

Safety is a slider – different systems have different thresholds

Original Photo

  • Imagine you were early days at twitter
  • The system wasn't human safety critical, in your mind
  • Until it became a source of human saftey and communication during countless revolutions

Contentment

Contentment is about being satisfied with what you have.

The state of our systems is often a source of deep discontent :)

It may not make you happier – but it won’t hurt

Happiness is not a goal – it’s a by-product of a life well lived - Eleanor Roosevelt Original Image
  • Happiness is fleeting
  • If you are in trouble, contentment helps you make better decisions
  • Think about a brutal on call week - if the systems that support you are good, you survive

Knowledge

Access to knowledge is a leading indicator of social progress.

We should be making it easier to understand what the system is for, why we need it, and what good outcomes are.

The goal isn’t to minimize needed knowledge – its to provide access to the wealth of it, when we need it.

Original Image
  • The right knowledge, at the right time
  • Think about PaaS - its awesome you just git push
  • Until you are Rap Genius and heroku changes the router and everything sucks and you don't know why
  • Which doesn't make PaaS awful - at a different level of criticality, who cares?
The power or right to act, speak, or think as one wants without hindrance or restraint. – The Internet

We should be empowering ourselves and others to act, speak, and think as they need to with less hindrance.

Original Image
  • The Big Web got this right
  • Empower individuals to work as they see fit
  • Trust them to do the right things
  • Build systems that increase the trust needed to allow more freedom

Safety

Contentment

Knowledge

Freedom

  • We will come back to these throughout

Being good at Operations

Means being good at two things

Availability

Efficiency

  • Availability: Is the system down? Bring it back up.
  • Efficiency:Make the effort required to do work easier.
  • The work here is building and maintaining computers, networks, and applications
  • So efficiently doing that covers damn near everything

Focus on Availability

Efficiency Follows

  • Availability shows where you need to be most efficient now
  • It's a virtuous cycle

Availability

$$Availability = \frac{Uptime}{(Uptime + Downtime)}$$ Much thanks to Theo Schlossnagle, John Allspaw, Patrick Debois, and others for informing much of this section. Mistakes are mine.

Availability is everybody's problem

  • There is no team that owns availability - other than the company itself
  • The problems are too big

The 9's

Availability Downtime per month 90% (one nine) 72 hours 99% (two nines) 7.2 hours 99.9% (three nines) 43.8 minutes 99.99% (four nines) 4.32 minutes 99.999% (five nines) 25.9 seconds Original Image
  • The difference in magnitude matters - days, hours, half hours, minutes, seconds
  • To achieve higher levels, everything has to get more precise
  • Know your target, and communicate it
  • It probably isn't five nines

The M's

  • Mean Time To Failure (MTTF) ↑ The average time there is correct behavior
  • Mean Time To Diagnose (MTTD) ↓ The average time it takes to diagnose the problem
  • Mean Time To Repair (MTTR) ↓ The average time it takes to fix a problem
  • Mean Time Between Failures (MTBF) ↑ The average time between failures
  • We want to decrease MTTD and MTTR
  • And increase MTTF and MTBF

Focus your efforts

On reducing Mean Time to Diagnose and Mean Time to Repair.

Failure is inevitable - it's how you detect and react that matter most to availability.

Original Image
  • All systems fail
  • Fear of failure is the greatest killer of availability

Slow and ponderous

Fast and nimble

  • Online banking is a huge thing for consumer banks
  • I met with one that has 5 9's of availability
  • They achieved this through changing the website once ever 6 months
  • After a torture chamber of hate and pain
  • They were not better at diagnose and repair - they were good at MTBF, and lucky
  • Contrast that with a more nimble org, who might have more frequent outages (say scheduled maintenance once a week)
  • But the system improves week over week
  • Raise your hand which one you want!
  • It's safer, increases human contentment, is easier to reason about, and frees people up

Diagnose

Metrics Collection

Collect metrics from the operating system, network, and applications.

High resolution matters!

As few systems as possible.

Original Image
  • You can't fix what you can't see
  • Metrics resolution has direct impact on MTTD

Diagnose

Two Critical Metrics

Is it up - from a users perspective Is it making money

Original Image

  • One binary metric - can your users use your stuff
  • Money is often a trailing indicator of deeper systemic problems that are hard to see
  • I helped run an ad network back in the day, and the hour-by-hour money graph was the fastest way to see if we were letting people run over cap
  • Money graph also helps you justify other activity!

Diagnose

Graphing, Trends and Analysis

Use graphs to understand normal behavior.

Graphs taken from Theo Schlossnagle and OmniTI

  • Lets say this is puppy.com - the prime source for puppy news
  • Nice, easy content day - 70% utilization, smooth peaks and valleys
  • The Doge dog beats up the taco bell chihuahua outside the most posh dog park in LA
  • Puppy.com has the exclusive video

Diagnose

Graphing, Trends and Analysis

Use graphs to understand abnormal behavior.

Graphs taken from Theo Schlossnagle and OmniTI

  • The new york times picks it up, and adds long exposure traffic
  • Digg shows up, and it goes to 11
  • Happens in 60 seconds!

Auto-Scaling Will Not Save You

  • Either you design for this load, or you fail to meet the expectations
  • The right answer here is serve puppy.com from behind fast.ly :)

Capacity Planning

Identify key metrics Put them on a graph Set a limit Plot a trend line Expand your time horizon

Original Image

Capacity Planning

  • Do this on a regular cadence - monthly, etc.
  • Show your R-squared - think of it as a confidence number
  • This could be any metric that matters for your system
  • This is the number one source of trivially preventable outages

Diagnose

Alerts

Get the attention of the right humans.

  • As few alerts as possible
  • Routed to the people who can take action
  • Start with the is it up alert
  • Never create an alert that isn't actionable!
  • There is nothing more disrespectful than waking someone up for shit they can't fix
  • It's happening.. its happening... again

Repair

Incident Response

Original Image
  • Observe: whats going on
  • Orient: put whats going on in context of waht you know about the system, people, and dynamics
  • Decide: what to do next
  • Act: take action
  • Originally for fighter pilots to get inside the heads of the enemy
  • A faster loop means success in combat
  • This is the same pattern for responding to operations availability issues

Repair

Orient

Orient is the step we often fail at.

Thinking is the best tool we have in incident response.

Understanding more about the system, and how each piece behaves, is what separates the good from the great.

What Rob Pike learned from Ken Thompson

  • In fighter jets, knowing typical behavior, jets, and culture was crucial
  • Rob Pike and Ken Thompson working on a visual language
  • Rob typed faster, so he was at the keyboard
  • Rob attacked bugs, Ken thought about it
  • Ken was orienting better
  • Unlike a fighter jet, he had time :)

Repair

Incident Command

The First Responder is the default Incident Commander

Decides what to do next Coordinates resources Can hand off command Communicates status Not about rank

There is only ONE Incident Commander.

This isn't always true in real Incident Command, but go with it.

  • When it gets bigger than one person can handle, we flip to this
  • Knowing we have a Process, and command structure makes it easier to OODA
  • And faster loops means faster resolution

Learn

Post Mortem

Incident Commander schedules a post mortem within 24 hours of incident resolution.

Purpose is to learn from the incident, and and identify the work needed to:

  • Prevent recurrence (if necessary)
  • Improve Mean Time To Diagnose
  • Improve Mean Time To Repair

Original Image

  • This should be the IC at the end of the incident
Progress on safety coincides with learning from failure. This makes punishment and learning two mutually exclusive activities: Organizations can either learn from an accident or punish the individuals involved in it, but hardly do both at the same time. The reason is that punishment of individuals can protect false beliefs about basically safe systems, where humans are the least reliable components. Learning challenges and potentially changes the belief about what creates safety. Moreover, punishment emphasizes that failures are deviant, that they do not naturally belong in the organization... Sidney W.A. Dekker, Ten Questions about Human Error: A New View of Human Factors and System Safety (Human Factors in Transportation)

Learn

How to run a Post Mortem

Invoke the space: we are here to learn, not to blame Describe the incident Establish the timeline Identify contributing factors Describe customer impact Describe remediation tasks for the root cause Describe improvement tasks for response process
  • We hold post mortems to learn and improve, not to blame and punish
  • Puppys.com went down when Digg linked to the Doge/Chihuaua story
  • Story gets posted at 8am PST, NYT picks it up at 8:15am PST, Digg posts at 8:30am PST
  • Site goes down at 8:30am, alert at 8:31am, diagnosed at 8:50am, more capacity launched on ec2 at 8:55am, online and resolved at 9:00am PST
  • The traffic load overwhelmed mpm worker apache configuration, and exhausted capacity
  • People could not watch the doge dog crush the chihuaha, and click ads
  • Launched more capacity. Long term remediation is to move static content to a CDN
  • We investigated a denial of service and backend database issues before we looked at traffic graphs. Add passive alert on traffic.

Prioritize the outcomes

  • The process works because you prioritize the outcomes
  • Our remediation steps are the efficiency improvements you want
  • If you fail to act, or do other stuff, you're wasting the opportunity

Availability Roundup

  • Understand your Availability Targets
  • Track and understand your M*'s
  • Reduce time to detect and repair
  • Use capacity planning to avoid obvious incidents
  • Have an incident response and command process
  • Perform and publish post-mortems for every incident
  • Prioritize the outcomes

Efficiency

$$Efficiency = \frac{Output}{Effort}$$

Make the effort required to do work easier.

Original Image

People

Process

Technology

  • 3 areas for efficiency, in order or most potential for gains
  • Think about Puppy's dot com - if we didn't have the right people, if we didn't have a process for incidents, if we didn't have post mortems, the technology fixes wouldn't make a dent long term
  • What is the mission?
  • How does your organization intend to fulfill it?
  • How do you contribute?
  • What are the stakes?
  • Knowing your purpose enables you to put decisions in context
  • The more context you have, the better your decision will be
  • Like a very long OODA loop

Know the people

  • Software Developers
  • Business Decision Makers
  • Systems and Network Administrators
  • Marketing and PR
  • Sales
  • Legal

Original Image

  • Trust is crucial to effective operations
  • Knowing people is crucial to trusting them
  • Set up lunch dates
  • Talk about your lives
  • THIS IS WHERE DEVOPS COMES FROM
  • John Allspaw and Paul Hammond are friends
When they create electronic devices, they can reflect on whether that new product will take people away from themselves, their family and nature. Instead they can create the kind of devices and software that can help them go back to themselves, to take care of their feelings. By doing that, they will feel good because they’re doing something good for society. - Thich Naht Hanh at Google
  • The way we do our work informs our lives
  • Having good lives improves the quality of our work in every dimension
  • We are blessed to be the architects of our environment
  • Lets back Thay up with data

People

Engaged Workers Rule

Stats in this section come from asking 25 million employees the same 12 questions in Gallup's state of the American Workplace with causality evidence from Causal Impact of Employee Work Perceptions on the Bottom Line of Organizations.
  • Gallup has been running this study since the 90s
  • They have proven the impact engaged workers have is causul
  • What other single thing could you possibly do that has a 22% impact on profitability?
  • 21% impact on productivity!
  • 65% less turnover!
  • Or a 41% impact on defects! Happy people care about their work more
  • It's the most critical operations efficiency task

Sources of Engagement

Clear expectations Opportunity to shine Praise Having people care about you Having your opinions count A mission that makes you feel important Commitment to quality

Original Image

  • Repetition, Repetition, Repetition
  • Training people is like training cats - you gotta be on that

Assholes

Know you an Asshole

After encountering them, people feel oppressed, humiliated, or otherwise worse about themselves They target people less powerful than them Chronic assholes are the problem. Sections on Assholes taken from The No Asshole Rule.
  • Not talking about a bad day - these poeple are out to undo all the good engaged people do

Assholes are inefficient

Positive interactions must outnumber negative ones 5:1

Bad interactions have stronger, more pervasive, and longer lasting effects

Findings found in How, when, and why bad apples spoil the barrel: Negative group members and dysfunctional groups.
  • Pick someone out, insult them gently, then compliment them
  • Point out this is what they will remember from this talk, forever

What you can do

  • Don't be an Asshole, and fire or shun those who are
  • Set clear expectations for others
  • Praise people
  • Make friends with, and care about your co-workers
  • Listen to each other
  • Take pride in your work

Process

The way we work is critical to our outcomes

Original Image

Kaizen

改善

Change for the better

Continuous Improvement

A few lean/improvement resources: Lean thinking, The Goal - there are so many more.

Kaizen

Small improvements

Evaluate a process, make it better.

Try using the scientific method:

Ask a question Do research Construct a hypothesis Test your hypothesis Analyze data and draw a conclusion Communicate your results

Kaizen

Anyone can do it

Kaikaku

Radical Change

Recognize when desired results are beyond incremental improvement.

Start fresh, incorporate a new process, then do Kaizen

  • Continuous Delivery is a good example
  • If you are a big, waterfall org with manual testing
  • Incrementally moving to CD is going to fail
  • You need to blow up the way you work, learn how that feels, and kaizen your way to happiness
  • A house built on sand and all that

Original Image

Technology

Systems Design

Understand the requirements

Do not mistake existing implementations for hard requirements
  • Big retailers web division, wanted to automate, I wanted to sell software
  • Asked how they felt about Cd, said they weren't CD people
  • I was like: Me neither! ;)
  • They told me their design, said "then we come together and make it work"
  • We rebuilt it in that room, much better - not real requirements

Scalable Systems Design

Identify autonomous actors, and have them keep their promises

Rolling Upgrade

  • Traditional web servers behind a load balancer
  • Upgrade servers one at a time

Naive way

Take App1 from Load Balancer Pool Update Software on App1 Verify update worked Put App1 back into Load Balancer Pool What happens if a server is down? What happens to traffic in transit? What if we die in the middle?
  • This is what you would do if you wrote the steps down!
  • And it's whats going to happen in any case
  • But linearly implementing these as a script - whoa doggies
  • 600 configuration changes to the load balancer!

Autonomous Actors

Each component responsible for itself

Promises

Each Autonomous Actor promises to behave a certain way.

Other Actors can verify those promises.

Identify Autonomous Actors

Load Balancers

Promises to route traffic to working app servers

Application Servers

Promises to serve application traffic and publish status

Better way

Update software on App1
  • Add a service that is smart about the apps status to each server
  • Monitor that service with the load balancer
  • Upgrade process manages that services response
  • Load balancer just blindly routes traffic
  • All the questions from the neive implementation can be answered by improvements to the status endpoint

The better solution has fewer interactions.

But it has more pieces.

  • We reduced the degree of difficulty in the process
  • Increased the number of moving parts
  • Safety: Resilient against many more failure modes
  • Knowledge Far easier to reason about during Orient in the OODA loop
  • Freedom: Pattern adapts to different values of "available" based on service needs
  • Contentment Safer>, easier to reason about, and more flexible - that makes everyone content

Efficiency Roundup

  • Greatest gains are in improving People
  • Continually improve process, be willing to redesign in the face of new challenges
  • Use Scalable Systems Design to improve your technology and automation

How to be good at Operations

Design to improve the safety, contentment, knowledge and freedom of your colleagues and users.

Focus on improving availability through reducing MTTD and MTTR.

Improve the organizations efficiency through improvements in People, Process, and Technology.

ring-3ring-2ring-1ring-0
0