Auto-scaling, Queues and CloudFormations to Slash Costs at Neat

Maurício Linhares / @mauriciojr / neat.com

Who?

Technical Lead at Neat.com
Brazilian from João Pessoa, not from Rio de Janeiro or São Paulo
Moved here 8 weeks ago
No puedo hablar español, lo siento

Where were we 2.5 years ago?

Why were we burning money?

No auto-scaling means...

Unexpected traffic spike?

Just go there and provision a bunch of new machines. And please remember to take them down once the spike is over!

We were running at an Elastic Computing Cloud

But our systems were not making use of this elasticity...

What did we need?

Quickly provisionable machines
Auto-scaling groups
Queues and queue metrics
All organized together with CloudFormations

First step?

Revamp the provisioning process.

Original process

Use Knife and Chef to build an instance for a service out of a bare bones AMI
All steps, from installing software to setting up config happen at this point
Slow, many minutes from nothing to instance running
Not reproductible - machines provisioned at different points in time will have different versions of their libraries

Doesn't work for auto-scaling

When you're auto-scaling to meet real time customer demand, you can't waste any time.

Pre-made AMIs/images enter the scene

k mint spi -b SPI-Bundle-RELEASE-1.5.370-NSDK-4.0.0.242

The Golden AMI

A specific version of the software required gets installed
No environment-specific configuration exists yet
Once booted in an actual environment, machine uses user-data to figure out where to pull config and starts it's work
Fast and reproductible, all instances are the same

User data as JSON

{
  "environment" : "production",
  "role" : "thumbnailer"
}

Chef kicks off and figures out what to do

A separate script reads the user data, calls Chef using the given role and environment. Instance gets configured and services are ready to action.

Auto-scaling groups arrive

Collection of machines instanciated out of a specific configuration
Register machines at an ELB if you need it
Simple (and mostly useless) health check process
Really, that's all

Alarms, metrics and scaling policies

This is where it gets interesting.

Pick a metric

It has to be in CloudWatch but you can push anything there. Using a metric that is already provided by AWS is always simpler.

Setup alarms and scaling policies

Alarms trigger actions when their threshold is met.

"ScaleUpWorkerAlarm": {
  "Type": "AWS::CloudWatch::Alarm",
  "Properties": {
    "AlarmDescription": "Scale-Up if queue depth exceeds our limit",
    "Namespace": "AWS/SQS",
    "MetricName": "ApproximateNumberOfMessagesVisible",
    "Dimensions": [
      {
        "Name": "QueueName",
        "Value": "MyQueue"
      }
    ],
    "Statistic": "Average",
    "Period": "60",
    "EvaluationPeriods": "3",
    "Threshold": 100,
    "ComparisonOperator": "GreaterThanThreshold",
    "AlarmActions": [
      {
        "Ref": "WorkerScaleUpPolicy"
      }
    ]
  }
}

"WorkerScaleUpPolicy": {
	"Type": "AWS::AutoScaling::ScalingPolicy",
	"Properties": {
		"AdjustmentType": "ChangeInCapacity",
		"AutoScalingGroupName": {
			"Ref": "WorkerAutoScalingGroup"
		},
		"Cooldown": 300,
		"ScalingAdjustment": 1
	}
}

Working together

Metrics, alarms and policies work together to make your auto-scaling group grow or shrink as needed. You can have as many alarms, metrics and policies as you want, just make sure they actually represent how you want your app to grow.

Ok, lots of different parts, how do we tie them together?

What are CloudFormations?

Templated (JSON) AWS resources
Supports declaring most of the existing services and config options
Removes the need to perform manual steps to setup services your app needs
Creates whole, isolated, environments
Configurable with external parameters and mappings inside the template

Unified resource creation

Resource creation was all over the place, now only CloudFormations do it.

Answers the What does this app needs? question

Now you just open the CloudFormations associated with it and it should be there.

No more all access keys for apps

Templates must include their own security policies and allow access only to resources they themselves create, using IAM (Identity and Access Management) profiles.

And what happened to our story?

The service went from being manually provisioned and scaled to a full fledged auto-scaling solution. It now runs at 1/2 of the original cost and served as an example for all new services being created.

Is this the end?

We're still learning how CloudFormations work

and being bitten every once in a while.

What did we learn so far at Neat?

Do not name stuff

If AWS can generate a name for it, do not name it. Use CloudFormation outputs to get their names.

Avoid nesting or cross-CF dependencies

If you really need to do it make sure the dependency tree is shallow or you will have trouble.

Separate stuff that changes frequently from the ones that do not

Don't place your RDS database at the same template as your webapp auto-scaling group.

Do not upload templates directly, build tools to do it

And make sure these tools understand how to name stacks and validate parameters.

k cfn id2 server update -e qa -c neat

Create two auto-scaling groups to simplify zero-downtime deployments

Whenever you want to deploy something, scale up the group that is not currently scaled and then scale down the one that was.

Use IAM profiles for everything

Yes, I'm repeating this.

Create and hook up MANY alarms to your monitoring service

We're all humans, send notifications for more than one threshold to make sure they won't be snoozed into oblivion.

Make sure your logs are going somewhere

Because all machines die.

What about problems?

There's no diff

Want to figure out what will change between the current template and the one deployed? Run it. If Justin Campbell was here he would say Terraform has diffs.

If a resource is deleted out of the CF...

You'll be in for a lot of trouble.

JSON is verbose and doesn't take comments or documentation

But are tools to use other languages like Python or Ruby to declare templates.

Not all features are there yet

S3 notifications still don't have all the options available at the console/API.

Vendor lock-in

You're investing and you're stuck.

It's a black box

Problemns? Open a ticket and wait.

Custom resources are painful to write and test

Check what we did at https://github.com/TheNeatCompany/cfn-bridge

Whats next for us?

Preemptive scaling

We already have the numbers and the usage patterns are quite consistent, scaling up based on time means customers wait even less when they actually start to use the app.

Move the monolith

While all new apps have moved to CF-based setups, our monolith is still work in progress, but we will get there.

Better custom health-check for instances

Right now the health checks are rudimentar and not very effective and spotting instances that are misbehaving.

This was a team effort

Bruce Willke Jr.
Kevin Lee
Richard Henning
Sarah Gray
Shairon Toledo
Todd Davenport
Travis Truman

Questions?

Thanks!

Auto-scaling, Queues and CloudFormations to Slash Costs at Neat Maurício Linhares / @mauriciojr / neat.com

Auto-scaling, Queues and CloudFormations to Slash Costs at Neat

mauricio

Auto-scaling, Queues and CloudFormations to Slash Costs at Neat

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

phillydevops-cf-talk

Auto-scaling, Queues and CloudFormations to Slash Costs at Neat

Who?

Where were we 2.5 years ago?

Why were we burning money?

No auto-scaling means...

Unexpected traffic spike?

We were running at an Elastic Computing Cloud

What did we need?

First step?

Original process

Doesn't work for auto-scaling

Pre-made AMIs/images enter the scene

The Golden AMI

User data as JSON

Chef kicks off and figures out what to do

Auto-scaling groups arrive

Alarms, metrics and scaling policies

Pick a metric

Setup alarms and scaling policies

Working together

Ok, lots of different parts, how do we tie them together?

What are CloudFormations?

Unified resource creation

Answers the What does this app needs? question

No more all access keys for apps

And what happened to our story?

Is this the end?

We're still learning how CloudFormations work

What did we learn so far at Neat?

Do not name stuff

Avoid nesting or cross-CF dependencies

Separate stuff that changes frequently from the ones that do not

Do not upload templates directly, build tools to do it

Create two auto-scaling groups to simplify zero-downtime deployments

Use IAM profiles for everything

Create and hook up MANY alarms to your monitoring service

Make sure your logs are going somewhere

What about problems?

There's no diff

If a resource is deleted out of the CF...

JSON is verbose and doesn't take comments or documentation

Not all features are there yet

Vendor lock-in

It's a black box

Custom resources are painful to write and test

Whats next for us?

Preemptive scaling

Move the monolith

Better custom health-check for instances

This was a team effort

Questions?

Thanks!

0 0