High CPU Steal On AWS Burstable Instances

April 12, 2017

Seeing High CPU Steal on AWS Burstable Instance Types?

At Tiller, we were, too. We have some backend systems that process data offline, using a job processing system that we’ve set up in a one layer of an AWS OpsWorks stack, using a node-based Agenda job processors running on t2.small instances.

We’ve been having some subtle problems for a while, that we finally reached a point we could no longer ignore, and so we looked in deeper. The symptoms were that, after a while, an instance in that layer becomes busy enough to start missing deadlines and generating significant numbers of errors. We noticed a high amount of CPU Steal on those instances, at those times, and initially thought we might be suffering from the ‘noisy neighbor’ problem.

Turns out it wasn’t a noisy neighbor: it was us.

Spoiler alert: our choice of instance type and job processor algorithm weren’t really a good match. This outstanding blog post by Leonid Mamchenkov was a great help in figuring this out, as was my co-worker Brasten, who said something to me weeks ago along the lines of “Tim, I don’t think this is a noisy neighbor. I think this is us.”

More specifically, Agenda’s multi-node emergent scheduling algorithm isn’t the best fit for AWS Burstable Instance Types. In fact it’s pretty much the worst fit. I was distracted from noticing this from a more appropriate holistic system perspective, by three very valid independent benefits:

Simplicity

When Agenda figures out it’s got something to do, it essentially grabs as much work as it can right away (up to some tunable limits). This approach is common, and usually yields an implementation that is easy to read and understand. Simplicity is always good, right?

Computational Efficiency

In addition to being simple, algorithms of this nature are typically very computationally efficient. In these days of inexpensive cloud-based computing horsepower, that’s probably not a reason to CHOOSE a particular implementation, but it’s always a nice-to-have, and made it easy to further convince myself of the system-wide approach’s validity, despite it’s inappropriateness in our system at large, as it turns out...

Economic Efficiency

AWS Burstable Instance Types are economically efficient instances that guarantee a minimum amount of available CPU, and use a ‘credit’ system to provide burstable performance for short periods of time where more is needed.

All of those, independently, are GREAT ideas! But if you put the component pieces together in a horizontally scaling environment: You’re Gonna Have a Bad Time. Or, more professionally stated, the simplicity gains, computational efficiency gains, and economic efficiency gains were more than offset by all the time we spent beating our heads against the wall.

Let me tell you what I did wrong, so at least you can learn from it.

We ran the agendaJobs job processor on a number of burstable instances in a layer of one of our systems. Each agendaJobs instance then grabs as much of the work as we’ve tuned it to handle, as soon as it can. Effectively, ‘agendaJobs1’ will grab a bunch of work before ‘agendaJobs2’ will even pick up anything at all.

‘agendaJobs1’ will very efficiently burn through as much WORK as it can - that’s GREAT, right! Well, computationally, yes - except WORK also happens to equate to those CPU credits I mentioned, and CPU credits aren’t shared across instances. So a more relevant way of describing what’s going on is that ‘agendaJobs1’ will burn through all of its accrued CPU Credits -- and keep right on chugging -- while ‘agendaJobs2’ sits there like a bump on a log, with a full bank of accrued CPU credits.

One very tempting “solution” is to configure Agenda to only take on as much work as it can, tuned such that the limit won’t exhaust all of a single instance’s CPU credits. In our case, t2.small instances are guaranteed only 20% of the CPU, so our initial patch was to configure Agenda to only take enough work to use up to about 20% of the CPU. This actually worked well for us since the system in question isn’t really loaded all that much. We were over 20% on one instance regularly, but the second instance almost never had anything to do at all. In this system, we’re currently using the multiple servers primarily for redundancy rather than horizontal scaling, and the scaling becomes important later, when we need to, well -- scale it.

A better, more general, more scalable solution is to switch to one of ‘m’ instance types, or a job scheduling system that uses round-robin scheduling. Switching to a round-robin scheduling system is a lot of coding work, however, and theoretically also introduces a scary ‘cliff’ where everything system-wide bogs down as soon as ALL instances exhaust their CPU credits, which would happen all at once.

The solution we’re most likely to embrace going forward is to offload the scheduling and tuning problems to someone else -- probably Amazon. We could, for instance run the work on Lambda, instead of our own job processing system. Also, and I’ve said this to myself and others 100 times: listen to Brasten.

Tim Johns' Blog