06 December 2008

Regenerative Braking for Services

George Reese has a missive over on O’Reilly’s site about why auto-scaling your “cloud” application is a bad idea. He starts from the naïve case where scaling your computing without bounds leads to your expenses scaling without bounds as well. Okay, that makes sense. Then he goes on to explain that setting those bounds to do the right thing is too hard, and involves humans doing capacity planning, so you should just do better capacity planning with humans and leave the automation out.

Now I’m a big fan of robots and having machines do tedious work for me, so this claim holds little truck with me. Frankly, the words “too hard” translated as I read them to “you could have a strategic advantage over competitors if you do this well.” Unsurprisingly, I’m not the only one that feels this way, and, in fact, several people chimed in with refutations and examples of how they're already doing this today to great advantage.

A response by Sam Curren, Really Bad Reasons Not To Auto-scale, refuted most of the “it’s too hard to get it right” arguments. Adam Jacob had a good comment as well, if you’re monitoring the wrong things, it is in fact easy to get wrong. In fact, one can look to Don MacAskill’s post about smugmug on EC2 to see some examples of what measuring the right things can look like. Breaking things apart into pieces that are easier to measure is an implicit piece of Don’s discussion that probably warrants more discussion another time.

One thing that hasn’t been mentioned yet in this conversation is that if you don't degrade gracefully under pressure in any of these models you've already lost. If your service is starting to degrade (or know it’s about to) the only hard part is knowing whether to grin and gracefully degrade under the temporary pressure, or bring in more capacity. Thing is, humans are quite capable of making the wrong call here, and even if they make the right call, they’ll do it much slower and they won’t do it in the middle of the night when your service suddenly gets an unanticipated spike in popularity in Japan.

Back to the mental translation, if you can develop good algorithms (or even very simple ones) to better predict when to scale up and down, you save a lot of money that is traditionally blown on idle resources in slack times. Those idle resources can be turned off, or pressed into use for non-time-critical batch work, or even sublet them to someone else to do processing with. And in fact this last one is quite probably the business that EC2 and App Engine probably represent. “Here’s some spare resources let’s sell some usage on them rather than making $0 on resources that are continuing to costing money to run.” (That other large cluster players aren’t involved this market yet indicates they either don’t have enough capacity as it is, or they aren’t in a position where they care about that idle cost yet, or they just don’t get it. It’s another interesting conversation in and of itself.)

In any case, being more efficient about resource usage represents a competitive advantage that can make a big difference. It’s like the regenerative braking on hybrid cars. Many people just afford the cost of wasting that energy as heat, perhaps not even knowing that there is a better way. However, with some initial investment and knowhow you can capture some of it and realize greater efficiency and a cost savings to boot.