Surge was, as expected, an awesome conference. I’m so glad I got to go in only it’s second year, as I will now forever be able to say “I went to one of the first Surge conferences!” It has quickly become a must-go devops/scalability conference. I’ve sat on this post for a little too long, but better late than never.
“Also, cloud” – Hindsight
The talk I delivered was about the challenges of building a hosted service on a cloud infrastructure. The abstract is here, the slides are here or here if you prefer a PDF. Apart from some poor clock management early on, the talk went fairly well – we’ll see what the audience thought of the content when the ratings come out.
As always, there were things I realized I missed after the talk was over. Here are some follow up thoughts on topics I did not address to my satisfaction – that damn hindsight!
During the talk, I was a bit dismissive about the idea of a multi-provider cloud strategy. Jay Janssen rightfully called me out on it in the Q&A (small aside: I had seen Jay’s name all over various internal Yahoo! mailing lists during my tenure there – conferences are awesome for putting names with faces like that). Jay pointed out that my distaste for multi-provider strategies was in direct conflict with the theme of the talk – decoupling. If you’re trying to decouple from everything, shouldn’t you also decouple yourself from your provider?
Jay is correct – philosophically, I should be all for multi-provider. However, as with many efforts at decoupling, going multi-provider introduces a substantial amount of complexity; an unreasonable amount in my opinion. Let’s think about what this entails.
First and foremost, any redundancy strategy would have to be hot-hot. I do not believe in hot-cold or “spares” because I’ve seen them fail every single time. It’s just too easy to let the spare deteriorate and find it in a state of utter disrepair when you need it most.
So now you’ve got essentially two copies of your infrastructure. Suddenly you have to manage:
- Network connectivity between the datacenters
- Data replication between the datacenters (related to the above; admittedly, you’d have this problem going “multi-region” on ec2 only)
- Different versions of base OS’s
- Different kernel builds
- Different underlying performance characteristics
- You have to come up with a formula for deploying equivalent capacity across providers
- You now have two capacity plans
- Multiple provisioning pipelines (yes, you can abstract most of this)
- Multiple sources of Heisenbugs (different virtualization hosts)
There are likely other things that will come up. Where do your load balancers go? How do you compensate for different providers’ divergent feature sets? It could be that you have a hefty budget and availability is your highest concern. In this case, the additional complexity introduced by the additional provider could well be worth it to your organization.
However, at that point you might take a step back and re-evaluate – should you even be using the cloud? People use cloud infrastructures for all sorts of reasons, but one of the big ones is the lowered barrier to entry for building up a distributed infrastructure – no need to deal with datacenter leases, uplinks, hiring a site-ops team, and all sorts of related unpleasantness. Having to manage multiple cloud providers increases the complexity of your deployment to the point that I start to wonder if it’s worthwhile.
Additional Examples of Simplified Datacenters
As part of my emphasis on the importance of reducing complexity, I made references to a talk I had attended by Yahoo!’s Mike Christian at geekSessions 2.2, in which he touted things like datecenters built in colder climates with no HVAC system at all – one less component to potentially fail (I’m pretty sure similar strategies have been in deployment since at least 2008). Since the talk a few folks pointed me to additional impressive examples:
- AOL has been showing off its “human-free datacenters” – The Register article with additional links within here
- Google has also been doing clever things with datacenters. A good, albeit dated, example can be fond here
(not to be confused with amortized computational complexity)
There’s an awesome term found in Sidney Dekker’s “Ten Questions about Human Error” – “amortizing complexity” (it is used as part of a discussion of G. Ross’s “Flight strip survey report” (1995), but I can’t find that paper). The term refers to techniques that allow the operator to interface with a complex system using only a handful of parameters. I really wish I had thought to use this term in my talk. Those that saw it might recall me pulling on an imaginary lever to symbolize an operator removing an entire datacenter from operation in the event of an unexpected failure. “Amortizing” describes this action perfectly – instead of having to figure out an unexpected, possibly baffling interaction in a complex system under production pressure, the operator is able to just disable the entire piece of infrastructure and investigate leisurely. Instead of having N possible ourses of action in the event of a failure, the operator has one attractive option that enables the problem to be sorted out (amortized) over time.
Of course, amortizing complexity almost inevitably leads to some corner cases where some variable that is normally unimportant is ignored. However, this shouldn’t stop us from trying to improve the normal case and relying on monitoring techniques such as confidence bands to alert us when more obscure metrics are not what we should expect.
Thoughts about the conference
I’m going to write down some general themes from the conference; the themes are in no particular order.
With Joyent heavily represented and the likes of Bryan Cantrill in the crowd, there was lots of talk of Node.js. I was encouraged by the fact that most of the applications mentioned were for usecases that I actually believe to be appropriate for the platform – small, near-stateless services. Time will tell if Node.js fulfills its destiny of being the next Rails, but for now, it’s making for some hilarious internet shit talking to be sure.
Theme: AWS Bashing
One thing that soured the event for me was the constant bashing of AWS by just about everyone in attendance. Being the most popular girl is hard, especially when you don’t come to the party.
Theme: Disaster Porn, aka Generalist Porn
The part about engineers and ops folks loving to talk about the gnarly shit they have lived through is not new. The thing that I found refreshing (possibly because I wasn’t at last year’s conference) is the emphasis on being able to go up and down the stack. From Ben Fried’s keynote to Artur Bergman’s “full stack” talk to Theo’s closing notes, the theme of being able to follow a bug through the entrails of your platform kept showing up, accompanied by a bewildering story of the descent. The recurrence of “baffling” failures set my talk and the larger theme of humans interacting with complex systems nicely. One thing I’ll say is, listening to guys like Artur and Theo describe the debugging process for some of these low level bugs inspires to dig deeper and better know a computer.
Overall, the conference was what a conference should be – a thought-provoking experience where I got to meet tons of awesome people. Definitely looking forward to going again next year.