Blog Posts - August 2011

Slides – Cassandra for Sysadmins

At outbrain, we like things that are awesome.

Cassandra is awesome.

Ergo, we like Cassandra.

We’ve had it in production for a few years now.

I won’t delve into why the developers like it, but as a Sysadmin on-call in the evenings, I can tell you straight out I’m glad it has my back.

We have MySQL deployed pretty heavily, and it is fantastic at what it does.  However, MySQL has a bit of an administrative overhead compared to a lot of the new alternative data stores out there, especially when making MySQL work in a large geographically distributed environment.

If you can model your data in Cassandra, are educated about the trade-offs, and have an undying wish not to have to worry too deeply about managing replication and sharding, it is a no-brainer.

I did a presentation on Cassandra (with Jake Luciani from Datastax) to the NYC Chapter of the League of Professional System Administrators  (LOPSA) from the standpoint of an Admin.

Us Sysadmins fear change, because it is our butt on the line if there is an outage.  With executives anxiously pacing behind us and revenue flushing down the drain, we’re the last line of defense if there is an issue and we’re the ones who will be torn away from families in the evenings to handle an outage.

So, yeah… we’re a conservative lot 🙂

That being said, change and progress can be good, especially when it frees you up.  Cassandra is resilient, fault-graceful and elastic. Once you understand how so, you’ll be slightly less surly. Your developers might not even recognize you!

These slides are for the Sys Admin, noble fellow, to assuage his fears and get him started with Cassandra.

 

Visualizing Our Deployment Pipeline

(This is a cross post from Ran’s blog)

When large numbers start piling up, in order to make sense of them,  they need to be visualized.
I still work as a consultant at Outbrain about one day a week, and most of the time I’m in charge of the deployment system last described here. The challenges that are encountered when we develop the system are good challenges, and every day we have too many deployments to be easily followed, so I decided to visualize them.
On an average day, we usually have  a dozen or two deployment (to production, not including test clusters) so I figured why don’t I use my google-visualization-fo0 and draw some nice graphs. Here are the results and explanations follow.
Before I begin, just to put things in context, Outbrain had been practicing  Continuous Deployment for a while (6 months or so) and although there are a few systems that helped us get there, one of the main pillars was a relatively new tool written by the fine folks at LinkedIn (and in particular Yan— Thanks Yan!), so just wanted to give a fair shout out to them and thank Yan for the nice tool, API and ongoing awesome support. If you’re looking for a deployment tool do give glu a try, it’s pretty awesome! Without glu and it’s API all the nice graphs and the rest of the system would not have seen the light of day.

 

The Annotated Timeline
This graph may seem intimidating at first, so don’t be scared and let’s dive right into it… BTW, you may click on the image to enlarge it.

First, let’s zoom in to the right hand side of the graph. This graph uses Google’s annotated timeline graph which is really cool for showing how things change over time and correlate them to events, which is what I do here — the events are the deployments and the x axis is the time while the y is the version of the deployed module.
On the right hand side you see a list of deployment events —  for example, the one at the top has “ERROR www @tom…” and the one next is “BehavioralEngine @yatirb…” etc. This list can be filtered so if you type a name of one of the developers such as @tom or @yatirb you see only the deployments made by him (of course all deployments are made by devs, not by ops, hey, we’re devopsy, remember?).
If you type into the filter box only www you see all the deployments for the www component, which by no surprise is just our website.
If you type ERROR you see all deployments that had errors (and yes, this happens too, not a big deal).
The nice thing about this graph from is first that while you filter the elements on the graph that are filtered out dissapear, so for example let’s see only deployments to www (click on the image to enlarge):
You’d notice that not only the right hand side list is shrunk and contains only deployments to www, but also the left hand side graph now only has the appropriate markers. The rest of the lines are still there but only the markers for the www line are on the graph right now.
Now let’s have a look at the graph. One of the coolest things is that you can zoom in to a specific timespan using the controls at the lower part of the graph. (click to enlarge)

In this graph the x axis shows the time (date and time of day) and the y axis shows the svn revision number. Each colored line represents a single module (so we have one line for www and one line for the BehavioralEngine etc).

What you would usually see is for each line (representing a module) a monotonically increasing value over time, a line from the bottom left corner towards the top right corner, however, in relatively rare cases where a developer wants to deploy an older version of his module, then you clearly see it by the line suddenly dropping down a bit instead of climbing up; this is really nice, helps find unusual events.

 

The Histogram
In the next graph you see an overview of deployments per day. (click to enlarge)

This is more of a holistic view of how things went the last couple of days, it just shows how many deployments took place each day (counts production clusters only) and colors the successful ones in green and the failed ones in red.

This graph is like an executive summary that can tell the story of – in case there are too many reds (or there are reds at all), then someone needs to take that seriously and figure out what needs to be fixed (usually that someone is me…) – or in case the bars aren’t high enough, then someone needs to kick developer’s buts and get them deploying somethin already…

Like many other graphs from Google’s library (this one’s a Stacked Column Chart, BTW), it shows nice tooltips when hovering over any of the columns with their x values (the date) and their y value (number of successful/failed deployments)

 

Versions DNA Mapping
The following graph shows the current variety of versions that we have in our production systems for each and every module. It was attributed as a DNA mapping by one of our developers b/c of the similarity in how they look but that’s how far this similarity goes…

The x axis lists the different modules that we have (names were intentionally left out, but you can imaging having www and other folks there). The y axis shows the svn versions of them in production. It uses glu’s live model as reported by glu’s agents to zookeeper.

Let’s zoom in a bit:

What this diagram tells us is that the module www has versions starting from 41268 up to 41463 in production. This is normal as we don’t necessarily deploy everything to all servers at once, but this graph helps us easily find hosts that are left behind for too long, so for example if one of the modules had not been deployed in a while then you’d see it falling behind low on the graph. Similarly, if a module has a large variability in versions in production, chances are that you want to close that gap pretty soon. The following graph illustrates both cases:

To implement this graph I used a crippled version of the Candle Stick Chart, which is normally used for showing stock values; it’s not ideal for this use case but it’s the closest I could find.

That’s all, three charts is enough for now and there are other news regarding our evolving deployment system, but they are not as visual; if you have any questions or suggestions for other types of graphs that could be useful don’t be shy to comment or tweet (@rantav).