Category: Dev Methods

Devops Shmevops… we call it Ownership.

Yes it’s been a long time since we last updated this blog – Shame on us!

A lot has happened since our last blog post, which came while we we dealing the effects of hurricane Sandy.  In the end,  our team handled it bravely and effectively, with no downtime and no business impact. However, a storm is still a storm, and did have to do an emergency evacuation from our old New York data center and move to a new one.

More things have happened since and today I want to focus on one major aspect of our life in the last year. We have made some cultural decisions that somehow changed the way we treat our work. Yes, the Devops movement has its influence here. When we stood in front of the decision of “NOC or NOT”,  Basically, we adopted the theme of “You build it, You run it!”.

Instead of hiring 10 students, attempting to train them on the “moving target” of a continuously changing production setup , we decided to  hire 2 engineers and concentrate effort on building strong monitoring system that will allow engineers to take ownership on monitoring their systems

Now, Outbrain is indeed a high scale system. Building a monitoring system that enables more then 1000 machines and more then 100 services to report metrics every minute is quite a challenge. We chose the stack of Logstash, RabbitMQ and Graphite for that mission. In addition we developed an open source project called Graphitus which enables us to  build dashboards from graphite metrics. Since adopting it we have more then 100 dashboards the teams are using daily. We also developed Dashanty which enables each team to develop an operational dashboard for itself.

On the alerting front we stayed with Nagios but improved it’s data sources. Instead of Nagios polling metrics by itself, we developed a Nagios/Graphite plugin where Nagios querys Graphite for the latest metrics and according to thresholds shoots appropriate alerts to relevant people. On top of that, the team developed an application called RedAlert that enable each and every team/engineer to configure their own alerts on their own owned services, configure when alerts are critical and when such alert should be pushed to them. This data goes into Nagios that start monitoring the metric in Graphite and will fire an alert if something goes wrong. “Push” alerts are configured to go to PagerDuty that will be able to locate the relevant engineer, email, text or call him as needed.

Now that’s on the technical part. What is more important to make it happen is the cultural side that this technology supports:

We truly believe in  “End to End Ownership”. “You build it, You run it!” is one way to say that. In an environment where everybody can (and should) change production at any moment , putting someone else to watch the systems makes it impossible. We were also very keen about MTTR (Mean Time To Recover). We don’t promise our business people 100% fault free environment, but we do promise fast recovery time. When we put these two themes in front of us, we came to the conclusion it is best that alerts will be directed to owner engineers as fast as we can, with fewer mediators on the way. So, we came up with the following:

  • We put a baseline of monitoring systems to support the procedure – and we continuously improve it.
  • Engineers/teams are owners of services (very SOA architecture). Lets use the term “Owner”. We try to eliminate services without clear owners.
  • Owners push metrics into graphite using calls on code or other collectors.
  • Owners define alerts on these metrics using RedAlert system.
  • Each team defined “on call schedule” on PagerDuty. “On call” engineer is the point of contact for any alerting service under the team ownership.
  • Ops are owners for the infrastructure (Servers/Network/software infra) – they also have “Ops on shift” – awake 24/7 (we use the team distribution between NY and IL for that).
  • Non push alerts that does not require immediate action are gathered along non working hours and treated during working hours.
  • Push Alerts are routed via PagerDuty the following way: Ops on shift get them and if he can address them or correlate them with infrastructure issue – he acknowledge them. In case Ops on Shift doesn’t know what to do with it, Pager duty continues and rout the alerts to the engineer on call.
  • Usually the next thing that will happen is that both of then will jump on the HipChat and start tackling the issue to shorten MTTR and resolve it.

The biggest benefit of this method is increased sense of “ownership” for everyone in the team. The virtual wall between Ops and Dev (which was initially somehow low in Outbrain) was completely removed. Everybody is more “Production sensitive”.

Few things that helped us through it:

  1. Our team. As management we encouraged it and formalized it but the motivation came from the team. It is very rare to see engineers that want (not to say hardly push) to take more ownership on their products and to really “Own them”. I feel lucky that we have such team. It made our decisions much simpler.
  2. Being so tech-ish and pushing our monitoring capabilities to such edges instead of going to the easy, labor intensive, half ass solution (AKA NOC).
  3. A 2 week “Quality Time” of all engineering that was devoted to improving MTTR and building all necessary to support this procedure. – All Credits to Erez Mazor for running this week.
This post will be followed by more specific posts about the systems we developed and will be written by the actual people that build them.

How to “Outbrain” Selenium Tests with Ext framework

Many of our internal applications were developed using the Extjs framework.

Extjs Is a very powerful JavaScript framework and one of the most popular javascript user interface open source framework , however when it comes to automated test with selenium the real challenge begin.

It is very difficult to write automated test to Ext application with selenium because Ext generates many <div> and <span> tags with an automatically-generated ID (something like “ext-comp-11xx”). Accessing these tags through Selenium is the big challenge we are trying to solve. We wanted to find a way to get these automatically-generated IDs automatically.
How do we approach this?

Ext has a component manager, where all of the developers’ components are being saved.  We can “ask” the component manager for the component ID by sending it a descriptor of the component. To simplify – we (the selenium server) tell the component manager “I need the ID of the current visible window which, btw, is labeled as ‘campaign editor'”.

This will look something like:

ComponentLocatorFactory  extjsCmpLoc = new ComponentLocatorFactory(selenuim);

Window testWin = new Window(extjsCmpLoc.createLocator(”campaign editor”Xtype.WINDOW));
Then we can to use Ext window method like close -> testWin.close();

Anther Example :

ComponentLocatorFactory  extjsCmpLoc = new ComponentLocatorFactory(selenuim);

Button newButton = new  Button(extjsCmpLoc.createLocator(“Add Campaign”, ExtjsUtils.Xtype.BUTTON));;


You can ask for all of the visible components by type, by label or both:


TextField flyfromdate = new TextField( extjsCmpLoc.createLocator(ExtjsUtils.Xtype.DATEFIELD, 0));


TextField flytodate = new TextField(extjsCmpLoc.createLocator(ExtjsUtils.Xtype.DATEFIELD1));



Here’s a simple diagram of our solution:


link to project in git-hub :

This solution is Open Source. In the meantime, if you have any questions, feel free to contact me directly. Asaf at outbrain dot com.


Asaf Levy

Visualizing Our Deployment Pipeline

(This is a cross post from Ran’s blog)

When large numbers start piling up, in order to make sense of them,  they need to be visualized.
I still work as a consultant at Outbrain about one day a week, and most of the time I’m in charge of the deployment system last described here. The challenges that are encountered when we develop the system are good challenges, and every day we have too many deployments to be easily followed, so I decided to visualize them.
On an average day, we usually have  a dozen or two deployment (to production, not including test clusters) so I figured why don’t I use my google-visualization-fo0 and draw some nice graphs. Here are the results and explanations follow.
Before I begin, just to put things in context, Outbrain had been practicing  Continuous Deployment for a while (6 months or so) and although there are a few systems that helped us get there, one of the main pillars was a relatively new tool written by the fine folks at LinkedIn (and in particular Yan— Thanks Yan!), so just wanted to give a fair shout out to them and thank Yan for the nice tool, API and ongoing awesome support. If you’re looking for a deployment tool do give glu a try, it’s pretty awesome! Without glu and it’s API all the nice graphs and the rest of the system would not have seen the light of day.


The Annotated Timeline
This graph may seem intimidating at first, so don’t be scared and let’s dive right into it… BTW, you may click on the image to enlarge it.

First, let’s zoom in to the right hand side of the graph. This graph uses Google’s annotated timeline graph which is really cool for showing how things change over time and correlate them to events, which is what I do here — the events are the deployments and the x axis is the time while the y is the version of the deployed module.
On the right hand side you see a list of deployment events —  for example, the one at the top has “ERROR www @tom…” and the one next is “BehavioralEngine @yatirb…” etc. This list can be filtered so if you type a name of one of the developers such as @tom or @yatirb you see only the deployments made by him (of course all deployments are made by devs, not by ops, hey, we’re devopsy, remember?).
If you type into the filter box only www you see all the deployments for the www component, which by no surprise is just our website.
If you type ERROR you see all deployments that had errors (and yes, this happens too, not a big deal).
The nice thing about this graph from is first that while you filter the elements on the graph that are filtered out dissapear, so for example let’s see only deployments to www (click on the image to enlarge):
You’d notice that not only the right hand side list is shrunk and contains only deployments to www, but also the left hand side graph now only has the appropriate markers. The rest of the lines are still there but only the markers for the www line are on the graph right now.
Now let’s have a look at the graph. One of the coolest things is that you can zoom in to a specific timespan using the controls at the lower part of the graph. (click to enlarge)

In this graph the x axis shows the time (date and time of day) and the y axis shows the svn revision number. Each colored line represents a single module (so we have one line for www and one line for the BehavioralEngine etc).

What you would usually see is for each line (representing a module) a monotonically increasing value over time, a line from the bottom left corner towards the top right corner, however, in relatively rare cases where a developer wants to deploy an older version of his module, then you clearly see it by the line suddenly dropping down a bit instead of climbing up; this is really nice, helps find unusual events.


The Histogram
In the next graph you see an overview of deployments per day. (click to enlarge)

This is more of a holistic view of how things went the last couple of days, it just shows how many deployments took place each day (counts production clusters only) and colors the successful ones in green and the failed ones in red.

This graph is like an executive summary that can tell the story of – in case there are too many reds (or there are reds at all), then someone needs to take that seriously and figure out what needs to be fixed (usually that someone is me…) – or in case the bars aren’t high enough, then someone needs to kick developer’s buts and get them deploying somethin already…

Like many other graphs from Google’s library (this one’s a Stacked Column Chart, BTW), it shows nice tooltips when hovering over any of the columns with their x values (the date) and their y value (number of successful/failed deployments)


Versions DNA Mapping
The following graph shows the current variety of versions that we have in our production systems for each and every module. It was attributed as a DNA mapping by one of our developers b/c of the similarity in how they look but that’s how far this similarity goes…

The x axis lists the different modules that we have (names were intentionally left out, but you can imaging having www and other folks there). The y axis shows the svn versions of them in production. It uses glu’s live model as reported by glu’s agents to zookeeper.

Let’s zoom in a bit:

What this diagram tells us is that the module www has versions starting from 41268 up to 41463 in production. This is normal as we don’t necessarily deploy everything to all servers at once, but this graph helps us easily find hosts that are left behind for too long, so for example if one of the modules had not been deployed in a while then you’d see it falling behind low on the graph. Similarly, if a module has a large variability in versions in production, chances are that you want to close that gap pretty soon. The following graph illustrates both cases:

To implement this graph I used a crippled version of the Candle Stick Chart, which is normally used for showing stock values; it’s not ideal for this use case but it’s the closest I could find.

That’s all, three charts is enough for now and there are other news regarding our evolving deployment system, but they are not as visual; if you have any questions or suggestions for other types of graphs that could be useful don’t be shy to comment or tweet (@rantav).

Leader Election with Zookeeper


Recently we had to implement an active-passive redundancy of a singleton service in our production environment where the general rule is always have “more than one of anything”. The main motivation is to alleviate the need to manually monitor and manage these services, whose presence is crucial to the overall health of the site.

This means that we sometime have a service installed on several machines for redundancy, but only one of the is active at any given moment. If the active services goes down for some reason, another service rises to do its work. This is actually called leader election. One of the most prominent open source implementation facilitating the process of leader election is Zookeeper. So what is Zookeeper?

Originally developed by Yahoo reasearch, Zookeepr acts as a service providing reliable distributed coordination. It is highly concurrent, very fast and suitable mainly for read-heavy access patterns. Reads can be done against any node of a Zookeeper cluster while writes a quorum-based. To reach a quorum, Zookeeper utilizes an atomic broadcast protocol. So how does it work?

Read more >

Feature Flags Made Easy

I recently participated in the ILTechTalk week. Most of the talks discussed issues like Scalability, Software Quality, Company Culture, and Continuous Deployment (CD). Since the talks were hosted at Outbrain, we got many direct questions about our concrete implementations. Some of the questions and statements claimed that Feature Flags complicate your code. What bothered most participants was that committing code directly to trunk requires addition of feature flags in some cases and that it may make their code base more complex.

While in some cases, feature flags may make the code slightly more complicated, it shouldn’t be so in most cases. The main idea I’m presenting here is that conditional logic can be easily replaced with polymorphic code. In fact, conditional logic can always be replaced by polymorphism.

Enough with the abstract talk…

Suppose we have an application that contains some imaginary feature, and we want to introduce a feature flag. Below is a code snippet that developers normally come up with:

public void runApplication() { // ... if (useNewImplementation) { executeNewImaginaryFeatureImplementation(); } else { executeOldImaginaryFeatureImplementation(); } // ... }

While this is a legitimate implementations in some cases, it does complicate your code base by increasing the cyclomatic complexity of your code. In some cases, the test for activation of the feature may recur in many place in the code, so this approach can quickly turn into a maintenance nightmare.

Luckily, implementing a feature flag using polymorphism is pretty easy. First, let’s define an interface for the imaginary feature and two implementations (old and new):

public interface ImaginaryFeature { public void executeFeature(); } class OldImaginaryFeature implements ImaginaryFeature { @Override public void executeFeature() { System.out.println("old feature implementation"); } } class NewImaginaryFeature implements ImaginaryFeature { @Override public void executeFeature() { System.out.println("new feature implementation"); } }

Now, let’s use the feature in our application, selecting the implementation at runtime:

public class PolymorphicApplication { private final ImaginaryFeature imaginaryFeature; public PolymorphicApplication() { this.imaginaryFeature = createImaginaryFeature(); } private ImaginaryFeature createImaginaryFeature() { final String featureClass = System.getProperty("PolymorphicApplication.imaginaryFeature.class"); try { return (ImaginaryFeature) Class.forName(featureClass).newInstance(); } catch (final Exception e) { throw new IllegalStateException("Failed to create ImaginaryFeature of class " + featureClass, e); } } public void runApplication() { // ... imaginaryFeature.executeFeature(); // ... } }

Here, we initialized the imaginary feature member by reflection, using a class name specified as a system property. The createImaginaryFeature() method above is usually abstracted into a factory but kept as is here for brevity. But we’re still not done. Most of the readers would probably say that the introduction of a factory and reflection makes the code less readable and less maintainable. I have to agree — and apart from that, adding dependencies to the concrete implementations will complicate the code even more. Luckily, I have a secret weapon at my disposal. It is called IoC, (or DI). When using an IoC container such as Spring or Guice, your code can be made extremely flexible, and implementing feature flags becomes a walk in the park.

Below is a rewrite of the PolymorphicApplication using Spring dependency injection:

public class SpringPolymorphicApplication { private final ImaginaryFeature imaginaryFeature; public SpringPolymorphicApplication(final ImaginaryFeature imaginaryFeature) { this.imaginaryFeature = imaginaryFeature; } public void runApplication() { // ... imaginaryFeature.executeFeature(); // ... } }

The spring code above defines an application and 2 imaginary feature implementations. By default, the application is initialized with the oldImaginaryFeature, but this behavior can be overridden by specifying a -DimaginaryFeature.implementation.bean=newImaginaryFeature command line argument. Only a single feature implementation will be initialized by Spring, and the implementations may have dependencies.

Bottom line is: with a bit of extra preparation and correct design decisions, feature flags shouldn’t be a burden on your code base. By extra preparation, I mean extracting interfaces for your domain objects, using an IoC container, etc, which is something we should be doing in most cases anyway.


Eran Harel is a Senior Software Developer at Outbrain.