Goodbye static CNAMEs, hello Consul

Nearly every large scale system becomes distributed at some point: a collection of many instances and services that compose the solution you provide. And as you scale horizontally to provide high availability, better load distribution, etc…, you find yourself spinning up multiple instances of services, or using systems that function in a clustered architecture. That’s all cool in theory, but soon after you ask yourself, “how do I manage all of this? How should these services communicate with each other? And how do they even know what instances (or machines) exist?”

Those are excellent questions!

What methods are in use today?

The naive approach, which we’d followed in Outbrain for many years, is to route all inter-service traffic via load balancers (HAProxy in our case). Every call to another system, such as a MySql slave, is done to the load balancer (one in a pool of many), via an agreed upon name, such as a DNS CNAME. The load balancer, which holds a static configuration of all the different services and their instances, directs the call to one of those instances, based on the predefined policy.

backend be_onering_es   ## backend name
  balance leastconn     ## how to distribute load
  option httpchk GET /  ## service health check method
  option httpclose      ## add “Connection: close” header if missing
  option forwardfor     ## send client IP through XFF header
  server ringdb-20001 ringdb-20001:9200 check slowstart 10s weight 100   ## backend node 1
  server ringdb-20002 ringdb-20002:9200 check slowstart 10s weight 100   ## backend node 2

The load balancer is also responsible for checking service health, to make sure requests are routed only to live services, as dead ones are “kicked out of the pool”, and revived ones are brought back in.

An alternative to the load balancer method, used in high throughput systems such as Cassandra, is configuring CNAMEs that point to specific nodes in the cluster. We then use those CNAMES in the consuming applications’s configuration. The client is then responsible to activate a policy of balancing between those nodes, both for load and availability.

OK, so what’s the problem here?

There’s a few actually:

  1. The mediator (Load balancer), as quick as it may be in processing requests (and HAProxy is really fast!), is another hop on the network. With many services talking to each other, this could prove a choke point in some network topologies. It’s also a shared resource between multiple services and if one service misbehaves, everyone pays the price. This is especially painful with big payloads.
  2. The world becomes very static! Moving services between hosts, scaling them out/in, adding new services – it all involves changing the mediator’s config, and in many cases done manually. Manual work requires expertise and is error prone. When the changes becomes frequent… it simply does not scale.
  3. When moving ahead to infrastructure that is based on containers and resource management, where instances of services and resources are allocated dynamically, the whole notion of HOSTNAME goes away and you cannot count on it in ANY configuration.

What this all adds up to is “the end of the static configuration era”. Goodbye static configs, hello Dynamic Service Discovery! And cue Consul.

What is Consul?

In a nutshell, Consul is a Service Discovery System, with a few interesting features:

  1. It’s a distributed system, made out of an agent in each node. Nodes talk to each other via a gossip protocol, making node discovery simple, robust, and dynamic. There’s no configuration file describing all members of a Consul cluster.
  2. It’s fault tolerant by design, and using concepts such as Anti Entropy, gracefully handles nodes disappearing and reappearing – a common scenario in VM/container based infrastructure.
  3. It has first-class treatment of datacenters, as self-contained, interconnected entities. This means that DC failure / disconnection would be self-contained. It also means that a node in one DC can query for information in another DC with as little knowledge as the remote DC’s name.
  4. It holds the location (URI) and health of every service on every host, and makes this data available via multiple channels, such as a REST API and GUI. The API also lets you make complex queries and get the service data segment you’re interested in. For example: Get me all the addresses of all instances of service ‘X’ from Datacenter ‘Y’ in ‘staging env’ (tag).
  5. There is a very simple way to get access to “Healthy” service instances by leveraging the Consul DNS interface. Perfect for those pesky 3rd party services whose code you can’t or don’t want to modify, or just to get up and running quickly without modifying any client code (disclaimer: doesn’t fit all scenarios).

How does Consul work?

You can read all about it here, but let me take you through a quick tour of the architecture:

click to enlarge

As you can see, Consul has multi datacenter awareness built right in (you can read more about it here). But for our case, let’s keep it simple, and look at the case of a single datacenter (Datacenter 1 in the diagram).

What the diagram tags as “Clients” are actually “Consul agents”, running locally on every participating host. Those talk to each other, as well as the Consul servers (which are “agents” configured as Servers), through a “Gossip protocol”. If you’re familiar with Cassandra, and that rings a bell, then you’re right, it’s the same concept used by Cassandra nodes to find out which ones are up or down in a cluster. A Gossip protocol essentially makes sure “Everybody knows Everything about Everyone”. So within reasonable delay, all agents know (and propagate) state information about other agents. And you just so happen to have an agent running locally on your node, ready to share everything it knows via API, DNS or whatnot. How convenient!

Agents are also the ones performing health checks to the services on the hosts they run on, and gossiping any health state changes. To make this work, every service must expose a means to query its health status, and when registered with its local Consul agent, also register its health check information. At Outbrain we use an HTTP based “SelfTest” endpoint that every one of our homegrown services exposes (through our OB1K container, practically for free!).

Consul servers are also part of the gossip pool and thus propagate state in the cluster. However, they also maintain a quorum and elect a leader, who receives all updates (via RPC calls forwarded from the other servers) and registers them in it’s database. From here on, the data is replicated to the other servers and propagated to all the agents via Gossip. This method is a bit different from other Gossip based systems that have no servers and leaders, but it allows the system to support stronger consistency models.

There’s also a distributed key-value store we haven’t mentioned, rich ACLs, and a whole ecosystem of supporting and derived tools… but we said we’d keep it simple for now.

Where does that help with service discovery?

First, what we’ve done is taken all of our systems already organized in clusters and registered them with Consul. Systems such as Kafka, Zookeeper, Cassandra and others. This allows us to select a live service node from a cluster, simply by calling a hostname through the Consul DNS interface. For example, take Graphite: Outbrain’s systems are currently generating ~4M metrics per minute. Getting all of these metrics through a load balancer, or even a cluster of LBs, would be suboptimal, to say the least. Consul allows us to have each host send metrics to a hostname, such as “graphite.service.consul”, which returns a random IP of a live graphite relay node. Want to add a couple more nodes to share the load? no problem, just register them with Consul and they automagically appear in the list the next time a client resolves that hostname. Which, as we mentioned, happens quite a few times a minute. No load balancers in the way to serve as choke points, no editing of static config files. Just simple, fast, out-of-band communication.

How do these 3rd party services register?

We’re heavy users of Chef, and have thus created a chef cookbook to help us get the job done. Here’s a (simplified) code sample we use to register Graphite servers:

ob_consul 'graphite' do
  owner 'ops-vis'         ## add ‘owner’ tag to identify owning group
  port 1231               ## port the service is running on
  check_cmd "echo '' | nc localhost 1231 || exit 2"    ## health check shell command
  check_interval '60s'    ## health check execution interval
  template false          ## whether the health check command is a Chef template (for scripts)
  tags [‘prod’]           ## more tags

How to do clients consume services?

Clients simply resolve the DNS record they’re interested in… and that’s it. Consul takes care of all the rest, including randomizing the results.

$ host graphite is an alias for relayng.service.consul.
relayng.service.consul has address
relayng.service.consul has address

How does this data reach the DNS?

We’ve chosen to place Consul “behind” our internal DNS servers, and forward all requests for the “consul” domain name to a consul agent running on the DNS servers.

zone "consul" IN {
    type forward;
    forward only;
    forwarders { port 8600; };

Note that there’s other ways to go about this, such as routing all DNS requests to the local Consul agent running on each node, and having it forward everything “non-Consul” to your DNS servers. There’s advantages and disadvantages to each approach. For our current needs, having an agent sit behind the DNS servers works quite well.

Where does the Consul implementation at Outbrain stand now?

At Outbrain we’re already using Consul for:

  • Graphite servers.
  • Hive Thrift servers that are Hive interfaces to the Hadoop cluster they’re running on. Here the Consul CNAME represents the actual Hadoop cluster you want your query to run on. We’ve also added a layer that enables accessing these clusters from different datacenters using Consul’s multi-DC support.
  • Kafka servers.
  • Elasticsearch servers.

And our roadmap for the near future:

  • MySql Slaves – so we can eliminate the use of HAProxy in that path.
  • Cassandra servers where maintaining a list of active nodes in the app configuration becomes stale over time.
  • Prometheus – our new monitoring and alerting system.
  • Zookeeper clusters.


But that’s not all! stay tuned for more on Consul, client-side load balancing, and making your environment more dynamic.

New Hire LaunchPad

The first few days or weeks of a new hire in the company can be critical. What they say about first impressions, surely applies in this case as well – this initial time period, where an engineer needs to understand ‘how things are done here’, is important.

As outbrain is scaling its engineering R&D group, it became evident that the current new hire training program is lagging behind. A new hire usually went through an extensive onboarding frontal lectures, where each subject was lead by a more veteran employee with the proper knowledge. It would usually take a month or so to go through all the lectures, not because there were so many of them, but because they were scattered and opened only when there were enough people (otherwise, it could inflict an unacceptable pressure on the employees giving them)

Hands on stuff, such as how to write a standard service, how to deploy it, what is the standard way to add metrics to you application etc, AKA – the important stuff, was usually taught in an informal way, where each team leader did the best they could, under the usual tight time constraints. As this method served us well for quite some time, the recruitment scale – and, in particular, the large number of new hires in a short time – forced us to re-think this process.

What we envisioned, was a 2 week bootcamp, that each engineering R&D new hire will have to undergo. We considered 3 basic approaches:

  • A  frontal lectures class, opened once a month, with people from various teams. This will not be efficient, as new hires rate are not consistent, and the learning is not by experience. the load on the instructors is huge, and they don’t always have the time or the teaching skills

  • A by-subject self learning, consisting of a given list of things to study and learn. As this scales better than the first approach, it still lacks the hands on experience, which is so important for understanding such a diversified, dynamic and multilayered environment.

  • Tasks-based bootcamp, consisting of a list of incremental tasks (where each new task relies on the actual implementation of the previous task). The bootcampee ends up creating a fully functional, deployed service. The service is a real service, in the sense that it is actually deployed to production, uses key infrastructure systems like any other service etc.

Our choice was the tasks-based bootcamp. in the course of a few weeks, we created a plan (documented in the company’s wiki), consisting of 9 separate units. Each unit has its own page, consisting 4 sections: the unit’s overview, some general notes, the steps for the unit and a section with tools and applications for this unit. We added a big feedback link to each page, so we can get the users feedback and improve accordingly.

The feedback we got is very promising – people can work independently, team leaders can easily scale initial training and the hands on experience gained was proving to be extremely important. The bootcamp itself is also used a knowledge source, where people can go back to and understand how to do things.

Our next steps, other than constantly improving the current bootcamp, are identifying other subjects that needs to be taught (naturally, a more advanced and specific ones) and building a bootcamp program following the same lines – hands on, progress on your own, real world training.

Angular DRY mocking – Leonardo


This post was written by Sagiv Frenkel.

As developers one of the first and most basic things we learn is “Don’t repeat yourself!”.
That means trying to avoid writing the same code twice – in other words, no copy paste!
While we still sin with the occasional copy paste, it’s something we’re mindful of and is easy to notice. We just have to remember to refactor later on.

But do we treat our mocking the same ?

Lets look at a typical development flow

1) Create your UI/UX, services and controller.
2) Create your server API calls.
3) Test your application, manually/automated with self generated data in different scenarios.

What’s wrong with this approach?

We are’nt repeating code, but we are repeating work

1) Documenting – there’s no good way to tell which user/data to use for which scenario.
2) Running – you need to log in/out to change users or manually change code to fit changes.
3) Testing – error scenarios, edge cases, and request delays/throttling are very hard. Using override scripts or using comments to switch data are the only tools at our disposal.

Can we do better?

Introducing Leonardo

Leonardo is an open sourced AngularJS module created by Outbrain. It can be installed from npm or Bower, and easily integrates into existing AngularJS applications (more details on Leonardo’s GitHub repo)

Leonardo has a fancy UI where you can easily toggle different states/scenarios.

It enables you to:

1) Centralize your mocking and scenario configuration.
2) Persist the configuration into an external file.
3) Create manual QA or automated test

We use Leonardo extensively with protractor. More on this in another post.

Want to get started with Leonardo?

Check this Example to see how you can move from a regular image gallery to a mocked one.

How does Leonardo work?

Leonardo has two important concepts – states and scenarios.


We add states to declare what and how to mock.
There are two types:

Ajax States – This it what we will typically use. We declare the url and verb we wish to mock and what response data we wish to return – including a delay and a status.
[javascript]leoConfiguration.addStates([ { name: 'flicker-images', verb: "jsonp", url: '', options: [ { name: 'get ninja turtles', status: 200, data: { "items": [ { "id": "20054214406", "farm": 1, "title": "leo1"}, { "id": "19896041068", "farm": 1, "title": "017580"} ] } },{ name: 'get ninja enemies', status: 200, data: { "items": [ { "id": "20058148116", "title": "the_shredder"}, { "id": "20102720711", "title": "the_ninjas" } ] } } ] } ]); [/javascript]

Non Ajax States – This requires more work on the part of the developers. Basically, this allows you to declare a state and its underlying data, (not mandatory) and you can later check if it’s on or off.
[javascript]leoConfiguration.addState({ name: 'Set Mission', options: [ { name: 'turtles', data: "Protect April o'neil" }, { name: 'shredder', data: 'Destroy the ninja turtles' } ] });[/javascript]

You can query Leonardo for the value of a certain state.
[javascript]var mission = leoConfiguration.getState('Set Mission'); $rootScope.mission = mission ? : "";[/javascript]

Leonardo triggers an event whenever a state changes.
[javascript]$rootScope.$on('leonardo:setStates', function(){ var debug = leoConfiguration.getState('debug'); $rootScope.debug = !!debug; });[/javascript]


Scenarios simply enable you to set a specific set of states as active.
[javascript]leoConfiguration.addState({ name: 'Set Mission', options: [ { name: 'turtles', data: "Protect April o'neil" }, { name: 'shredder', data: 'Destroy the ninja turtles' } ] });[/javascript]


– We currently only support Angular application. That is what we initially developed on, and was easy to implements. If the tool gains traction and popularity, it should be easy to migrate to a more vanilla approach.

Use Leonardo to start mocking http or anything you like! We’d love to get your feedback!

Announcing orchestrator-agent

Announcing orchestrator-agent

This post was written by Shlomi Noach

orchestrator-agent is a side-kick, complementary project of orchestrator, implementing a daemon service on one’s MySQL hosts which communicates with and accepts commands from orchestrator, built with the original purpose of providing an automated solution for provisioning new or corrupted slaves.

It was built by Outbrain, with Outbrain’s specific use case in mind. While we release it as open source, only a small part of its functionality will appeal to the public (this is why it’s not strictly part of the orchestrator project, which is a general purpose, wide-audience solution). Nevertheless, it is a simple implementation of a daemon, such that can be easily extended by the community. The project is open for pull-requests!

A quick breakdown of orchestrator-agent is as follows:

  • Executes as a daemon on linux hosts
  • Interacts and invokes OS commands (via bash)
  • Does not directly interact with a MySQL server running on that host (does not connect via mysql credentials)
  • Expects a single MySQL service on host
  • Can control the MySQL service (e.g. stop, start)
  • Is familiar with LVM layer on host
  • Can take LVM snapshots, mount snapshots, remove snapshots
  • Is familiar with the MySQL data directory, disk usage, file system
  • Can send snapshot data from a mounted snapshot on a running MySQL host
  • Can prepare data directory and receive snapshot data from another host
  • Recognizes local/remote datacenters
  • Controlled by orchestrator, two orchestrator-agents implement an automated and audited solution for seeding a new/corrupted MySQL host based on a running server.

Read more >

Announcing Aletheia – A streaming data delivery framework

This post was written by Stas Levin

Outbrain is proud to announce Aletheia, our solution for a uniform data delivery and flow monitoring across data producing and consuming subsystems. At Outbrain we have great amounts of data being constantly moved and processed by various real time and batch oriented mechanisms. To allow fast recovery and high SLA, we need to be able to detect problems in our data crunching mechanisms as fast as we can, preferably at near real time. The later problems are detected, the harder it is to investigate them (and thus fix them), and chances of business impact grow rapidly.

To address these issues, we’ve built Aletheia, a framework providing a uniform way to deliver and consume data, with built in monitoring capabilities allowing both producing and consuming sides to report statistics, which can be used to monitor the pipeline state in a timely fashion.

Read more >

A/B testing @ Outbrain – Wabbit


What Is A/B Testing

A/B testing is a method widely used to validate assumptions about web site optimizations. With A/B tests we can test two configurations, configuration A and configuration B, of a web page design and compare them according to some metrics that define what a success result is. In other words, you test your new design against the current design and measure which one produces better results. To decide which design is better than the other, you split the traffic to your web page between these two configurations and then you can measure which configuration had better performance and apply this configurations as the default configuration of your site.


What To Test?

The choice of what to test depends on your goals. In Outbrain each configuration is called an A/B test variant. The idea of Outbrain’s A/B testing is to allow publishers to test two different designs of their widgets, and measure which design had better Click Through Rate (CTR) and Revenue Per 1,000 Impressions (RPM) performance.

In the core of the system there are more than 450 settings that define the configuration of each widget, which is installed on a blog or a group of sites.

There are more than two hundred online settings that directly affect the widget. Each of these settings can be tested within A/B test variants. For example, one of these online settings is called “Widget Structure”. This setting configures the look and feel of the widget.


Screen Shot 2014-09-02 at 2.46.42 PM

Widget structure – look and feel of the widget

If your goal is to test an addition of a new widget structure, you can configure the variant A with the new widget structure addition, against variant B that uses the original design of the widget structure and serves as the control group.

Screen Shot 2014-09-02 at 4.15.57 PM

When the test comes to an end many questions may come up. How did it affect the customers? Did the new design of the widget structure deliver better CTR and RPM performance? Maybe if we changed the title of the new widget structure it would have resulted in better performance? Maybe if we changed the images size of the old widget structure, it would have resulted in better performance? All of these questions can be answered one by one if we set appropriate A/B test variants.

Even though each A/B test in our system is unique, there are certain widget settings that are usually tested for every variant:

  • Number of paid recommendations
  • Number of organic recommendations
  • Image size in the widget
  • The number of recommendations on the widget unit
  • Widget structure


A/B Tests in Outbrain

Once you decided that you want to create a new A/B test, you can do it using an internal tool named Wabbit – Widget A/B testing tool. The tool gives you the ability to create/edit an existing A/B test or to pull internal reports with Key Performance Indicator (KPI) performance for the test.

The A/B test can be defined on a specific widget on one site or it can be done on a group of sites that use the same widget.

When the test ends, we pull the A/B test report to measure which configuration had better performance. If the data indicates one of the configurations is an improvement according to our KPIs and the test has experienced enough traffic to be considered significant, we give the option to apply the new configuration as the default for the widget.



  • In Outbrain we recommend running experiments for at least two weeks and no more than a month. The main reason for that is to eliminate the “day of the week” effect because users who visit the site on the weekend might represent different segment than those who visit the site during the week.
  • On the other hand, running an A/B test more than a month leads to unreliable test results, such as cookie expiration that causes the users to start see different configurations which compromises the consistency of the test.
  • At Outbrain, we also recommend allocating at least 5% of traffic toward an AB test to increase the probability of ending the test with results that have more than a 90% confidence level based on statistical analysis. Here’s a calculator from KissMetrics that will allow you to easily figure out if you’re A/B test results are significant.

The power of promises for file downloading

In this blog post I will be implementing a file download with a progress indicator using cookies, AngularJS and the promises.

Promises are a powerful concept with a number of advantages, in the following implementation pay attention to these points (your more then welcome to comment):

  1. Clarity and readability of code
  2. Error handling
  3. Separation of concerns

I thought of showing the same implementation without promises, but I think anyone who has tried to handle more than one callback and handle the error cases properly will easily see the difference.

The Module

A download button that changes it’s text with set intervals.
At the end it should be in a success state or an error state.
To complicate things a little and show the power of promises I added another step called “validateBeforeDownload”, this step will call the server to validate the download and fail it if necessary.

download (1) arrow-vector-2 aausv arrow-vector-2 download (5)
See It Live!

Downloading a file

The standard way of downloading a file is with a simple “a” tag with an href.
In order to do be able to add the “validateBeforeDownload” step and avoid passing “dom” to a service – I am using an Iframe which a service creates and destroys. This will trigger the download and if the server headers are appropriate the download will begin.

Service Code

[javascript]var generateIframeDownload = function(){ var iframe = document.createElement('iframe'); $cookieStore.put('download_file', 'true'); iframe.src = '/myserver/dowload'; = "none"; document.body.appendChild(iframe); }[/javascript]

Adding in the progress

Easier said then done! Downloading a file can’t be done with an simple ajax call, so you can’t tell when the download is complete.
The solution I’m using is setting a cookie, let’s call it “download_file” with a timer that checks for a cookie every 500ms.

  • While the cookie exists the loading state is preserved.
  • Once the request completes, the server deletes the cookie and the timer is stopped.

This isn’t the best solution but is simple and doesn’t require sockets or external plugins.

Service Code

[javascript]var manageIframeProgress = function(){ var defer = $q.defer(); // notify that the download is in progress every half a second / do this for a maximum of 50 intervals var promise = $interval(function () { if (!$cookieStore.get('download_file')){ $interval.cancel(promise); } }, 500, 50); promise.then(defer.reject, defer.resolve, defer.notify); promise.finally(function () { $cookieStore.remove('download_file'); document.body.removeChild(iframe); }); }[/javascript]

Java Server

Just to get the full stack of implementation here is the code for handling the response data and the clearing of the cookie.
[java]public String exportExcel() throws Exception { final byte[] bytesToOutput = createExcelReport().toByteArray(); output = new ByteArrayInputStream(bytesToOutput); fileSize = bytesToOutput.length; HttpServletResponse response = getResponse(); Cookie cookie = new Cookie("download_file", "true"); cookie.setPath("/"); cookie.setMaxAge(0); cookie.setSecure(true); response.addCookie(cookie); return "exportCsv"; }[/java]

Wrapping everything together with promises

Pay attention to the comments in the code, some of the code is there to simulate the server requests and response and are only there for the full picture.


Each visual state of the button is determined by it’s text (scope.downloadExcelText).


Notice $timeout mocks an asynchronous call and it’s response to a server.
this would normally be done with $http.
[javascript]angular.module("fileDownload").factory("downloadService", function($interval, $timeout, $q, $cookieStore){ var generateIframeDownload = function(){ var iframe = document.createElement('iframe'); $cookieStore.put('download_file', 'true'); iframe.src = '/myserver/dowload'; = "none"; document.body.appendChild(iframe); } var manageIframeProgress = function(){ var defer = $q.defer(); // notify that the download is in progress every half a second / do this for a maximum of 50 intervals var promise = $interval(function () { if (!$cookieStore.get('download_file')){ $interval.cancel(promise); } }, 500, 50); promise.then(defer.reject, defer.resolve, defer.notify); promise.finally(function () { $cookieStore.remove('download_file'); document.body.removeChild(iframe); }); } return { validateBeforeDownload: function (config) { var defer = $q.defer(); // notify that the download is in progress every half a second $interval(function (i) { defer.notify(i); }, 500); //mock response from server - this would typicaly be a $http request and response $timeout(function () { // in case of error: //defer.reject("this file can not be dowloaded"); defer.resolve(config); }, 2000); return defer.promise; }, downloadFile: function (config) { generateIframeDownload(); var promise = manageIframeProgress(); //mock response from server - this would be automaticlly triggered by the file download compeletion $timeout(function(){ $cookieStore.remove('download_file'); }, 3000); return promise; } } });[/javascript]


This is were our hard work pays off and promises start to shine.

Lets step into the promise mechanism –
Prepending the “downloadService.validateBeforeDownload” to the “downloadService.downloadExcel” with the “then” method creates a third promise which shares callbacks for: success, failure and notifications (for the progress).
There is also a finally callback attached to this promise that we use for sharing code between the success and failure.
But the really nice thing here is it also enables handling errors just from the “validateBeforeDownload”, and bubbling them up if needed with $q.reject or by simply throwing the error.

Pay attention that each step towards completion of the promise seems to be handled in an async manner and the actual asynchronicity is handled by the promise mechanism and the service. Magic!
[javascript]angular.module("fileDownload").controller("downloadCtrl", function($scope, $timeout, $q, downloadService){ $scope.downloadFile = function(){ var params = {}; var loadingText = 'Loading Data'; var options = ['.', '..', '...']; $scope.downloadFileText = loadingText + options[0]; var promise = downloadService.validateBeforeDownload(params).then(null, function (reason) { alert(reason); // you can also throw the error // throw reason; return $q.reject(reason); }).then(downloadService.downloadFile).then(function(){ $scope.downloadFileText = 'Loaded'; }, function(){ $scope.downloadFileText = 'Failed'; }, function(i){ i = (i+1)%3; $scope.downloadFileText = loadingText + options[i]; }); promise.finally(function(){ $timeout(function(){ delete $scope.downloadFileText; }, 2000); }); }; });[/javascript]

Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner

Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner


This post was written by Shlomi Noach

We’re happy to announce the availability of Outbrain‘s Orchestrator: MySQL replication management & visualization tool.


  • Orchestrator reads your replication topologies (give it one server – be it master or slave – in each topology, and it will reveal the rest).
  • It keeps a state of this topology.
  • It can continuously poll your servers to get an up to date topology map.
  • It visualizes the topology in a clear and slick D3 tree.
  • It allows you to modify your topology; move slaves around. You can use the command line variation, the JSON API, or you can use the web interface.

Quick links: Orchestrator Manual, FAQ, Downloads

Nothing like nice screenshots

To move slaves around the topology (repoint a slave to a different master) through orchestrator‘s web interface, we use Drag and Drop,








Orchestrator keeps you safe. It does so by:

  • Correctly calculating the binary log files & positions (aka coordinates) of the slave you’re moving, its current master, its new master; it properly stops, starts and stalls your replication till everything is in sync.
  • Helping you to avoid shooting yourself in the leg. It will not allow moving a slave that uses STATEMENT based replication under a ROW based replication server. Or a 5.5 under a 5.6. Or anything under a server that doesn’t have binary logs. Or log_slave_updates. Or if one of the servers involed lags too much. Or more…


It also points out a few problems, visually. While it is not – and will not be – a monitoring tool, it requires some replication status info for its own purposes. Too much lag? Replication not working? Server cannot be accessed? Server under maintenance? This all shows up in your topology. We use it a lot to get a holistic view over our current replication topologies status.


Orchestrator keeps the state of your topologies. Unlike other tools that will drill down from the master and just pick up on whatever’s connected right now, orchestrator remembers what used to be connected, too. If a slave is not replicating at this very moment, that does not mean it’s not part of the topology. Same for a MySQL service that has been temporarily stopped. And this includes all their slaves, if any. Until told otherwise (or until too much time passes and a server is assumed dead), orchestrator keeps the map intact.


Orchestrator supports a maintenance-mode state; it’s a flag saying “this server is in maintenance mode right now”. But this flag includes an owner and a reason for audit purposes. And while a server is under maintenance, orchestrator will disallow replication topology changes that include this server.


Operations performed via orchestrator are audited (well, almost all). You have a complete history on what slave has been moved from where to where; what server has been taken under maintenance and when, etc.


The most important thing is of course automating error-prone human sequences of actions. Repointing slaves is a mess (when you don’t have GTIDs). Automation saves time and greatly reduces the possibility that something goes wrong (of course never eliminates). We happen to use orchestrator at Outbrain on production, and twice in the past month had major events where orchestrator saved us many hours and worry.


Orchestrator supports “standard” replication: log file:pos kind of replication. Non GTID, non-parallel. Good (?) old replication.

Why not GTID? We’re using MySQL 5.5. We’ve had issues while evaluating 5.6; and besides, migrating to GTID is a mess (several solutions or proposed solutions seem to exist). At this time the majority of MySQL users seem to run 5.5, and a minority of those running 5.6 uses GTID (this is according to an unofficial “raise your hands” survey during last Percona Live event). “Standard” replication still applies to the majority of users. Support for GTID may be added in the future.

Read the FAQ for further questions on supported replication technologies.

How do you like it?

Orchestrator can run as a command line tool (no need for Web). It can server HTTP JSON API (no need for visualization) or it can server as HTTP web interface (no need to use command line options). Have it your way.

The technology stack

Orchestrator is written in Go, with Martini as web framework; MySQL as backend database; D3, jQuery & bootstrap for frontend.


Orchestrator is released as open source under the Apache 2.0 license and is available at:


Read the Manual


Get the bundled binary+web files tarball, RPM or DEB packages. Or just clone the project. It’s free.


So Long Spring XMLs

Like many java projects these days, we use Spring in Outbrain for configuring our java dependencies wiring. Spring is a technology that started in order to solve a common, yet not so simple, issue – wiring all the dependencies in a java project. This was done by utilizing the IoC (Inversion of Control) principles. Today Spring does a lot more than just wiring and bootstrapping, but in this post I will focus mainly on that.

When Spring just started, the only way to configure the wirings of an application, was to use XMLs which defined the dependencies between different beans. As Spring had continued to develop, 2 more methods were added to configure dependencies – the annotation method and the @Configuration method. In Outbrain we use XML configuration. I found this method has a lot of pain points which I found remedy to using spring @Configuration

What is this @Configuration class?

You can think of a @Configuration class just like XML definitions, only defined by code. Using code instead of XMLs allows some advantages over XMLs which made me switch to this method:

  1. No typos – You can’t have a typo in code. The code just won’t compile
  2. Compile time check (fail fast) – With XMLs it’s possible to add an argument to a bean’s constructor but to forget to inject this argument when defining the bean in the XML. Again, this can’t happen with code. The code just won’t compile
  3. IDE features come for free – Using code allows you to find usages of the bean’s constructor to find out easily the contexts that use it; It allows you to jump back and forth between beans definitions and basically everything you can do with code, you get for free.
  4. Feature flags – In Outbrain we use feature-flags a lot. Due to the continuous-deployment culture of the company, a code that is pushed to the trunk can find itself in production in a matter of minutes. Sometimes, when developing features, we use feature flags to enable/disable certain features. This is pretty easy to do by defining 2 different implementations to the same interface and decide which one to load according to the flag. When using XMLs we had to use the alias feature which makes it not intuitive enough to create feature-flags. With @Configuration, we can create a simple if clause for choosing the right implementation.

Read more >

Introducing Propagator: multi-everything deployment made easy

Introducing Propagator: multi-everything deployment made easy

This post was written by Shlomi Noach.

Outbrain is happy to release its own Propagator as open source. Propagator is a schema & data deployment tool which makes it easy to deploy, review, audit & fix deployments to your database servers.

What does multi-everything mean? It is:

  • Multi-server: push your schema & data changes to multiple instances in parallel
  • Multi-role: different servers have different schemas
  • Multi-environment: recognizes the differences between development, QA, build & production servers
  • Multi-technology: supports MySQL, Hive (Cassandra on the TODO list)
  • Multi-user: allows users authenticated and audited access
  • Multi-planetary: TODO

With dozens of database servers in our company (and these are master database servers), from development machines to testing machines, through build machines to production servers, and with a growing team of over 70 engineers, we faced the growing problem of controlling our database schema evolution. Controlling creation of tables, columns, keys, foreign keys; controlling creation of data that must be consistent across all servers became an infeasible task. Some changes were lost; some servers forgotten along the way, and inconsistencies blocked our development & deployments again and again. Read more >