Category: Dev Methods

Failure Testing for your private cloud – Introducing GomJabbar

Fork me on GitHub

TL;DR Chaos Drills can contribute a lot to your services resilience, and it’s actually quite a fun activity. We’ve built a tool called GomJabbar to help you run those drills.


Here at Outbrain we manage quite a large scale deployment of hundreds of services / modules, and thousands of hosts. We practice CI/CD, and implemented quite a sound infrastructure, which we believe is scalable, performant, and resilient. We do however experience many production issues on a daily basis, just like any other large scale organization. You simply can’t ensure a 100% fault free system. Servers will crash, run out of disk space, and lose connectivity to the network. Software will experience bugs, and erroneous conditions. Our job as software engineers is to anticipate these conditions, and design our code to handle them gracefully.

For quite a long time we were looking into ways of improving our resilience, and validate our assumptions, using a tool like Netflix’s Chaos Monkey. We also wanted to make sure our alerting system actually triggers when things go wrong. The main problem we were facing is that Chaos Monkey is a tool that was designed to work with cloud infrastructure, while we maintain our own private cloud.

The main motivation for developing such a tool, is that failures have the tendency of occurring when you’re least prepared, and in the least desirable time, e.g. Friday nights, when you’re out having a pint with your buddies. Now, to be honest with ourselves, when things fail during inconvenient times, we don’t always roll our sleeves and dive in to look for the root cause. Many times the incident will end after a service restart, and once the alerts clear we forget about it.

Wouldn’t it be great if we could have “chaos drills”, where we could practice handling failures, test and validate our assumptions, and learn how to improve our infrastructure?

Chaos Drills at Outbrain

We built GomJabbar exactly for the reasons specified above. Once a week, at a well known time, mid day, we randomly select a few targets where we trigger failures. At this point, the system should either auto-detect the failures, and auto-heal, or bypass them. In some cases alerts should be triggered to let teams know that a manual intervention is required.

After each chaos drill we conduct a quick take-in session for each of the triggered failures, and ask ourselves the following questions:

  1. Did the system handle the failure case correctly?
  2. Was our alerting strategy effective?
  3. Did the team have the knowledge to handle, and troubleshoot the failure?
  4. Was the issue investigated thoroughly?

These take-ins lead to super valuable inputs, which we probably wouldn’t collect any other way.

How did we kick this off?

Before we started running the chaos drills, there were a lot of concerns about the value of such drills, and the time it will require. Well, since eliminating our fear from production is one of the key goals of this activity, we had to take care of that first.

"I must not fear.
 Fear is the mind-killer.
 Fear is the little-death that brings total obliteration.
 I will face my fear.
 I will permit it to pass over me and through me.
 And when it has gone past I will turn the inner eye to see its path.
 Where the fear has gone there will be nothing. Only I will remain."

(Litany Against Fear - Frank Herbert - Dune)

So we started a series of chats with the teams, in order to understand what was bothering them, and found ways to mitigate it. So here goes:

  • There’s an obvious need to avoid unnecessary damage.
    • We’ve created filters to ensure only approved targets get to participate in the drills.
      This has a side effect of pre-marking areas in the code we need to take care of.
    • We currently schedule drills via statuspage.io, so teams know when to be ready, and if the time is inappropriate,
      we reschedule.
    • When we introduce a new kind of fault, we let everybody know, and explain what should they prepare for in advance.
    • We started out from minor faults like graceful shutdowns, continued to graceless shutdowns,
      and moved on to more interesting testing like faulty network emulation.
  • We’ve measured the time teams spent on these drills, and it turned out to be negligible.
    Most of the time was spent on preparations. For example ensuring we have proper alerting,
    and correct resilience features in the clients.
    This is actually something you need to do anyway. At the end of the day, we’ve heard no complaints about interruptions, nor time waste.
  • We’ve made sure teams, and engineers on call were not left on their own. We wanted everybody to learn
    from this drill, and when they were’nt sure how to proceed, we jumped in to help. It’s important
    to make everyone feel safe about this drill, and remind everybody that we only want to learn and improve.

All that said, it’s important to remember that we basically simulate failures that occur on a daily basis. It’s only that when we do that in a controlled manner, it’s easier to observe where are our blind spots, what knowledge are we lacking, and what we need to improve.

Our roadmap – What next?

  • Up until now, this drill was executed in a semi-automatic procedure. The next level is to let the teams run this drill on a fixed interval, at a well known time.
  • Add new kinds of failures, like disk space issues, power failures, etc.
  • So far, we were only brave enough to run this on applicative nodes, and there’s no reason to stop there. Data-stores, load-balancers, network switches, and the like are also on our radar in the near future.
  • Multi-target failure injection. For example, inject a failure to a percentage of the instances of some module in a random cluster. Yes, even a full cluster outage should be tested at some point, in case you were asking yourself.

The GomJabbar Internals

GomJabbar is basically an integration between a discovery system, a (fault) command execution scheduler, and your desired configuration. The configuration contains mostly the target filtering rules, and fault commands.

The fault commands are completely up to you. Out of the box we provide the following example commands, (but you can really write your own script to do what suits your platform, needs, and architecture):

  • Graceful shutdowns of service instances.
  • Graceless shutdowns of service instances.
  • Faulty Network Emulation (high latency, and packet-loss).

Upon startup, GomJabbar drills down via the discovery system, fetches the clusters, modules, and their instances, and passes each via the filters provided in the configuration files. This process is also performed periodically. We currently support discovery via consul, but adding other methods of discovery is quite trivial.

When a users wishes to trigger faults, GomJabbar selects a random target, and returns it to the user, along with a token that identifies this target. The user can then trigger one of the configured fault commands, or scripts, on the random target. At this point GomJabbar uses the configured CommandExecutor in order to execute the remote commands on the target hosts.

GomJabbar also maintains a audit log of all executions, which allows you to revert quickly in the face of a real production issue, or an unexpected catastrophe cause by this tool.

What have we learned so far?

If you’ve read so far, you may be asking yourself what’s in it for me? What kind of lessons can I learn from these drills?

We’ve actually found and fixed many issues by running these drills, and here’s what we can share:

  1. We had broken monitoring and alerting around the detection of the integrity of our production environment. We wanted to make sure that everything that runs in our data-centers is managed, and at a well known (version, health, etc). We’ve found that we didn’t compute the difference between the desired state, and the actual state properly, due to reliance on bogus data-sources. This sort of bug attacked us from two sides: once when we triggered graceful shutdowns, and once for graceless shutdowns.
  2. We’ve found services that had no owner, became obsolete, and were basically running unattended in production. The horror.
  3. During the faulty network emulations, we’ve found that we had clients that didn’t implement proper resilience features, and caused cascading failures in the consumers several layers up our service stack. We’ve also noticed that in some cases, the high latency also cascaded. This was fixed by adding proper timeouts, double-dispatch, and circuit-breakers.
  4. We’ve also found that these drills motivated developers to improve their knowledge about the metrics we expose, logs, and the troubleshooting tools we provide.

Conclusion

We’ve found the chaos drills to be an incredibly useful technique, which helps us improve our resilience and integrity, while helping everybody learn about how things work. We’re by no means anywhere near perfection. We’re actually pretty sure we’ll find many many more issues we need to take care of. We’re hoping this exciting new tool will help us move to the next level, and we hope you find it useful too 😉

Effective Testing with Loan Pattern in Scala

Tests are crucial in systems that rely on CI/CD as part of their release cycle. One of the challenges is to write stable tests that work for you without spending a lot of time on maintaining bad tests.

Tests are Hard

They’re hard to write, hard to maintain and it’s even harder to stabilize a flaky test. At Outbrain, we take special pride in our ability (for the most part) to deliver new features to production and doing so with the confidence that only reliable tests can give you. These tests play a crucial role in our ability to deliver fast, good and stable code making sure no regression bugs were introduced in the process. It is crucial then, to not only maintain good test suites (unit tests, integration, and e2e) but also to fix any test that misbehaves (flaky tests).

We have a special environment to facilitate integration and e2e tests called simulation environment (it is only one of the set of tools we have for that purpose). This is a dedicated set of servers which we use to simulate our production environment. We deploy every new version of our services to that environment before we deploy to production, and run tests that check new flows of code, regression, and interoperability to other services.

In order to write an effective test for a new feature, we sometimes need to setup the environment with entities that are required to the feature we’re testing. If, for example, our new feature is to register a car to an owner (a Person entity). Before running the tests we need the required entities, a Car and a Person in our database. We’re not trying to test a flow for creating a new car, or a new person in this scenario. Therefore there is no need in creating the car and/or the person entities explicitly in the test before the actual test scenario happens. And in order to make our tests as clear and succinct as possible — we don’t want to be creating this data explicitly in each and every test.

 

Bad Practices

So, it was a common practice (albeit a bad one) to have pre-existing data on which we would rely on to run tests (for the whole simulation environment!). This led to two big (interconnected) problems:

  1. No test isolation – a test mistakenly deleting some or all of the pre-existing data, for example, would do so for all the tests that run in that environment
  2. Flaky tests – tests running concurrently are creating, deleting and generally changing data that affects others, which in turn would fail tests for no good reason — which makes it really hard to analyze and fix a failing test

 

We’ve tackled this problem by creating the needed data before the tests in a test class and deleting it after the test run. Which mitigated the problem somewhat — not only the tests in the same class were interconnected but also added boilerplate to the test class. Now, a test class looked something like (assuming these are entities autogenerated by Scalike for the relevant tables):

ScalaTest:

class MyTestClass extends WordSpec with BeforeAndAfterAll {

  val person = Person(carId = None).save() // creating data for test
  val car = Car(color = "green", ownerId = None).save()

  override def afterAll(): Unit = {
    // Clean db after test.
    person.destroy()
    car.destroy()
  }

  "My service" should {
    "set the owner of the car" in {
      val service = new Service

      service.setCarOwner(carId = car.id, ownerId = person.id)

      service.getPerson(person.id).carId shouldEqual Some(car.id)
    }
  }
}

Specs2:

class MyTestClass extends SpecificationWithJUnit with AfterAll {
  val person = Person(carId = None).save() // creating data for test
  val car = Car(color = "green", ownerId = None).save()

  override def afterAll(): Unit = {
    // Clean db after test.
    person.destroy()
    car.destroy()
  }

  "My service" should {
    "set the owner of the car" in {
      val service = new Service

      service.setCarOwner(carId = car.id, ownerId = person.id)

      service.getPerson(person.id).carId must be equalTo Some(car.id) 
    }
  }
}

Looking at this, we were presented with a challenge. First, the data is created for all the tests that run in a class, which must be deleted only after all tests have finished running — this means that the tests are not isolated one from another and potentially may become flaky. Second, we wanted an elegant way of creating and deleting the needed entities seamlessly in order to minimize the boilerplate for each test class.

Note: It is possible however, in Specs2, to make a better solution by using the ‘Scope’ trait like so:

trait Context extends Scope with After {
  val person = Person(carId = None).save() // creating data for test
  val car = Car(color = "green", ownerId = None).save()

  override def after: Any = {
    // Clean db after test.
    person.destroy()
    car.destroy()
  }
}

And using it in a test like so:

class MyTestClass extends SpecificationWithJUnit {
  "My service" should {
    "set the owner of the car" in new Context {
      val service = new Service

      service.setCarOwner(carId = car.id, ownerId = person.id)

      service.getPerson(person.id).carId must be equalTo Some(car.id)
    }
  }
}


It’s a good solution, for a simpler problem than we faced. We needed the tests running in a single transaction, with a supplied session and a configurable db name (indicating a set of Scalike connection parameters).

Enter Loan Pattern

We first encountered this pattern when using ScalaTest and quickly moved to using it also in Specs2 (as most of our tests are written in Specs2). From ScalaTest documentation for Sharing fixtures:

“A test fixture is composed of the objects and other artifacts (files, sockets, database connections, etc.) tests use to do their work. When multiple tests need to work with the same fixtures, it is important to try and avoid duplicating the fixture code across those tests.”
“If you need to both pass a fixture object into a test and perform cleanup at the end of the test, you’ll need to use the loan pattern”

Which means, we can use fixtures to set up ‘artifacts’ for the tests to use, promoting the DRY principle by minimizing code duplication. It is also a good way to reduce boilerplate when writing tests. So, we wrote this one simple trait:

ScalaTest:

trait TestDataSupport extends DefaultGenerator {

  def withTestData(testCode: DefaultObjects => Any): Unit = {
    val testData = createTestData()
    try {
      testCode(testData) // "loan" the fixture to the test
    }
    finally clearTestData(testData) // clean up the fixture
  }

  private def createTestData(): DefaultObjects = {
    DefaultObjects.create(name = Random.alphanumeric.take(10).mkString)
  }

  private def clearTestData(testData: DefaultObjects): Unit = {
    testData.cleanup
  }
}

Let’s go over what’s happening in this trait. We’re mixing in a custom trait called ‘DefaultGenerator’ which gives us the ‘DefaultObjects’ which are the entities we need to be pre-created for our tests to run. We have two private methods. One that calls ‘create’ on ‘DefaultObjects’ with a custom name to generate the needed entities. The other calls ‘cleanup’ on the test data to clean the environment after the test has finished running. And the star of this trait, the method (or fixture if you will) ‘withTestData’ which gets the test function as a parameter, calls the private method ‘createTestData’, calls the test and passing it the data we just generated and finally cleans up the generated data after the test finishes.

When mixing this trait in our test class, we get the following code:

class MyTestClass extends WordSpec with TestDataSupport {

  "My service" should {
    "set the owner of the car" in withTestData { testData =>
      val service = new Service

      service.setCarOwner(carId = testData.car.id, ownerId = testData.person.id)

      service.getPerson(testData.person.id).carId shouldEqual Some(testData.car.id) 
    }
  }
}


‘testData’ is the data generated in our ‘withTestData’ method (a car and a person in our case).

The Specs2 version of the Loan Pattern is a bit more complex, as we’ve added some more bells and whistles to make it easier for us to create those entities in our domain. We’re using Scalike to create the entities in MySQL database, and we need a somewhat more refined control over the session we’re using, DB name etc’.

Specs2:

trait DataContextName {
  def className: String
}
trait DataContextDbName {
  val dbName: Symbol = 'default
}
import scalikejdbc.{DBSession, NamedDB}

package object testdata {
  private[testdata] implicit class NamedDbSession(namedDB: NamedDB) {
    def withSession[A](session: DBSession)(execution: DBSession => A): A =
      execution(session)
  }
}
import org.specs2.execute.{AsResult, Result}
import org.specs2.specification.ForEach
import scalikejdbc.{DB, DBSession, NamedDB}

import scala.util.Try

trait DefaultDataContext extends ForEach[DefaultObjects] 
  with DataContextDbName with DataContextName with DefaultGenerator {
  
  implicit lazy val session: DBSession = DB.autoCommitSession()

  override def foreach[R](f: (DefaultObjects) => R)(implicit evidence$3: AsResult[R]): Result =
    NamedDB(dbName).withSession(session) { implicit session: DBSession =>

      val testData = DefaultObjects.create(name = randomName(className))

      val result = Try {
        AsResult(f(testData))
      }

      testData.cleanup(session)
      result.get
    }
}

It’s very similar to the ScalaTest flavor, but with several changes we needed to make to better facilitate our needs in the Specs2 tests. We have a mechanism to initialize a named DB connection, with a named connection pool and an explicit session. Besides these additions, it’s pretty similar to ScalaTest — generate the test data, run the test and clean the generated data.

The test class now looks like this:

trait TestClassDataContext extends DefaultDataContext {
  override val dbName: Symbol = TestClassConnectionPoolName
  def className: String = "Test class name"
}
class MyTestClass extends SpecificationWithJUnit with TestClassDataContext {
  "My service" should {
    "set the owner of the car" >> { testData: DefaultObjects =>
      val service = new Service

      service.setCarOwner(carId = testData.car.id, ownerId = testData.person.id)

      service.getPerson(testData.person.id).carId shouldEqual Some(testData.car.id) 
    }
  }
}

Summary

We tackled several issues our team faced on a day to day basis, which made our simulation environment unstable, hard to maintain and generally very frustrating to work on. By extracting data generation and cleanup to an external trait and using a clever mechanism to reduce boilerplate, we managed to clean and simplify the test class, reduce code duplication and generally made our lives easier. Tests are still hard, but a bit easier to write and nicer to read. What do you think?

Automating your workflow

During development, there are many occasions where we have to do things that are not directly related to the feature we are working on, or things that are repetitive and recurring.
In the time span of a feature development this can often take as much time to do as the actual development.

For instance, updating your local dev micro services environment before testing your code. This task on its own, which usually includes updating your local repo version, building and starting several services and many times debugging and fixing issues caused by others, can take hours, many times just to test a simple procedure.

We are developers, we spend every day automating and improving other people’s workflows, yet we often spend so many hours doing the same time consuming tasks over and over again.
So why not build the tools we need to automate our own workflows?

In our team we decided to build a few tools to help out with some extra irritating tasks we were constantly complaining about to each other.

First one was simple, creating a slush sub-generator. For those of you who don’t know, slush is a scaffolding tool, like yeoman but for gulp. We used this to create our Angular components.
Each time we needed to make a new component we had to create a new folder, with three files:


  Comp.component.ts
  Comp.jade
  Comp.less

Each file of course has its own internal structure of predisposed code, and each component had to be registered in the app module and the main less file.

This was obviously extremely annoying to redo each time, so we automated it. Now each time you run “ob-genie” from the terminal, you are asked the name of your component and what module to register it with, and the rest happens on its own. We did this for services and directives too.

Other than saving a lot of time and frustration, this had an interesting side effect – people on the team were creating more components than before! This was good because it resulted in better separation of code and better readability. Seems that many tim the developers were simply too lazy to create a new component and just chucked it all in together. Btw, Angular-CLI have added a similar capability, guess great minds think alike.

Another case we took on in our team was to rid ourselves of the painstaking task of setting up the local environment. This I must say was a real pain point. Updating the repo, building and running the services we needed each time could take hours, assuming everything went well.
There have been times where I spent days on this just to test the simplest of procedures.
Often I admit, I simply pushed my code to a test environment and debugged it there.
So we decided to build a proxy server to channel all local requests to the test environment.

For this we used node-proxy, a very easy to configure proxy. However, this was still not an easy task since each company has very specific configurations issues we had to work with.
One thing that was missing was proper routing capabilities. Since you want some requests to go local and some remote we added this before each request.

https.createServer(credentials, function (req, res) {
 Object.keys(options.routingTable).some(function (key) {
   const regX = new RegExp(key);
   if (regX.test(req.url)) {
     printMe(req.url + ' => ' + (options.routingTable[key].targetName || options.routingTable.target));
     proxy.web(req, res, options.routingTable[key]);
     curTarget = options.routingTable[key];
     return true;
   }
 });
}).listen(options.home_port);

We passed as an option the routing table with a regex for each path, making it easy to configure which requests to proxy out, and which in.

routingTable = {
  'site': local,
  '^/static': local,
  '/*/': remote
};

Another hurdle was working with HTTPS, since our remote environments work on HTTPS.
In order to adhere to this we needed to create SSL certificate for our proxy and the requestCert parameter in our proxy server to false, so that the it doesn’t get validated.

The end configuration should look something like this.

const local = {
   targetName: 'local',
   target: 'https://localhost:4141,
   changeOrigin: true, 
   secure: false
 },
 remote = {
   targetName: 'remote',
   requestCert: false,
   rejectUnauthorized: false,
   target: 'https://test.outbrain.com:8181,
   secure: false,
   changeOrigin: true,
   autoRewrite: true

 },
 routingTable = {
   'site': local,
   '^/static': local,
   '/*/': remote
 };

const options = {
 routingTable: routingTable,
 home_port: 2109,
 debug: true,
 startPath: 'amplify/site/'
 };

With this you should be able to run locally and route all needed calls to the test environment when working on localhost:2109.

So to conclude, be lazy, make your work easier, and use the skills you have to automate your workflows as much as possible.

Kibana for Funnel Analysis

How we use Kibana (4) for user-acquisition funnel analysis

Outbrain has recently launched a direct-to-consumer (D2C) initiative. Our first product is a chatbot. As with every D2C product, acquiring users is important. Therefore, optimizing the acquisition channel is also important. The basis of our optimization is analysis.

kbfunnel-image01

Our Solution (General Architecture)

Our acquisition funnel spans on 2 platforms (2 web pages and a chatbot). Passing many parameter between platforms can be a challenge, so we chose a more stateful, server-based model. Client requests for a new session Id, together with basic data like IP and User agent. Server stores a session (we use Cassandra in this case) with processed fields like Platform, OS, Country, Referral, User Id. At a later stage the client reports a funnel event for a session Id. The server writes all known fields for the session into 2 storages:

  • ElasticSearch for quick & recent analytics (Using the standard ELK stack)
  • Hadoop for long term storage and offline reports

A few example fields stored per event

  • User Id – An unique & anonymous identifier for a user
  • Session Id – The session Id is the only parameter passed between funnel steps
  • Event Type – The specific step in the funnel – serve, view, click
  • User Agent – Broken down to Platform and OS
  • Location – based on IP
  • Referral fields – Information on the context in which the funnel is excercised
  • A/B Tests variants – The A/B Test variant Ids that are included in the session

Goal of the Analysis: Display most important metrics quickly

Kibana plugin #1: Displaying percent metric

Kibana has several ways of displaying a fraction, but none excel in displaying small numbers. (Pie can be used to visualize fractions, but small). We developed a Kibana plugin for displaying a single metric, in percent format.

kbfunnel-image00

We use this visualization for displaying the conversion rate of the most interesting part of our funnel.

Kibana plugin #2: Displaying the funnel

We couldn’t find a good way for displaying a funnel so we developed a visualization plugin (honestly, we were eager to develop this, so we did not scan the entire internet..)

Based on the great D3 Funnel by Jake Zatecky, this is a Kibana plugin that display buckets of events in funnel format. It’s customizable and open-source. Feel free to use it…

kbfunnel-image02

Putting it all together

Displaying your most important metrics and the full funnel is nice. Comparing variant A with variant B is very nice. We’ve setup our dashboard to show similar key metrics on 2 versions of the funnel. We always try to run at least 1 A/B test and this dashboard shows us realtime results of our tests.

kbfunnel-image04

Cherry on top

Timelion is awesome. If you’re not using it, I suggest trying it.

Viewing your most important metrics over time is very useful, especially when you’re making changes fast. Here’s an example:

kbfunnel-image03

Summary

We track a user’s activity by sending events to the server. The server writes these events to ES and Hadoop. We developed 2 Kibana plugins to visualize the most important metrics of our user-acquisition funnel. We can filter the funnel by Platform, Country, OS, Time, Referral, or any other fields we bothered to save. In addition, we always filter by A/B Test variants and compare 2 specific variants.

Micro Service Split

image07

In this post I will describe a technical methodology we used to remove a piece of functionality from a Monolith/Service and turn it into a Micro-Service. I will try to reason about some of the decisions we have made and the path we took, as well as a more detailed description of internal tools, libraries and frameworks we use at Outbrain and in our team, to shed some light on the way we work in the team. And as a bonus you might learn from our mistakes!
Let’s start with the description of the original service and what it does.
Outbrain runs the largest content discovery platform. From the web surfer’s perspective it means serving a recommended content list that might interest her, in the form of ‘You might also like’ links. Some of those impression links are sponsored. ie: when she clicks on a link, someone is paying for that click, and the revenue is shared between Outbrain and the owner of the page with the link on it. That is how Outbrain makes its revenue.

My team, among other things, is responsible for the routing of the user to the requested page after pressing the link, and for the bookkeeping and accounting that is required in order to calculate the cost of the click, who should be charged, etc.
In our case the service we are trying to split is the ‘Bookkeeper’. Its primary role is to manage the paid impression links budget. After a budget is spent, The ‘Bookkeeper’ should notify Outbrain’s backend servers to refrain from showing the impression link again. And this has to be done as fast as possible. If not, people will click on links we cannot charge because the budget was already spent. Technically, this is done by an update to a database record. However, there are other cases we might want to stop the exposure of impression links. One such an example is a request from the customer paying for the future click to disable the impression link exposure. So for such cases we have an API endpoint that does exactly the same with the same code. That endpoint is actually part of the ‘Bookkeeper’ that is enabled by a feature toggle on specific machines. This ‘Activate-Impressionable’ endpoint as we call it, is what was decided to split out of the ‘Bookkeeper’ into a designated Micro-Service .
In order to execute the split, we have chosen a step-by-step strategy that will allow us to reduce the risk during execution and keep it as controlled and reversible as possible. From a bird’s eye view I will describe it as a three steps process: Plan, Up and Running as fast as possible and Refactor. The rest of the post describes these steps.

Plan (The who’s and why’s)

In my opinion this is the most important step. You don’t want to split a service just in order to split. Each Micro Service introduces maintenance and management overhead, with its own set of challenges[1]. On the other hand, Microservices architecture is known for its benefits such as code maintainability (for each Micro Service), the ability to scale out and improved resilience[2].
Luckily for me, someone already did that part for me and took the decision that ‘Activate-Impressionable’ should split from the ‘Bookkeeper’. But still, Let’s name some of the key factor of our planning step.
Basically I would say that a good separation is a logical separation with its own non-overlap RESTful endpoints and isolated code base. The logical separation should be clear. You should think what is the functionality of the new service, and how isolated it is. It is possible to analyze the code for inter-dependencies among classes and packages using tools such as lattix. At the bottom line, it is important to have a clear definition of the responsibility of the new Micro Service.
In our case, the ‘Bookkeeper’ was eventually split so that it remain the bigger component, ‘Activate-Impressionable’ was smaller and the common library was smaller than both. The exact lines of code can be seen in the table below.

Screen Shot 2016-02-21 at 11.21.56

Unfortunately I assessed it only after the split and not in the plan step. We might say that there is too much code in common when looking at those numbers. It is something worth considering when deciding what to split. A lot of common code implies low isolation level.

Of course part of the planning is time estimation. Although I am not a big fan of “guestimates” I can tell that the task was planned for couple of weeks and took about that long.
Now that we have a plan, let’s get to work.

Up and Running – A Step by Step guide

As in every good refactor, we want to do it in small baby steps, and remain ‘green’ all the time[3]. In continuous deployment that means we can and do deploy to production as often as possible to make sure it is ‘business as usual’. In this step we want to get to a point the new service is working in parallel to the original service. At the end of this step we will point our load-balancers to the new service endpoints. In addition, the code remains mutual in this step, means we can always deploy the original fully functioning ‘Bookkeeper’. We actually do that if we feel the latest changes had any risk.
So let’s break it down into the actual phases:

Overview Step Details
micro service split 0 Starting phase
micro service split 1 Create the new empty Micro-Service ‘Activate-Impressionable’. In outbrain we do it using scaffolding of ob1k framework. Ob1k is an open source Micro Services Framework that was developed in-house.
micro service split 2 Create a new empty Library dependent both by the new ‘Activate-Impressionable’ service and the ‘Bookkeeper’. Ideally, if there is a full logic separation with no mutual code between the services that library will be deleted in the cleanup phase.
micro service split 3 Move the relevant source code to the library. Luckily in our case, there was one directory that was clearly what we have to split out. Unluckily, that code also pulled up some more code it was dependent on and this had to be done carefully not to pull too much nor too little. The good news are that this phase is pretty safe for static typing languages such as Java, in which our service is written in. The compiler protects us here with compilation errors so the feedback loop is very short. Tip: don’t forget to move unit tests as well.
micro service split 4 Move common resources to the library, such as spring beans defined in xml files and our feature flags files that defined in yaml files. This is the dangerous part. We don’t have the compiler here to help, so we actually test it in production. And when I say production I mean using staging/canary/any environment with production configuration but without real impact. Luckily again, both yaml and spring beans are configured to fail fast, so if we did something wrong it will just blow out in our face and the service will refuse to go up. For this step I even ended up developing a one-liner bash script to assist with those wicked yaml files.
micro service split 5 Copy and edit web resources (web.xml) to define the service endpoints. In our case web.xml can’t reside in a library so it has to be copied. Remember we still want the endpoints active in the ‘Bookkeeper’ at that phase. Lesson learned: inspect all files closely. In our case log4j.xml which seems like an innocent file by its name contains designated appenders that are consumed by other production services. I didn’t notice that and didn’t move the required appender, and it was found only later in production.
Deploy Deploy the new service to production. What we did is deploy the ‘Activate-Impressionable’ side-by-side on the same machines as the ‘Bookkeeper’, just with a different ports and context path. Definitely makes you sleep better at night.
Up-And-Running Now is a good time to test once again if both ‘Bookkeeper’ and ‘Activate-Impressionable’ are working as expected. Generally now we are up and running with only few more things to do here.
Clients Redirect Point clients of the service to the new endpoints (port + context path). A step that might take some time depends on the number of clients and the ability to redeploy them. In outbrain we use HA-Proxy, so reconfiguring it did most of the work, but some clients did require code modifications.
(More) Validation Move/copy simulator tests and monitors. In our team, we heavily rely on tests we call simulator tests. These are actually black-box tests written in JUnit that runs against the service installed on a designated machine. These tests see the service as a black-box and calls its endpoints while mock/emulate other services and data in the database for the test run. So usually a test run can look like: put something in the database, trigger the endpoint, and see the result in the database or in the http response. There is also a question here whether to test ‘Activate-Impressionable’ or the ‘Bookkeeper’. Ideally you will test them both (tests are duplicated for that phase), and that is what we did.

 

Refactor, Disconnect & Cleanup

When we got here the new service is working and we should expect no more behaviour changes from the endpoints point of view. But we still want the code to be fully split and the services to be independent from each other. In the previous step we performed the phases in a way that everything remains reversible with a simple feature toggle & deploy.

In this step we move to a state where the ‘Bookkeeper’ will no longer host the ‘Activate-Impressionable’ functionality. Sometimes it is a good idea to have a gap from the previous step to make sure that there are no rejections and backfires that we didn’t trace in our tests and monitoring.
First thing, If was not done up until now, is deploying the ‘Bookkeeper’ without the service functionality and make sure everything is still working. And wait a little bit more…
And now we just have to push the sources and the resources from the library to the ‘Activate-Impressionable’ service. In the ideal case there is no common code, we can also delete the library. This was not how it was in our case. We still have a lot of common code we can’t separate for the time being.
Now is also the time to do resources cleanup, web.xml edit etc’.
And for the bold and OCD among us – packages rename and refactor of code with the new service naming conventions.

Conclusion

image02
The entire process in our case took a couple of weeks. Part of the fun and advantage in such process, is the opportunity to know better an old code and its structure and functionality without the need to modify something for a new feature with its constraints. Especially when someone else wrote it originally.
In order to perform well such a process it is important to plan and remain organized and on track. In case of a context switch it is very important to keep a bookmark of where you need to return to in order to continue. In our team we even did that with a handoff of the task between developers. Extreme Programming, it is.
It is interesting to see the surprising results in terms of lines of code. Originally we thought of it as splitting a micro-service from a monolith. In retrospective, it looks to me more like splitting a service into two services. ‘Micro’ in this case is in the eye of the beholder.

References

[1] http://highscalability.com/blog/2014/4/8/microservices-not-a-free-lunch.html
[2] http://eugenedvorkin.com/seven-micro-services-architecture-advantages/
[3] http://blog.cleancoder.com/uncle-bob/2014/12/17/TheCyclesOfTDD.html

http://martinfowler.com/articles/microservices.html
https://github.com/outbrain/ob1k
http://www.yaml.org/
http://lattix.com/

DevOps – The Outbrain Way

Like many other fast moving companies, at Outbrain we have tried several iterations in the  attempt to find the most effective “DevOps” model for us. As expected with any such effort, the road has been bumpy and there have been many “lessons learned” along the way. As of today, we feel that we have had some major successes in refining this model, and would like to share some of our insights from our journey.

 

Why to get Dev and Ops together in the first place?

A lot has been written on this topic, and the motivations and benefits of adding the operational perspective into the development cycles has been thoroughly discussed in the industry – so we will not repeat those.

I would just say that we look at these efforts as preventive medicine, like eating well and exercise – life is better when you stay healthy.  It’s not as good when you get sick and seek medical treatment to get health again.

 

What’s in a name?

We do not believe in the term “DevOps”, and what it represents.  We try hard to avoid it –  why is that?

Because we expect every Operations engineer to have development understanding and skills, and every Developer to have operational understanding of how the service he / she developes works, and we help them achieve and improve those skills – so everyone is DevOps.

We do believe there is a need to get more system and production skills and expertise closer to the development cycles – so we call it Production Engineers.

 

First try – Failed!

We started by assigning Operations Engineers to work with dedicated development groups – the big problem was that it was done on top of their previous responsibility in building the overall infrastructure (config management, monitoring infrastructure, network architecture etc.), which was already a full time job as it was.  

This mainly led to frustration on both sides – the operations eng. who felt they have no time to do anything properly, just touching the surface all the time and spread too thin, and the developers who felt they are not getting enough bandwidth from operations and they are held back.

Conclusion – in order to succeed we need to go all in – have dedicated resources!

 

Round 2 – Dedicated Production Eng.

Not giving up on the concept and learning from round 1 – we decided to create a new role – “Production Engineers” (or PE for short), whom are dedicated to specific development groups.

This dedication manifest in different levels. Some of them are semi trivial aspects, like seating arrangements – having the PE sit with the development team, and sharing with them the day to day experience; And some of them are focus oriented, like joining the development team goals and actually becoming an integral part of the development team.

On the other hand, the PE needs to keep very close relationship with the Infrastructure Operational team, who continues to develop the infrastructure and tools to be used by the PEs and support the PEs with technical expertise on more complex issues require subject matter experts.

 

What & How model:

So how do we prevent the brain split situation of the PE? Is the PE part of the development team or the Operations team? When you have several PEs supporting different development groups – are they all independent or can we gain from knowledge transfer between them?

In order for us to have a lighthouse to help us answer all those questions and more that would evident come up, we came up with the  “What & How” model:

“What” – stands for the goals, priorities and what needs to be achieved. “The what” is set by the development team management (as they know best what they need to deliver).

“How” – stands for which methods, technologies and processes should be used to achieve those goals most efficiently from operational perspective. This technical, subject matter guidance is provided by the operations side of the house.

 

So what is a PE @ Outbrain?

At first stage, Operations Engineer is going through an on-boarding period, during which the Eng. gains the understanding of Outbrain operational infrastructure. Once this Eng. gained enough millage he /she can become a PE, joining a development group and working with them to achieve the development goals, set the “right” way from operational perspective, properly leveraging the Outbrain infrastructure and tools.

The PE enjoys both worlds – keeping presence in the Operations group and keeping his/hers technical expertise on one hand, and on the other hand be an integral part of the development team.

From a higher level perspective – we have eliminated the frustrations points, experienced in our first round of “DevOps” implementation, and are gaining the benefit of close relationship, and better understanding of needs and tools between the different development groups and the general Operations group. By the way, we have also gained a new carrier development path for our Operations Eng. and Production Eng. that can move between those roles and enjoy different types of challenges and life styles.

 

e8f82598-c6e2-4c08-85ce-f6d34f74f3b6

Real Time Performance Monitoring @ Outbrain

Outbrain serves millions of requests per minute, based on a micro service architecture.Consequently, as you might expect, visibility and performance monitoring are crucial.

Serving millions of requests per minute, across multiple data centers, in a micro services environment, is not an easy task. Every request is routed to many applications, and may potentially stall or fail at every step in the flow. Identifying bottlenecks, troubleshooting failures and knowing our capacity limits are all difficult tasks. Yet, these are not things you can just give up on “because they’re hard”, but are rather tasks that every engineer must be able to tackle without too much overhead. It is clear that we have to aim for all engineers to be able to understand how their applications are doing at any given time.

Since we face all of these challenges every day, we’ve reached the point where a paradigm shift was required. For example, move from the old, familiar “investigate the past” to the new, unfamiliar “investigate the present”. That’s only one of the requirements we came up with. Here are few more:

 

Real time visibility

Sounds pretty straightforward, right? However when using a persistent monitoring system, it always has at least few minutes of delay. These few minutes might contain millions of errors that potentially affect your business. Aiming for low MTTR means cutting delays where possible, thus moving from minute-based granularity to second-based.

 

Throughput, Latency and error rate are linked

Some components might suffer from high latency, but maybe the amount of traffic they receive is negligible. Others might have low latency under high load, but that’s only because they fail fast for almost every request (we are reactive!). We wanted to view these metrics together, and rank them by importance.

 

Mathematical correctness at any given aggregation (Don’t lie!)

When dealing with latency, one should look at percentiles, not averages, as averages can be deceiving and might not tell the whole story. But what if we want to view latency per host, and then view it per data center ? if we store only percentiles per host (which is highly common in our industry), it is not mathematically correct to average them! On the other hand, we have so much traffic that we can’t just store any measurement with its latency; and definitely not view them all in real time

 

Latency resolution matters

JVM based systems tend to display crazy numbers when looking at the high percentiles (how crazy ? With heavy gc storms and lock contention there is no limit to how worse these values can get). It’s crucial for us to differentiate between latency in the 99.5 and 99.9 percentiles, while values at the 5 or 10 percentiles don’t really matter.

Summing up all of the requirements above, we reached a conclusion that our fancy persistent monitoring system, with its minute-based resolution, supporting millions of metrics per minute, doesn’t cut it anymore. We like it that every host can write thousands of metric values every minute, and we like being able to view historical data over long periods of time, but moving forward, it’s just not good enough. So, as we often do, we sat down to rethink our application-level metric collection and came up with a new, improved solution.

 

Our Monitoring Unit

First, consider metric collection from the application perspective. Logically, it is an application’s point-of-view of some component: a call to another application, to a backend or plain CPU bound computation. Therefore, for every component, we measure its number of requests, failures, timeouts and push backs along with a latency histogram over a short period of time.

In addition, we want to see the worst performing hosts in terms of any such metric (can be mean latency, num errors, etc)

mu

To achieve this display for each measured component we decided to use these great technologies:

 

HDR Histograms

http://hdrhistogram.github.com/HdrHistogram/

HdrHistogram supports the recording and analysis of sampled data value counts, across a configurable value range, with configurable value precision within the range. It is designed for recording histograms of latency measurements in performance-sensitive applications.

Why is this important? Because when using such histograms to measure the latency of some component, it allows you to have good accuracy of the values in the high percentiles at the expense of the low percentiles

So, we decided to store in memory instances of histograms (as well as counters for requests, errors, timeouts, push backs, etc) for each measured component. We then replace them each second and expose these histograms in the form of rx.Observable using our own OB1K application server capabilities.

All that is left is to aggregate and display.

Java Reactive extensions

https://github.com/ReactiveX/RxJava

rx is a great tool to merge and aggregate streams of data in memory. In our case, we built a service to merge raw streams of measured components; group them by the measured type, and aggregate them in a window of a few seconds. But here’s the trick – we do that on demand. This allows us to let the users view results grouped by any dimension they desire without losing the mathematical correctness of latency histograms aggregation.

Some examples on the operators we use to aggregate the multiple monitoring units:

 

merge

rx merge operator enables treating multiple streams as a single stream

 

window

rx window operator enables sliding window abstraction

 

scan

rx scan operator enables aggregation over each window

 

To simplify things, we can say that for each component we want to display, we connect to each machine to fetch the monitored stream endpoint, perform ‘merge’ to get a single stream abstraction, ‘window’ to get a result per time unit, and ‘scan’ to perform the aggregation

 

Hystrix Dashboard

https://github.com/Netflix/Hystrix

The guys at Netflix found a great formula for displaying serving components’ status in a way that links between volume, error percentage and latency in a single view. We really liked that, so we adopted this UI to show our aggregated results.

The hystrix dashboard view of a single measured component shows counters of successes, failures, timeouts and push backs, along with a latency histogram, information on the number of hosts, and more. In addition, it provides a balloon view, which grows/shrinks with traffic volume per component, and is color-coded by the error rate.

See below how this looks in the breakdown view of all request components. The user gets a view of all measured components, sorted by volume, with a listing of the worst performing hosts.

view1

Another example shows the view of one application, with nothing but its entry points, grouped by data center. Our Operations guys find this extremely useful when needing to re-balance traffic across data centers.

REBALANCE

 

OK, so far so good. Now let’s talk about what we actually do with it.

Troubleshooting

Sometimes an application doesn’t meet its SLA, be it in latency or error rate. The simple case is due to a broken internal component (for example, some backend went down and all calls to it result in failures). At this point we can view the application dashboard and easily locate the failing call. A more complex use case is an increase in the amount of calls to a high latency component at the expense of a low latency one (for example, cache hit rate drop). Here our drill down will need to focus on the relative amount of traffic each component receives – we might be expecting a 1:2 ratio, while in reality we might observe a 1:3 ratio.

With enough alerting in place, this could be caught by an alert. Having the real time view will allow us to locate the root cause quickly even when the alert is a general one.

troubleshoot

Performance comparison

In many cases we want to compare the performance of two groups of hosts doing the same operation, such as version upgrades or topology changes. We use tags to differentiate groups of machines (each datacenter is a tag, each environment, and even each hostname). We then can ask for a specific metric, grouped by tags, to get the following view:

compare

 

Load testing

We conduct several types of load tests. One is where we shift as much traffic as possible to one data center, trying to hit the first system-wide bottleneck. Another is performed on specific applications. In both cases we use the application dashboard to view the bottlenecks, just like we would when troubleshooting unexpected events.

One thing to keep in mind is that when an application is loaded, sometimes the CPU is bounded and measurements are false because threads just don’t get CPU time. Another case where this happens is during GC. In such cases we must also measure the effects of this phenomenon.

The measured unit in this case is ‘jvm hiccup’, which basically means taking one thread, letting it sleep for a while and measuring “measurement time on top of the sleep time”. Low hiccups means we can rely on the numbers presented by other metrics.

hiccup

 

What’s next?

Real time monitoring holds a critical role in our everyday work, and we have many plans to leverage these measurements. From smarter metric driven load balancing in the client to canary deployments based on real application behavior – there is no limit to what you can achieve when you measure stuff in a fast, reliable manner.

Monitoring APIs with ELK

The Basics

One of the main challenges we’ve dealt with during the last couple of years, was opening our platform and recommendation engine to the developers’ community. With the amount of data that Outbrain processes, direct relations with hundreds of thousands of sites and reach of more than 600M users a month, we can drive the next wave of content innovation. One of Outbrain’s main drivers for enabling automated large scale recommendations system is to provide application developers the option to interact with our system via API.

Developers build applications, and those application are used by users, in different locations and times. When exposing API to external usage you can rarely predict how people will actually use it

These variations can come from different reasons:

  1. Unpredictable scenarios
  2. Unintentional misuse of the API. Either for lack of proper documentation, a bug, or simply because a developer didn’t RTFM.
  3. Intentional misuse of the API. Yeah, you should expect people will abuse your API or use it for fraudulent activity.

In all those cases, we need to know how the developers community is using the APIs and how the end users (applications) are using it as well and also take proactive measures.

Hello ELK.

The Stack

image01

ElasticSearch, Logstash  and Kibana (AKA ELK) are great tools for collecting, filtering, processing, indexing and searching through logs. The setup is simple: Our service writes logs (using Log4J), the logs are picked up by a Logstash agent that sent it to an ElasticSearch  index. Kibana is setup to visualize the data of the ES index.

The Data

Web server logs are usually too generic. Application debug logs are usually too noisy. In our case, we have added a dedicated log with a single line for every API request. Since we’re in application code, we can enrich the log with interesting fields, like country of request origin (translating the IP to country). etc…

Here’s a list of useful fields:

  • Request IP  – Don’t forget about XFF header
  • Country / City – We use a 3rd party database for translating IPs to country.
  • Request User-Agent
  • Request Device Type – Resolved from the User-Agent
  • Request Http Method – GET, POST, etc.
  • Request Query Parameters
  • Request URL
  • Response Http Status – code. 200, 204, etc.
  • Response Error Message – The API service can fill in extra details on errors.
  • Developer Identifier / API Key – If you can identify the Developer, Application or User, add these fields.

What can you get out of this?

So we’ve got the data in ES, now what?

Obvious – Events over time

image03

This is pretty trivial. You want to see how many request are made. With Kibana’s ® slice ‘n dice capabilities, you can easily break it down per Application, Country, or any other field that you’ve bothered to add. In case an application is abusing your API and calling it a lot, you can see who just jumped over time with his requests and handle it.

Request Origin

image04

If you’re able to resolve the request IP (or XFF header IP) to country, you’ll get a cool looking map / table and see where requests are coming from. This way you can detect anomalies like frauds etc…

 

Http Status Breakdown

image02

By itself, this is nice to have. When combined with Kibana’s slice n’ dice capabilities this let’s you see an overview for any breakdown. In many cases you can see that an application/developer is shooting the wrong API call. Be proactive and lend some assistance in near real time. Trust us, they’ll be impressed.

IP Diversity

image00

Why would you care about this? Consider the following: A developer creates an application using your API, but all requests are made from a limited number of IPs. This could be intentional, for example if all requests are made through some cloud service. This could also hint on a bug in the integration of the API. Now you can investigate.

Save the Best for Last

The data exists in ElasticSearch. Using Kibana is just one way of using it. Here are a few awesome ways to use the data.

Automated Validations (or Anomaly detection)

Once we’ve identified key anomalies in API usage, we’ve setup automated tests to search for these anomalies on a daily basis. Automatic anomaly detection in API usage proved to be incredibly useful when scaling a product. These tests can be run on demand or scheduled, and a daily report is produced.

image05

Abuse Detection

ElasticSearch is (as the name suggests) very elastic. It enables querying and aggregating the data in a variety of ways. Security experts can (relatively) easily slice & dice the data to find abuse patterns. For example, we detect when the same user-id is used in two different locations and trigger an alert.

Key Takeaways

  • Use ELK for analyzing your API usage
  • Have the application write the events (not a generic web-server).
  • Provide application-level information. E.g. Additional error information, Resolved geo location.
  • Share the love

Angular DRY mocking – Leonardo

leonardo-logo

This post was written by Sagiv Frenkel.

As developers one of the first and most basic things we learn is “Don’t repeat yourself!”.
That means trying to avoid writing the same code twice – in other words, no copy paste!
While we still sin with the occasional copy paste, it’s something we’re mindful of and is easy to notice. We just have to remember to refactor later on.

But do we treat our mocking the same ?

Lets look at a typical development flow

1) Create your UI/UX, services and controller.
2) Create your server API calls.
3) Test your application, manually/automated with self generated data in different scenarios.

What’s wrong with this approach?

We are’nt repeating code, but we are repeating work

1) Documenting – there’s no good way to tell which user/data to use for which scenario.
2) Running – you need to log in/out to change users or manually change code to fit changes.
3) Testing – error scenarios, edge cases, and request delays/throttling are very hard. Using override scripts or using comments to switch data are the only tools at our disposal.

Can we do better?

Introducing Leonardo

Leonardo is an open sourced AngularJS module created by Outbrain. It can be installed from npm or Bower, and easily integrates into existing AngularJS applications (more details on Leonardo’s GitHub repo)


Leonardo has a fancy UI where you can easily toggle different states/scenarios.

It enables you to:

1) Centralize your mocking and scenario configuration.
2) Persist the configuration into an external file.
3) Create manual QA or automated test

We use Leonardo extensively with protractor. More on this in another post.

Want to get started with Leonardo?

Check this Example to see how you can move from a regular image gallery to a mocked one.

How does Leonardo work?

Leonardo has two important concepts – states and scenarios.

state

We add states to declare what and how to mock.
There are two types:

Ajax States – This it what we will typically use. We declare the url and verb we wish to mock and what response data we wish to return – including a delay and a status.

leoConfiguration.addStates([
  {
    name: 'flicker-images',
    verb: "jsonp",
    url: 'http://api.flickr.com/services/feeds/photos_public.gne',
    options: [
      {
        name: 'get ninja turtles', status: 200,
        data: {
          "items": [
            { "id": "20054214406", "farm": 1, "title": "leo1"},                
            { "id": "19896041068", "farm": 1, "title": "017580"}
           ]
        }
       },{
          name: 'get ninja enemies', status: 200,
          data: {
              "items": [
                { "id": "20058148116", "title": "the_shredder"},                   
                { "id": "20102720711", "title": "the_ninjas" }
              ]   
          }
       }
    ]
  }
]);    

Non Ajax States – This requires more work on the part of the developers. Basically, this allows you to declare a state and its underlying data, (not mandatory) and you can later check if it’s on or off.

leoConfiguration.addState({
  name: 'Set Mission',
  options: [
    { name: 'turtles', data: "Protect April o'neil" },
    { name: 'shredder', data: 'Destroy the ninja turtles' }
  ]
});

You can query Leonardo for the value of a certain state.

var mission = leoConfiguration.getState('Set Mission');
$rootScope.mission = mission ? mission.data : "";

Leonardo triggers an event whenever a state changes.

$rootScope.$on('leonardo:setStates', function(){
  var debug = leoConfiguration.getState('debug');
  $rootScope.debug = !!debug;
});

Scenarios:

Scenarios simply enable you to set a specific set of states as active.

leoConfiguration.addState({
  name: 'Set Mission',
  options: [
    { name: 'turtles', data: "Protect April o'neil" },
    { name: 'shredder', data: 'Destroy the ninja turtles' }
  ]
});

Note:

– We currently only support Angular application. That is what we initially developed on, and was easy to implements. If the tool gains traction and popularity, it should be easy to migrate to a more vanilla approach.

Use Leonardo to start mocking http or anything you like! We’d love to get your feedback!

A/B testing @ Outbrain – Wabbit

 

What Is A/B Testing

A/B testing is a method widely used to validate assumptions about web site optimizations. With A/B tests we can test two configurations, configuration A and configuration B, of a web page design and compare them according to some metrics that define what a success result is. In other words, you test your new design against the current design and measure which one produces better results. To decide which design is better than the other, you split the traffic to your web page between these two configurations and then you can measure which configuration had better performance and apply this configurations as the default configuration of your site.

 

What To Test?

The choice of what to test depends on your goals. In Outbrain each configuration is called an A/B test variant. The idea of Outbrain’s A/B testing is to allow publishers to test two different designs of their widgets, and measure which design had better Click Through Rate (CTR) and Revenue Per 1,000 Impressions (RPM) performance.

In the core of the system there are more than 450 settings that define the configuration of each widget, which is installed on a blog or a group of sites.

There are more than two hundred online settings that directly affect the widget. Each of these settings can be tested within A/B test variants. For example, one of these online settings is called “Widget Structure”. This setting configures the look and feel of the widget.

 

Screen Shot 2014-09-02 at 2.46.42 PM

Widget structure – look and feel of the widget

If your goal is to test an addition of a new widget structure, you can configure the variant A with the new widget structure addition, against variant B that uses the original design of the widget structure and serves as the control group.

Screen Shot 2014-09-02 at 4.15.57 PM

When the test comes to an end many questions may come up. How did it affect the customers? Did the new design of the widget structure deliver better CTR and RPM performance? Maybe if we changed the title of the new widget structure it would have resulted in better performance? Maybe if we changed the images size of the old widget structure, it would have resulted in better performance? All of these questions can be answered one by one if we set appropriate A/B test variants.

Even though each A/B test in our system is unique, there are certain widget settings that are usually tested for every variant:

  • Number of paid recommendations
  • Number of organic recommendations
  • Image size in the widget
  • The number of recommendations on the widget unit
  • Widget structure

 

A/B Tests in Outbrain

Once you decided that you want to create a new A/B test, you can do it using an internal tool named Wabbit – Widget A/B testing tool. The tool gives you the ability to create/edit an existing A/B test or to pull internal reports with Key Performance Indicator (KPI) performance for the test.

The A/B test can be defined on a specific widget on one site or it can be done on a group of sites that use the same widget.

When the test ends, we pull the A/B test report to measure which configuration had better performance. If the data indicates one of the configurations is an improvement according to our KPIs and the test has experienced enough traffic to be considered significant, we give the option to apply the new configuration as the default for the widget.

 

Tips!!

  • In Outbrain we recommend running experiments for at least two weeks and no more than a month. The main reason for that is to eliminate the “day of the week” effect because users who visit the site on the weekend might represent different segment than those who visit the site during the week.
  • On the other hand, running an A/B test more than a month leads to unreliable test results, such as cookie expiration that causes the users to start see different configurations which compromises the consistency of the test.
  • At Outbrain, we also recommend allocating at least 5% of traffic toward an AB test to increase the probability of ending the test with results that have more than a 90% confidence level based on statistical analysis. Here’s a calculator from KissMetrics that will allow you to easily figure out if you’re A/B test results are significant.