Blog Posts - May 2017

I WANT IT ALL – Go Hybrid

When I was a kid, my parents used to tell me that I can’t have my cake and eat too.  Actually that’s a lie, they never said that. Still it is something I hear parents say quite often. And not just parents. I meet the same phrase everywhere I go. People constantly taking a firm, almost religious stance about choosing one thing over another: Mac vs PC, Android vs iOS, Chocolate vs Vanilla (obviously Chocolate!).

So I’d like to take a moment to take a different, more inclusive approach.

Forget Mac vs PC. Forget Chocolate vs Vanilla.

I don’t want to choose. I Want it all!

At Outbrain, the core of our compute infrastructure is based on bare metal servers. With a fleet of over 6000 physical nodes, spread across 3 datacenters, we’ve learned over the years how to manage an efficient, tailored environment that caters to our unique needs. One of which being the processing and serving of over 250 Billion personalized recommendations a month, to over 550 Million unique users.

Still, we cannot deny that the Cloud brings forth advantages that are hard to achieve in bare metal environments. And in the spirit of inclusiveness (and maximising value), we want to leverage these advantages to complement and extend what we’ve already built. Whether focusing on workloads that require a high level of elasticity, such as ad-hoc research projects involving large amount of data, or simply external services that can increase our productivity. We’ve come to view Cloud Solutions as supplemental to our tailored infrastructure rather than a replacement.

 

Over recent months, we’ve been experimenting with 3 different vectors involving the Cloud:

 

Elasticity

Our world revolves around publications, especially news. As such, whenever a major news event occurs, we feel immediate, potentially high impact. Users rush to publisher sites, where we are installed. They want their news, they want their recommendations, and they want them all now.

For example, when Carrie Fisher, AKA Princess Leia, passed away last December, we saw a 30% traffic increase on top of our usual peak traffic. That’s quite a spike.

Since usually we do not know when the breaking news event will be, it means that we are required to keep enough extra capacity to support such surges.

By leveraging the cloud, we can keep that additional extra capacity to bare minimum, relying instead on the inherent elasticity of the cloud, provisioning only what we need when we need it.

Doing this can improve the efficiency of our environment and cost model.

Ad-hoc Projects

A couple of months back one of researchers came up with an interesting behavioral hypothesis. For the discussion at hand, lets say that it was “people who like chocolate are more likely to raise pet gerbils.” (drop a comment with the word “gerbils” to let me know that you’ve read thus far). That sounded interesting, but raised a challenge. To validate or disprove this, we needed to analyze over 600 Terabytes of data.

We could have run it on our internal Hadoop environment, but that came with a not-so-trivial price tag. Not only did we have to provision additional capacity in our Hadoop cluster to support the workload, we anticipated the analysis to also carry impact on existing workloads running in the cluster. And all this before getting into operational aspects such as labor and lead time.

Instead, we chose to upload the data into Google’s BigQuery. This gave us both shorter lead times for the setup and very nice performance. In addition, 3 months into the project, when the analysis was completed, we simply shut down the environment and were done with it. As simple as that!

Productivity

We use Fastly for dynamic content acceleration. Given the scale we mentioned, this has the side-effect of generating about 15 Terabytes of Fastly access logs each month. For us, there’s a lot of interesting information in those logs. And so, we had 3 alternatives when deciding how to analyse them:

  •      SaaS based log analysis vendors
  •      An internal solution, based on the ELK stack
  •      A cloud based solution, based on BigQuery and DataStudio

After performing a PoC and running the numbers, we found that the BigQuery option – if done right – was the most effective for us. Both in terms of cost, and amount of required effort.

There are challenges when designing and running a hybrid environment. For example, you have to make sure you have consolidated tools to manage both on-prem and Cloud resources. The predictability of your monthly cost isn’t as trivial as before (no one likes surprises there!), controls around data can demand substantial investments… but that doesn’t make the fallback to “all Vanilla” or “all Chocolate” a good one. It just means that you need to be mindful and prepared to invest in tooling, education and processes.

 

In summary, I’d like to revisit my parents advice, and try to improve on it a bit (which I’m sure they won’t mind!):

Be curious. Check out what is out there. If you like what you see – try it out. At worst, you’ll learn something new. At best, you’ll have your cake… and eat it too.

 

X tips [x>5] for building a bulletproof deployment pipeline with Jenkins

Continuous delivery is a methodology where each commit can potentially get into production in a timely manner.

Jenkins Pipeline is one of the tools out there that automates the delivery process to make it short, robust, and without human intervention as much as possible.

We have recently done such an integration on our team at Outbrain, so here are some tips and advises from our humble experience.

X. You should have done this ages ago (so do it today)

Don’t wait till you have all the building blocks in place. Start with a partial pipeline and add all the automated steps you already have in place. It will give you the motivation to add more automation and improve the visibility of the process.

The pipeline set of plugins in Jenkins are ~1 year old in its current form, So it is mature and well documented. Definitely ready to use.

X. Validate artifacts and source code consistency across pipeline

That tip I read in the Teamcity pipeline post but is relevant for Jenkins as well. Make sure that the same version of sources and artifacts is used across all stages. Otherwise, commit might be pushed while the pipeline is executing, and you might end up deploying untested version.

X. Use commit hook with message regexp

Well, if I will try to generalise this tip I would say: try to ask for as little human intervention as possible (when it is not required). A good place to start is a commit hook. It works in a way that when a developer push code with a specific commit message — in our case #d2p (deploy to production), the pipeline is automatically triggered.

Here is a code sample from Jenkinsfile (the pipeline configuration file):

gitCommitMessage = sh(returnStdout: true, script: 'git log -1 --pretty=%B').trim()
deployToProd = (gitCommitMessage =~ /#d2p/ || params.DEPLOY_TAG == "#d2p") //we also allow '#d2p' when triggering manually

X. Try the Blue Ocean view

Blue Ocean set of plugins are in release-candidate stage as of the time of writing (now it is GA). Stable enough and a very good UI— especially for parallel stages. So I would recommend using. In addition, it is working side-by-side with the old UI.

All is green
When something goes wrong

X. Ask for user authorization on sensitive operations

If you still not sure that your monitoring system is robust enough, start by automating the pipeline, and ask for developer authorization before the actual deploy to production.

Here is a code sample from Jenkinsfile:

timeout(time:5, unit:'HOURS') {
  input message: 'Deploy to production?', ok: 'Deploy!'
}

X. Integrate slack or other notifications

Slack is awesome and has a very documented API as well. Sending notifications on pipeline triggering and progress helps to communicate the work between team members. We use it to send start, completed, and failure notifications right now. We plan to integrate the above approval input with a slack bot so we can approve it directly from slack.

A slack notification

X. Make the pipeline fast (parallelize it)

Making pipeline turnaround time short helps to keep work efficient and fun. Set a target for the total turnaround time. Our target is less than 10 minutes. One of the easiest ways to keep it fast is by running stages that are independent in parallel. For example, we run a deployment to a test machine in parallel to the integration tests and deployment to a canary machine in parallel to our black-box tests.

Here is a code sample from Jenkinsfile:

stage("Testing: phase a") {
    parallel 'JUnit': {
        stage("junit") {
            sh '...'
        }
    }, 'Deploy to simulator': {
        stage("Deploy to simulator") {
            sh '...'
        }
    }
}
stage("Testing: phase b") {
    parallel 'Simulator tests': {
        stage("Simulator tests") {
            sh '...'
            }
        }
    }, 'Canary server': {
        stage("Deploy to canary") {
            sh '...'
        }
        stage("Tests on canary") {
            sh '...'
        }
    }
}

X more tips in the great post below:

Enjoy Piping!

 

P.S. – The original post was published on my personal blog:

https://medium.com/@OhadShai/x-tips-x-5-for-building-a-bulletproof-deployment-pipeline-with-jenkins-9079de9a1082

I am going to have a talk at Jenkins Use Conference that is based on this blog post:

https://www.eventbrite.com/e/jenkins-user-conference-2017-israel-tlv-david-inter-continental-tickets-32226522396

You are welcome to there for more details!

Effective Testing with Loan Pattern in Scala

Tests are crucial in systems that rely on CI/CD as part of their release cycle. One of the challenges is to write stable tests that work for you without spending a lot of time on maintaining bad tests.

Tests are Hard

They’re hard to write, hard to maintain and it’s even harder to stabilize a flaky test. At Outbrain, we take special pride in our ability (for the most part) to deliver new features to production and doing so with the confidence that only reliable tests can give you. These tests play a crucial role in our ability to deliver fast, good and stable code making sure no regression bugs were introduced in the process. It is crucial then, to not only maintain good test suites (unit tests, integration, and e2e) but also to fix any test that misbehaves (flaky tests).

We have a special environment to facilitate integration and e2e tests called simulation environment (it is only one of the set of tools we have for that purpose). This is a dedicated set of servers which we use to simulate our production environment. We deploy every new version of our services to that environment before we deploy to production, and run tests that check new flows of code, regression, and interoperability to other services.

In order to write an effective test for a new feature, we sometimes need to setup the environment with entities that are required to the feature we’re testing. If, for example, our new feature is to register a car to an owner (a Person entity). Before running the tests we need the required entities, a Car and a Person in our database. We’re not trying to test a flow for creating a new car, or a new person in this scenario. Therefore there is no need in creating the car and/or the person entities explicitly in the test before the actual test scenario happens. And in order to make our tests as clear and succinct as possible — we don’t want to be creating this data explicitly in each and every test.

 

Bad Practices

So, it was a common practice (albeit a bad one) to have pre-existing data on which we would rely on to run tests (for the whole simulation environment!). This led to two big (interconnected) problems:

  1. No test isolation – a test mistakenly deleting some or all of the pre-existing data, for example, would do so for all the tests that run in that environment
  2. Flaky tests – tests running concurrently are creating, deleting and generally changing data that affects others, which in turn would fail tests for no good reason — which makes it really hard to analyze and fix a failing test

 

We’ve tackled this problem by creating the needed data before the tests in a test class and deleting it after the test run. Which mitigated the problem somewhat — not only the tests in the same class were interconnected but also added boilerplate to the test class. Now, a test class looked something like (assuming these are entities autogenerated by Scalike for the relevant tables):

ScalaTest:

class MyTestClass extends WordSpec with BeforeAndAfterAll {

  val person = Person(carId = None).save() // creating data for test
  val car = Car(color = "green", ownerId = None).save()

  override def afterAll(): Unit = {
    // Clean db after test.
    person.destroy()
    car.destroy()
  }

  "My service" should {
    "set the owner of the car" in {
      val service = new Service

      service.setCarOwner(carId = car.id, ownerId = person.id)

      service.getPerson(person.id).carId shouldEqual Some(car.id)
    }
  }
}

Specs2:

class MyTestClass extends SpecificationWithJUnit with AfterAll {
  val person = Person(carId = None).save() // creating data for test
  val car = Car(color = "green", ownerId = None).save()

  override def afterAll(): Unit = {
    // Clean db after test.
    person.destroy()
    car.destroy()
  }

  "My service" should {
    "set the owner of the car" in {
      val service = new Service

      service.setCarOwner(carId = car.id, ownerId = person.id)

      service.getPerson(person.id).carId must be equalTo Some(car.id) 
    }
  }
}

Looking at this, we were presented with a challenge. First, the data is created for all the tests that run in a class, which must be deleted only after all tests have finished running — this means that the tests are not isolated one from another and potentially may become flaky. Second, we wanted an elegant way of creating and deleting the needed entities seamlessly in order to minimize the boilerplate for each test class.

Note: It is possible however, in Specs2, to make a better solution by using the ‘Scope’ trait like so:

trait Context extends Scope with After {
  val person = Person(carId = None).save() // creating data for test
  val car = Car(color = "green", ownerId = None).save()

  override def after: Any = {
    // Clean db after test.
    person.destroy()
    car.destroy()
  }
}

And using it in a test like so:

class MyTestClass extends SpecificationWithJUnit {
  "My service" should {
    "set the owner of the car" in new Context {
      val service = new Service

      service.setCarOwner(carId = car.id, ownerId = person.id)

      service.getPerson(person.id).carId must be equalTo Some(car.id)
    }
  }
}


It’s a good solution, for a simpler problem than we faced. We needed the tests running in a single transaction, with a supplied session and a configurable db name (indicating a set of Scalike connection parameters).

Enter Loan Pattern

We first encountered this pattern when using ScalaTest and quickly moved to using it also in Specs2 (as most of our tests are written in Specs2). From ScalaTest documentation for Sharing fixtures:

“A test fixture is composed of the objects and other artifacts (files, sockets, database connections, etc.) tests use to do their work. When multiple tests need to work with the same fixtures, it is important to try and avoid duplicating the fixture code across those tests.”
“If you need to both pass a fixture object into a test and perform cleanup at the end of the test, you’ll need to use the loan pattern”

Which means, we can use fixtures to set up ‘artifacts’ for the tests to use, promoting the DRY principle by minimizing code duplication. It is also a good way to reduce boilerplate when writing tests. So, we wrote this one simple trait:

ScalaTest:

trait TestDataSupport extends DefaultGenerator {

  def withTestData(testCode: DefaultObjects => Any): Unit = {
    val testData = createTestData()
    try {
      testCode(testData) // "loan" the fixture to the test
    }
    finally clearTestData(testData) // clean up the fixture
  }

  private def createTestData(): DefaultObjects = {
    DefaultObjects.create(name = Random.alphanumeric.take(10).mkString)
  }

  private def clearTestData(testData: DefaultObjects): Unit = {
    testData.cleanup
  }
}

Let’s go over what’s happening in this trait. We’re mixing in a custom trait called ‘DefaultGenerator’ which gives us the ‘DefaultObjects’ which are the entities we need to be pre-created for our tests to run. We have two private methods. One that calls ‘create’ on ‘DefaultObjects’ with a custom name to generate the needed entities. The other calls ‘cleanup’ on the test data to clean the environment after the test has finished running. And the star of this trait, the method (or fixture if you will) ‘withTestData’ which gets the test function as a parameter, calls the private method ‘createTestData’, calls the test and passing it the data we just generated and finally cleans up the generated data after the test finishes.

When mixing this trait in our test class, we get the following code:

class MyTestClass extends WordSpec with TestDataSupport {

  "My service" should {
    "set the owner of the car" in withTestData { testData =>
      val service = new Service

      service.setCarOwner(carId = testData.car.id, ownerId = testData.person.id)

      service.getPerson(testData.person.id).carId shouldEqual Some(testData.car.id) 
    }
  }
}


‘testData’ is the data generated in our ‘withTestData’ method (a car and a person in our case).

The Specs2 version of the Loan Pattern is a bit more complex, as we’ve added some more bells and whistles to make it easier for us to create those entities in our domain. We’re using Scalike to create the entities in MySQL database, and we need a somewhat more refined control over the session we’re using, DB name etc’.

Specs2:

trait DataContextName {
  def className: String
}
trait DataContextDbName {
  val dbName: Symbol = 'default
}
import scalikejdbc.{DBSession, NamedDB}

package object testdata {
  private[testdata] implicit class NamedDbSession(namedDB: NamedDB) {
    def withSession[A](session: DBSession)(execution: DBSession => A): A =
      execution(session)
  }
}
import org.specs2.execute.{AsResult, Result}
import org.specs2.specification.ForEach
import scalikejdbc.{DB, DBSession, NamedDB}

import scala.util.Try

trait DefaultDataContext extends ForEach[DefaultObjects] 
  with DataContextDbName with DataContextName with DefaultGenerator {
  
  implicit lazy val session: DBSession = DB.autoCommitSession()

  override def foreach[R](f: (DefaultObjects) => R)(implicit evidence$3: AsResult[R]): Result =
    NamedDB(dbName).withSession(session) { implicit session: DBSession =>

      val testData = DefaultObjects.create(name = randomName(className))

      val result = Try {
        AsResult(f(testData))
      }

      testData.cleanup(session)
      result.get
    }
}

It’s very similar to the ScalaTest flavor, but with several changes we needed to make to better facilitate our needs in the Specs2 tests. We have a mechanism to initialize a named DB connection, with a named connection pool and an explicit session. Besides these additions, it’s pretty similar to ScalaTest — generate the test data, run the test and clean the generated data.

The test class now looks like this:

trait TestClassDataContext extends DefaultDataContext {
  override val dbName: Symbol = TestClassConnectionPoolName
  def className: String = "Test class name"
}
class MyTestClass extends SpecificationWithJUnit with TestClassDataContext {
  "My service" should {
    "set the owner of the car" >> { testData: DefaultObjects =>
      val service = new Service

      service.setCarOwner(carId = testData.car.id, ownerId = testData.person.id)

      service.getPerson(testData.person.id).carId shouldEqual Some(testData.car.id) 
    }
  }
}

Summary

We tackled several issues our team faced on a day to day basis, which made our simulation environment unstable, hard to maintain and generally very frustrating to work on. By extracting data generation and cleanup to an external trait and using a clever mechanism to reduce boilerplate, we managed to clean and simplify the test class, reduce code duplication and generally made our lives easier. Tests are still hard, but a bit easier to write and nicer to read. What do you think?