Reducing risks by taking risks

You have a hypothesis you wish to prove or refute regarding your behavior in an extreme scenario.  For example:  after restarting the DB, all services are fully functioning within X seconds. There are at least 3 approaches:

  1. Wait && See — if you wait long enough this scenario will probably happen by itself in production.
  2. Deep Learning  — put the DBA, OPS and service owner in a big room, and ask them to analyze what will happen to the services after restarting the DB. They will need to go over network configuration, analyze the connection pool and DB settings, and so on.
  3. Just Do It — deliberately create this scenario within a controlled environment:

Timing — Bases on your knowledge pick the right quarter, month, week, day and hour – so the impact on the business will be minimal.

Monitoring — make sure you have all monitoring and alerts in place. If needed, do “manual monitoring”.

Scale — If applicable do it only on portion of your production: one data center, part of the service, your own laptop, etc.

The Wait && See approach will give you the right answer, but at the wrong time. Knowing how your system behaves only after a catastrophe occurs is missing the point and could be expensive, recovery-wise.

The Deep Learning approach seems to be the safer one, but it requires a lot of effort and resources. The answer will not be accurate as it is not an easy task to analyze and predict the behaviour of a complex system. So, in practice, you haven’t resolved your initial hypothesis.

The Just Do It approach is super effective – you will get an accurate answer in a very short time, and you’re able to manage the cost of recovery. That why we picked this strategy in order to resolve the hypothesis. A few tips if you’re going down this path:

  • Internal and External Communication — because you set the date, you can send a heads up to all stakeholder and customers. We use statuspage.io for that.
  • Collaboration — set a HipChat / Slack / Hangouts room so all relevant people will be on the same page. Post when the experiment starts, finishes, and ask service owners to update their status.
  • Document the process and publish the results within your knowledge management system (wiki / whatever).
  • This approach is applicable for organisation that practice proactive risk management, have a healthy culture of learning from mistakes, and maintain a blameless atmosphere.

 

People that pick Deep Learning are often doing so because it is considered a “safer”, more  conservative approach, but that is only an illusion. There is high probability that in a real life DB restart event the system would behave differently from your analysis. This effect of surprise and its outcomes are exactly what we wanted to solve by predicting the system’s behaviour.

 

The Just Do It approach actually reduces this risk and enables you to keep it manageable. It’s a good example of when the bold approach is safer than the conservative one. So don’t hesitate to take risks in order to reduce risk, and Just Do It

Photo credit: http://www.jinxiboo.com/blog/tag/risks

ScyllaDB POC – live blogging #2

8508669031_4851697b0f_mHi again.

As we are now few weeks into this POC and we gave you first glimpse into it in the first post. We would like to continue and update on what are the latest developments.

However before we come to the update. There are 2 things that we needed tell you about the setup that did not get into the first post just for length reasons.

 

 

 

The first thing to tell is the data structure we use as Doron explains it:

The Data

Our data is held using a single key as partition key, and additional 2 columns as clustering keys, in a manner of key/value structure:

  • A – partition key (number).
  • B – Clustering key 1 (text). This is the value type (i.e. “name”, “age” etc…).
  • C – Clustering key 2 (text).
  • D – The data (text). This is the value itself (i.e. “David”, “19”, etc…).

When storing the data we store either all data for a partition key, or partial data.

When reading we always read all data for the partition, meaning:

select * from cf_name where A=XXX;

When we need multiple partitions we just fire the above query in parallel and wait for the slowest result. We have come to understand that this approach is the fastest one.

As this meant to be the fastest, we need to understand that such reads latency is measured always by the slowest key to read. If you fire tens of reads into the cluster and If, by the chance, you bump into a slow response of one of the nodes (GC can be a good example for such case) your all read attempt is delayed.

This where we thought Scylla can help us improve the latency of the system.

 

The second thing we wanted to tell you about is what made this POC so easily done in Outbrain and this is our Dual DAO mechanism that Doron explains below.

Dual DAO implementation for Data Store migration/test

We have come to see that in many cases we need a dual DAO implementation to replace the regular DAO implementation.

The main use-case is data migration from one cluster to the other:

  1. Start dual writes to both clusters.
  2. Migrate the data – by streaming it into the cluster (for example).
  3. Start dual read – read from the new cluster and fallback to the old.
  4. Remove the old cluster once there are no fallbacks (if the migration was smooth, there should not be any).

The dual DAO we have written holds both instances of the old and new DAO, and also a phase – a number between 1 and 5. In addition, the dual DAO implements the same interface like the regular DAO, so it is pretty easy to inject it in and out when done.

The phases:

  1. Write to old cluster and read from old cluster only.
  2. Write to both cluster and read from old cluster only.
  3. Write to both cluster and read from new cluster, fallback to old cluster if missing.
  4. Write to new cluster and read from new cluster, fallback to old cluster if missing.
  5. Write to new cluster and read from new cluster only.

The idea here is to support gradual and smooth transition between the clusters. We have come to understand that in some cases transition done only by the dual DAO can be very slow (we can move to 5 only after no fallback is done), so we had 2 ways to accelerate it:

  1. Read-repair – when in phase 3 or 4, and we fallback to the old cluster – write the data we fetch to the new cluster. This approach usually fits to use cases of heavy reads.
  2. Stream the data in manually by Data guys – this has proven to be the more effective and quick way – and this is what we have done in this case.

Dual DAO for data store test:

The ScyllaDB POC introduces a slightly different use-case. We do not want to migrate to Scylla (at least not right now) – but add it to our system in order to test it.

In order to match those demands, we have added a new flavor to our dual DAO:

  1. The write to the new DAO is done in the background – logging its results in case of failures – but does not fail the write process. When working in dual DAO for migration we want to fail if either the new or old DAO fail. When working in test-mode we do not. At first we did not implement it like this – and we had a production issue when ScyllaDB upgrade (to a pre GA version, from 0.17 to 0.19)  caused the save queries to fail (on 9/3/16). Due to those issues we have changed the dual write to not fail upon the new DAO failure.
  2. The read from the new DAO is never used. On phases 3-5 we do read from the new DAO, and place a handler in the end to time the fetch and compare the results to the old DAO. However, the actual response gets back from the old DAO, when it is done. This is not bullet-proof but reduces the chances of production issues due to issues in the test cluster.

 

Mid April Update:

Last time we have reported was when ScyllaBD was loaded with partial and ongoing data while the C* cluster was loaded with the full historic data. Scylla was performing much better as we showed.

The next step we had to do was to load the full data into ScyllaDB cluster. This was done by Scylla streaming process that read the data from the C* cluster and wrote it to the ScyllaDB cluster. This process uncovered a bug with it that failed to extract the data from the newest version of C* that we’ve used.

Shlomi’s explanation is as follows:

Sstables upgraded from cassandra 2.0 to cassandra 2.1 can contain range tombstone with a different format from tombstones written in cassandra 2.1 – we had to add support for the additional format

The Scylla team figured it out and fixed it pretty quickly.More info about this issue can be found in the project GitHub.

After all the data was filled correctly into the ScyllaDB cluster, We’ve seen degrade in the ScyllaDB cluster which made it perform slightly worse than the C* (Surprise!!!!). A faulty configuration we’ve found in the C* “Speculative retries” mechanism and fixed it, actually made the C* latency about 50% better than the ScyllaDB. We need to remember we are measuring latency at its 99th percentile. In our case – as Doron mentioned above it’s while using multiple reads which is even harder.

ScyllaBD guys were as surprised as we are, They took a look into the system and found a bug with their cross DC read-repair, they fixed the issue and installed a new version.

Here is the description of the issue as explained by Tzach from Scylla: (reference to issue – https://github.com/scylladb/scylla/issues/1165)

After investigation, we found out the Scylla latency issue was caused by a subtle bug in the read-repair, causing unnecessary, synchronize, cross DC query.

When creating a table with read-repair chance (10% in this case) in Scylla or Cassandra, 10% of the reads send background queries from coordinator to ALL replications, including cross DC. Normally, the coordinator waits for responses only from a subset of the replicas, based on Consistency Level, Local Quorum in our case, before responding to the client.

However, when the first replica responses, does not match, coordinator will wait for ALL responses before sending response to the client. This is where Scylla bug was hidden.

Turnout, response digest was wrongly calculated on some cases, which cause the coordinator to believe local data is not in sync, and wait for remote DC nodes to response.

This issue is now fix and backport to Scylla 1.0.1

 

So, we upgraded to Scylla 1.0.1 and things look better. We can now say that C* and Scylla are in same range of latency but still C* is better. Or as Doron phrased it in this project google group:

Performance definitely improved starting 13/4 19:00 (IL time). It is clear.

I would say C* and Scylla are pretty much the same now… Scylla spikes a bit lower…

 

That did not satisfy the Scylla guys they went back to the lab and came back with the following resolution as described by Shlomi:

In order to test the root cause for the latency spikes we decided to try poll-mode:

Scylla supports two modes of operations – the default, event triggered mode and another extreme poll-mode. In the later, scylla constantly polls the event loop without no idle time at all. The cpu waits for work, always consuming 100% per core. Scylla was initially designed with this mode only and later switched to a more sane mode of operation with calls epoll_wait. The later requires complex synchronization before blocking on the OS syscall to prevent races between the cores going in/out sleep.

We recently discovered that the default mode suffers from increased latency and thus fixed it upstream, this fix did not propagate to 1.0.x branch yet and thus we recommend to use poll-mode here.

The journey didn’t stop here since another test execution in our lab revealed that hardware Hyper Threads (HT) may increase latencies during poll mode.

Yesterday, following our tests with and without HT, I have updated scylla configuration to use poll-mode and running only on 12 physical cores  (allowing the kernel to schedule the other processes on the HT).

Near future investigation will look into why poll-mode with HT is an issue (Avi suspects it’s L1 cache eviction). Once the default, event-based code will be back ported to the release, we’ll switch to it as well.

 

This is how the graphs look today:

Screen Shot 2016-04-21 at 12.30.26 PM

 

As you can see in the graphs above: (Upper graph is C*/ Lower graph is Scylla)

  • Data was loaded into Scylla in 4/1 (lower graph).
  • On 4/4 Doron did the fix in C* configuration and its latency improved dramatically.
  • On 4/13 Scylla 1.0.1 was installed and did an improvement.
  • On 4/20 Shlomi did the HyperThreading poll-mode config change.
  • The current status is that Scylla base is nicely on the 50ms spiking to above 90ms whereas C* is at 60-65 spiking to 120ms in some cases. It worth to remember that those measurements are 99th percentile taken from within Outbrain’s client service which is Java and by itself suffers from unstable performance.

There is a lot to learn from such POC. The guys from Scylla are super responsive and don’t hesitate to take learnings from every unexpected results.

Our next steps are:

  1. We still see some level of results inconsistencies between the 2 systems that we want to verify where they come from and fix.
  2. Move to throughput test by decreasing the number of machines in the Scylla cluster.

 

That’s all for now. More to come.   

We are testing ScyllaDB – live blogging #1

The background

Screen Shot 2016-03-15 at 12.42.47 AMIn the last month we have started, in Outbrain, to test ScyllaDB. I will tell you in a minute what ScyllaDb is and how we came to test it but I think what is most important is that ScyllaDB is a new database at its early stages and still before its first GA (coming soon). It is not an easy decision to be among the firsts to try such a young project that not many have used before (up until now there are about 2 other production installations) but as they say, someone have to be the first one… Both ScyllaDB and Outbrain are very happy to openly share how the test goes, what are the hurdles what works and what not.

 

How it all began:

I know the guys from Scylla for quite some time, we have met through the first iteration of the company (Cloudius-systems) and we’ve met at the early stages of ScyllaDB too. Dor and Avi, the founders of ScyllaDB, wanted to consult if as heavy users of Cassandra, we will be happy for the solution they are going to write. I said, “Yes,  definitely”  and I remember saying, “If you will give me Cassandra functionality and operability at the speed and throughput of Redis, You got me.”

Time went by and about 6 months ago they came back and said they are ready to start integrations with live production environments.

 

This is the time to tell you what ScyllaDB is.

The easiest description is “Cassandra on steroids”. That’s right but in order to do that, the guys in Scylla basically had to write all Cassandra server from scratch, meaning:

  • Keep all Cassandra interface perfectly the same so client applications will not have to change.
  • Write it all over in C++, and by that overcome the issues that JVM brings with it, mostly no GC that was hurting the high percentiles of latency.
  • Write it all in Asynchronous programming model that enable the server to run in very high throughput.
  • Shard per core approach – on top of the cluster sharding, Scylla uses shard-per-core which allows it to run lockless and scale up with the number of cores
  • Scylla uses its own cache and does not rely on the operating system cache. It saves data copy and does not slow down due to page faults

I must say that was intriguing my mind as if you are looking at OpenSource NoSQL data systems that picked up, there is one camp of  C++, High performance but, yet,  simple functionality (memcached or redis) and the heavy functionality but JVM based camp (Spark, Hadoop, Cassandra). However if you can combine the good of both worlds – it sounds great.

 

Where does that meet Outbrain?

Outbrain is a heavy user of Cassandra. We have few hundreds of Cassandra machines running in 20 clusters over 3 datacenters. They store 1-2 terabytes of data each. Some of the clusters are being hit on user’s query time and unexpected latency is an issue. As data, traffic and complexity grew up with outbrain it became more and more complex to maintain the cassandra clusters and keep them up to reasonable performance. It always required more and more hardware to support the growth as well as the performance.

The promise of getting stable latency, 5-10x more throughput (much less machines)without the cost of re-writing our code made a lot of sense and we decide to give it a shot.

One thing was not yet in the product that we needed deeply was Cross DC clusters. The Cassandra feature of eventual consistency across different clusters in different Data Center is key to how Outbrain operates and it was very important for us. It took the guys from ScyllaDB a couple of months to finish that feature, test and verify all works and we were ready to go.

ScyllaDB team is located in Herzliya which is very close to our office in Netanya and they were very happy to come and start the test.

 

The team working on this test is:

Doron Friedland – Backend engineer at Outbrain’s App Services team.

Evgeny Rachlenko – from Outbrain’s Data Operations team.

Tzach Liyatan – ScyllaDB Product manager.

Shlomi Livne – ScyllaDB VP of R&D.

 

The first step was to allocate the right cluster and functionality we want to run the test on. After a short consideration we chose to run this comparison test on the cluster that holds all our Documents store. It holds all information about all active documents in Outbrain’s system. We are talking about few millions of documents where each one of them have hundreds of different features represented as Cassandra columns. This store is being updated all the time and being accessed in every user request (few million requests every minute). Cassandra started struggling with this load and we started applying many solutions and optimizations in order to keep the load. We also enlarged the cluster so we can keep it up.

One more thing that we did in order to overcome the Cassandra performance issues was to add a level of application cache that consumes few more machines

by itself.

One can say, that’s why you chose a scalable solution like Cassandra so you can grow it as you wish. But when the number of servers start to rise and have significant cost, you want to look at other solutions. This is where ScyllaDB came into play.

 

The next step was to install a cluster, similar in size to the production cluster.

 

Evgeny describes below the process of installing the cluster:

Well, the  installation impressed me in the two aspects.

Configuration part was pretty same to Cassandra with few changes in parameters.

Scylla simply ignoring GC, or HEAP_SIZE parameters  and use configuration as extension of cassandra.yaml file.

Our Cassandra’s clusters  running with many components integrated into outbrain ecosystem.  Shlomi with Tzach has defined properly  the most important graphs and alerts. Services such as consul, collectd, prometheus with graphana  also has been integrated as part of POC. Most integration test passed without my intervention except light changes in the Scylla chef’s cookbook.

 

Tzach is describing what it looked like from their side:

Scylla installation, done by Evgeny, was using a clone of Cassandra Chef recipes, with a few minor changes. Nodetool and cqlsh was used for sanity test of the new cluster.

As part of this process, Scylla metric was directed to OutBrain existing Prometheus/ Grafana monitoring system. Once traffic was directed to the system, the application and ScylladDB metrics was all in one dashboard, for easy comparison.

 

Doron is describing the application level steps of the test:

    1. Create dual DAO to work with ScyllaDB in parallel to our Cassandra main storage (see elaboration on the dual DAO implementation below).
    2. Start dual writes to both clusters (in production).
    3. Start dual read (in production) to read from ScyllaDB in addition to the Cassandra store (see the test-dual DAO elaboration below).
    4. Not done yet: migrate the entire data from Cassandra to ScyllaDB by streaming the data into the ScyllaDB cluster (similar to migration between Cassandra clusters).
    5. Not done yet: Measure the test-reads from ScyllaDB and compare both the latency and the data itself – to the data taken from Cassandra.
    6. Not done yet: In case the latency from ScyllaDB is better, try to reduce the number of nodes to test the throughput.

 

 

 

Performance metrics:

Here are some very initial measurement results:

You can clearly see below that ScyllaDB is performing much better and in a much more stable performance.

One current disclaimer here is that Scylla still does not have all the historic data and just using data of the last week.

It’s not visible from the graph but ScyllaDB is not loaded and thus spends most of the time idling, the more loaded it will become, latency will reduce (until a limit of course).

We need to wait and see the following weeks measurements. Follow our next posts.

 

Read latency (99 percentile) of single entry: *First week – data not migrated (2.3-7.3):

Scylla:

Screen Shot 2016-03-15 at 12.35.43 AM

Cassandra:

Screen Shot 2016-03-15 at 12.38.29 AM

 

Read latency (99 percentile) of multi entries (* see comment below): *First week – data not migrated (2.3-7.3):

Scylla

Screen Shot 2016-03-15 at 12.40.11 AM

Cassandra

Screen Shot 2016-03-15 at 12.41.37 AM

 

* The read of multiple partitions keys is done by firing single partitions key requests in parallel, and waiting for the slowest one. We have learned that this use-case extremes evert latency issues we have in the high percentiles.

That’s where we are now. The test is moving on and we will update with new findings as we progress.

In the next post Doron will describe the Data model and our Dual DAO which is the way we run such tests. Shlomi and Tzach will describe the Data transfer and upgrade events we had while doing it.

Stay tuned.

 

Micro Service Split

image07

In this post I will describe a technical methodology we used to remove a piece of functionality from a Monolith/Service and turn it into a Micro-Service. I will try to reason about some of the decisions we have made and the path we took, as well as a more detailed description of internal tools, libraries and frameworks we use at Outbrain and in our team, to shed some light on the way we work in the team. And as a bonus you might learn from our mistakes!
Let’s start with the description of the original service and what it does.
Outbrain runs the largest content discovery platform. From the web surfer’s perspective it means serving a recommended content list that might interest her, in the form of ‘You might also like’ links. Some of those impression links are sponsored. ie: when she clicks on a link, someone is paying for that click, and the revenue is shared between Outbrain and the owner of the page with the link on it. That is how Outbrain makes its revenue.

My team, among other things, is responsible for the routing of the user to the requested page after pressing the link, and for the bookkeeping and accounting that is required in order to calculate the cost of the click, who should be charged, etc.
In our case the service we are trying to split is the ‘Bookkeeper’. Its primary role is to manage the paid impression links budget. After a budget is spent, The ‘Bookkeeper’ should notify Outbrain’s backend servers to refrain from showing the impression link again. And this has to be done as fast as possible. If not, people will click on links we cannot charge because the budget was already spent. Technically, this is done by an update to a database record. However, there are other cases we might want to stop the exposure of impression links. One such an example is a request from the customer paying for the future click to disable the impression link exposure. So for such cases we have an API endpoint that does exactly the same with the same code. That endpoint is actually part of the ‘Bookkeeper’ that is enabled by a feature toggle on specific machines. This ‘Activate-Impressionable’ endpoint as we call it, is what was decided to split out of the ‘Bookkeeper’ into a designated Micro-Service .
In order to execute the split, we have chosen a step-by-step strategy that will allow us to reduce the risk during execution and keep it as controlled and reversible as possible. From a bird’s eye view I will describe it as a three steps process: Plan, Up and Running as fast as possible and Refactor. The rest of the post describes these steps.

Plan (The who’s and why’s)

In my opinion this is the most important step. You don’t want to split a service just in order to split. Each Micro Service introduces maintenance and management overhead, with its own set of challenges[1]. On the other hand, Microservices architecture is known for its benefits such as code maintainability (for each Micro Service), the ability to scale out and improved resilience[2].
Luckily for me, someone already did that part for me and took the decision that ‘Activate-Impressionable’ should split from the ‘Bookkeeper’. But still, Let’s name some of the key factor of our planning step.
Basically I would say that a good separation is a logical separation with its own non-overlap RESTful endpoints and isolated code base. The logical separation should be clear. You should think what is the functionality of the new service, and how isolated it is. It is possible to analyze the code for inter-dependencies among classes and packages using tools such as lattix. At the bottom line, it is important to have a clear definition of the responsibility of the new Micro Service.
In our case, the ‘Bookkeeper’ was eventually split so that it remain the bigger component, ‘Activate-Impressionable’ was smaller and the common library was smaller than both. The exact lines of code can be seen in the table below.

Screen Shot 2016-02-21 at 11.21.56

Unfortunately I assessed it only after the split and not in the plan step. We might say that there is too much code in common when looking at those numbers. It is something worth considering when deciding what to split. A lot of common code implies low isolation level.

Of course part of the planning is time estimation. Although I am not a big fan of “guestimates” I can tell that the task was planned for couple of weeks and took about that long.
Now that we have a plan, let’s get to work.

Up and Running – A Step by Step guide

As in every good refactor, we want to do it in small baby steps, and remain ‘green’ all the time[3]. In continuous deployment that means we can and do deploy to production as often as possible to make sure it is ‘business as usual’. In this step we want to get to a point the new service is working in parallel to the original service. At the end of this step we will point our load-balancers to the new service endpoints. In addition, the code remains mutual in this step, means we can always deploy the original fully functioning ‘Bookkeeper’. We actually do that if we feel the latest changes had any risk.
So let’s break it down into the actual phases:

Overview Step Details
micro service split 0 Starting phase
micro service split 1 Create the new empty Micro-Service ‘Activate-Impressionable’. In outbrain we do it using scaffolding of ob1k framework. Ob1k is an open source Micro Services Framework that was developed in-house.
micro service split 2 Create a new empty Library dependent both by the new ‘Activate-Impressionable’ service and the ‘Bookkeeper’. Ideally, if there is a full logic separation with no mutual code between the services that library will be deleted in the cleanup phase.
micro service split 3 Move the relevant source code to the library. Luckily in our case, there was one directory that was clearly what we have to split out. Unluckily, that code also pulled up some more code it was dependent on and this had to be done carefully not to pull too much nor too little. The good news are that this phase is pretty safe for static typing languages such as Java, in which our service is written in. The compiler protects us here with compilation errors so the feedback loop is very short. Tip: don’t forget to move unit tests as well.
micro service split 4 Move common resources to the library, such as spring beans defined in xml files and our feature flags files that defined in yaml files. This is the dangerous part. We don’t have the compiler here to help, so we actually test it in production. And when I say production I mean using staging/canary/any environment with production configuration but without real impact. Luckily again, both yaml and spring beans are configured to fail fast, so if we did something wrong it will just blow out in our face and the service will refuse to go up. For this step I even ended up developing a one-liner bash script to assist with those wicked yaml files.
micro service split 5 Copy and edit web resources (web.xml) to define the service endpoints. In our case web.xml can’t reside in a library so it has to be copied. Remember we still want the endpoints active in the ‘Bookkeeper’ at that phase. Lesson learned: inspect all files closely. In our case log4j.xml which seems like an innocent file by its name contains designated appenders that are consumed by other production services. I didn’t notice that and didn’t move the required appender, and it was found only later in production.
Deploy Deploy the new service to production. What we did is deploy the ‘Activate-Impressionable’ side-by-side on the same machines as the ‘Bookkeeper’, just with a different ports and context path. Definitely makes you sleep better at night.
Up-And-Running Now is a good time to test once again if both ‘Bookkeeper’ and ‘Activate-Impressionable’ are working as expected. Generally now we are up and running with only few more things to do here.
Clients Redirect Point clients of the service to the new endpoints (port + context path). A step that might take some time depends on the number of clients and the ability to redeploy them. In outbrain we use HA-Proxy, so reconfiguring it did most of the work, but some clients did require code modifications.
(More) Validation Move/copy simulator tests and monitors. In our team, we heavily rely on tests we call simulator tests. These are actually black-box tests written in JUnit that runs against the service installed on a designated machine. These tests see the service as a black-box and calls its endpoints while mock/emulate other services and data in the database for the test run. So usually a test run can look like: put something in the database, trigger the endpoint, and see the result in the database or in the http response. There is also a question here whether to test ‘Activate-Impressionable’ or the ‘Bookkeeper’. Ideally you will test them both (tests are duplicated for that phase), and that is what we did.

 

Refactor, Disconnect & Cleanup

When we got here the new service is working and we should expect no more behaviour changes from the endpoints point of view. But we still want the code to be fully split and the services to be independent from each other. In the previous step we performed the phases in a way that everything remains reversible with a simple feature toggle & deploy.

In this step we move to a state where the ‘Bookkeeper’ will no longer host the ‘Activate-Impressionable’ functionality. Sometimes it is a good idea to have a gap from the previous step to make sure that there are no rejections and backfires that we didn’t trace in our tests and monitoring.
First thing, If was not done up until now, is deploying the ‘Bookkeeper’ without the service functionality and make sure everything is still working. And wait a little bit more…
And now we just have to push the sources and the resources from the library to the ‘Activate-Impressionable’ service. In the ideal case there is no common code, we can also delete the library. This was not how it was in our case. We still have a lot of common code we can’t separate for the time being.
Now is also the time to do resources cleanup, web.xml edit etc’.
And for the bold and OCD among us – packages rename and refactor of code with the new service naming conventions.

Conclusion

image02
The entire process in our case took a couple of weeks. Part of the fun and advantage in such process, is the opportunity to know better an old code and its structure and functionality without the need to modify something for a new feature with its constraints. Especially when someone else wrote it originally.
In order to perform well such a process it is important to plan and remain organized and on track. In case of a context switch it is very important to keep a bookmark of where you need to return to in order to continue. In our team we even did that with a handoff of the task between developers. Extreme Programming, it is.
It is interesting to see the surprising results in terms of lines of code. Originally we thought of it as splitting a micro-service from a monolith. In retrospective, it looks to me more like splitting a service into two services. ‘Micro’ in this case is in the eye of the beholder.

References

[1] http://highscalability.com/blog/2014/4/8/microservices-not-a-free-lunch.html
[2] http://eugenedvorkin.com/seven-micro-services-architecture-advantages/
[3] http://blog.cleancoder.com/uncle-bob/2014/12/17/TheCyclesOfTDD.html

http://martinfowler.com/articles/microservices.html
https://github.com/outbrain/ob1k
http://www.yaml.org/
http://lattix.com/

DevOps – The Outbrain Way

Like many other fast moving companies, at Outbrain we have tried several iterations in the  attempt to find the most effective “DevOps” model for us. As expected with any such effort, the road has been bumpy and there have been many “lessons learned” along the way. As of today, we feel that we have had some major successes in refining this model, and would like to share some of our insights from our journey.

 

Why to get Dev and Ops together in the first place?

A lot has been written on this topic, and the motivations and benefits of adding the operational perspective into the development cycles has been thoroughly discussed in the industry – so we will not repeat those.

I would just say that we look at these efforts as preventive medicine, like eating well and exercise – life is better when you stay healthy.  It’s not as good when you get sick and seek medical treatment to get health again.

 

What’s in a name?

We do not believe in the term “DevOps”, and what it represents.  We try hard to avoid it –  why is that?

Because we expect every Operations engineer to have development understanding and skills, and every Developer to have operational understanding of how the service he / she developes works, and we help them achieve and improve those skills – so everyone is DevOps.

We do believe there is a need to get more system and production skills and expertise closer to the development cycles – so we call it Production Engineers.

 

First try – Failed!

We started by assigning Operations Engineers to work with dedicated development groups – the big problem was that it was done on top of their previous responsibility in building the overall infrastructure (config management, monitoring infrastructure, network architecture etc.), which was already a full time job as it was.  

This mainly led to frustration on both sides – the operations eng. who felt they have no time to do anything properly, just touching the surface all the time and spread too thin, and the developers who felt they are not getting enough bandwidth from operations and they are held back.

Conclusion – in order to succeed we need to go all in – have dedicated resources!

 

Round 2 – Dedicated Production Eng.

Not giving up on the concept and learning from round 1 – we decided to create a new role – “Production Engineers” (or PE for short), whom are dedicated to specific development groups.

This dedication manifest in different levels. Some of them are semi trivial aspects, like seating arrangements – having the PE sit with the development team, and sharing with them the day to day experience; And some of them are focus oriented, like joining the development team goals and actually becoming an integral part of the development team.

On the other hand, the PE needs to keep very close relationship with the Infrastructure Operational team, who continues to develop the infrastructure and tools to be used by the PEs and support the PEs with technical expertise on more complex issues require subject matter experts.

 

What & How model:

So how do we prevent the brain split situation of the PE? Is the PE part of the development team or the Operations team? When you have several PEs supporting different development groups – are they all independent or can we gain from knowledge transfer between them?

In order for us to have a lighthouse to help us answer all those questions and more that would evident come up, we came up with the  “What & How” model:

“What” – stands for the goals, priorities and what needs to be achieved. “The what” is set by the development team management (as they know best what they need to deliver).

“How” – stands for which methods, technologies and processes should be used to achieve those goals most efficiently from operational perspective. This technical, subject matter guidance is provided by the operations side of the house.

 

So what is a PE @ Outbrain?

At first stage, Operations Engineer is going through an on-boarding period, during which the Eng. gains the understanding of Outbrain operational infrastructure. Once this Eng. gained enough millage he /she can become a PE, joining a development group and working with them to achieve the development goals, set the “right” way from operational perspective, properly leveraging the Outbrain infrastructure and tools.

The PE enjoys both worlds – keeping presence in the Operations group and keeping his/hers technical expertise on one hand, and on the other hand be an integral part of the development team.

From a higher level perspective – we have eliminated the frustrations points, experienced in our first round of “DevOps” implementation, and are gaining the benefit of close relationship, and better understanding of needs and tools between the different development groups and the general Operations group. By the way, we have also gained a new carrier development path for our Operations Eng. and Production Eng. that can move between those roles and enjoy different types of challenges and life styles.

 

e8f82598-c6e2-4c08-85ce-f6d34f74f3b6

Real Time Performance Monitoring @ Outbrain

Outbrain serves millions of requests per minute, based on a micro service architecture.Consequently, as you might expect, visibility and performance monitoring are crucial.

Serving millions of requests per minute, across multiple data centers, in a micro services environment, is not an easy task. Every request is routed to many applications, and may potentially stall or fail at every step in the flow. Identifying bottlenecks, troubleshooting failures and knowing our capacity limits are all difficult tasks. Yet, these are not things you can just give up on “because they’re hard”, but are rather tasks that every engineer must be able to tackle without too much overhead. It is clear that we have to aim for all engineers to be able to understand how their applications are doing at any given time.

Since we face all of these challenges every day, we’ve reached the point where a paradigm shift was required. For example, move from the old, familiar “investigate the past” to the new, unfamiliar “investigate the present”. That’s only one of the requirements we came up with. Here are few more:

 

Real time visibility

Sounds pretty straightforward, right? However when using a persistent monitoring system, it always has at least few minutes of delay. These few minutes might contain millions of errors that potentially affect your business. Aiming for low MTTR means cutting delays where possible, thus moving from minute-based granularity to second-based.

 

Throughput, Latency and error rate are linked

Some components might suffer from high latency, but maybe the amount of traffic they receive is negligible. Others might have low latency under high load, but that’s only because they fail fast for almost every request (we are reactive!). We wanted to view these metrics together, and rank them by importance.

 

Mathematical correctness at any given aggregation (Don’t lie!)

When dealing with latency, one should look at percentiles, not averages, as averages can be deceiving and might not tell the whole story. But what if we want to view latency per host, and then view it per data center ? if we store only percentiles per host (which is highly common in our industry), it is not mathematically correct to average them! On the other hand, we have so much traffic that we can’t just store any measurement with its latency; and definitely not view them all in real time

 

Latency resolution matters

JVM based systems tend to display crazy numbers when looking at the high percentiles (how crazy ? With heavy gc storms and lock contention there is no limit to how worse these values can get). It’s crucial for us to differentiate between latency in the 99.5 and 99.9 percentiles, while values at the 5 or 10 percentiles don’t really matter.

Summing up all of the requirements above, we reached a conclusion that our fancy persistent monitoring system, with its minute-based resolution, supporting millions of metrics per minute, doesn’t cut it anymore. We like it that every host can write thousands of metric values every minute, and we like being able to view historical data over long periods of time, but moving forward, it’s just not good enough. So, as we often do, we sat down to rethink our application-level metric collection and came up with a new, improved solution.

 

Our Monitoring Unit

First, consider metric collection from the application perspective. Logically, it is an application’s point-of-view of some component: a call to another application, to a backend or plain CPU bound computation. Therefore, for every component, we measure its number of requests, failures, timeouts and push backs along with a latency histogram over a short period of time.

In addition, we want to see the worst performing hosts in terms of any such metric (can be mean latency, num errors, etc)

mu

To achieve this display for each measured component we decided to use these great technologies:

 

HDR Histograms

http://hdrhistogram.github.com/HdrHistogram/

HdrHistogram supports the recording and analysis of sampled data value counts, across a configurable value range, with configurable value precision within the range. It is designed for recording histograms of latency measurements in performance-sensitive applications.

Why is this important? Because when using such histograms to measure the latency of some component, it allows you to have good accuracy of the values in the high percentiles at the expense of the low percentiles

So, we decided to store in memory instances of histograms (as well as counters for requests, errors, timeouts, push backs, etc) for each measured component. We then replace them each second and expose these histograms in the form of rx.Observable using our own OB1K application server capabilities.

All that is left is to aggregate and display.

Java Reactive extensions

https://github.com/ReactiveX/RxJava

rx is a great tool to merge and aggregate streams of data in memory. In our case, we built a service to merge raw streams of measured components; group them by the measured type, and aggregate them in a window of a few seconds. But here’s the trick – we do that on demand. This allows us to let the users view results grouped by any dimension they desire without losing the mathematical correctness of latency histograms aggregation.

Some examples on the operators we use to aggregate the multiple monitoring units:

 

merge

rx merge operator enables treating multiple streams as a single stream

 

window

rx window operator enables sliding window abstraction

 

scan

rx scan operator enables aggregation over each window

 

To simplify things, we can say that for each component we want to display, we connect to each machine to fetch the monitored stream endpoint, perform ‘merge’ to get a single stream abstraction, ‘window’ to get a result per time unit, and ‘scan’ to perform the aggregation

 

Hystrix Dashboard

https://github.com/Netflix/Hystrix

The guys at Netflix found a great formula for displaying serving components’ status in a way that links between volume, error percentage and latency in a single view. We really liked that, so we adopted this UI to show our aggregated results.

The hystrix dashboard view of a single measured component shows counters of successes, failures, timeouts and push backs, along with a latency histogram, information on the number of hosts, and more. In addition, it provides a balloon view, which grows/shrinks with traffic volume per component, and is color-coded by the error rate.

See below how this looks in the breakdown view of all request components. The user gets a view of all measured components, sorted by volume, with a listing of the worst performing hosts.

view1

Another example shows the view of one application, with nothing but its entry points, grouped by data center. Our Operations guys find this extremely useful when needing to re-balance traffic across data centers.

REBALANCE

 

OK, so far so good. Now let’s talk about what we actually do with it.

Troubleshooting

Sometimes an application doesn’t meet its SLA, be it in latency or error rate. The simple case is due to a broken internal component (for example, some backend went down and all calls to it result in failures). At this point we can view the application dashboard and easily locate the failing call. A more complex use case is an increase in the amount of calls to a high latency component at the expense of a low latency one (for example, cache hit rate drop). Here our drill down will need to focus on the relative amount of traffic each component receives – we might be expecting a 1:2 ratio, while in reality we might observe a 1:3 ratio.

With enough alerting in place, this could be caught by an alert. Having the real time view will allow us to locate the root cause quickly even when the alert is a general one.

troubleshoot

Performance comparison

In many cases we want to compare the performance of two groups of hosts doing the same operation, such as version upgrades or topology changes. We use tags to differentiate groups of machines (each datacenter is a tag, each environment, and even each hostname). We then can ask for a specific metric, grouped by tags, to get the following view:

compare

 

Load testing

We conduct several types of load tests. One is where we shift as much traffic as possible to one data center, trying to hit the first system-wide bottleneck. Another is performed on specific applications. In both cases we use the application dashboard to view the bottlenecks, just like we would when troubleshooting unexpected events.

One thing to keep in mind is that when an application is loaded, sometimes the CPU is bounded and measurements are false because threads just don’t get CPU time. Another case where this happens is during GC. In such cases we must also measure the effects of this phenomenon.

The measured unit in this case is ‘jvm hiccup’, which basically means taking one thread, letting it sleep for a while and measuring “measurement time on top of the sleep time”. Low hiccups means we can rely on the numbers presented by other metrics.

hiccup

 

What’s next?

Real time monitoring holds a critical role in our everyday work, and we have many plans to leverage these measurements. From smarter metric driven load balancing in the client to canary deployments based on real application behavior – there is no limit to what you can achieve when you measure stuff in a fast, reliable manner.

Monitoring APIs with ELK

The Basics

One of the main challenges we’ve dealt with during the last couple of years, was opening our platform and recommendation engine to the developers’ community. With the amount of data that Outbrain processes, direct relations with hundreds of thousands of sites and reach of more than 600M users a month, we can drive the next wave of content innovation. One of Outbrain’s main drivers for enabling automated large scale recommendations system is to provide application developers the option to interact with our system via API.

Developers build applications, and those application are used by users, in different locations and times. When exposing API to external usage you can rarely predict how people will actually use it

These variations can come from different reasons:

  1. Unpredictable scenarios
  2. Unintentional misuse of the API. Either for lack of proper documentation, a bug, or simply because a developer didn’t RTFM.
  3. Intentional misuse of the API. Yeah, you should expect people will abuse your API or use it for fraudulent activity.

In all those cases, we need to know how the developers community is using the APIs and how the end users (applications) are using it as well and also take proactive measures.

Hello ELK.

The Stack

image01

ElasticSearch, Logstash  and Kibana (AKA ELK) are great tools for collecting, filtering, processing, indexing and searching through logs. The setup is simple: Our service writes logs (using Log4J), the logs are picked up by a Logstash agent that sent it to an ElasticSearch  index. Kibana is setup to visualize the data of the ES index.

The Data

Web server logs are usually too generic. Application debug logs are usually too noisy. In our case, we have added a dedicated log with a single line for every API request. Since we’re in application code, we can enrich the log with interesting fields, like country of request origin (translating the IP to country). etc…

Here’s a list of useful fields:

  • Request IP  – Don’t forget about XFF header
  • Country / City – We use a 3rd party database for translating IPs to country.
  • Request User-Agent
  • Request Device Type – Resolved from the User-Agent
  • Request Http Method – GET, POST, etc.
  • Request Query Parameters
  • Request URL
  • Response Http Status – code. 200, 204, etc.
  • Response Error Message – The API service can fill in extra details on errors.
  • Developer Identifier / API Key – If you can identify the Developer, Application or User, add these fields.

What can you get out of this?

So we’ve got the data in ES, now what?

Obvious – Events over time

image03

This is pretty trivial. You want to see how many request are made. With Kibana’s ® slice ‘n dice capabilities, you can easily break it down per Application, Country, or any other field that you’ve bothered to add. In case an application is abusing your API and calling it a lot, you can see who just jumped over time with his requests and handle it.

Request Origin

image04

If you’re able to resolve the request IP (or XFF header IP) to country, you’ll get a cool looking map / table and see where requests are coming from. This way you can detect anomalies like frauds etc…

 

Http Status Breakdown

image02

By itself, this is nice to have. When combined with Kibana’s slice n’ dice capabilities this let’s you see an overview for any breakdown. In many cases you can see that an application/developer is shooting the wrong API call. Be proactive and lend some assistance in near real time. Trust us, they’ll be impressed.

IP Diversity

image00

Why would you care about this? Consider the following: A developer creates an application using your API, but all requests are made from a limited number of IPs. This could be intentional, for example if all requests are made through some cloud service. This could also hint on a bug in the integration of the API. Now you can investigate.

Save the Best for Last

The data exists in ElasticSearch. Using Kibana is just one way of using it. Here are a few awesome ways to use the data.

Automated Validations (or Anomaly detection)

Once we’ve identified key anomalies in API usage, we’ve setup automated tests to search for these anomalies on a daily basis. Automatic anomaly detection in API usage proved to be incredibly useful when scaling a product. These tests can be run on demand or scheduled, and a daily report is produced.

image05

Abuse Detection

ElasticSearch is (as the name suggests) very elastic. It enables querying and aggregating the data in a variety of ways. Security experts can (relatively) easily slice & dice the data to find abuse patterns. For example, we detect when the same user-id is used in two different locations and trigger an alert.

Key Takeaways

  • Use ELK for analyzing your API usage
  • Have the application write the events (not a generic web-server).
  • Provide application-level information. E.g. Additional error information, Resolved geo location.
  • Share the love

How to substitute dynamic URL parameters: a template engine use case

Background

Outbrain is one of the world’s largest content discovery platforms, serving more than 200B recommendations monthly and reaching over 561 million unique visitors from across the globe.

As Outbrain becomes an important source of traffic and revenues for publishers, partners are looking for more tracking capabilities that will help them determine the amount of traffic generated by Outbrain, and how this traffic is distributed throughout their site. In addition, publishers want to see this information on their analytic tool of choice (Omniture, GA, Webtrends etc.).

Outbrain now provides them with tracking parameters that can be appended to the URL of recommendations.

Motivation

A url is constructed of three parts: prefix, actual url and suffix, where the prefix and suffix serve as means to let our partners add tracking capabilities in the form of dynamic parameters (for instance: the url’s title, publish date and id). Eventually, the dynamic parameters are added as a query string (field1=value1&field2=value2&field3=value3) to the recommendation url.

The legacy code would contract the three url parts and to use String replace in order to add the dynamic parameters. This implementation was hard to maintain (in situations where we wanted to support more dynamic parameters) as well as difficult to use since there was a need to import a lot of code to a project, and to depend on many modules.

When revisiting the problem, we understood that it would be appropriate to use a Separation of Concerns approach, separating the template from the model, and from the transformation logic itself – a template engine sounded like the right choice! In addition, since more and more dynamic parameters were being added, we used a builder pattern in order to achieve an easier and cleaner usage for clients, and easier maintenance in the future.

The problem – using a template engine doesn’t guarantee better performance

We decided to use StringTemplate as our template engine since we were familiar with it and had some good experience using it. The result of the refactor was a cleaner, shorter and maintainable API.

Unfortunately, when we deployed the new changes to production, we noticed a significant increase in serving time that was unacceptable in terms of user experience. After investigating the root cause, we found out that the usage of StringTemplate was pretty expensive. Even though the templates could be reused, we couldn’t reuse them all. For instance, we created a new template for each request, since it was constructed with different dynamic parameters. (The form of the url though, was the same: prefix, actual url and suffix).

So at that moment we had a clean and elegant solution that wasn’t performing well. We then looked for some alternative solutions for StringTemplate that could save us the expensive cost of constructing a new template for each request.

The solution – right tool for the job

Eventually, we found a light-weight template engine, that allowed us to keep using the Separation of Concerns approach and still achieve good performance. We ended up using Apache Commons Lang 3.0 StrSubstitutor – a simpler alternative to StringTemplate. This time, we made sure that it outperformed our last implementation by doing some micro benchmarking, and indeed the results were much better. The new implementation executed more than 4 times operation per second.

MicroBenchmarking Results

We used Java Microbenchmark Harness in order to perform our performance measurements.

microbenchmark

Raw data:

raw_data

For more information see: http://openjdk.java.net/projects/code-tools/jmh/

Goodbye static CNAMEs, hello Consul

Nearly every large scale system becomes distributed at some point: a collection of many instances and services that compose the solution you provide. And as you scale horizontally to provide high availability, better load distribution, etc…, you find yourself spinning up multiple instances of services, or using systems that function in a clustered architecture. That’s all cool in theory, but soon after you ask yourself, “how do I manage all of this? How should these services communicate with each other? And how do they even know what instances (or machines) exist?”

Those are excellent questions!

What methods are in use today?

The naive approach, which we’d followed in Outbrain for many years, is to route all inter-service traffic via load balancers (HAProxy in our case). Every call to another system, such as a MySql slave, is done to the load balancer (one in a pool of many), via an agreed upon name, such as a DNS CNAME. The load balancer, which holds a static configuration of all the different services and their instances, directs the call to one of those instances, based on the predefined policy.

backend be_onering_es   ## backend name
  balance leastconn     ## how to distribute load
  option httpchk GET /  ## service health check method
  option httpclose      ## add “Connection: close” header if missing
  option forwardfor     ## send client IP through XFF header
  server ringdb-20001 ringdb-20001:9200 check slowstart 10s weight 100   ## backend node 1
  server ringdb-20002 ringdb-20002:9200 check slowstart 10s weight 100   ## backend node 2

The load balancer is also responsible for checking service health, to make sure requests are routed only to live services, as dead ones are “kicked out of the pool”, and revived ones are brought back in.

An alternative to the load balancer method, used in high throughput systems such as Cassandra, is configuring CNAMEs that point to specific nodes in the cluster. We then use those CNAMES in the consuming applications’s configuration. The client is then responsible to activate a policy of balancing between those nodes, both for load and availability.

OK, so what’s the problem here?

There’s a few actually:

  1. The mediator (Load balancer), as quick as it may be in processing requests (and HAProxy is really fast!), is another hop on the network. With many services talking to each other, this could prove a choke point in some network topologies. It’s also a shared resource between multiple services and if one service misbehaves, everyone pays the price. This is especially painful with big payloads.
  2. The world becomes very static! Moving services between hosts, scaling them out/in, adding new services – it all involves changing the mediator’s config, and in many cases done manually. Manual work requires expertise and is error prone. When the changes becomes frequent… it simply does not scale.
  3. When moving ahead to infrastructure that is based on containers and resource management, where instances of services and resources are allocated dynamically, the whole notion of HOSTNAME goes away and you cannot count on it in ANY configuration.

What this all adds up to is “the end of the static configuration era”. Goodbye static configs, hello Dynamic Service Discovery! And cue Consul.

What is Consul?

In a nutshell, Consul is a Service Discovery System, with a few interesting features:

  1. It’s a distributed system, made out of an agent in each node. Nodes talk to each other via a gossip protocol, making node discovery simple, robust, and dynamic. There’s no configuration file describing all members of a Consul cluster.
  2. It’s fault tolerant by design, and using concepts such as Anti Entropy, gracefully handles nodes disappearing and reappearing – a common scenario in VM/container based infrastructure.
  3. It has first-class treatment of datacenters, as self-contained, interconnected entities. This means that DC failure / disconnection would be self-contained. It also means that a node in one DC can query for information in another DC with as little knowledge as the remote DC’s name.
  4. It holds the location (URI) and health of every service on every host, and makes this data available via multiple channels, such as a REST API and GUI. The API also lets you make complex queries and get the service data segment you’re interested in. For example: Get me all the addresses of all instances of service ‘X’ from Datacenter ‘Y’ in ‘staging env’ (tag).
  5. There is a very simple way to get access to “Healthy” service instances by leveraging the Consul DNS interface. Perfect for those pesky 3rd party services whose code you can’t or don’t want to modify, or just to get up and running quickly without modifying any client code (disclaimer: doesn’t fit all scenarios).

How does Consul work?

You can read all about it here, but let me take you through a quick tour of the architecture:

click to enlarge

As you can see, Consul has multi datacenter awareness built right in (you can read more about it here). But for our case, let’s keep it simple, and look at the case of a single datacenter (Datacenter 1 in the diagram).

What the diagram tags as “Clients” are actually “Consul agents”, running locally on every participating host. Those talk to each other, as well as the Consul servers (which are “agents” configured as Servers), through a “Gossip protocol”. If you’re familiar with Cassandra, and that rings a bell, then you’re right, it’s the same concept used by Cassandra nodes to find out which ones are up or down in a cluster. A Gossip protocol essentially makes sure “Everybody knows Everything about Everyone”. So within reasonable delay, all agents know (and propagate) state information about other agents. And you just so happen to have an agent running locally on your node, ready to share everything it knows via API, DNS or whatnot. How convenient!

Agents are also the ones performing health checks to the services on the hosts they run on, and gossiping any health state changes. To make this work, every service must expose a means to query its health status, and when registered with its local Consul agent, also register its health check information. At Outbrain we use an HTTP based “SelfTest” endpoint that every one of our homegrown services exposes (through our OB1K container, practically for free!).

Consul servers are also part of the gossip pool and thus propagate state in the cluster. However, they also maintain a quorum and elect a leader, who receives all updates (via RPC calls forwarded from the other servers) and registers them in it’s database. From here on, the data is replicated to the other servers and propagated to all the agents via Gossip. This method is a bit different from other Gossip based systems that have no servers and leaders, but it allows the system to support stronger consistency models.

There’s also a distributed key-value store we haven’t mentioned, rich ACLs, and a whole ecosystem of supporting and derived tools… but we said we’d keep it simple for now.

Where does that help with service discovery?

First, what we’ve done is taken all of our systems already organized in clusters and registered them with Consul. Systems such as Kafka, Zookeeper, Cassandra and others. This allows us to select a live service node from a cluster, simply by calling a hostname through the Consul DNS interface. For example, take Graphite: Outbrain’s systems are currently generating ~4M metrics per minute. Getting all of these metrics through a load balancer, or even a cluster of LBs, would be suboptimal, to say the least. Consul allows us to have each host send metrics to a hostname, such as “graphite.service.consul”, which returns a random IP of a live graphite relay node. Want to add a couple more nodes to share the load? no problem, just register them with Consul and they automagically appear in the list the next time a client resolves that hostname. Which, as we mentioned, happens quite a few times a minute. No load balancers in the way to serve as choke points, no editing of static config files. Just simple, fast, out-of-band communication.

How do these 3rd party services register?

We’re heavy users of Chef, and have thus created a chef cookbook to help us get the job done. Here’s a (simplified) code sample we use to register Graphite servers:

ob_consul 'graphite' do
  owner 'ops-vis'         ## add ‘owner’ tag to identify owning group
  port 1231               ## port the service is running on
  check_cmd "echo '' | nc localhost 1231 || exit 2"    ## health check shell command
  check_interval '60s'    ## health check execution interval
  template false          ## whether the health check command is a Chef template (for scripts)
  tags [‘prod’]           ## more tags
end

How to do clients consume services?

Clients simply resolve the DNS record they’re interested in… and that’s it. Consul takes care of all the rest, including randomizing the results.

$ host graphite
graphite.dc_name.outbrain.com is an alias for relayng.service.consul.
relayng.service.consul has address 10.10.10.11
relayng.service.consul has address 10.10.10.12

How does this data reach the DNS?

We’ve chosen to place Consul “behind” our internal DNS servers, and forward all requests for the “consul” domain name to a consul agent running on the DNS servers.

zone "consul" IN {
    type forward;
    forward only;
    forwarders { 127.0.0.1 port 8600; };
};

Note that there’s other ways to go about this, such as routing all DNS requests to the local Consul agent running on each node, and having it forward everything “non-Consul” to your DNS servers. There’s advantages and disadvantages to each approach. For our current needs, having an agent sit behind the DNS servers works quite well.

Where does the Consul implementation at Outbrain stand now?

At Outbrain we’re already using Consul for:

  • Graphite servers.
  • Hive Thrift servers that are Hive interfaces to the Hadoop cluster they’re running on. Here the Consul CNAME represents the actual Hadoop cluster you want your query to run on. We’ve also added a layer that enables accessing these clusters from different datacenters using Consul’s multi-DC support.
  • Kafka servers.
  • Elasticsearch servers.

And our roadmap for the near future:

  • MySql Slaves – so we can eliminate the use of HAProxy in that path.
  • Cassandra servers where maintaining a list of active nodes in the app configuration becomes stale over time.
  • Prometheus – our new monitoring and alerting system.
  • Zookeeper clusters.

 

But that’s not all! stay tuned for more on Consul, client-side load balancing, and making your environment more dynamic.

New Hire LaunchPad

The first few days or weeks of a new hire in the company can be critical. What they say about first impressions, surely applies in this case as well – this initial time period, where an engineer needs to understand ‘how things are done here’, is important.

As outbrain is scaling its engineering R&D group, it became evident that the current new hire training program is lagging behind. A new hire usually went through an extensive onboarding frontal lectures, where each subject was lead by a more veteran employee with the proper knowledge. It would usually take a month or so to go through all the lectures, not because there were so many of them, but because they were scattered and opened only when there were enough people (otherwise, it could inflict an unacceptable pressure on the employees giving them)

Hands on stuff, such as how to write a standard service, how to deploy it, what is the standard way to add metrics to you application etc, AKA – the important stuff, was usually taught in an informal way, where each team leader did the best they could, under the usual tight time constraints. As this method served us well for quite some time, the recruitment scale – and, in particular, the large number of new hires in a short time – forced us to re-think this process.

What we envisioned, was a 2 week bootcamp, that each engineering R&D new hire will have to undergo. We considered 3 basic approaches:

  • A  frontal lectures class, opened once a month, with people from various teams. This will not be efficient, as new hires rate are not consistent, and the learning is not by experience. the load on the instructors is huge, and they don’t always have the time or the teaching skills

  • A by-subject self learning, consisting of a given list of things to study and learn. As this scales better than the first approach, it still lacks the hands on experience, which is so important for understanding such a diversified, dynamic and multilayered environment.

  • Tasks-based bootcamp, consisting of a list of incremental tasks (where each new task relies on the actual implementation of the previous task). The bootcampee ends up creating a fully functional, deployed service. The service is a real service, in the sense that it is actually deployed to production, uses key infrastructure systems like any other service etc.

Our choice was the tasks-based bootcamp. in the course of a few weeks, we created a plan (documented in the company’s wiki), consisting of 9 separate units. Each unit has its own page, consisting 4 sections: the unit’s overview, some general notes, the steps for the unit and a section with tools and applications for this unit. We added a big feedback link to each page, so we can get the users feedback and improve accordingly.

The feedback we got is very promising – people can work independently, team leaders can easily scale initial training and the hands on experience gained was proving to be extremely important. The bootcamp itself is also used a knowledge source, where people can go back to and understand how to do things.

Our next steps, other than constantly improving the current bootcamp, are identifying other subjects that needs to be taught (naturally, a more advanced and specific ones) and building a bootcamp program following the same lines – hands on, progress on your own, real world training.