Category: Uncategorized

Code Retreat @ Outbrain

c0de-retr3at logo-06

Some people say writing code is kind of an art.
I don’t think so.
Well, maybe it is true if you are writing an ASCII-Art script or you are a Brainfuck programmer. I think that in most cases writing code is an engineering. Writing a program that will do something is like a car taking you from A to B that someone engineered. When a programmer writes code it should do something: data crunching, tasks automation or driving a car. Something. For me, an art is a non-productive effort of some manner. Writing a program is not such case. Or maybe it is more like a martial-art (or marshaling-art :-)) where you fight your code to do something.

So — what’s the problem?

Most of the programs I know that evolve over time, needs to have a quality which is not an art, but an ability to be maintainable. That’s usually reflect in a high level of readability.

Readability is difficult.

First, when you write some piece code, there is a mental mode you are in, and a mental model you have about the code and how it should be. When you or someone else read the code, they don’t have that model in mind, and usually, they read only a fragment of the entire code base.
In addition, there are various languages and styles of coding. When you read something that was written by someone else, with a different style or in a different language it is like reading a novel someone wrote in a different dialect or language.
That is why, when you write code you should be thoughtful to the “future you” reading the code, by making the code more readable. I am not going to talk here about how to do it, design patterns or best practices to writing your code, but I would just say that practicing and experience are important aspects in relate to code readability and maintainability.
As a programmer, when I retrospect what I did last week at work I can estimate about 50% of the time or less was coding. Among other things were writing this blog post, meetings (which I try to eliminate as much as possible) and all sort of work and personal stuff that happens during the day. When writing code, among the main goals are the quality, answering the requirements and do both in a timely manner.
From a personal perspective, one of my main goals is improving my skill-set as a programmer. Many times, I find that goal in a conflict with the goals above that were dictated by business needs.

Practice and more practice

There are few techniques that come to solve that by practicing on classroom tasks. Among them are TDD Kata’s and code-retreat days. Mainly their agenda says: “let’s take a ‘classroom’ problem, and try to solve it over and over again, in various techniques, constraints, languages and methodologies in order to improve our skill-set and increase our set of tools, rather than answering business needs”.

Code Retreat @ Outbrain — What do we do there?

So, in Outbrain we are doing code-retreat sessions. Well, we call it code-retreat because we write code and it is a classroom tasks (and a buzzy name), but it is not exactly the religious Corey-Haines-full-Saturday Code-Retreat. It’s an hour and a half sessions, every two weeks, that we practice writing code. Anyone who wants to code is invited — not only the experts — and the goals are: improve your skills, have fun, meet developers from other teams in Outbrain that usually you don’t work with (mixing with others) and learn new stuff.
We are doing it for a couple of months now. Up until now, all sessions were about fifteen minutes of presentation/introduction to the topic, and the rest was coding.
In all sessions, the task was Conway’s game of life. The topics that we covered were:

  • Cowboy programming — this was the first session we did. The game of life was presented and each coder could choose how to implement it upon her own wish. The main goal was an introduction to the game of life so in the next sessions we can concentrate on the style itself. I believe an essential part of improving the skills is the fact that we solve the same problem repeatedly.
  • The next session was about Test-Driven-Development. We watched uncle-bob short example of TDD, and had a fertile discussion while coding about some of the principles, such as: don’t write code if you don’t have a failing test.

After that, we did a couple of pair programming sessions. In those sessions, one of the challenges was matching pairs. Sometimes we do it in a lottery and sometimes people could group together by their own selection, but it was dictated by the popular programming languages that the developers choose to use: Java, Kotlin, Python or JavaScript. We plan to do language introduction sessions in the future.
In addition, we talked about other styles we might practice in the future: mob-programming and pairs switches.
These days we are having functional programming sessions, the first one was with the constraint of “all-immutable” (no loops, no mutable variables, etc’) and it will be followed by more advanced constructs of functional programming.
All-in-all I think we are having a lot of fun, a lot of retreat and little coding. Among the main challenges are keeping the people on board as they have a lot of other tasks to do, keeping it interesting and setting the right format (pairs/single/language). The sessions are internal for Outbrain employees right now, but tweet me (@ohadshai) in case you are around and would like to join.
And we also have cool stickers:
c0de-retr3at logo-07c0de-retr3at logo-05c0de-retr3at logo-04
P.S. — all the materials and examples are here:

The original post was published in my personal blog:

From complex monolith to scalable workflow





One of the core functionalities in Outbrain’s solution is our crawling system.

The crawler fetches web pages (e.g. articles), and index them in our database.

The crawlers can be divided into several high-level steps:

  1. Fetch – download the page
  2. Resolve – identify the page context (e.g. what is the domain?)
  3. Extract – try to extract features out of the HTML – like title, image, description, content etc.
  4. NLP – run NLP algorithms using the extracted features, in order to classify and categorize the page.
  5. Save – store to the DB.



The old implementation

The crawler module was one of the first modules that the first Outbrainers had to implement, back in Outbrain’s small-start-up days 6-7 years ago.


In January 2015 we have decided that it was time to sunset the old crawlers, and rewrite everything from scratch. The main motivations for this decision were:

  1. Crawlers were implemented as a long monolith, without clear steps (OOP as opposed to FP).
  2. Crawlers were implemented as a library that was used in many different services, according to the use-case. Each change forced us to build and check all services.
  3. Technologies are not up-to-date anymore (Jetty vs. Netty, sync vs. async development etc.).



The new implementation

When designing the new architecture of our crawlers, we tried to follow the following ideas:

  1. Simple step-by-step (workflow) architecture, simplifying the flow as much as possible.
  2. Split complex logic to micro-services for easier scale and debug.
  3. Use async flows with queues when possible, to control bursts better.
  4. Use newer technologies like Netty and Kafka.


Our first decision was to split the main flow into 3 services. Since the crawler flow, like many others, is basically an ETL (Extract, Transform & Load) – we have decided to split the main flow into 3 different services:

  1. Extract – fetch the HTML and resolve the domain.
  2. Transform – take features out of the HTML (title, images, content…) + run NLP algorithms.
  3. Load – save to the DB.


The implementation of those services is based on “workflow” ideas. We created interface to implement a step, and each service contains several steps, each step doing a single and simple calculation. For example, some of the steps in the “Transform” service are:







In addition we have implemented a class called Router – that is injected with all the steps it needs to run, and is in charge of running them one after the other, reporting errors and skipping unnecessary steps (for example, no need to run categorization when content extraction failed).


Furthermore, every logic that was a bit complex was extracted out of those services to a dedicated micro-service. For example, the fetch part (download the page from the web) was extracted to a different micro-service. This helped us encapsulate fallback logic (between different http clients) and some other related logics we had outside of the main flow. This is also very helpful when we want to debug – we just make an API call to that service to get the same result the main flow gets.


We modeled each piece of data we extracted out of the page into features, so each page would eventually translated into a list of features:






Publish Date



The data flow in those services was very simple. Each step got all the features that were created up to its run, and added (if needed) one or more features to its output. That way the features list (starting with only URL) got “inflated” going over all the steps, reaching the “Load” part with all the features we need to save.

Screen Shot 2017-01-19 at 12.38.45 PM



The migration


One of the most painful parts of such rewrites is the migration. Since this is a very important core-functionality in Outbrain, we could not just change it and cross fingers that everything is OK. In addition, it took several months to build this new flow, and we wanted to test as we go – in production – and not wait until we are done.


The main concept for the migration was to create this new flow side by side with the old flow, having them both run in the same time in production, allowing us to test the new flow without harming production.


The main steps of the migration were

  1. Create the new services and start implementing some of the functionality. Do not save anything in the end.
  2. Start calls from the old-flow to the new one, a-synchronically, passing all features that the old flow calculated.
  3. Each step in the new flow that we implement can compare its results to the old-flow results, and report when the results are different.
  4. Implement the “save” part – but do it only for a small part of the pages – control it by a setting.
  5. Evaluate the new-flow using the comparison done between the old and new flows results.
  6. Gradually enable the new-flow for more and more pages – monitoring the effect in production.
  7. Once feeling comfortable enough, remove the old-flow and run everything only in the new-flow.


The approach can be described as “TDD” in production. We have created a skeleton for the new-flow and started streaming the crawls into it, while actually it does almost nothing. We have started writing the functionality – each one tested, in production, compared to the old-flow. Once all steps were done and tested, we have replaced the old-flow by the new one.



Where we are now

As of December 20th 2016 we are running only the new flow for 100% of the traffic.

The changes we already see:

  1. Throughput/Scale: we have increased the throughput (#crawls per minute) from 2K to ~15K.
  2. Simplicity: time to add new features or solve bugs decreased dramatically. There is no good KPI here but most bugs are solved in 1-2 hours, including production integration.
  3. Less production issues: it is easier for the QA to understand and even debug the flow (using calls to the micro services) – so some of the issues are eliminated even be getting to developers.
  4. Bursts handling: due to the queues architecture we endure bursts much better. It also allows simple recovery after one of the services is down (maintenance for example).

5. Better auditing: thanks to the workflow architecture, it was very easy to add audit message using our ELK infrastructure (Elastic search, Log-stash, Kibana). The crawling flow today reports the outcome (and issues) of every step it does, allowing us, the QA, and even the field guys to understand the crawl in details without the need of a developer.

Automating your workflow

During development, there are many occasions where we have to do things that are not directly related to the feature we are working on, or things that are repetitive and recurring.
In the time span of a feature development this can often take as much time to do as the actual development.

For instance, updating your local dev micro services environment before testing your code. This task on its own, which usually includes updating your local repo version, building and starting several services and many times debugging and fixing issues caused by others, can take hours, many times just to test a simple procedure.

We are developers, we spend every day automating and improving other people’s workflows, yet we often spend so many hours doing the same time consuming tasks over and over again.
So why not build the tools we need to automate our own workflows?

In our team we decided to build a few tools to help out with some extra irritating tasks we were constantly complaining about to each other.

First one was simple, creating a slush sub-generator. For those of you who don’t know, slush is a scaffolding tool, like yeoman but for gulp. We used this to create our Angular components.
Each time we needed to make a new component we had to create a new folder, with three files:


Each file of course has its own internal structure of predisposed code, and each component had to be registered in the app module and the main less file.

This was obviously extremely annoying to redo each time, so we automated it. Now each time you run “ob-genie” from the terminal, you are asked the name of your component and what module to register it with, and the rest happens on its own. We did this for services and directives too.

Other than saving a lot of time and frustration, this had an interesting side effect – people on the team were creating more components than before! This was good because it resulted in better separation of code and better readability. Seems that many tim the developers were simply too lazy to create a new component and just chucked it all in together. Btw, Angular-CLI have added a similar capability, guess great minds think alike.

Another case we took on in our team was to rid ourselves of the painstaking task of setting up the local environment. This I must say was a real pain point. Updating the repo, building and running the services we needed each time could take hours, assuming everything went well.
There have been times where I spent days on this just to test the simplest of procedures.
Often I admit, I simply pushed my code to a test environment and debugged it there.
So we decided to build a proxy server to channel all local requests to the test environment.

For this we used node-proxy, a very easy to configure proxy. However, this was still not an easy task since each company has very specific configurations issues we had to work with.
One thing that was missing was proper routing capabilities. Since you want some requests to go local and some remote we added this before each request.

https.createServer(credentials, function (req, res) {
 Object.keys(options.routingTable).some(function (key) {
   const regX = new RegExp(key);
   if (regX.test(req.url)) {
     printMe(req.url + ' => ' + (options.routingTable[key].targetName ||;
     proxy.web(req, res, options.routingTable[key]);
     curTarget = options.routingTable[key];
     return true;

We passed as an option the routing table with a regex for each path, making it easy to configure which requests to proxy out, and which in.

routingTable = {
  'site': local,
  '^/static': local,
  '/*/': remote

Another hurdle was working with HTTPS, since our remote environments work on HTTPS.
In order to adhere to this we needed to create SSL certificate for our proxy and the requestCert parameter in our proxy server to false, so that the it doesn’t get validated.

The end configuration should look something like this.

const local = {
   targetName: 'local',
   target: 'https://localhost:4141,
   changeOrigin: true, 
   secure: false
 remote = {
   targetName: 'remote',
   requestCert: false,
   rejectUnauthorized: false,
   target: ',
   secure: false,
   changeOrigin: true,
   autoRewrite: true

 routingTable = {
   'site': local,
   '^/static': local,
   '/*/': remote

const options = {
 routingTable: routingTable,
 home_port: 2109,
 debug: true,
 startPath: 'amplify/site/'

With this you should be able to run locally and route all needed calls to the test environment when working on localhost:2109.

So to conclude, be lazy, make your work easier, and use the skills you have to automate your workflows as much as possible.

How to take innovation into production


Outbrain’s Hackathon

The Outbrain Hackathon which is held twice a year, is a 24-hours event in which employees and friends are invited to build and present an original product or innovation.

The Hackathon is a mini festival held at all Outbrain’s offices around the globe where the offices are open for 24 hours and meals and beers are served all day long.

The winning team is rewarded with a worthy present and the opportunity to turn the idea into a working feature/product.

Few weeks before the event, people start to raise ideas and to team up.

In the beginning of the event, a representative from each team has 5 minutes to present his/her idea in front of the whole company.

In the end of the event, each representative presents a demo of the software his/her team developed.

Immediately afterwards, a vote is conducted and the winners are declared.

In this post I will share my experience from the Hackathon which my team and I won.


About Our hack: Let’s do some innovation

One of our legacy services is called the “Editorial Reviewer”.

It is a user interface for approving/rejecting newly created promoted content, based on Outbrain’s content guidelines (See:

This was an old service from Outbrain’s early days. It was slow and was using old technologies and frameworks. We decided to rewrite it with a fresh breeze of technologies and make it function and look awesome.


Let’s get to work…

Prior to the Hackathon we did some research about what we could remove or improve about the current service and if there are new features or demands from the users we could add during the makeover.

We then divided the work among the team members and chose the technologies and frameworks based on our needs and desires.

One of our main goal was to improve the performance of the old tool.

Switching from multi page web application architecture to a single page web application made a real change but wasn’t enough.

The real challenge was to speed up the database access calls that the service makes.

We analyze the current queries, and found out they can be dramatically improved.

After a few hours, we already had a working demo. It looked a bit childish but it was already performing much better!!

We decided to get some advice from our masters and took it up with our UX designer who came up with a really cool sketch which we were excited to implement.

After a long night of hacking and tons of coffee we finally had an impressive working demo that we could present to the company.

The feedbacks we got were awesome! The teams from Outbrain that were using the old tool were super excited and couldn’t wait to start using our new hack.

As part of the prize, we got the time to develop it into a full-blown product.


2 months later, we got the chance to invest more time in our idea.

We added more tests, monitors and dashboards, did some fine-tuning and at the end, came up with a really cool and sexy single page application that was much faster, comfortable and reliable than the old tool.



The atmosphere was great. All participants worked around the clock and did their best to kick ass!

The challenge of working on a project that we chose and the fact we were striving to make it happen regardless of the tight time frame was amazing and so was the final outcome.

Is Cassandra really visible? Meet Cassibility…

You love Cassandra, but do you really know what’s going on inside your clusters?

Cassandra-CassibilityThis blog post describes how we managed to shed some light on our Cassandra clusters, add visibility and share it with the open source community.

Outbrain has been using Cassandra for several years now. As all companies, we started small, but during the time our usage grew, complexity increased, and we ended up having over 15 Cassandra clusters with hundreds of nodes, spread across multiple data centers.

We are working in the popular micro services model, where each group has its own Cassandra cluster to ensure resources and problems are isolated. But one thing remains common – we need to have good visibility in order to understand what is going on.

We already use Prometheus for metrics collection, Grafana for visualising its data and Pagerduty as our alerting system. However, we still needed to have detailed enough visibility on the Cassandra internals to ensure we could react to any issues encountered before they became a problem and make appropriate and informed performance tunings. I have to admit that when you don’t encounter a lot of problems, you tend to believe that what you have is actually sufficient, but when you suddenly have some nasty production issues, and we had our fair share, it becomes very challenging to debug it efficiently, in realtime, sometimes in the middle of the night.


Let’s say, as a simple example, that you realized that the application is behaving slower because the latency in Cassandra increased. You would like to understand what happened, and you start thinking that it can be due to a variety of causes – maybe it’s a system issue, like a hardware problem, or a long GC. Maybe it’s an applicative issue, like an increase in the number of requests due to a new feature or an application bug, and if so you would like to point the developer to a specific scenario which caused it. If so it would be good if you could tell him that this is happening in a specific keyspace or column family. In this case, if you’re also using row cache for example, you would wonder if maybe the application is not using the cache well enough, for example the new feature is using a new table which is not in the cache, so the hit rate will be low. And Maybe it’s not related to any of the above and it is actually happening due to a repair or read repair process, or massive amount of compactions that accumulated.  It would be great if you could see all of this in just a few dashboards, where all you had to do in order to dig into these speculation of your could be done in just a few clicks, right? Well, that’s what Cassibility gives you.

Take a look at the following screenshots, and see how you can see an overview of the situation and pinpoint the latency issue to number of requests or connections change, then quickly move to a system dashboard to isolate the loaded resource:

* Please note, the dashboards don’t correspond to the specific problem described, this is just an example of the different graphs



Then if you’d like to see if it’s related to specific column families, to cache or to repairs, there are dedicated dashboards for this as well




Here is the story of how we created Cassibility.

We decided to invest time in creating better and deeper visibility, and had some interesting iterations in this project.

At first, we tried to look for an open-source package that we could use, but as surprising as it may be, even with the wide usage of Cassandra around the world, we couldn’t find one that was sufficient and detailed enough for our requirements. So we started thinking how to do it ourselves.


Iteration I

We began to dig into what Cassandra can show us. We found out that Cassandra itself exposes quite a lot of metrics, could reach dozens of thousands of metrics per node, and they can be captured easily via JMX. Since we were already using the Prometheus JMX exporter ( in our Prometheus implementation, it seemed like the right choice to use it and easy enough to accomplish.

Initially we thought we should just write a script that exposes all of our metrics and automatically create JSON files that represent each metric in a graph. We exposed additional dimensions for each metric in order to specify the name of the cluster, the data center and some other information that we could use to monitor our Cassandra nodes more accurately. We also thought of automatically adding Grafana templates to all the graphs, from which one could choose and filter which cluster he wants to see, which datacenter, which Keyspace or Column Family / Table and even how to see the result (as a sum, average, etc.).

This sounded very good in theory, but after thinking about it a bit more, such abstraction was very hard to create. For instance there are some metrics that are counters, (e.g number of requests) and some that are gauge (e.g latency percentile). This means that with counters you may want to calculate the rate on top of the metric itself, like when you would want to take the number of requests and use it to calculate  a throughput. With a gauge you don’t need to do anything on top of the metric.    

Another example is how you would like to see the results when looking at the whole cluster. There are some metrics, which we would like to see in the node resolution and some in the datacenter or cluster resolution. If we take the throughput, it will be interesting to see what is the overall load on the cluster, so you can sum up the throughput of all nodes to see that. The same calculation is interesting at the keyspace or column family level. But if you look at latency, and you look at a specific percentile, then summing, averaging or finding maximum across all nodes actually has no meaning. Think about what it means if you take the number that represents the request latency which 99% of the requests on a specific node are lower than, and then do the maximum over all nodes in the cluster. You don’t really get the 99’th percentile of latency over the whole cluster, you get a lot of points, each representing the value of the node with the highest 99’th percentile latency in every moment. There is not much you can do with this information.

There are lot of different examples of this problem with other metrics but I will skip them as they require a more in depth explanation.


The next issue was how to arrange the dashboards. This is also something that is hard to do automatically. We thought to just take the structure of the Mbeans, and arrange dashboards accordingly, but this is also not so good. The best example is, of course, that anyone would like to see an overview dashboard that contains different pieces from different Mbeans, or a view of the load on your system resources, but there are many other examples.


Iteration II

We realized that we need to better understand every metric in order to create a clear dashboard suite, that will be structured in a way that is intuitive to use while debugging problems.

When reviewing the various sources of documentation on the many metrics, we found that although there was some documentation out there, it was often basic and incomplete – typically lacking important detail such as the units in which the metric is calculated, or is written in a way that doesn’t explain much on top of the metric name itself. For example, there is an Mbean called ClientRequest, which includes different metrics on the external requests sent to Cassandra’s coordinator nodes. It contains metrics about latency and throughput. On the Cassandra Wiki page the description is as follows:

      Latency : Latency statistics.

TotalLatency : Total latency in micro seconds

That doesn’t say much. Which statistics exactly? What does the total mean in comparison to  just latency? The throughput, by the way, is actually an attribute called counter within the latency MBean of a certain scope (Read, Write, etc.), but there are no details about this in the documentation and it’s not that intuitive to understand. I’m not saying you can’t get to it with some digging and common sense, but it certainly takes time when you’re starting.

Since we couldn’t find one place with good and full set of documentation we started digging ourselves, comparing values to see if they made sense and used a consultant named Johnny Miller from who has worked a lot with Cassandra and was very familiar with its internals and  metrics.

We improved our overall understanding at the same time as building and structuring the dashboards and graphs.

Before we actually started, we figured out two things:

  1. What we are doing must be something that many other companies working with Cassandra need, so our project just might as well be an open-source one, and help others too.
  2. There were a lot of different sections inside Cassandra, from overview to cache, entropy, keyspace/column family granularities and more, each of which we may want to look at separately in case we get some clues that something may be going on there. So each such section could actually be represented as a dashboard, and could be worked on in parallel.

We dedicated the first day to focus on classifying the features into logical groupings with a common theme and deciding what information was required in each one.

Once we had defined that, we then started to think about the best and fastest way to implement the project and decided to have a 1 day Hackathon in Outbrain. Many people from our Operations and R&D teams joined this effort, and since we could parallelize the work to at most 10 people, that’s the number of people who participated in the end.

This day focused both on creating the dashboards as well as finding solutions to all places where we used Outbrain specific tools to gather information (for example, we use Consul for service discovery and are able to pull information from it). We ended the day with having produced 10 dashboards,with some documentation, and we were extremely happy with the result.


Iteration III


To be sure that we are actually releasing something that is usable, intuitive to use and clear, we wanted to review the dashboards, documentation and installation process. During this process, like most of us engineers know, we found out that the remaining 20% will take quite a bit to complete.

Since in the Hackathon people with different knowledge of Cassandra participated, some of the graphs were not completely accurate. Additional work was therefore needed to work out exactly which graphs should go together and what level of detail is actually helpful to look at while debugging, how the graphs will look when there are a lot of nodes, column families/tables and check various other edge cases. We spent several hours a week over the next few weeks on different days to finalize it.

We are already using Cassibility in our own production environment, and it has already helped us to expose anomalies, debug problems quickly and optimize performance.

I think that there is a big difference between having some visibility and some graphs and having a full, well organized and understandable list of dashboards that gives you the right views at the right granularity, with clear documentation. The latter is what will really save you time and effort and even help people that are relatively new to Cassandra to understand it better.

I invite you to take a look, download and easily start using Cassibility: . We will be happy to hear your feedback!

Building the Culture to build Systems

Part 1 – The Badges

If there’s one thing we could name that differentiates a good team from a great one, it’s culture. It’s not something you can buy, it’s something you need to build, grow and nurture over time.

So how do you build the right culture?

It is very much like asking “how do you keep in shape?” First you need to set the goals you want to achieve and then you need to start exercising, but remember it is an effort that never ends – once you stop investing in it and not using it, you will lose it.

First – setting the goals, i.e. defining the values

You need to set goals, so you will know what to focus on, and as we are dealing with culture – the goals are actually the values you want to adopt and enforce.

For us some of the values and behaviours we want to emphasize are: collaboration, learning, fun, getting out of your comfort zone, initiative, excellence.

Once you have that defined you can start working out! In this post, and others to come, we will share some of the exercises we use to make sure we stay in shape.

Exercise 1 – The Badges

We created several sets of badges to enforce selected behaviours and celebrate occasions. Those badges are been handed out during weekly group sync meetings, and can also be added to mail signatures if one desires so. They also makes a nice collection, and naturally there are common and rare variants. We’ve found that this drives conversation and directly affects culture.

A few examples of the badges sets:

The Tech Collection:

We wanted to encourage using different tech tools, like Vagrant, improve Chef quality, killing tech debt etc. – so we made sure we have appropriate badges:



The Production Collection:

At the end of the day we are all dealing with production – and we want to celebrate that.

We created several badges for all kind of production related happenings, whether you have broken production, or spend a sleepless night – you will get mentioned. Seeing a senior engineer being granted a “broken prod” badge, drives our “blameless” culture, but still carries the weight of responsibility. No one wants a whole set of black badges!

sheep breakprod


The Celebration Collection:

We like to celebrate – whether it is the first on-call shift, the fact that one presented in a conference, arranged a meetup, or just helped a colleague out. By celebrating behaviors we value, we drive people to adopt them. Here are a few examples:

solo present ThanksBro


The “Jewish Mother” Collection:

Adding a good laugh is always nice, and at the end of the day we all have our quirky behaviours. By celebrating them we keep smiling… and strengthen tolerance.

holdon Alon eat


It is a living exercise and we keep on adding badges as we move along. In fact, we have our own engineers suggest and even create them. Our experience so far shows that gamifying culture-building activities has a positive effect on team atmosphere, and direct effect on the specific behaviors we address.
Stay tuned for additional exercises in How to Keep Your Culture in Shape.

Reducing risks by taking risks

You have a hypothesis you wish to prove or refute regarding your behavior in an extreme scenario.  For example:  after restarting the DB, all services are fully functioning within X seconds. There are at least 3 approaches:

  1. Wait && See — if you wait long enough this scenario will probably happen by itself in production.
  2. Deep Learning  — put the DBA, OPS and service owner in a big room, and ask them to analyze what will happen to the services after restarting the DB. They will need to go over network configuration, analyze the connection pool and DB settings, and so on.
  3. Just Do It — deliberately create this scenario within a controlled environment:

Timing — Bases on your knowledge pick the right quarter, month, week, day and hour – so the impact on the business will be minimal.

Monitoring — make sure you have all monitoring and alerts in place. If needed, do “manual monitoring”.

Scale — If applicable do it only on portion of your production: one data center, part of the service, your own laptop, etc.

The Wait && See approach will give you the right answer, but at the wrong time. Knowing how your system behaves only after a catastrophe occurs is missing the point and could be expensive, recovery-wise.

The Deep Learning approach seems to be the safer one, but it requires a lot of effort and resources. The answer will not be accurate as it is not an easy task to analyze and predict the behaviour of a complex system. So, in practice, you haven’t resolved your initial hypothesis.

The Just Do It approach is super effective – you will get an accurate answer in a very short time, and you’re able to manage the cost of recovery. That why we picked this strategy in order to resolve the hypothesis. A few tips if you’re going down this path:

  • Internal and External Communication — because you set the date, you can send a heads up to all stakeholder and customers. We use for that.
  • Collaboration — set a HipChat / Slack / Hangouts room so all relevant people will be on the same page. Post when the experiment starts, finishes, and ask service owners to update their status.
  • Document the process and publish the results within your knowledge management system (wiki / whatever).
  • This approach is applicable for organisation that practice proactive risk management, have a healthy culture of learning from mistakes, and maintain a blameless atmosphere.


People that pick Deep Learning are often doing so because it is considered a “safer”, more  conservative approach, but that is only an illusion. There is high probability that in a real life DB restart event the system would behave differently from your analysis. This effect of surprise and its outcomes are exactly what we wanted to solve by predicting the system’s behaviour.


The Just Do It approach actually reduces this risk and enables you to keep it manageable. It’s a good example of when the bold approach is safer than the conservative one. So don’t hesitate to take risks in order to reduce risk, and Just Do It

Photo credit:

ScyllaDB POC – live blogging #2

8508669031_4851697b0f_mHi again.

As we are now few weeks into this POC and we gave you first glimpse into it in the first post. We would like to continue and update on what are the latest developments.

However before we come to the update. There are 2 things that we needed tell you about the setup that did not get into the first post just for length reasons.




The first thing to tell is the data structure we use as Doron explains it:

The Data

Our data is held using a single key as partition key, and additional 2 columns as clustering keys, in a manner of key/value structure:

  • A – partition key (number).
  • B – Clustering key 1 (text). This is the value type (i.e. “name”, “age” etc…).
  • C – Clustering key 2 (text).
  • D – The data (text). This is the value itself (i.e. “David”, “19”, etc…).

When storing the data we store either all data for a partition key, or partial data.

When reading we always read all data for the partition, meaning:

select * from cf_name where A=XXX;

When we need multiple partitions we just fire the above query in parallel and wait for the slowest result. We have come to understand that this approach is the fastest one.

As this meant to be the fastest, we need to understand that such reads latency is measured always by the slowest key to read. If you fire tens of reads into the cluster and If, by the chance, you bump into a slow response of one of the nodes (GC can be a good example for such case) your all read attempt is delayed.

This where we thought Scylla can help us improve the latency of the system.


The second thing we wanted to tell you about is what made this POC so easily done in Outbrain and this is our Dual DAO mechanism that Doron explains below.

Dual DAO implementation for Data Store migration/test

We have come to see that in many cases we need a dual DAO implementation to replace the regular DAO implementation.

The main use-case is data migration from one cluster to the other:

  1. Start dual writes to both clusters.
  2. Migrate the data – by streaming it into the cluster (for example).
  3. Start dual read – read from the new cluster and fallback to the old.
  4. Remove the old cluster once there are no fallbacks (if the migration was smooth, there should not be any).

The dual DAO we have written holds both instances of the old and new DAO, and also a phase – a number between 1 and 5. In addition, the dual DAO implements the same interface like the regular DAO, so it is pretty easy to inject it in and out when done.

The phases:

  1. Write to old cluster and read from old cluster only.
  2. Write to both cluster and read from old cluster only.
  3. Write to both cluster and read from new cluster, fallback to old cluster if missing.
  4. Write to new cluster and read from new cluster, fallback to old cluster if missing.
  5. Write to new cluster and read from new cluster only.

The idea here is to support gradual and smooth transition between the clusters. We have come to understand that in some cases transition done only by the dual DAO can be very slow (we can move to 5 only after no fallback is done), so we had 2 ways to accelerate it:

  1. Read-repair – when in phase 3 or 4, and we fallback to the old cluster – write the data we fetch to the new cluster. This approach usually fits to use cases of heavy reads.
  2. Stream the data in manually by Data guys – this has proven to be the more effective and quick way – and this is what we have done in this case.

Dual DAO for data store test:

The ScyllaDB POC introduces a slightly different use-case. We do not want to migrate to Scylla (at least not right now) – but add it to our system in order to test it.

In order to match those demands, we have added a new flavor to our dual DAO:

  1. The write to the new DAO is done in the background – logging its results in case of failures – but does not fail the write process. When working in dual DAO for migration we want to fail if either the new or old DAO fail. When working in test-mode we do not. At first we did not implement it like this – and we had a production issue when ScyllaDB upgrade (to a pre GA version, from 0.17 to 0.19)  caused the save queries to fail (on 9/3/16). Due to those issues we have changed the dual write to not fail upon the new DAO failure.
  2. The read from the new DAO is never used. On phases 3-5 we do read from the new DAO, and place a handler in the end to time the fetch and compare the results to the old DAO. However, the actual response gets back from the old DAO, when it is done. This is not bullet-proof but reduces the chances of production issues due to issues in the test cluster.


Mid April Update:

Last time we have reported was when ScyllaBD was loaded with partial and ongoing data while the C* cluster was loaded with the full historic data. Scylla was performing much better as we showed.

The next step we had to do was to load the full data into ScyllaDB cluster. This was done by Scylla streaming process that read the data from the C* cluster and wrote it to the ScyllaDB cluster. This process uncovered a bug with it that failed to extract the data from the newest version of C* that we’ve used.

Shlomi’s explanation is as follows:

Sstables upgraded from cassandra 2.0 to cassandra 2.1 can contain range tombstone with a different format from tombstones written in cassandra 2.1 – we had to add support for the additional format

The Scylla team figured it out and fixed it pretty quickly.More info about this issue can be found in the project GitHub.

After all the data was filled correctly into the ScyllaDB cluster, We’ve seen degrade in the ScyllaDB cluster which made it perform slightly worse than the C* (Surprise!!!!). A faulty configuration we’ve found in the C* “Speculative retries” mechanism and fixed it, actually made the C* latency about 50% better than the ScyllaDB. We need to remember we are measuring latency at its 99th percentile. In our case – as Doron mentioned above it’s while using multiple reads which is even harder.

ScyllaBD guys were as surprised as we are, They took a look into the system and found a bug with their cross DC read-repair, they fixed the issue and installed a new version.

Here is the description of the issue as explained by Tzach from Scylla: (reference to issue –

After investigation, we found out the Scylla latency issue was caused by a subtle bug in the read-repair, causing unnecessary, synchronize, cross DC query.

When creating a table with read-repair chance (10% in this case) in Scylla or Cassandra, 10% of the reads send background queries from coordinator to ALL replications, including cross DC. Normally, the coordinator waits for responses only from a subset of the replicas, based on Consistency Level, Local Quorum in our case, before responding to the client.

However, when the first replica responses, does not match, coordinator will wait for ALL responses before sending response to the client. This is where Scylla bug was hidden.

Turnout, response digest was wrongly calculated on some cases, which cause the coordinator to believe local data is not in sync, and wait for remote DC nodes to response.

This issue is now fix and backport to Scylla 1.0.1


So, we upgraded to Scylla 1.0.1 and things look better. We can now say that C* and Scylla are in same range of latency but still C* is better. Or as Doron phrased it in this project google group:

Performance definitely improved starting 13/4 19:00 (IL time). It is clear.

I would say C* and Scylla are pretty much the same now… Scylla spikes a bit lower…


That did not satisfy the Scylla guys they went back to the lab and came back with the following resolution as described by Shlomi:

In order to test the root cause for the latency spikes we decided to try poll-mode:

Scylla supports two modes of operations – the default, event triggered mode and another extreme poll-mode. In the later, scylla constantly polls the event loop without no idle time at all. The cpu waits for work, always consuming 100% per core. Scylla was initially designed with this mode only and later switched to a more sane mode of operation with calls epoll_wait. The later requires complex synchronization before blocking on the OS syscall to prevent races between the cores going in/out sleep.

We recently discovered that the default mode suffers from increased latency and thus fixed it upstream, this fix did not propagate to 1.0.x branch yet and thus we recommend to use poll-mode here.

The journey didn’t stop here since another test execution in our lab revealed that hardware Hyper Threads (HT) may increase latencies during poll mode.

Yesterday, following our tests with and without HT, I have updated scylla configuration to use poll-mode and running only on 12 physical cores  (allowing the kernel to schedule the other processes on the HT).

Near future investigation will look into why poll-mode with HT is an issue (Avi suspects it’s L1 cache eviction). Once the default, event-based code will be back ported to the release, we’ll switch to it as well.


This is how the graphs look today:

Screen Shot 2016-04-21 at 12.30.26 PM


As you can see in the graphs above: (Upper graph is C*/ Lower graph is Scylla)

  • Data was loaded into Scylla in 4/1 (lower graph).
  • On 4/4 Doron did the fix in C* configuration and its latency improved dramatically.
  • On 4/13 Scylla 1.0.1 was installed and did an improvement.
  • On 4/20 Shlomi did the HyperThreading poll-mode config change.
  • The current status is that Scylla base is nicely on the 50ms spiking to above 90ms whereas C* is at 60-65 spiking to 120ms in some cases. It worth to remember that those measurements are 99th percentile taken from within Outbrain’s client service which is Java and by itself suffers from unstable performance.

There is a lot to learn from such POC. The guys from Scylla are super responsive and don’t hesitate to take learnings from every unexpected results.

Our next steps are:

  1. We still see some level of results inconsistencies between the 2 systems that we want to verify where they come from and fix.
  2. Move to throughput test by decreasing the number of machines in the Scylla cluster.


That’s all for now. More to come.   

Next update is here.

We are testing ScyllaDB – live blogging #1

The background

Screen Shot 2016-03-15 at 12.42.47 AMIn the last month we have started, in Outbrain, to test ScyllaDB. I will tell you in a minute what ScyllaDb is and how we came to test it but I think what is most important is that ScyllaDB is a new database at its early stages and still before its first GA (coming soon). It is not an easy decision to be among the firsts to try such a young project that not many have used before (up until now there are about 2 other production installations) but as they say, someone have to be the first one… Both ScyllaDB and Outbrain are very happy to openly share how the test goes, what are the hurdles what works and what not.


How it all began:

I know the guys from Scylla for quite some time, we have met through the first iteration of the company (Cloudius-systems) and we’ve met at the early stages of ScyllaDB too. Dor and Avi, the founders of ScyllaDB, wanted to consult if as heavy users of Cassandra, we will be happy for the solution they are going to write. I said, “Yes,  definitely”  and I remember saying, “If you will give me Cassandra functionality and operability at the speed and throughput of Redis, You got me.”

Time went by and about 6 months ago they came back and said they are ready to start integrations with live production environments.


This is the time to tell you what ScyllaDB is.

The easiest description is “Cassandra on steroids”. That’s right but in order to do that, the guys in Scylla basically had to write all Cassandra server from scratch, meaning:

  • Keep all Cassandra interface perfectly the same so client applications will not have to change.
  • Write it all over in C++, and by that overcome the issues that JVM brings with it, mostly no GC that was hurting the high percentiles of latency.
  • Write it all in Asynchronous programming model that enable the server to run in very high throughput.
  • Shard per core approach – on top of the cluster sharding, Scylla uses shard-per-core which allows it to run lockless and scale up with the number of cores
  • Scylla uses its own cache and does not rely on the operating system cache. It saves data copy and does not slow down due to page faults

I must say that was intriguing my mind as if you are looking at OpenSource NoSQL data systems that picked up, there is one camp of  C++, High performance but, yet,  simple functionality (memcached or redis) and the heavy functionality but JVM based camp (Spark, Hadoop, Cassandra). However if you can combine the good of both worlds – it sounds great.


Where does that meet Outbrain?

Outbrain is a heavy user of Cassandra. We have few hundreds of Cassandra machines running in 20 clusters over 3 datacenters. They store 1-2 terabytes of data each. Some of the clusters are being hit on user’s query time and unexpected latency is an issue. As data, traffic and complexity grew up with outbrain it became more and more complex to maintain the cassandra clusters and keep them up to reasonable performance. It always required more and more hardware to support the growth as well as the performance.

The promise of getting stable latency, 5-10x more throughput (much less machines)without the cost of re-writing our code made a lot of sense and we decide to give it a shot.

One thing was not yet in the product that we needed deeply was Cross DC clusters. The Cassandra feature of eventual consistency across different clusters in different Data Center is key to how Outbrain operates and it was very important for us. It took the guys from ScyllaDB a couple of months to finish that feature, test and verify all works and we were ready to go.

ScyllaDB team is located in Herzliya which is very close to our office in Netanya and they were very happy to come and start the test.


The team working on this test is:

Doron Friedland – Backend engineer at Outbrain’s App Services team.

Evgeny Rachlenko – from Outbrain’s Data Operations team.

Tzach Liyatan – ScyllaDB Product manager.

Shlomi Livne – ScyllaDB VP of R&D.


The first step was to allocate the right cluster and functionality we want to run the test on. After a short consideration we chose to run this comparison test on the cluster that holds all our Documents store. It holds all information about all active documents in Outbrain’s system. We are talking about few millions of documents where each one of them have hundreds of different features represented as Cassandra columns. This store is being updated all the time and being accessed in every user request (few million requests every minute). Cassandra started struggling with this load and we started applying many solutions and optimizations in order to keep the load. We also enlarged the cluster so we can keep it up.

One more thing that we did in order to overcome the Cassandra performance issues was to add a level of application cache that consumes few more machines

by itself.

One can say, that’s why you chose a scalable solution like Cassandra so you can grow it as you wish. But when the number of servers start to rise and have significant cost, you want to look at other solutions. This is where ScyllaDB came into play.


The next step was to install a cluster, similar in size to the production cluster.


Evgeny describes below the process of installing the cluster:

Well, the  installation impressed me in the two aspects.

Configuration part was pretty same to Cassandra with few changes in parameters.

Scylla simply ignoring GC, or HEAP_SIZE parameters  and use configuration as extension of cassandra.yaml file.

Our Cassandra’s clusters  running with many components integrated into outbrain ecosystem.  Shlomi with Tzach has defined properly  the most important graphs and alerts. Services such as consul, collectd, prometheus with graphana  also has been integrated as part of POC. Most integration test passed without my intervention except light changes in the Scylla chef’s cookbook.


Tzach is describing what it looked like from their side:

Scylla installation, done by Evgeny, was using a clone of Cassandra Chef recipes, with a few minor changes. Nodetool and cqlsh was used for sanity test of the new cluster.

As part of this process, Scylla metric was directed to OutBrain existing Prometheus/ Grafana monitoring system. Once traffic was directed to the system, the application and ScylladDB metrics was all in one dashboard, for easy comparison.


Doron is describing the application level steps of the test:

    1. Create dual DAO to work with ScyllaDB in parallel to our Cassandra main storage (see elaboration on the dual DAO implementation below).
    2. Start dual writes to both clusters (in production).
    3. Start dual read (in production) to read from ScyllaDB in addition to the Cassandra store (see the test-dual DAO elaboration below).
    4. Not done yet: migrate the entire data from Cassandra to ScyllaDB by streaming the data into the ScyllaDB cluster (similar to migration between Cassandra clusters).
    5. Not done yet: Measure the test-reads from ScyllaDB and compare both the latency and the data itself – to the data taken from Cassandra.
    6. Not done yet: In case the latency from ScyllaDB is better, try to reduce the number of nodes to test the throughput.




Performance metrics:

Here are some very initial measurement results:

You can clearly see below that ScyllaDB is performing much better and in a much more stable performance.

One current disclaimer here is that Scylla still does not have all the historic data and just using data of the last week.

It’s not visible from the graph but ScyllaDB is not loaded and thus spends most of the time idling, the more loaded it will become, latency will reduce (until a limit of course).

We need to wait and see the following weeks measurements. Follow our next posts.


Read latency (99 percentile) of single entry: *First week – data not migrated (2.3-7.3):


Screen Shot 2016-03-15 at 12.35.43 AM


Screen Shot 2016-03-15 at 12.38.29 AM


Read latency (99 percentile) of multi entries (* see comment below): *First week – data not migrated (2.3-7.3):


Screen Shot 2016-03-15 at 12.40.11 AM


Screen Shot 2016-03-15 at 12.41.37 AM


* The read of multiple partitions keys is done by firing single partitions key requests in parallel, and waiting for the slowest one. We have learned that this use-case extremes evert latency issues we have in the high percentiles.

That’s where we are now. The test is moving on and we will update with new findings as we progress.

In the next post Doron will describe the Data model and our Dual DAO which is the way we run such tests. Shlomi and Tzach will describe the Data transfer and upgrade events we had while doing it.

Stay tuned.

Next update is here.


How to substitute dynamic URL parameters: a template engine use case


Outbrain is one of the world’s largest content discovery platforms, serving more than 200B recommendations monthly and reaching over 561 million unique visitors from across the globe.

As Outbrain becomes an important source of traffic and revenues for publishers, partners are looking for more tracking capabilities that will help them determine the amount of traffic generated by Outbrain, and how this traffic is distributed throughout their site. In addition, publishers want to see this information on their analytic tool of choice (Omniture, GA, Webtrends etc.).

Outbrain now provides them with tracking parameters that can be appended to the URL of recommendations.


A url is constructed of three parts: prefix, actual url and suffix, where the prefix and suffix serve as means to let our partners add tracking capabilities in the form of dynamic parameters (for instance: the url’s title, publish date and id). Eventually, the dynamic parameters are added as a query string (field1=value1&field2=value2&field3=value3) to the recommendation url.

The legacy code would contract the three url parts and to use String replace in order to add the dynamic parameters. This implementation was hard to maintain (in situations where we wanted to support more dynamic parameters) as well as difficult to use since there was a need to import a lot of code to a project, and to depend on many modules.

When revisiting the problem, we understood that it would be appropriate to use a Separation of Concerns approach, separating the template from the model, and from the transformation logic itself – a template engine sounded like the right choice! In addition, since more and more dynamic parameters were being added, we used a builder pattern in order to achieve an easier and cleaner usage for clients, and easier maintenance in the future.

The problem – using a template engine doesn’t guarantee better performance

We decided to use StringTemplate as our template engine since we were familiar with it and had some good experience using it. The result of the refactor was a cleaner, shorter and maintainable API.

Unfortunately, when we deployed the new changes to production, we noticed a significant increase in serving time that was unacceptable in terms of user experience. After investigating the root cause, we found out that the usage of StringTemplate was pretty expensive. Even though the templates could be reused, we couldn’t reuse them all. For instance, we created a new template for each request, since it was constructed with different dynamic parameters. (The form of the url though, was the same: prefix, actual url and suffix).

So at that moment we had a clean and elegant solution that wasn’t performing well. We then looked for some alternative solutions for StringTemplate that could save us the expensive cost of constructing a new template for each request.

The solution – right tool for the job

Eventually, we found a light-weight template engine, that allowed us to keep using the Separation of Concerns approach and still achieve good performance. We ended up using Apache Commons Lang 3.0 StrSubstitutor – a simpler alternative to StringTemplate. This time, we made sure that it outperformed our last implementation by doing some micro benchmarking, and indeed the results were much better. The new implementation executed more than 4 times operation per second.

MicroBenchmarking Results

We used Java Microbenchmark Harness in order to perform our performance measurements.


Raw data:


For more information see: