Category: Dev Methods

Automating your workflow

During development, there are many occasions where we have to do things that are not directly related to the feature we are working on, or things that are repetitive and recurring.
In the time span of a feature development this can often take as much time to do as the actual development.

For instance, updating your local dev micro services environment before testing your code. This task on its own, which usually includes updating your local repo version, building and starting several services and many times debugging and fixing issues caused by others, can take hours, many times just to test a simple procedure.

We are developers, we spend every day automating and improving other people’s workflows, yet we often spend so many hours doing the same time consuming tasks over and over again.
So why not build the tools we need to automate our own workflows?

In our team we decided to build a few tools to help out with some extra irritating tasks we were constantly complaining about to each other.

First one was simple, creating a slush sub-generator. For those of you who don’t know, slush is a scaffolding tool, like yeoman but for gulp. We used this to create our Angular components.
Each time we needed to make a new component we had to create a new folder, with three files:


  Comp.component.ts
  Comp.jade
  Comp.less

Each file of course has its own internal structure of predisposed code, and each component had to be registered in the app module and the main less file.

This was obviously extremely annoying to redo each time, so we automated it. Now each time you run “ob-genie” from the terminal, you are asked the name of your component and what module to register it with, and the rest happens on its own. We did this for services and directives too.

Other than saving a lot of time and frustration, this had an interesting side effect – people on the team were creating more components than before! This was good because it resulted in better separation of code and better readability. Seems that many tim the developers were simply too lazy to create a new component and just chucked it all in together. Btw, Angular-CLI have added a similar capability, guess great minds think alike.

Another case we took on in our team was to rid ourselves of the painstaking task of setting up the local environment. This I must say was a real pain point. Updating the repo, building and running the services we needed each time could take hours, assuming everything went well.
There have been times where I spent days on this just to test the simplest of procedures.
Often I admit, I simply pushed my code to a test environment and debugged it there.
So we decided to build a proxy server to channel all local requests to the test environment.

For this we used node-proxy, a very easy to configure proxy. However, this was still not an easy task since each company has very specific configurations issues we had to work with.
One thing that was missing was proper routing capabilities. Since you want some requests to go local and some remote we added this before each request.

https.createServer(credentials, function (req, res) {
 Object.keys(options.routingTable).some(function (key) {
   const regX = new RegExp(key);
   if (regX.test(req.url)) {
     printMe(req.url + ' => ' + (options.routingTable[key].targetName || options.routingTable.target));
     proxy.web(req, res, options.routingTable[key]);
     curTarget = options.routingTable[key];
     return true;
   }
 });
}).listen(options.home_port);

We passed as an option the routing table with a regex for each path, making it easy to configure which requests to proxy out, and which in.

routingTable = {
  'site': local,
  '^/static': local,
  '/*/': remote
};

Another hurdle was working with HTTPS, since our remote environments work on HTTPS.
In order to adhere to this we needed to create SSL certificate for our proxy and the requestCert parameter in our proxy server to false, so that the it doesn’t get validated.

The end configuration should look something like this.

const local = {
   targetName: 'local',
   target: 'https://localhost:4141,
   changeOrigin: true, 
   secure: false
 },
 remote = {
   targetName: 'remote',
   requestCert: false,
   rejectUnauthorized: false,
   target: 'https://test.outbrain.com:8181,
   secure: false,
   changeOrigin: true,
   autoRewrite: true

 },
 routingTable = {
   'site': local,
   '^/static': local,
   '/*/': remote
 };

const options = {
 routingTable: routingTable,
 home_port: 2109,
 debug: true,
 startPath: 'amplify/site/'
 };

With this you should be able to run locally and route all needed calls to the test environment when working on localhost:2109.

So to conclude, be lazy, make your work easier, and use the skills you have to automate your workflows as much as possible.

Kibana for Funnel Analysis

How we use Kibana (4) for user-acquisition funnel analysis

Outbrain has recently launched a direct-to-consumer (D2C) initiative. Our first product is a chatbot. As with every D2C product, acquiring users is important. Therefore, optimizing the acquisition channel is also important. The basis of our optimization is analysis.

kbfunnel-image01

Our Solution (General Architecture)

Our acquisition funnel spans on 2 platforms (2 web pages and a chatbot). Passing many parameter between platforms can be a challenge, so we chose a more stateful, server-based model. Client requests for a new session Id, together with basic data like IP and User agent. Server stores a session (we use Cassandra in this case) with processed fields like Platform, OS, Country, Referral, User Id. At a later stage the client reports a funnel event for a session Id. The server writes all known fields for the session into 2 storages:

  • ElasticSearch for quick & recent analytics (Using the standard ELK stack)
  • Hadoop for long term storage and offline reports

A few example fields stored per event

  • User Id – An unique & anonymous identifier for a user
  • Session Id – The session Id is the only parameter passed between funnel steps
  • Event Type – The specific step in the funnel – serve, view, click
  • User Agent – Broken down to Platform and OS
  • Location – based on IP
  • Referral fields – Information on the context in which the funnel is excercised
  • A/B Tests variants – The A/B Test variant Ids that are included in the session

Goal of the Analysis: Display most important metrics quickly

Kibana plugin #1: Displaying percent metric

Kibana has several ways of displaying a fraction, but none excel in displaying small numbers. (Pie can be used to visualize fractions, but small). We developed a Kibana plugin for displaying a single metric, in percent format.

kbfunnel-image00

We use this visualization for displaying the conversion rate of the most interesting part of our funnel.

Kibana plugin #2: Displaying the funnel

We couldn’t find a good way for displaying a funnel so we developed a visualization plugin (honestly, we were eager to develop this, so we did not scan the entire internet..)

Based on the great D3 Funnel by Jake Zatecky, this is a Kibana plugin that display buckets of events in funnel format. It’s customizable and open-source. Feel free to use it…

kbfunnel-image02

Putting it all together

Displaying your most important metrics and the full funnel is nice. Comparing variant A with variant B is very nice. We’ve setup our dashboard to show similar key metrics on 2 versions of the funnel. We always try to run at least 1 A/B test and this dashboard shows us realtime results of our tests.

kbfunnel-image04

Cherry on top

Timelion is awesome. If you’re not using it, I suggest trying it.

Viewing your most important metrics over time is very useful, especially when you’re making changes fast. Here’s an example:

kbfunnel-image03

Summary

We track a user’s activity by sending events to the server. The server writes these events to ES and Hadoop. We developed 2 Kibana plugins to visualize the most important metrics of our user-acquisition funnel. We can filter the funnel by Platform, Country, OS, Time, Referral, or any other fields we bothered to save. In addition, we always filter by A/B Test variants and compare 2 specific variants.

Micro Service Split

image07

In this post I will describe a technical methodology we used to remove a piece of functionality from a Monolith/Service and turn it into a Micro-Service. I will try to reason about some of the decisions we have made and the path we took, as well as a more detailed description of internal tools, libraries and frameworks we use at Outbrain and in our team, to shed some light on the way we work in the team. And as a bonus you might learn from our mistakes!
Let’s start with the description of the original service and what it does.
Outbrain runs the largest content discovery platform. From the web surfer’s perspective it means serving a recommended content list that might interest her, in the form of ‘You might also like’ links. Some of those impression links are sponsored. ie: when she clicks on a link, someone is paying for that click, and the revenue is shared between Outbrain and the owner of the page with the link on it. That is how Outbrain makes its revenue.

My team, among other things, is responsible for the routing of the user to the requested page after pressing the link, and for the bookkeeping and accounting that is required in order to calculate the cost of the click, who should be charged, etc.
In our case the service we are trying to split is the ‘Bookkeeper’. Its primary role is to manage the paid impression links budget. After a budget is spent, The ‘Bookkeeper’ should notify Outbrain’s backend servers to refrain from showing the impression link again. And this has to be done as fast as possible. If not, people will click on links we cannot charge because the budget was already spent. Technically, this is done by an update to a database record. However, there are other cases we might want to stop the exposure of impression links. One such an example is a request from the customer paying for the future click to disable the impression link exposure. So for such cases we have an API endpoint that does exactly the same with the same code. That endpoint is actually part of the ‘Bookkeeper’ that is enabled by a feature toggle on specific machines. This ‘Activate-Impressionable’ endpoint as we call it, is what was decided to split out of the ‘Bookkeeper’ into a designated Micro-Service .
In order to execute the split, we have chosen a step-by-step strategy that will allow us to reduce the risk during execution and keep it as controlled and reversible as possible. From a bird’s eye view I will describe it as a three steps process: Plan, Up and Running as fast as possible and Refactor. The rest of the post describes these steps.

Plan (The who’s and why’s)

In my opinion this is the most important step. You don’t want to split a service just in order to split. Each Micro Service introduces maintenance and management overhead, with its own set of challenges[1]. On the other hand, Microservices architecture is known for its benefits such as code maintainability (for each Micro Service), the ability to scale out and improved resilience[2].
Luckily for me, someone already did that part for me and took the decision that ‘Activate-Impressionable’ should split from the ‘Bookkeeper’. But still, Let’s name some of the key factor of our planning step.
Basically I would say that a good separation is a logical separation with its own non-overlap RESTful endpoints and isolated code base. The logical separation should be clear. You should think what is the functionality of the new service, and how isolated it is. It is possible to analyze the code for inter-dependencies among classes and packages using tools such as lattix. At the bottom line, it is important to have a clear definition of the responsibility of the new Micro Service.
In our case, the ‘Bookkeeper’ was eventually split so that it remain the bigger component, ‘Activate-Impressionable’ was smaller and the common library was smaller than both. The exact lines of code can be seen in the table below.

Screen Shot 2016-02-21 at 11.21.56

Unfortunately I assessed it only after the split and not in the plan step. We might say that there is too much code in common when looking at those numbers. It is something worth considering when deciding what to split. A lot of common code implies low isolation level.

Of course part of the planning is time estimation. Although I am not a big fan of “guestimates” I can tell that the task was planned for couple of weeks and took about that long.
Now that we have a plan, let’s get to work.

Up and Running – A Step by Step guide

As in every good refactor, we want to do it in small baby steps, and remain ‘green’ all the time[3]. In continuous deployment that means we can and do deploy to production as often as possible to make sure it is ‘business as usual’. In this step we want to get to a point the new service is working in parallel to the original service. At the end of this step we will point our load-balancers to the new service endpoints. In addition, the code remains mutual in this step, means we can always deploy the original fully functioning ‘Bookkeeper’. We actually do that if we feel the latest changes had any risk.
So let’s break it down into the actual phases:

Overview Step Details
micro service split 0 Starting phase
micro service split 1 Create the new empty Micro-Service ‘Activate-Impressionable’. In outbrain we do it using scaffolding of ob1k framework. Ob1k is an open source Micro Services Framework that was developed in-house.
micro service split 2 Create a new empty Library dependent both by the new ‘Activate-Impressionable’ service and the ‘Bookkeeper’. Ideally, if there is a full logic separation with no mutual code between the services that library will be deleted in the cleanup phase.
micro service split 3 Move the relevant source code to the library. Luckily in our case, there was one directory that was clearly what we have to split out. Unluckily, that code also pulled up some more code it was dependent on and this had to be done carefully not to pull too much nor too little. The good news are that this phase is pretty safe for static typing languages such as Java, in which our service is written in. The compiler protects us here with compilation errors so the feedback loop is very short. Tip: don’t forget to move unit tests as well.
micro service split 4 Move common resources to the library, such as spring beans defined in xml files and our feature flags files that defined in yaml files. This is the dangerous part. We don’t have the compiler here to help, so we actually test it in production. And when I say production I mean using staging/canary/any environment with production configuration but without real impact. Luckily again, both yaml and spring beans are configured to fail fast, so if we did something wrong it will just blow out in our face and the service will refuse to go up. For this step I even ended up developing a one-liner bash script to assist with those wicked yaml files.
micro service split 5 Copy and edit web resources (web.xml) to define the service endpoints. In our case web.xml can’t reside in a library so it has to be copied. Remember we still want the endpoints active in the ‘Bookkeeper’ at that phase. Lesson learned: inspect all files closely. In our case log4j.xml which seems like an innocent file by its name contains designated appenders that are consumed by other production services. I didn’t notice that and didn’t move the required appender, and it was found only later in production.
Deploy Deploy the new service to production. What we did is deploy the ‘Activate-Impressionable’ side-by-side on the same machines as the ‘Bookkeeper’, just with a different ports and context path. Definitely makes you sleep better at night.
Up-And-Running Now is a good time to test once again if both ‘Bookkeeper’ and ‘Activate-Impressionable’ are working as expected. Generally now we are up and running with only few more things to do here.
Clients Redirect Point clients of the service to the new endpoints (port + context path). A step that might take some time depends on the number of clients and the ability to redeploy them. In outbrain we use HA-Proxy, so reconfiguring it did most of the work, but some clients did require code modifications.
(More) Validation Move/copy simulator tests and monitors. In our team, we heavily rely on tests we call simulator tests. These are actually black-box tests written in JUnit that runs against the service installed on a designated machine. These tests see the service as a black-box and calls its endpoints while mock/emulate other services and data in the database for the test run. So usually a test run can look like: put something in the database, trigger the endpoint, and see the result in the database or in the http response. There is also a question here whether to test ‘Activate-Impressionable’ or the ‘Bookkeeper’. Ideally you will test them both (tests are duplicated for that phase), and that is what we did.

 

Refactor, Disconnect & Cleanup

When we got here the new service is working and we should expect no more behaviour changes from the endpoints point of view. But we still want the code to be fully split and the services to be independent from each other. In the previous step we performed the phases in a way that everything remains reversible with a simple feature toggle & deploy.

In this step we move to a state where the ‘Bookkeeper’ will no longer host the ‘Activate-Impressionable’ functionality. Sometimes it is a good idea to have a gap from the previous step to make sure that there are no rejections and backfires that we didn’t trace in our tests and monitoring.
First thing, If was not done up until now, is deploying the ‘Bookkeeper’ without the service functionality and make sure everything is still working. And wait a little bit more…
And now we just have to push the sources and the resources from the library to the ‘Activate-Impressionable’ service. In the ideal case there is no common code, we can also delete the library. This was not how it was in our case. We still have a lot of common code we can’t separate for the time being.
Now is also the time to do resources cleanup, web.xml edit etc’.
And for the bold and OCD among us – packages rename and refactor of code with the new service naming conventions.

Conclusion

image02
The entire process in our case took a couple of weeks. Part of the fun and advantage in such process, is the opportunity to know better an old code and its structure and functionality without the need to modify something for a new feature with its constraints. Especially when someone else wrote it originally.
In order to perform well such a process it is important to plan and remain organized and on track. In case of a context switch it is very important to keep a bookmark of where you need to return to in order to continue. In our team we even did that with a handoff of the task between developers. Extreme Programming, it is.
It is interesting to see the surprising results in terms of lines of code. Originally we thought of it as splitting a micro-service from a monolith. In retrospective, it looks to me more like splitting a service into two services. ‘Micro’ in this case is in the eye of the beholder.

References

[1] http://highscalability.com/blog/2014/4/8/microservices-not-a-free-lunch.html
[2] http://eugenedvorkin.com/seven-micro-services-architecture-advantages/
[3] http://blog.cleancoder.com/uncle-bob/2014/12/17/TheCyclesOfTDD.html

http://martinfowler.com/articles/microservices.html
https://github.com/outbrain/ob1k
http://www.yaml.org/
http://lattix.com/

DevOps – The Outbrain Way

Like many other fast moving companies, at Outbrain we have tried several iterations in the  attempt to find the most effective “DevOps” model for us. As expected with any such effort, the road has been bumpy and there have been many “lessons learned” along the way. As of today, we feel that we have had some major successes in refining this model, and would like to share some of our insights from our journey.

 

Why to get Dev and Ops together in the first place?

A lot has been written on this topic, and the motivations and benefits of adding the operational perspective into the development cycles has been thoroughly discussed in the industry – so we will not repeat those.

I would just say that we look at these efforts as preventive medicine, like eating well and exercise – life is better when you stay healthy.  It’s not as good when you get sick and seek medical treatment to get health again.

 

What’s in a name?

We do not believe in the term “DevOps”, and what it represents.  We try hard to avoid it –  why is that?

Because we expect every Operations engineer to have development understanding and skills, and every Developer to have operational understanding of how the service he / she developes works, and we help them achieve and improve those skills – so everyone is DevOps.

We do believe there is a need to get more system and production skills and expertise closer to the development cycles – so we call it Production Engineers.

 

First try – Failed!

We started by assigning Operations Engineers to work with dedicated development groups – the big problem was that it was done on top of their previous responsibility in building the overall infrastructure (config management, monitoring infrastructure, network architecture etc.), which was already a full time job as it was.  

This mainly led to frustration on both sides – the operations eng. who felt they have no time to do anything properly, just touching the surface all the time and spread too thin, and the developers who felt they are not getting enough bandwidth from operations and they are held back.

Conclusion – in order to succeed we need to go all in – have dedicated resources!

 

Round 2 – Dedicated Production Eng.

Not giving up on the concept and learning from round 1 – we decided to create a new role – “Production Engineers” (or PE for short), whom are dedicated to specific development groups.

This dedication manifest in different levels. Some of them are semi trivial aspects, like seating arrangements – having the PE sit with the development team, and sharing with them the day to day experience; And some of them are focus oriented, like joining the development team goals and actually becoming an integral part of the development team.

On the other hand, the PE needs to keep very close relationship with the Infrastructure Operational team, who continues to develop the infrastructure and tools to be used by the PEs and support the PEs with technical expertise on more complex issues require subject matter experts.

 

What & How model:

So how do we prevent the brain split situation of the PE? Is the PE part of the development team or the Operations team? When you have several PEs supporting different development groups – are they all independent or can we gain from knowledge transfer between them?

In order for us to have a lighthouse to help us answer all those questions and more that would evident come up, we came up with the  “What & How” model:

“What” – stands for the goals, priorities and what needs to be achieved. “The what” is set by the development team management (as they know best what they need to deliver).

“How” – stands for which methods, technologies and processes should be used to achieve those goals most efficiently from operational perspective. This technical, subject matter guidance is provided by the operations side of the house.

 

So what is a PE @ Outbrain?

At first stage, Operations Engineer is going through an on-boarding period, during which the Eng. gains the understanding of Outbrain operational infrastructure. Once this Eng. gained enough millage he /she can become a PE, joining a development group and working with them to achieve the development goals, set the “right” way from operational perspective, properly leveraging the Outbrain infrastructure and tools.

The PE enjoys both worlds – keeping presence in the Operations group and keeping his/hers technical expertise on one hand, and on the other hand be an integral part of the development team.

From a higher level perspective – we have eliminated the frustrations points, experienced in our first round of “DevOps” implementation, and are gaining the benefit of close relationship, and better understanding of needs and tools between the different development groups and the general Operations group. By the way, we have also gained a new carrier development path for our Operations Eng. and Production Eng. that can move between those roles and enjoy different types of challenges and life styles.

 

e8f82598-c6e2-4c08-85ce-f6d34f74f3b6

Real Time Performance Monitoring @ Outbrain

Outbrain serves millions of requests per minute, based on a micro service architecture.Consequently, as you might expect, visibility and performance monitoring are crucial.

Serving millions of requests per minute, across multiple data centers, in a micro services environment, is not an easy task. Every request is routed to many applications, and may potentially stall or fail at every step in the flow. Identifying bottlenecks, troubleshooting failures and knowing our capacity limits are all difficult tasks. Yet, these are not things you can just give up on “because they’re hard”, but are rather tasks that every engineer must be able to tackle without too much overhead. It is clear that we have to aim for all engineers to be able to understand how their applications are doing at any given time.

Since we face all of these challenges every day, we’ve reached the point where a paradigm shift was required. For example, move from the old, familiar “investigate the past” to the new, unfamiliar “investigate the present”. That’s only one of the requirements we came up with. Here are few more:

 

Real time visibility

Sounds pretty straightforward, right? However when using a persistent monitoring system, it always has at least few minutes of delay. These few minutes might contain millions of errors that potentially affect your business. Aiming for low MTTR means cutting delays where possible, thus moving from minute-based granularity to second-based.

 

Throughput, Latency and error rate are linked

Some components might suffer from high latency, but maybe the amount of traffic they receive is negligible. Others might have low latency under high load, but that’s only because they fail fast for almost every request (we are reactive!). We wanted to view these metrics together, and rank them by importance.

 

Mathematical correctness at any given aggregation (Don’t lie!)

When dealing with latency, one should look at percentiles, not averages, as averages can be deceiving and might not tell the whole story. But what if we want to view latency per host, and then view it per data center ? if we store only percentiles per host (which is highly common in our industry), it is not mathematically correct to average them! On the other hand, we have so much traffic that we can’t just store any measurement with its latency; and definitely not view them all in real time

 

Latency resolution matters

JVM based systems tend to display crazy numbers when looking at the high percentiles (how crazy ? With heavy gc storms and lock contention there is no limit to how worse these values can get). It’s crucial for us to differentiate between latency in the 99.5 and 99.9 percentiles, while values at the 5 or 10 percentiles don’t really matter.

Summing up all of the requirements above, we reached a conclusion that our fancy persistent monitoring system, with its minute-based resolution, supporting millions of metrics per minute, doesn’t cut it anymore. We like it that every host can write thousands of metric values every minute, and we like being able to view historical data over long periods of time, but moving forward, it’s just not good enough. So, as we often do, we sat down to rethink our application-level metric collection and came up with a new, improved solution.

 

Our Monitoring Unit

First, consider metric collection from the application perspective. Logically, it is an application’s point-of-view of some component: a call to another application, to a backend or plain CPU bound computation. Therefore, for every component, we measure its number of requests, failures, timeouts and push backs along with a latency histogram over a short period of time.

In addition, we want to see the worst performing hosts in terms of any such metric (can be mean latency, num errors, etc)

mu

To achieve this display for each measured component we decided to use these great technologies:

 

HDR Histograms

http://hdrhistogram.github.com/HdrHistogram/

HdrHistogram supports the recording and analysis of sampled data value counts, across a configurable value range, with configurable value precision within the range. It is designed for recording histograms of latency measurements in performance-sensitive applications.

Why is this important? Because when using such histograms to measure the latency of some component, it allows you to have good accuracy of the values in the high percentiles at the expense of the low percentiles

So, we decided to store in memory instances of histograms (as well as counters for requests, errors, timeouts, push backs, etc) for each measured component. We then replace them each second and expose these histograms in the form of rx.Observable using our own OB1K application server capabilities.

All that is left is to aggregate and display.

Java Reactive extensions

https://github.com/ReactiveX/RxJava

rx is a great tool to merge and aggregate streams of data in memory. In our case, we built a service to merge raw streams of measured components; group them by the measured type, and aggregate them in a window of a few seconds. But here’s the trick – we do that on demand. This allows us to let the users view results grouped by any dimension they desire without losing the mathematical correctness of latency histograms aggregation.

Some examples on the operators we use to aggregate the multiple monitoring units:

 

merge

rx merge operator enables treating multiple streams as a single stream

 

window

rx window operator enables sliding window abstraction

 

scan

rx scan operator enables aggregation over each window

 

To simplify things, we can say that for each component we want to display, we connect to each machine to fetch the monitored stream endpoint, perform ‘merge’ to get a single stream abstraction, ‘window’ to get a result per time unit, and ‘scan’ to perform the aggregation

 

Hystrix Dashboard

https://github.com/Netflix/Hystrix

The guys at Netflix found a great formula for displaying serving components’ status in a way that links between volume, error percentage and latency in a single view. We really liked that, so we adopted this UI to show our aggregated results.

The hystrix dashboard view of a single measured component shows counters of successes, failures, timeouts and push backs, along with a latency histogram, information on the number of hosts, and more. In addition, it provides a balloon view, which grows/shrinks with traffic volume per component, and is color-coded by the error rate.

See below how this looks in the breakdown view of all request components. The user gets a view of all measured components, sorted by volume, with a listing of the worst performing hosts.

view1

Another example shows the view of one application, with nothing but its entry points, grouped by data center. Our Operations guys find this extremely useful when needing to re-balance traffic across data centers.

REBALANCE

 

OK, so far so good. Now let’s talk about what we actually do with it.

Troubleshooting

Sometimes an application doesn’t meet its SLA, be it in latency or error rate. The simple case is due to a broken internal component (for example, some backend went down and all calls to it result in failures). At this point we can view the application dashboard and easily locate the failing call. A more complex use case is an increase in the amount of calls to a high latency component at the expense of a low latency one (for example, cache hit rate drop). Here our drill down will need to focus on the relative amount of traffic each component receives – we might be expecting a 1:2 ratio, while in reality we might observe a 1:3 ratio.

With enough alerting in place, this could be caught by an alert. Having the real time view will allow us to locate the root cause quickly even when the alert is a general one.

troubleshoot

Performance comparison

In many cases we want to compare the performance of two groups of hosts doing the same operation, such as version upgrades or topology changes. We use tags to differentiate groups of machines (each datacenter is a tag, each environment, and even each hostname). We then can ask for a specific metric, grouped by tags, to get the following view:

compare

 

Load testing

We conduct several types of load tests. One is where we shift as much traffic as possible to one data center, trying to hit the first system-wide bottleneck. Another is performed on specific applications. In both cases we use the application dashboard to view the bottlenecks, just like we would when troubleshooting unexpected events.

One thing to keep in mind is that when an application is loaded, sometimes the CPU is bounded and measurements are false because threads just don’t get CPU time. Another case where this happens is during GC. In such cases we must also measure the effects of this phenomenon.

The measured unit in this case is ‘jvm hiccup’, which basically means taking one thread, letting it sleep for a while and measuring “measurement time on top of the sleep time”. Low hiccups means we can rely on the numbers presented by other metrics.

hiccup

 

What’s next?

Real time monitoring holds a critical role in our everyday work, and we have many plans to leverage these measurements. From smarter metric driven load balancing in the client to canary deployments based on real application behavior – there is no limit to what you can achieve when you measure stuff in a fast, reliable manner.

Monitoring APIs with ELK

The Basics

One of the main challenges we’ve dealt with during the last couple of years, was opening our platform and recommendation engine to the developers’ community. With the amount of data that Outbrain processes, direct relations with hundreds of thousands of sites and reach of more than 600M users a month, we can drive the next wave of content innovation. One of Outbrain’s main drivers for enabling automated large scale recommendations system is to provide application developers the option to interact with our system via API.

Developers build applications, and those application are used by users, in different locations and times. When exposing API to external usage you can rarely predict how people will actually use it

These variations can come from different reasons:

  1. Unpredictable scenarios
  2. Unintentional misuse of the API. Either for lack of proper documentation, a bug, or simply because a developer didn’t RTFM.
  3. Intentional misuse of the API. Yeah, you should expect people will abuse your API or use it for fraudulent activity.

In all those cases, we need to know how the developers community is using the APIs and how the end users (applications) are using it as well and also take proactive measures.

Hello ELK.

The Stack

image01

ElasticSearch, Logstash  and Kibana (AKA ELK) are great tools for collecting, filtering, processing, indexing and searching through logs. The setup is simple: Our service writes logs (using Log4J), the logs are picked up by a Logstash agent that sent it to an ElasticSearch  index. Kibana is setup to visualize the data of the ES index.

The Data

Web server logs are usually too generic. Application debug logs are usually too noisy. In our case, we have added a dedicated log with a single line for every API request. Since we’re in application code, we can enrich the log with interesting fields, like country of request origin (translating the IP to country). etc…

Here’s a list of useful fields:

  • Request IP  – Don’t forget about XFF header
  • Country / City – We use a 3rd party database for translating IPs to country.
  • Request User-Agent
  • Request Device Type – Resolved from the User-Agent
  • Request Http Method – GET, POST, etc.
  • Request Query Parameters
  • Request URL
  • Response Http Status – code. 200, 204, etc.
  • Response Error Message – The API service can fill in extra details on errors.
  • Developer Identifier / API Key – If you can identify the Developer, Application or User, add these fields.

What can you get out of this?

So we’ve got the data in ES, now what?

Obvious – Events over time

image03

This is pretty trivial. You want to see how many request are made. With Kibana’s ® slice ‘n dice capabilities, you can easily break it down per Application, Country, or any other field that you’ve bothered to add. In case an application is abusing your API and calling it a lot, you can see who just jumped over time with his requests and handle it.

Request Origin

image04

If you’re able to resolve the request IP (or XFF header IP) to country, you’ll get a cool looking map / table and see where requests are coming from. This way you can detect anomalies like frauds etc…

 

Http Status Breakdown

image02

By itself, this is nice to have. When combined with Kibana’s slice n’ dice capabilities this let’s you see an overview for any breakdown. In many cases you can see that an application/developer is shooting the wrong API call. Be proactive and lend some assistance in near real time. Trust us, they’ll be impressed.

IP Diversity

image00

Why would you care about this? Consider the following: A developer creates an application using your API, but all requests are made from a limited number of IPs. This could be intentional, for example if all requests are made through some cloud service. This could also hint on a bug in the integration of the API. Now you can investigate.

Save the Best for Last

The data exists in ElasticSearch. Using Kibana is just one way of using it. Here are a few awesome ways to use the data.

Automated Validations (or Anomaly detection)

Once we’ve identified key anomalies in API usage, we’ve setup automated tests to search for these anomalies on a daily basis. Automatic anomaly detection in API usage proved to be incredibly useful when scaling a product. These tests can be run on demand or scheduled, and a daily report is produced.

image05

Abuse Detection

ElasticSearch is (as the name suggests) very elastic. It enables querying and aggregating the data in a variety of ways. Security experts can (relatively) easily slice & dice the data to find abuse patterns. For example, we detect when the same user-id is used in two different locations and trigger an alert.

Key Takeaways

  • Use ELK for analyzing your API usage
  • Have the application write the events (not a generic web-server).
  • Provide application-level information. E.g. Additional error information, Resolved geo location.
  • Share the love

Angular DRY mocking – Leonardo

leonardo-logo

This post was written by Sagiv Frenkel.

As developers one of the first and most basic things we learn is “Don’t repeat yourself!”.
That means trying to avoid writing the same code twice – in other words, no copy paste!
While we still sin with the occasional copy paste, it’s something we’re mindful of and is easy to notice. We just have to remember to refactor later on.

But do we treat our mocking the same ?

Lets look at a typical development flow

1) Create your UI/UX, services and controller.
2) Create your server API calls.
3) Test your application, manually/automated with self generated data in different scenarios.

What’s wrong with this approach?

We are’nt repeating code, but we are repeating work

1) Documenting – there’s no good way to tell which user/data to use for which scenario.
2) Running – you need to log in/out to change users or manually change code to fit changes.
3) Testing – error scenarios, edge cases, and request delays/throttling are very hard. Using override scripts or using comments to switch data are the only tools at our disposal.

Can we do better?

Introducing Leonardo

Leonardo is an open sourced AngularJS module created by Outbrain. It can be installed from npm or Bower, and easily integrates into existing AngularJS applications (more details on Leonardo’s GitHub repo)


Leonardo has a fancy UI where you can easily toggle different states/scenarios.

It enables you to:

1) Centralize your mocking and scenario configuration.
2) Persist the configuration into an external file.
3) Create manual QA or automated test

We use Leonardo extensively with protractor. More on this in another post.

Want to get started with Leonardo?

Check this Example to see how you can move from a regular image gallery to a mocked one.

How does Leonardo work?

Leonardo has two important concepts – states and scenarios.

state

We add states to declare what and how to mock.
There are two types:

Ajax States – This it what we will typically use. We declare the url and verb we wish to mock and what response data we wish to return – including a delay and a status.

leoConfiguration.addStates([
  {
    name: 'flicker-images',
    verb: "jsonp",
    url: 'http://api.flickr.com/services/feeds/photos_public.gne',
    options: [
      {
        name: 'get ninja turtles', status: 200,
        data: {
          "items": [
            { "id": "20054214406", "farm": 1, "title": "leo1"},                
            { "id": "19896041068", "farm": 1, "title": "017580"}
           ]
        }
       },{
          name: 'get ninja enemies', status: 200,
          data: {
              "items": [
                { "id": "20058148116", "title": "the_shredder"},                   
                { "id": "20102720711", "title": "the_ninjas" }
              ]   
          }
       }
    ]
  }
]);    

Non Ajax States – This requires more work on the part of the developers. Basically, this allows you to declare a state and its underlying data, (not mandatory) and you can later check if it’s on or off.

leoConfiguration.addState({
  name: 'Set Mission',
  options: [
    { name: 'turtles', data: "Protect April o'neil" },
    { name: 'shredder', data: 'Destroy the ninja turtles' }
  ]
});

You can query Leonardo for the value of a certain state.

var mission = leoConfiguration.getState('Set Mission');
$rootScope.mission = mission ? mission.data : "";

Leonardo triggers an event whenever a state changes.

$rootScope.$on('leonardo:setStates', function(){
  var debug = leoConfiguration.getState('debug');
  $rootScope.debug = !!debug;
});

Scenarios:

Scenarios simply enable you to set a specific set of states as active.

leoConfiguration.addState({
  name: 'Set Mission',
  options: [
    { name: 'turtles', data: "Protect April o'neil" },
    { name: 'shredder', data: 'Destroy the ninja turtles' }
  ]
});

Note:

– We currently only support Angular application. That is what we initially developed on, and was easy to implements. If the tool gains traction and popularity, it should be easy to migrate to a more vanilla approach.

Use Leonardo to start mocking http or anything you like! We’d love to get your feedback!

A/B testing @ Outbrain – Wabbit

 

What Is A/B Testing

A/B testing is a method widely used to validate assumptions about web site optimizations. With A/B tests we can test two configurations, configuration A and configuration B, of a web page design and compare them according to some metrics that define what a success result is. In other words, you test your new design against the current design and measure which one produces better results. To decide which design is better than the other, you split the traffic to your web page between these two configurations and then you can measure which configuration had better performance and apply this configurations as the default configuration of your site.

 

What To Test?

The choice of what to test depends on your goals. In Outbrain each configuration is called an A/B test variant. The idea of Outbrain’s A/B testing is to allow publishers to test two different designs of their widgets, and measure which design had better Click Through Rate (CTR) and Revenue Per 1,000 Impressions (RPM) performance.

In the core of the system there are more than 450 settings that define the configuration of each widget, which is installed on a blog or a group of sites.

There are more than two hundred online settings that directly affect the widget. Each of these settings can be tested within A/B test variants. For example, one of these online settings is called “Widget Structure”. This setting configures the look and feel of the widget.

 

Screen Shot 2014-09-02 at 2.46.42 PM

Widget structure – look and feel of the widget

If your goal is to test an addition of a new widget structure, you can configure the variant A with the new widget structure addition, against variant B that uses the original design of the widget structure and serves as the control group.

Screen Shot 2014-09-02 at 4.15.57 PM

When the test comes to an end many questions may come up. How did it affect the customers? Did the new design of the widget structure deliver better CTR and RPM performance? Maybe if we changed the title of the new widget structure it would have resulted in better performance? Maybe if we changed the images size of the old widget structure, it would have resulted in better performance? All of these questions can be answered one by one if we set appropriate A/B test variants.

Even though each A/B test in our system is unique, there are certain widget settings that are usually tested for every variant:

  • Number of paid recommendations
  • Number of organic recommendations
  • Image size in the widget
  • The number of recommendations on the widget unit
  • Widget structure

 

A/B Tests in Outbrain

Once you decided that you want to create a new A/B test, you can do it using an internal tool named Wabbit – Widget A/B testing tool. The tool gives you the ability to create/edit an existing A/B test or to pull internal reports with Key Performance Indicator (KPI) performance for the test.

The A/B test can be defined on a specific widget on one site or it can be done on a group of sites that use the same widget.

When the test ends, we pull the A/B test report to measure which configuration had better performance. If the data indicates one of the configurations is an improvement according to our KPIs and the test has experienced enough traffic to be considered significant, we give the option to apply the new configuration as the default for the widget.

 

Tips!!

  • In Outbrain we recommend running experiments for at least two weeks and no more than a month. The main reason for that is to eliminate the “day of the week” effect because users who visit the site on the weekend might represent different segment than those who visit the site during the week.
  • On the other hand, running an A/B test more than a month leads to unreliable test results, such as cookie expiration that causes the users to start see different configurations which compromises the consistency of the test.
  • At Outbrain, we also recommend allocating at least 5% of traffic toward an AB test to increase the probability of ending the test with results that have more than a 90% confidence level based on statistical analysis. Here’s a calculator from KissMetrics that will allow you to easily figure out if you’re A/B test results are significant.

The power of promises for file downloading

In this blog post I will be implementing a file download with a progress indicator using cookies, AngularJS and the promises.

Promises are a powerful concept with a number of advantages, in the following implementation pay attention to these points (your more then welcome to comment):

  1. Clarity and readability of code
  2. Error handling
  3. Separation of concerns

I thought of showing the same implementation without promises, but I think anyone who has tried to handle more than one callback and handle the error cases properly will easily see the difference.

The Module

A download button that changes it’s text with set intervals.
At the end it should be in a success state or an error state.
To complicate things a little and show the power of promises I added another step called “validateBeforeDownload”, this step will call the server to validate the download and fail it if necessary.

download (1) arrow-vector-2 aausv arrow-vector-2 download (5)
See It Live!

Downloading a file

The standard way of downloading a file is with a simple “a” tag with an href.
In order to do be able to add the “validateBeforeDownload” step and avoid passing “dom” to a service – I am using an Iframe which a service creates and destroys. This will trigger the download and if the server headers are appropriate the download will begin.

Service Code

var generateIframeDownload = function(){
  var iframe = document.createElement('iframe');
  $cookieStore.put('download_file', 'true');

  iframe.src = '/myserver/dowload';
  iframe.style.display = "none";
  document.body.appendChild(iframe);  
}

Adding in the progress

Easier said then done! Downloading a file can’t be done with an simple ajax call, so you can’t tell when the download is complete.
The solution I’m using is setting a cookie, let’s call it “download_file” with a timer that checks for a cookie every 500ms.

  • While the cookie exists the loading state is preserved.
  • Once the request completes, the server deletes the cookie and the timer is stopped.

This isn’t the best solution but is simple and doesn’t require sockets or external plugins.

Service Code

var manageIframeProgress = function(){
  var defer = $q.defer();
  // notify that the download is in progress every half a second / do this for a maximum of 50 intervals 
  var promise = $interval(function () {
      if (!$cookieStore.get('download_file')){
        $interval.cancel(promise);
      }
  }, 500, 50);
      
  promise.then(defer.reject, defer.resolve, defer.notify);
  
  promise.finally(function () {
    $cookieStore.remove('download_file');
    document.body.removeChild(iframe);
  });
}

Java Server

Just to get the full stack of implementation here is the code for handling the response data and the clearing of the cookie.

public String exportExcel() throws Exception {
 final byte[] bytesToOutput = createExcelReport().toByteArray();
 output = new ByteArrayInputStream(bytesToOutput);
 fileSize = bytesToOutput.length;
 HttpServletResponse response  = getResponse();
 Cookie cookie = new Cookie("download_file", "true");
 cookie.setPath("/");
 cookie.setMaxAge(0);
 cookie.setSecure(true);
 response.addCookie(cookie);
 return "exportCsv";
}

Wrapping everything together with promises

Pay attention to the comments in the code, some of the code is there to simulate the server requests and response and are only there for the full picture.

HTML

Each visual state of the button is determined by it’s text (scope.downloadExcelText).

Service

Notice $timeout mocks an asynchronous call and it’s response to a server.
this would normally be done with $http.

angular.module("fileDownload").factory("downloadService", function($interval, $timeout, $q, $cookieStore){
  
  var generateIframeDownload = function(){
    var iframe = document.createElement('iframe');
    $cookieStore.put('download_file', 'true');

    iframe.src = '/myserver/dowload';
    iframe.style.display = "none";
    document.body.appendChild(iframe);  
  }
  
  var manageIframeProgress = function(){
      var defer = $q.defer();
      
      // notify that the download is in progress every half a second / do this for a maximum of 50 intervals 
      var promise = $interval(function () {
        if (!$cookieStore.get('download_file')){
          $interval.cancel(promise);
        }
      }, 500, 50);
      
      promise.then(defer.reject, defer.resolve, defer.notify);
      
      promise.finally(function () {
        $cookieStore.remove('download_file');
        document.body.removeChild(iframe);
      });
  }
  
  return {
    validateBeforeDownload: function (config) {
      var defer = $q.defer();
      
      // notify that the download is in progress every half a second
      $interval(function (i) {
        defer.notify(i);
      }, 500);
    
      //mock response from server - this would typicaly be a $http request and response
      $timeout(function () {
        // in case of error: 
         //defer.reject("this file can not be dowloaded");
         defer.resolve(config);
      }, 2000);
  
      return defer.promise;
    },
    downloadFile: function (config) {
    
      generateIframeDownload();
      var promise = manageIframeProgress();
  
      //mock response from server - this would be automaticlly triggered by the file download compeletion
      $timeout(function(){
        $cookieStore.remove('download_file');
      }, 3000);
      
      return promise;
    }
  }
});

Controller

This is were our hard work pays off and promises start to shine.

Lets step into the promise mechanism –
Prepending the “downloadService.validateBeforeDownload” to the “downloadService.downloadExcel” with the “then” method creates a third promise which shares callbacks for: success, failure and notifications (for the progress).
There is also a finally callback attached to this promise that we use for sharing code between the success and failure.
But the really nice thing here is it also enables handling errors just from the “validateBeforeDownload”, and bubbling them up if needed with $q.reject or by simply throwing the error.

Pay attention that each step towards completion of the promise seems to be handled in an async manner and the actual asynchronicity is handled by the promise mechanism and the service. Magic!

angular.module("fileDownload").controller("downloadCtrl", function($scope, $timeout, $q, downloadService){
  $scope.downloadFile = function(){
    var params = {};
    var loadingText = 'Loading Data';
    var options = ['.', '..', '...'];
 
    $scope.downloadFileText = loadingText + options[0];
    var promise = downloadService.validateBeforeDownload(params).then(null, function (reason) {
      alert(reason);
      // you can also throw the error
      // throw reason;
      return $q.reject(reason);
    }).then(downloadService.downloadFile).then(function(){
      $scope.downloadFileText = 'Loaded';
    }, function(){
      $scope.downloadFileText = 'Failed';
    }, function(i){
      i = (i+1)%3;
      $scope.downloadFileText = loadingText + options[i];
    });
    
    promise.finally(function(){
      $timeout(function(){
        delete $scope.downloadFileText;  
      }, 2000);
    });
  };
});

So Long Spring XMLs

Like many java projects these days, we use Spring in Outbrain for configuring our java dependencies wiring. Spring is a technology that started in order to solve a common, yet not so simple, issue – wiring all the dependencies in a java project. This was done by utilizing the IoC (Inversion of Control) principles. Today Spring does a lot more than just wiring and bootstrapping, but in this post I will focus mainly on that.

When Spring just started, the only way to configure the wirings of an application, was to use XMLs which defined the dependencies between different beans. As Spring had continued to develop, 2 more methods were added to configure dependencies – the annotation method and the @Configuration method. In Outbrain we use XML configuration. I found this method has a lot of pain points which I found remedy to using spring @Configuration

What is this @Configuration class?

You can think of a @Configuration class just like XML definitions, only defined by code. Using code instead of XMLs allows some advantages over XMLs which made me switch to this method:

  1. No typos – You can’t have a typo in code. The code just won’t compile
  2. Compile time check (fail fast) – With XMLs it’s possible to add an argument to a bean’s constructor but to forget to inject this argument when defining the bean in the XML. Again, this can’t happen with code. The code just won’t compile
  3. IDE features come for free – Using code allows you to find usages of the bean’s constructor to find out easily the contexts that use it; It allows you to jump back and forth between beans definitions and basically everything you can do with code, you get for free.
  4. Feature flags – In Outbrain we use feature-flags a lot. Due to the continuous-deployment culture of the company, a code that is pushed to the trunk can find itself in production in a matter of minutes. Sometimes, when developing features, we use feature flags to enable/disable certain features. This is pretty easy to do by defining 2 different implementations to the same interface and decide which one to load according to the flag. When using XMLs we had to use the alias feature which makes it not intuitive enough to create feature-flags. With @Configuration, we can create a simple if clause for choosing the right implementation.

Read more >