Blog Posts - January 2017

From complex monolith to scalable workflow

plm-and-simplification-1

 

 

Introduction

One of the core functionalities in Outbrain’s solution is our crawling system.

The crawler fetches web pages (e.g. articles), and index them in our database.

The crawlers can be divided into several high-level steps:

  1. Fetch – download the page
  2. Resolve – identify the page context (e.g. what is the domain?)
  3. Extract – try to extract features out of the HTML – like title, image, description, content etc.
  4. NLP – run NLP algorithms using the extracted features, in order to classify and categorize the page.
  5. Save – store to the DB.

 

 

The old implementation

The crawler module was one of the first modules that the first Outbrainers had to implement, back in Outbrain’s small-start-up days 6-7 years ago.

 

In January 2015 we have decided that it was time to sunset the old crawlers, and rewrite everything from scratch. The main motivations for this decision were:

  1. Crawlers were implemented as a long monolith, without clear steps (OOP as opposed to FP).
  2. Crawlers were implemented as a library that was used in many different services, according to the use-case. Each change forced us to build and check all services.
  3. Technologies are not up-to-date anymore (Jetty vs. Netty, sync vs. async development etc.).

 

 

The new implementation

When designing the new architecture of our crawlers, we tried to follow the following ideas:

  1. Simple step-by-step (workflow) architecture, simplifying the flow as much as possible.
  2. Split complex logic to micro-services for easier scale and debug.
  3. Use async flows with queues when possible, to control bursts better.
  4. Use newer technologies like Netty and Kafka.

 

Our first decision was to split the main flow into 3 services. Since the crawler flow, like many others, is basically an ETL (Extract, Transform & Load) – we have decided to split the main flow into 3 different services:

  1. Extract – fetch the HTML and resolve the domain.
  2. Transform – take features out of the HTML (title, images, content…) + run NLP algorithms.
  3. Load – save to the DB.

 

The implementation of those services is based on “workflow” ideas. We created interface to implement a step, and each service contains several steps, each step doing a single and simple calculation. For example, some of the steps in the “Transform” service are:

TitleExtraction

DescriptionExtraction

ImageExtraction

ContentExtraction

Categorization

 

In addition we have implemented a class called Router – that is injected with all the steps it needs to run, and is in charge of running them one after the other, reporting errors and skipping unnecessary steps (for example, no need to run categorization when content extraction failed).

 

Furthermore, every logic that was a bit complex was extracted out of those services to a dedicated micro-service. For example, the fetch part (download the page from the web) was extracted to a different micro-service. This helped us encapsulate fallback logic (between different http clients) and some other related logics we had outside of the main flow. This is also very helpful when we want to debug – we just make an API call to that service to get the same result the main flow gets.

 

We modeled each piece of data we extracted out of the page into features, so each page would eventually translated into a list of features:

URL

TItle

Description

Image

Author

Publish Date

Categories

 

The data flow in those services was very simple. Each step got all the features that were created up to its run, and added (if needed) one or more features to its output. That way the features list (starting with only URL) got “inflated” going over all the steps, reaching the “Load” part with all the features we need to save.

Screen Shot 2017-01-19 at 12.38.45 PM

 

 

The migration

 

One of the most painful parts of such rewrites is the migration. Since this is a very important core-functionality in Outbrain, we could not just change it and cross fingers that everything is OK. In addition, it took several months to build this new flow, and we wanted to test as we go – in production – and not wait until we are done.

 

The main concept for the migration was to create this new flow side by side with the old flow, having them both run in the same time in production, allowing us to test the new flow without harming production.

 

The main steps of the migration were

  1. Create the new services and start implementing some of the functionality. Do not save anything in the end.
  2. Start calls from the old-flow to the new one, a-synchronically, passing all features that the old flow calculated.
  3. Each step in the new flow that we implement can compare its results to the old-flow results, and report when the results are different.
  4. Implement the “save” part – but do it only for a small part of the pages – control it by a setting.
  5. Evaluate the new-flow using the comparison done between the old and new flows results.
  6. Gradually enable the new-flow for more and more pages – monitoring the effect in production.
  7. Once feeling comfortable enough, remove the old-flow and run everything only in the new-flow.

 

The approach can be described as “TDD” in production. We have created a skeleton for the new-flow and started streaming the crawls into it, while actually it does almost nothing. We have started writing the functionality – each one tested, in production, compared to the old-flow. Once all steps were done and tested, we have replaced the old-flow by the new one.

 

 

Where we are now

As of December 20th 2016 we are running only the new flow for 100% of the traffic.

The changes we already see:

  1. Throughput/Scale: we have increased the throughput (#crawls per minute) from 2K to ~15K.
  2. Simplicity: time to add new features or solve bugs decreased dramatically. There is no good KPI here but most bugs are solved in 1-2 hours, including production integration.
  3. Less production issues: it is easier for the QA to understand and even debug the flow (using calls to the micro services) – so some of the issues are eliminated even be getting to developers.
  4. Bursts handling: due to the queues architecture we endure bursts much better. It also allows simple recovery after one of the services is down (maintenance for example).

5. Better auditing: thanks to the workflow architecture, it was very easy to add audit message using our ELK infrastructure (Elastic search, Log-stash, Kibana). The crawling flow today reports the outcome (and issues) of every step it does, allowing us, the QA, and even the field guys to understand the crawl in details without the need of a developer.

Automating your workflow

During development, there are many occasions where we have to do things that are not directly related to the feature we are working on, or things that are repetitive and recurring.
In the time span of a feature development this can often take as much time to do as the actual development.

For instance, updating your local dev micro services environment before testing your code. This task on its own, which usually includes updating your local repo version, building and starting several services and many times debugging and fixing issues caused by others, can take hours, many times just to test a simple procedure.

We are developers, we spend every day automating and improving other people’s workflows, yet we often spend so many hours doing the same time consuming tasks over and over again.
So why not build the tools we need to automate our own workflows?

In our team we decided to build a few tools to help out with some extra irritating tasks we were constantly complaining about to each other.

First one was simple, creating a slush sub-generator. For those of you who don’t know, slush is a scaffolding tool, like yeoman but for gulp. We used this to create our Angular components.
Each time we needed to make a new component we had to create a new folder, with three files:


  Comp.component.ts
  Comp.jade
  Comp.less

Each file of course has its own internal structure of predisposed code, and each component had to be registered in the app module and the main less file.

This was obviously extremely annoying to redo each time, so we automated it. Now each time you run “ob-genie” from the terminal, you are asked the name of your component and what module to register it with, and the rest happens on its own. We did this for services and directives too.

Other than saving a lot of time and frustration, this had an interesting side effect – people on the team were creating more components than before! This was good because it resulted in better separation of code and better readability. Seems that many tim the developers were simply too lazy to create a new component and just chucked it all in together. Btw, Angular-CLI have added a similar capability, guess great minds think alike.

Another case we took on in our team was to rid ourselves of the painstaking task of setting up the local environment. This I must say was a real pain point. Updating the repo, building and running the services we needed each time could take hours, assuming everything went well.
There have been times where I spent days on this just to test the simplest of procedures.
Often I admit, I simply pushed my code to a test environment and debugged it there.
So we decided to build a proxy server to channel all local requests to the test environment.

For this we used node-proxy, a very easy to configure proxy. However, this was still not an easy task since each company has very specific configurations issues we had to work with.
One thing that was missing was proper routing capabilities. Since you want some requests to go local and some remote we added this before each request.

https.createServer(credentials, function (req, res) {
 Object.keys(options.routingTable).some(function (key) {
   const regX = new RegExp(key);
   if (regX.test(req.url)) {
     printMe(req.url + ' => ' + (options.routingTable[key].targetName || options.routingTable.target));
     proxy.web(req, res, options.routingTable[key]);
     curTarget = options.routingTable[key];
     return true;
   }
 });
}).listen(options.home_port);

We passed as an option the routing table with a regex for each path, making it easy to configure which requests to proxy out, and which in.

routingTable = {
  'site': local,
  '^/static': local,
  '/*/': remote
};

Another hurdle was working with HTTPS, since our remote environments work on HTTPS.
In order to adhere to this we needed to create SSL certificate for our proxy and the requestCert parameter in our proxy server to false, so that the it doesn’t get validated.

The end configuration should look something like this.

const local = {
   targetName: 'local',
   target: 'https://localhost:4141,
   changeOrigin: true, 
   secure: false
 },
 remote = {
   targetName: 'remote',
   requestCert: false,
   rejectUnauthorized: false,
   target: 'https://test.outbrain.com:8181,
   secure: false,
   changeOrigin: true,
   autoRewrite: true

 },
 routingTable = {
   'site': local,
   '^/static': local,
   '/*/': remote
 };

const options = {
 routingTable: routingTable,
 home_port: 2109,
 debug: true,
 startPath: 'amplify/site/'
 };

With this you should be able to run locally and route all needed calls to the test environment when working on localhost:2109.

So to conclude, be lazy, make your work easier, and use the skills you have to automate your workflows as much as possible.

Kibana for Funnel Analysis

How we use Kibana (4) for user-acquisition funnel analysis

Outbrain has recently launched a direct-to-consumer (D2C) initiative. Our first product is a chatbot. As with every D2C product, acquiring users is important. Therefore, optimizing the acquisition channel is also important. The basis of our optimization is analysis.

kbfunnel-image01

Our Solution (General Architecture)

Our acquisition funnel spans on 2 platforms (2 web pages and a chatbot). Passing many parameter between platforms can be a challenge, so we chose a more stateful, server-based model. Client requests for a new session Id, together with basic data like IP and User agent. Server stores a session (we use Cassandra in this case) with processed fields like Platform, OS, Country, Referral, User Id. At a later stage the client reports a funnel event for a session Id. The server writes all known fields for the session into 2 storages:

  • ElasticSearch for quick & recent analytics (Using the standard ELK stack)
  • Hadoop for long term storage and offline reports

A few example fields stored per event

  • User Id – An unique & anonymous identifier for a user
  • Session Id – The session Id is the only parameter passed between funnel steps
  • Event Type – The specific step in the funnel – serve, view, click
  • User Agent – Broken down to Platform and OS
  • Location – based on IP
  • Referral fields – Information on the context in which the funnel is excercised
  • A/B Tests variants – The A/B Test variant Ids that are included in the session

Goal of the Analysis: Display most important metrics quickly

Kibana plugin #1: Displaying percent metric

Kibana has several ways of displaying a fraction, but none excel in displaying small numbers. (Pie can be used to visualize fractions, but small). We developed a Kibana plugin for displaying a single metric, in percent format.

kbfunnel-image00

We use this visualization for displaying the conversion rate of the most interesting part of our funnel.

Kibana plugin #2: Displaying the funnel

We couldn’t find a good way for displaying a funnel so we developed a visualization plugin (honestly, we were eager to develop this, so we did not scan the entire internet..)

Based on the great D3 Funnel by Jake Zatecky, this is a Kibana plugin that display buckets of events in funnel format. It’s customizable and open-source. Feel free to use it…

kbfunnel-image02

Putting it all together

Displaying your most important metrics and the full funnel is nice. Comparing variant A with variant B is very nice. We’ve setup our dashboard to show similar key metrics on 2 versions of the funnel. We always try to run at least 1 A/B test and this dashboard shows us realtime results of our tests.

kbfunnel-image04

Cherry on top

Timelion is awesome. If you’re not using it, I suggest trying it.

Viewing your most important metrics over time is very useful, especially when you’re making changes fast. Here’s an example:

kbfunnel-image03

Summary

We track a user’s activity by sending events to the server. The server writes these events to ES and Hadoop. We developed 2 Kibana plugins to visualize the most important metrics of our user-acquisition funnel. We can filter the funnel by Platform, Country, OS, Time, Referral, or any other fields we bothered to save. In addition, we always filter by A/B Test variants and compare 2 specific variants.