Orit Yaron

I WANT IT ALL – Go Hybrid

When I was a kid, my parents used to tell me that I can’t have my cake and eat too.  Actually that’s a lie, they never said that. Still it is something I hear parents say quite often. And not just parents. I meet the same phrase everywhere I go. People constantly taking a firm, almost religious stance about choosing one thing over another: Mac vs PC, Android vs iOS, Chocolate vs Vanilla (obviously Chocolate!).

So I’d like to take a moment to take a different, more inclusive approach.

Forget Mac vs PC. Forget Chocolate vs Vanilla.

I don’t want to choose. I Want it all!

At Outbrain, the core of our compute infrastructure is based on bare metal servers. With a fleet of over 6000 physical nodes, spread across 3 datacenters, we’ve learned over the years how to manage an efficient, tailored environment that caters to our unique needs. One of which being the processing and serving of over 250 Billion personalized recommendations a month, to over 550 Million unique users.

Still, we cannot deny that the Cloud brings forth advantages that are hard to achieve in bare metal environments. And in the spirit of inclusiveness (and maximising value), we want to leverage these advantages to complement and extend what we’ve already built. Whether focusing on workloads that require a high level of elasticity, such as ad-hoc research projects involving large amount of data, or simply external services that can increase our productivity. We’ve come to view Cloud Solutions as supplemental to our tailored infrastructure rather than a replacement.

 

Over recent months, we’ve been experimenting with 3 different vectors involving the Cloud:

 

Elasticity

Our world revolves around publications, especially news. As such, whenever a major news event occurs, we feel immediate, potentially high impact. Users rush to publisher sites, where we are installed. They want their news, they want their recommendations, and they want them all now.

For example, when Carrie Fisher, AKA Princess Leia, passed away last December, we saw a 30% traffic increase on top of our usual peak traffic. That’s quite a spike.

Since usually we do not know when the breaking news event will be, it means that we are required to keep enough extra capacity to support such surges.

By leveraging the cloud, we can keep that additional extra capacity to bare minimum, relying instead on the inherent elasticity of the cloud, provisioning only what we need when we need it.

Doing this can improve the efficiency of our environment and cost model.

Ad-hoc Projects

A couple of months back one of researchers came up with an interesting behavioral hypothesis. For the discussion at hand, lets say that it was “people who like chocolate are more likely to raise pet gerbils.” (drop a comment with the word “gerbils” to let me know that you’ve read thus far). That sounded interesting, but raised a challenge. To validate or disprove this, we needed to analyze over 600 Terabytes of data.

We could have run it on our internal Hadoop environment, but that came with a not-so-trivial price tag. Not only did we have to provision additional capacity in our Hadoop cluster to support the workload, we anticipated the analysis to also carry impact on existing workloads running in the cluster. And all this before getting into operational aspects such as labor and lead time.

Instead, we chose to upload the data into Google’s BigQuery. This gave us both shorter lead times for the setup and very nice performance. In addition, 3 months into the project, when the analysis was completed, we simply shut down the environment and were done with it. As simple as that!

Productivity

We use Fastly for dynamic content acceleration. Given the scale we mentioned, this has the side-effect of generating about 15 Terabytes of Fastly access logs each month. For us, there’s a lot of interesting information in those logs. And so, we had 3 alternatives when deciding how to analyse them:

  •      SaaS based log analysis vendors
  •      An internal solution, based on the ELK stack
  •      A cloud based solution, based on BigQuery and DataStudio

After performing a PoC and running the numbers, we found that the BigQuery option – if done right – was the most effective for us. Both in terms of cost, and amount of required effort.

There are challenges when designing and running a hybrid environment. For example, you have to make sure you have consolidated tools to manage both on-prem and Cloud resources. The predictability of your monthly cost isn’t as trivial as before (no one likes surprises there!), controls around data can demand substantial investments… but that doesn’t make the fallback to “all Vanilla” or “all Chocolate” a good one. It just means that you need to be mindful and prepared to invest in tooling, education and processes.

 

In summary, I’d like to revisit my parents advice, and try to improve on it a bit (which I’m sure they won’t mind!):

Be curious. Check out what is out there. If you like what you see – try it out. At worst, you’ll learn something new. At best, you’ll have your cake… and eat it too.

 

Building the Culture to build Systems

Part 1 – The Badges

If there’s one thing we could name that differentiates a good team from a great one, it’s culture. It’s not something you can buy, it’s something you need to build, grow and nurture over time.

So how do you build the right culture?

It is very much like asking “how do you keep in shape?” First you need to set the goals you want to achieve and then you need to start exercising, but remember it is an effort that never ends – once you stop investing in it and not using it, you will lose it.

First – setting the goals, i.e. defining the values

You need to set goals, so you will know what to focus on, and as we are dealing with culture – the goals are actually the values you want to adopt and enforce.

For us some of the values and behaviours we want to emphasize are: collaboration, learning, fun, getting out of your comfort zone, initiative, excellence.

Once you have that defined you can start working out! In this post, and others to come, we will share some of the exercises we use to make sure we stay in shape.

Exercise 1 – The Badges

We created several sets of badges to enforce selected behaviours and celebrate occasions. Those badges are been handed out during weekly group sync meetings, and can also be added to mail signatures if one desires so. They also makes a nice collection, and naturally there are common and rare variants. We’ve found that this drives conversation and directly affects culture.

A few examples of the badges sets:

The Tech Collection:

We wanted to encourage using different tech tools, like Vagrant, improve Chef quality, killing tech debt etc. – so we made sure we have appropriate badges:

vagrantchefShaved

 

The Production Collection:

At the end of the day we are all dealing with production – and we want to celebrate that.

We created several badges for all kind of production related happenings, whether you have broken production, or spend a sleepless night – you will get mentioned. Seeing a senior engineer being granted a “broken prod” badge, drives our “blameless” culture, but still carries the weight of responsibility. No one wants a whole set of black badges!

sheep breakprod

 

The Celebration Collection:

We like to celebrate – whether it is the first on-call shift, the fact that one presented in a conference, arranged a meetup, or just helped a colleague out. By celebrating behaviors we value, we drive people to adopt them. Here are a few examples:

solo present ThanksBro

 

The “Jewish Mother” Collection:

Adding a good laugh is always nice, and at the end of the day we all have our quirky behaviours. By celebrating them we keep smiling… and strengthen tolerance.

holdon Alon eat

 

It is a living exercise and we keep on adding badges as we move along. In fact, we have our own engineers suggest and even create them. Our experience so far shows that gamifying culture-building activities has a positive effect on team atmosphere, and direct effect on the specific behaviors we address.
Stay tuned for additional exercises in How to Keep Your Culture in Shape.

DevOps – The Outbrain Way

Like many other fast moving companies, at Outbrain we have tried several iterations in the  attempt to find the most effective “DevOps” model for us. As expected with any such effort, the road has been bumpy and there have been many “lessons learned” along the way. As of today, we feel that we have had some major successes in refining this model, and would like to share some of our insights from our journey.

 

Why to get Dev and Ops together in the first place?

A lot has been written on this topic, and the motivations and benefits of adding the operational perspective into the development cycles has been thoroughly discussed in the industry – so we will not repeat those.

I would just say that we look at these efforts as preventive medicine, like eating well and exercise – life is better when you stay healthy.  It’s not as good when you get sick and seek medical treatment to get health again.

 

What’s in a name?

We do not believe in the term “DevOps”, and what it represents.  We try hard to avoid it –  why is that?

Because we expect every Operations engineer to have development understanding and skills, and every Developer to have operational understanding of how the service he / she developes works, and we help them achieve and improve those skills – so everyone is DevOps.

We do believe there is a need to get more system and production skills and expertise closer to the development cycles – so we call it Production Engineers.

 

First try – Failed!

We started by assigning Operations Engineers to work with dedicated development groups – the big problem was that it was done on top of their previous responsibility in building the overall infrastructure (config management, monitoring infrastructure, network architecture etc.), which was already a full time job as it was.  

This mainly led to frustration on both sides – the operations eng. who felt they have no time to do anything properly, just touching the surface all the time and spread too thin, and the developers who felt they are not getting enough bandwidth from operations and they are held back.

Conclusion – in order to succeed we need to go all in – have dedicated resources!

 

Round 2 – Dedicated Production Eng.

Not giving up on the concept and learning from round 1 – we decided to create a new role – “Production Engineers” (or PE for short), whom are dedicated to specific development groups.

This dedication manifest in different levels. Some of them are semi trivial aspects, like seating arrangements – having the PE sit with the development team, and sharing with them the day to day experience; And some of them are focus oriented, like joining the development team goals and actually becoming an integral part of the development team.

On the other hand, the PE needs to keep very close relationship with the Infrastructure Operational team, who continues to develop the infrastructure and tools to be used by the PEs and support the PEs with technical expertise on more complex issues require subject matter experts.

 

What & How model:

So how do we prevent the brain split situation of the PE? Is the PE part of the development team or the Operations team? When you have several PEs supporting different development groups – are they all independent or can we gain from knowledge transfer between them?

In order for us to have a lighthouse to help us answer all those questions and more that would evident come up, we came up with the  “What & How” model:

“What” – stands for the goals, priorities and what needs to be achieved. “The what” is set by the development team management (as they know best what they need to deliver).

“How” – stands for which methods, technologies and processes should be used to achieve those goals most efficiently from operational perspective. This technical, subject matter guidance is provided by the operations side of the house.

 

So what is a PE @ Outbrain?

At first stage, Operations Engineer is going through an on-boarding period, during which the Eng. gains the understanding of Outbrain operational infrastructure. Once this Eng. gained enough millage he /she can become a PE, joining a development group and working with them to achieve the development goals, set the “right” way from operational perspective, properly leveraging the Outbrain infrastructure and tools.

The PE enjoys both worlds – keeping presence in the Operations group and keeping his/hers technical expertise on one hand, and on the other hand be an integral part of the development team.

From a higher level perspective – we have eliminated the frustrations points, experienced in our first round of “DevOps” implementation, and are gaining the benefit of close relationship, and better understanding of needs and tools between the different development groups and the general Operations group. By the way, we have also gained a new carrier development path for our Operations Eng. and Production Eng. that can move between those roles and enjoy different types of challenges and life styles.

 

e8f82598-c6e2-4c08-85ce-f6d34f74f3b6