Blog Posts - July 2016

Building the Culture to build Systems

Part 1 – The Badges

If there’s one thing we could name that differentiates a good team from a great one, it’s culture. It’s not something you can buy, it’s something you need to build, grow and nurture over time.

So how do you build the right culture?

It is very much like asking “how do you keep in shape?” First you need to set the goals you want to achieve and then you need to start exercising, but remember it is an effort that never ends – once you stop investing in it and not using it, you will lose it.

First – setting the goals, i.e. defining the values

You need to set goals, so you will know what to focus on, and as we are dealing with culture – the goals are actually the values you want to adopt and enforce.

For us some of the values and behaviours we want to emphasize are: collaboration, learning, fun, getting out of your comfort zone, initiative, excellence.

Once you have that defined you can start working out! In this post, and others to come, we will share some of the exercises we use to make sure we stay in shape.

Exercise 1 – The Badges

We created several sets of badges to enforce selected behaviours and celebrate occasions. Those badges are been handed out during weekly group sync meetings, and can also be added to mail signatures if one desires so. They also makes a nice collection, and naturally there are common and rare variants. We’ve found that this drives conversation and directly affects culture.

A few examples of the badges sets:

The Tech Collection:

We wanted to encourage using different tech tools, like Vagrant, improve Chef quality, killing tech debt etc. – so we made sure we have appropriate badges:

vagrantchefShaved

 

The Production Collection:

At the end of the day we are all dealing with production – and we want to celebrate that.

We created several badges for all kind of production related happenings, whether you have broken production, or spend a sleepless night – you will get mentioned. Seeing a senior engineer being granted a “broken prod” badge, drives our “blameless” culture, but still carries the weight of responsibility. No one wants a whole set of black badges!

sheep breakprod

 

The Celebration Collection:

We like to celebrate – whether it is the first on-call shift, the fact that one presented in a conference, arranged a meetup, or just helped a colleague out. By celebrating behaviors we value, we drive people to adopt them. Here are a few examples:

solo present ThanksBro

 

The “Jewish Mother” Collection:

Adding a good laugh is always nice, and at the end of the day we all have our quirky behaviours. By celebrating them we keep smiling… and strengthen tolerance.

holdon Alon eat

 

It is a living exercise and we keep on adding badges as we move along. In fact, we have our own engineers suggest and even create them. Our experience so far shows that gamifying culture-building activities has a positive effect on team atmosphere, and direct effect on the specific behaviors we address.
Stay tuned for additional exercises in How to Keep Your Culture in Shape.

Reducing risks by taking risks

You have a hypothesis you wish to prove or refute regarding your behavior in an extreme scenario.  For example:  after restarting the DB, all services are fully functioning within X seconds. There are at least 3 approaches:

  1. Wait && See — if you wait long enough this scenario will probably happen by itself in production.
  2. Deep Learning  — put the DBA, OPS and service owner in a big room, and ask them to analyze what will happen to the services after restarting the DB. They will need to go over network configuration, analyze the connection pool and DB settings, and so on.
  3. Just Do It — deliberately create this scenario within a controlled environment:

Timing — Bases on your knowledge pick the right quarter, month, week, day and hour – so the impact on the business will be minimal.

Monitoring — make sure you have all monitoring and alerts in place. If needed, do “manual monitoring”.

Scale — If applicable do it only on portion of your production: one data center, part of the service, your own laptop, etc.

The Wait && See approach will give you the right answer, but at the wrong time. Knowing how your system behaves only after a catastrophe occurs is missing the point and could be expensive, recovery-wise.

The Deep Learning approach seems to be the safer one, but it requires a lot of effort and resources. The answer will not be accurate as it is not an easy task to analyze and predict the behaviour of a complex system. So, in practice, you haven’t resolved your initial hypothesis.

The Just Do It approach is super effective – you will get an accurate answer in a very short time, and you’re able to manage the cost of recovery. That why we picked this strategy in order to resolve the hypothesis. A few tips if you’re going down this path:

  • Internal and External Communication — because you set the date, you can send a heads up to all stakeholder and customers. We use statuspage.io for that.
  • Collaboration — set a HipChat / Slack / Hangouts room so all relevant people will be on the same page. Post when the experiment starts, finishes, and ask service owners to update their status.
  • Document the process and publish the results within your knowledge management system (wiki / whatever).
  • This approach is applicable for organisation that practice proactive risk management, have a healthy culture of learning from mistakes, and maintain a blameless atmosphere.

 

People that pick Deep Learning are often doing so because it is considered a “safer”, more  conservative approach, but that is only an illusion. There is high probability that in a real life DB restart event the system would behave differently from your analysis. This effect of surprise and its outcomes are exactly what we wanted to solve by predicting the system’s behaviour.

 

The Just Do It approach actually reduces this risk and enables you to keep it manageable. It’s a good example of when the bold approach is safer than the conservative one. So don’t hesitate to take risks in order to reduce risk, and Just Do It

Photo credit: http://www.jinxiboo.com/blog/tag/risks