Some of you might ask, “why is he telling us about datacenter architecture? Don’t Cloud Services solve this already?” and some of you that already know me and what my opinions are on the subject will not be surprised. Yes, I’m not a fan of the Cloud Services and that is another discussion, however, there are some advantages for using Cloud Services that giving them up by establishing a datacenter felt somehow wrong for us.
Here are 2 of them:
1. Grow As You Go – When you build a datacenter you take on commitments for space (racks or cages) and high profile network gear that are investments you have to pay for in advance or before you really need to use them. This is not an issue for a Cloud-based setup because as you grow you spin up more instances.
2. Disaster Recovery Headroom – With a datacenter-based setup, in order to properly handle disaster recovery you need to double your setup so you can always move all your traffic to the other datacenter in case of disaster, which means doubling the hardware you buy. In the Cloud, this is also a non-issue.
These 2 arguments are very much correct, however even taking those into consideration, our setup is much more efficient in cost then any Cloud offering. The logic behind it is what I want to share here.
Traditionally, when a company’s business grows, a single rack or maybe 2 are not sufficient and you have the need to allocate adjacent racks space in a co-located datacenter. This makes your recurring expenses grow since you actually pay for reserved space that you don’t really use. It’s a big waste of your $$$. Once we managed to set more than one location for our service we found out that it will be much cheaper to build multiple small datacenters with a small space footprint than committing to a large space that we will not use most of the time. Adjacent space of at least 4 racks is much easier to find in most co-location facilities. More than that, our co-location provider agreed to give us 2 active racks with first right of refusal for the adjacent 2 racks so we actually pay for those we use.
This architecture also simplified much of our network gear requirements. Assuming each “LEGO Brick” is small, it needs to handle only a portion of the traffic and not all of it. This does not require high profile network gear and very cheap Linux machines are sufficient for handling most of the network roles including load balancing, etc.
We continued this approach for choosing the intra-LEGO Brick switching gear. Here we decided to use Brocade stackable switching technology. In general, it means that you can put a switch per cabinet and wire all the machines to it. When you add another cabinet you simply connect them in a chain that looks and acts like a single switch. You can grow such a stack up to 8 switches. At Outbrain, we try to eliminate single points of failure, so we have 2 stacks and machines are connected to both of them. Again, the stacking technology gave us the ability to not pay for network gear before we actually need it.
But what about Disaster Recovery (DR) headroom? (We decided to implement more than one location for disaster recovery as soon as we started generating revenue for our partners.) As I said, this is a valid argument. When we had 2 datacenters, 50% of our computing power was dedicated to DR and not used in normal time. This was not ideal and we needed to improve that. Actually, the LEGO bricks helped here once again. This week we opened our 3rd datacenter in Chicago. The math is simple, by adding another location our headroom dropped to only 33% which is a lot of $$$ savings when your business grows. When we add the 4th it will drop to 25%, etc.
I guess now you understand the logic and we can mention some fun info about the DC implementation itself:
- Datacenters communicate via a dedicated link, powered by our co-location vendor.
- We use a Global DNS service to balance traffic between the datacenters.
- In our newer datacenters, the power billing is a pay-per-use — no flat fees which again enable us to not pay for power we don’t use. It also motivates us to power off unneeded hardware and save power costs while saving the planet
- Power is 208V which is more efficient than the regular 110v.
- All servers are connected to a KVM to enable remote access to BIOS config if needed — much easier to manage from Israel and in general.
- We have a lot of Dell C6100s in our datacenters so each node there is also connected to an IPMI network in order to remotely restart each node without rebooting all 4 nodes in that chassis.
- You can read more about assembling these C6100s in Nathan’s detailed post.
I guess your question is “what does it take to manage this in terms of labor?” That answer is… not too much.
The Outbrain Operations team is a group of 4 Ops engineers. Most of the time they are not doing much related to the physical infrastructure, but like other ops teams, most of the time they handle the regular tasks of configuring infrastructure softwares (we use all of them from open source like MySQL, Cassandra, Hadoop, Hive, ActiveMQ, etc…), monitoring, code and system deployment (we heavily use Chef) etc.
In general, Operations’ role in the company is to keep the serving fast, reliable and (very important) cost-efficient. This is the main reason why we invest time, knowledge and innovation in architecting our datacenters wisely.
I guess one of the next posts will be about our new Chicago datacenter and the concept of the “Dataless Datacenter.”