Hi all As Hurricane Sandy is about to hit the east coast US, and as Outbrain’s main Datacenter is located in downtown Manhattan, we are taking measures to make as little service interruption as possible for our partners and customers. Outbrain is normally serving from 3 data centers and in case of NY data center loss, we will supply the service from one the other data centers. On this page, below – we will update on any service interruption and ETAs for problem solving. We assume all will go well and we will not have to update but… just in case
[UPDATE - Nov 3rd 3:45pm EST] - At this time Utility power is back to all our datacenters and HQ office. It is now time to restore the service from NY and get the office back to work. This will take some time but systems will gradually be put back up over the next week or so. There should be no effect on users, publishers or clients.
Our HQ will also start working gradually depending on the availability of public transportation.
We are here closing this reporting post – if you see any issues, please report to email@example.com or your rep.
I hope the storm of the century will be the last one for the next century (at least).
[UPDATE - Nov 1st 9:30am EST] - Our HQ, located on 13th between 5th and 6th in downtown New York City is still without power and therefore closed. Thankfully, our NY-based team is safe and in dry locations, and will continue to try and work as best they can. We highly appreciate the concern and best wishes we received from our partners and clients across the globe; thank you!
We are doing our best to continue to provide the best in class service, one we hope you’ve come to expect from us. As an update, our datacenter in NY is still without power and we expect it to be down for a few more days. We will continue to serve from our other datacenters located in Chicago and Los Angeles. To reiterate, our service did not go down, and we are currently still serving across our client’s sites. As of this morning, we recovered and updated all our reporting capabilities, so we should be back to 100%.
If you are experiencing any difficulties or seeing different, please reach out to your respective contacts. We’ll also continue to operate under emergency mode until Monday, you can reach us 24/7 at am-emergency-support@outbrain.
[UPDATE - Oct 31st 6:46am EST] - Serving still holds strong from our LA and Chicago data centers and we are not aware of any disruption to our service. We are working hard to recover our dashboard reporting capabilities, but it will probably take a couple more days before we’re able to get back to normal mode. Sorry for any inconvenience caused by this. Send us a note to am-emergency-support@outbrain.
[UPDATE - 6:51pm EST] - Again, not much to update – All is stable with both LA and Chicago datacenters. It’s the end of the day here in Israel and we are trying to get some rest. Our team mates in the US are keeping an eye on the system and will alert us if there is anything wrong. Good night.
[UPDATE - 3:35am EST] - Actually not much to update about the service. All is pretty much stable. we are safely serving from LA and Chicago. most back-end services are running in LA Datacenter and our tech team in Israel and NY are monitoring and handling issues as they raise. Our Datacenter vendors in NY are working with FDNY to pump the water from the flooded generator room so it will take a while to recover this datacenter
[UPDATE - 10:50am EST] - The clients dashboard is back up.
[UPDATE - 10am EST] – The clients dashboard on our site is periodically down – we are handling the issues there and will update soon.
[UPDATE - 5am EST] Our NY Data center went down. Our service is fully operational and we are serving through our Chicago and LA Data centers. If you’re accessing your Outbrain dashboard you may experience some delays in data freshness. We are working to resolve this issue and will continue to update.
[UPDATE - 2am EST] – Our NY Data center went completely off – We are fully serving from our Chicago and LA Data centers. External reports on our site are still down but we are working to fail over all services from the LA Datacenter. – we will follow with updates.
[Update - 12:50am EST] – power just went all off in our NY Datacenter and provider has evacuated the facility – we are taking our measures to move all functionality to other datacenters.
[UPDATE] - at 9pm EST] commercial power went down on our NY Datacenter. Provider failed over to generator and we continue to serve smoothly from this Datacenter. We continue to monitor the service closely and ready to take actions if needed.
Many of our internal applications were developed using the Extjs framework.
It is very difficult to write automated test to Ext application with selenium because Ext generates many <div> and <span> tags with an automatically-generated ID (something like “ext-comp-11xx”). Accessing these tags through Selenium is the big challenge we are trying to solve. We wanted to find a way to get these automatically-generated IDs automatically.
How do we approach this?
Ext has a component manager, where all of the developers’ components are being saved. We can “ask” the component manager for the component ID by sending it a descriptor of the component. To simplify – we (the selenium server) tell the component manager “I need the ID of the current visible window which, btw, is labeled as ‘campaign editor’”.
This will look something like:
ComponentLocatorFactory extjsCmpLoc = new ComponentLocatorFactory(selenuim);
Window testWin = new Window(extjsCmpLoc.createLocator(”campaign editor”Xtype.WINDOW));
Then we can to use Ext window method like close -> testWin.close();
Anther Example :
ComponentLocatorFactory extjsCmpLoc = new ComponentLocatorFactory(selenuim);
Button newButton = new Button(extjsCmpLoc.createLocator(“Add Campaign”, ExtjsUtils.Xtype.BUTTON));
You can ask for all of the visible components by type, by label or both:
TextField flyfromdate = new TextField( extjsCmpLoc.createLocator(ExtjsUtils.Xtype.DATEFIELD, 0));
TextField flytodate = new TextField(extjsCmpLoc.createLocator(ExtjsUtils.Xtype.DATEFIELD, 1));
Here’s a simple diagram of our solution:
link to project in git-hub : https://github.com/simbal/SelenuimExtend
This solution is Open Source. In the meantime, if you have any questions, feel free to contact me directly. Asaf at outbrain dot com.
At outbrain, we like things that are awesome.
Cassandra is awesome.
Ergo, we like Cassandra.
We’ve had it in production for a few years now.
I won’t delve into why the developers like it, but as a Sysadmin on-call in the evenings, I can tell you straight out I’m glad it has my back.
We have MySQL deployed pretty heavily, and it is fantastic at what it does. However, MySQL has a bit of an administrative overhead compared to a lot of the new alternative data stores out there, especially when making MySQL work in a large geographically distributed environment.
If you can model your data in Cassandra, are educated about the trade-offs, and have an undying wish not to have to worry too deeply about managing replication and sharding, it is a no-brainer.
Us Sysadmins fear change, because it is our butt on the line if there is an outage. With executives anxiously pacing behind us and revenue flushing down the drain, we’re the last line of defense if there is an issue and we’re the ones who will be torn away from families in the evenings to handle an outage.
So, yeah… we’re a conservative lot
That being said, change and progress can be good, especially when it frees you up. Cassandra is resilient, fault-graceful and elastic. Once you understand how so, you’ll be slightly less surly. Your developers might not even recognize you!
These slides are for the Sys Admin, noble fellow, to assuage his fears and get him started with Cassandra.
The Annotated Timeline
This graph may seem intimidating at first, so don’t be scared and let’s dive right into it… BTW, you may click on the image to enlarge it.
In this graph the x axis shows the time (date and time of day) and the y axis shows the svn revision number. Each colored line represents a single module (so we have one line for www and one line for the BehavioralEngine etc).
What you would usually see is for each line (representing a module) a monotonically increasing value over time, a line from the bottom left corner towards the top right corner, however, in relatively rare cases where a developer wants to deploy an older version of his module, then you clearly see it by the line suddenly dropping down a bit instead of climbing up; this is really nice, helps find unusual events.
In the next graph you see an overview of deployments per day. (click to enlarge)
This is more of a holistic view of how things went the last couple of days, it just shows how many deployments took place each day (counts production clusters only) and colors the successful ones in green and the failed ones in red.
This graph is like an executive summary that can tell the story of – in case there are too many reds (or there are reds at all), then someone needs to take that seriously and figure out what needs to be fixed (usually that someone is me…) – or in case the bars aren’t high enough, then someone needs to kick developer’s buts and get them deploying somethin already…
Like many other graphs from Google’s library (this one’s a Stacked Column Chart, BTW), it shows nice tooltips when hovering over any of the columns with their x values (the date) and their y value (number of successful/failed deployments)
Versions DNA Mapping
The following graph shows the current variety of versions that we have in our production systems for each and every module. It was attributed as a DNA mapping by one of our developers b/c of the similarity in how they look but that’s how far this similarity goes…
The x axis lists the different modules that we have (names were intentionally left out, but you can imaging having www and other folks there). The y axis shows the svn versions of them in production. It uses glu’s live model as reported by glu’s agents to zookeeper.
Let’s zoom in a bit:
What this diagram tells us is that the module www has versions starting from 41268 up to 41463 in production. This is normal as we don’t necessarily deploy everything to all servers at once, but this graph helps us easily find hosts that are left behind for too long, so for example if one of the modules had not been deployed in a while then you’d see it falling behind low on the graph. Similarly, if a module has a large variability in versions in production, chances are that you want to close that gap pretty soon. The following graph illustrates both cases:
To implement this graph I used a crippled version of the Candle Stick Chart, which is normally used for showing stock values; it’s not ideal for this use case but it’s the closest I could find.
That’s all, three charts is enough for now and there are other news regarding our evolving deployment system, but they are not as visual; if you have any questions or suggestions for other types of graphs that could be useful don’t be shy to comment or tweet (@rantav).
Recently we had to implement an active-passive redundancy of a singleton service in our production environment where the general rule is always have “more than one of anything”. The main motivation is to alleviate the need to manually monitor and manage these services, whose presence is crucial to the overall health of the site.
This means that we sometime have a service installed on several machines for redundancy, but only one of the is active at any given moment. If the active services goes down for some reason, another service rises to do its work. This is actually called leader election. One of the most prominent open source implementation facilitating the process of leader election is Zookeeper. So what is Zookeeper?
Originally developed by Yahoo reasearch, Zookeepr acts as a service providing reliable distributed coordination. It is highly concurrent, very fast and suitable mainly for read-heavy access patterns. Reads can be done against any node of a Zookeeper cluster while writes a quorum-based. To reach a quorum, Zookeeper utilizes an atomic broadcast protocol. So how does it work?
I recently participated in the ILTechTalk week. Most of the talks discussed issues like Scalability, Software Quality, Company Culture, and Continuous Deployment (CD). Since the talks were hosted at Outbrain, we got many direct questions about our concrete implementations. Some of the questions and statements claimed that Feature Flags complicate your code. What bothered most participants was that committing code directly to trunk requires addition of feature flags in some cases and that it may make their code base more complex.
While in some cases, feature flags may make the code slightly more complicated, it shouldn’t be so in most cases. The main idea I’m presenting here is that conditional logic can be easily replaced with polymorphic code. In fact, conditional logic can always be replaced by polymorphism.
Enough with the abstract talk…
Suppose we have an application that contains some imaginary feature, and we want to introduce a feature flag. Below is a code snippet that developers normally come up with:
|1 2 3 4 5 6 7 8 9 10||
While this is a legitimate implementations in some cases, it does complicate your code base by increasing the cyclomatic complexity of your code. In some cases, the test for activation of the feature may recur in many place in the code, so this approach can quickly turn into a maintenance nightmare.
Luckily, implementing a feature flag using polymorphism is pretty easy. First, let’s define an interface for the imaginary feature and two implementations (old and new):
|1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17||
Now, let’s use the feature in our application, selecting the implementation at runtime:
|1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24||
Here, we initialized the imaginary feature member by reflection, using a class name specified as a system property. The createImaginaryFeature() method above is usually abstracted into a factory but kept as is here for brevity. But we’re still not done. Most of the readers would probably say that the introduction of a factory and reflection makes the code less readable and less maintainable. I have to agree — and apart from that, adding dependencies to the concrete implementations will complicate the code even more. Luckily, I have a secret weapon at my disposal. It is called IoC, (or DI). When using an IoC container such as Spring or Guice, your code can be made extremely flexible, and implementing feature flags becomes a walk in the park.
Below is a rewrite of the PolymorphicApplication using Spring dependency injection:
|1 2 3 4 5 6 7 8 9 10 11 12 13 14 15||
|1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24||
The spring code above defines an application and 2 imaginary feature implementations. By default, the application is initialized with the oldImaginaryFeature, but this behavior can be overridden by specifying a -DimaginaryFeature.implementation.bean=newImaginaryFeature command line argument. Only a single feature implementation will be initialized by Spring, and the implementations may have dependencies.
Bottom line is: with a bit of extra preparation and correct design decisions, feature flags shouldn’t be a burden on your code base. By extra preparation, I mean extracting interfaces for your domain objects, using an IoC container, etc, which is something we should be doing in most cases anyway.
Eran Harel is a Senior Software Developer at Outbrain.
Yeah — I know, monitoring is a “must have” tool for every web application/operation functionality. If you have clients or partners that are dependant on your system, you don’t want to hurt their business (or your business) and react in time to issues. At Outbrain, we acknowledge that it is a tech system we are running on and tech systems are bound to fail. All you need is to catch the failure soon enough, understand the reason, react and fix. On DevOps terminology it is called TTD (time to detect) and TTR (time to recover). To accomplish that, you need a good system that will tell the story and wake you up if something is wrong long before it effects the business.
This is the main reason why we invested a lot in a highly capable monitoring system. With it, we are doing Continuous Deployment and a superb monitoring system is integral part of the Immune System that allows us to react fast to flaws in the continuous stream of system changes.
Our main goal is to serve good content recommendations to readers on the Internet. The typical situation is a user reading a content page. We want to recommend content for further reading, which is a “good” recommendation.
Itai, our head of R&D gave a presentation this week about Continuous Deployment and how we actually do it.
Here is is:
Some of you might ask, ”why is he telling us about datacenter architecture? Don’t Cloud Services solve this already?” and some of you that already know me and what my opinions are on the subject will not be surprised. Yes, I’m not a fan of the Cloud Services and that is another discussion, however, there are some advantages for using Cloud Services that giving them up by establishing a datacenter felt somehow wrong for us.
Here are 2 of them:
1. Grow As You Go – When you build a datacenter you take on commitments for space (racks or cages) and high profile network gear that are investments you have to pay for in advance or before you really need to use them. This is not an issue for a Cloud-based setup because as you grow you spin up more instances.
2. Disaster Recovery Headroom – With a datacenter-based setup, in order to properly handle disaster recovery you need to double your setup so you can always move all your traffic to the other datacenter in case of disaster, which means doubling the hardware you buy. In the Cloud, this is also a non-issue.
These 2 arguments are very much correct, however even taking those into consideration, our setup is much more efficient in cost then any Cloud offering. The logic behind it is what I want to share here.
Traditionally, when a company’s business grows, a single rack or maybe 2 are not sufficient and you have the need to allocate adjacent racks space in a co-located datacenter. This makes your recurring expenses grow since you actually pay for reserved space that you don’t really use. It’s a big waste of your $$$. Once we managed to set more than one location for our service we found out that it will be much cheaper to build multiple small datacenters with a small space footprint than committing to a large space that we will not use most of the time. Adjacent space of at least 4 racks is much easier to find in most co-location facilities. More than that, our co-location provider agreed to give us 2 active racks with first right of refusal for the adjacent 2 racks so we actually pay for those we use.
This architecture also simplified much of our network gear requirements. Assuming each “LEGO Brick” is small, it needs to handle only a portion of the traffic and not all of it. This does not require high profile network gear and very cheap Linux machines are sufficient for handling most of the network roles including load balancing, etc.
We continued this approach for choosing the intra-LEGO Brick switching gear. Here we decided to use Brocade stackable switching technology. In general, it means that you can put a switch per cabinet and wire all the machines to it. When you add another cabinet you simply connect them in a chain that looks and acts like a single switch. You can grow such a stack up to 8 switches. At Outbrain, we try to eliminate single points of failure, so we have 2 stacks and machines are connected to both of them. Again, the stacking technology gave us the ability to not pay for network gear before we actually need it.
But what about Disaster Recovery (DR) headroom? (We decided to implement more than one location for disaster recovery as soon as we started generating revenue for our partners.) As I said, this is a valid argument. When we had 2 datacenters, 50% of our computing power was dedicated to DR and not used in normal time. This was not ideal and we needed to improve that. Actually, the LEGO bricks helped here once again. This week we opened our 3rd datacenter in Chicago. The math is simple, by adding another location our headroom dropped to only 33% which is a lot of $$$ savings when your business grows. When we add the 4th it will drop to 25%, etc.
I guess now you understand the logic and we can mention some fun info about the DC implementation itself:
- Datacenters communicate via a dedicated link, powered by our co-location vendor.
- We use a Global DNS service to balance traffic between the datacenters.
- In our newer datacenters, the power billing is a pay-per-use — no flat fees which again enable us to not pay for power we don’t use. It also motivates us to power off unneeded hardware and save power costs while saving the planet
- Power is 208V which is more efficient than the regular 110v.
- All servers are connected to a KVM to enable remote access to BIOS config if needed — much easier to manage from Israel and in general.
- We have a lot of Dell C6100s in our datacenters so each node there is also connected to an IPMI network in order to remotely restart each node without rebooting all 4 nodes in that chassis.
- You can read more about assembling these C6100s in Nathan’s detailed post.
I guess your question is “what does it take to manage this in terms of labor?” That answer is… not too much.
The Outbrain Operations team is a group of 4 Ops engineers. Most of the time they are not doing much related to the physical infrastructure, but like other ops teams, most of the time they handle the regular tasks of configuring infrastructure softwares (we use all of them from open source like MySQL, Cassandra, Hadoop, Hive, ActiveMQ, etc…), monitoring, code and system deployment (we heavily use Chef) etc.
In general, Operations’ role in the company is to keep the serving fast, reliable and (very important) cost-efficient. This is the main reason why we invest time, knowledge and innovation in architecting our datacenters wisely.
I guess one of the next posts will be about our new Chicago datacenter and the concept of the “Dataless Datacenter.”