Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner
Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner
- Orchestrator reads your replication topologies (give it one server – be it master or slave – in each topology, and it will reveal the rest).
- It keeps a state of this topology.
- It can continuously poll your servers to get an up to date topology map.
- It visualizes the topology in a clear and slick D3 tree.
- It allows you to modify your topology; move slaves around. You can use the command line variation, the JSON API, or you can use the web interface.
Nothing like nice screenshots
To move slaves around the topology (repoint a slave to a different master) through orchestrator‘s web interface, we use Drag and Drop,
Orchestrator keeps you safe. It does so by:
- Correctly calculating the binary log files & positions (aka coordinates) of the slave you’re moving, its current master, its new master; it properly stops, starts and stalls your replication till everything is in sync.
- Helping you to avoid shooting yourself in the leg. It will not allow moving a slave that uses STATEMENT based replication under a ROW based replication server. Or a 5.5 under a 5.6. Or anything under a server that doesn’t have binary logs. Or log_slave_updates. Or if one of the servers involed lags too much. Or more…
It also points out a few problems, visually. While it is not – and will not be – a monitoring tool, it requires some replication status info for its own purposes. Too much lag? Replication not working? Server cannot be accessed? Server under maintenance? This all shows up in your topology. We use it a lot to get a holistic view over our current replication topologies status.
Orchestrator keeps the state of your topologies. Unlike other tools that will drill down from the master and just pick up on whatever’s connected right now, orchestrator remembers what used to be connected, too. If a slave is not replicating at this very moment, that does not mean it’s not part of the topology. Same for a MySQL service that has been temporarily stopped. And this includes all their slaves, if any. Until told otherwise (or until too much time passes and a server is assumed dead), orchestrator keeps the map intact.
Orchestrator supports a maintenance-mode state; it’s a flag saying “this server is in maintenance mode right now”. But this flag includes an owner and a reason for audit purposes. And while a server is under maintenance, orchestrator will disallow replication topology changes that include this server.
Operations performed via orchestrator are audited (well, almost all). You have a complete history on what slave has been moved from where to where; what server has been taken under maintenance and when, etc.
The most important thing is of course automating error-prone human sequences of actions. Repointing slaves is a mess (when you don’t have GTIDs). Automation saves time and greatly reduces the possibility that something goes wrong (of course never eliminates). We happen to use orchestrator at Outbrain on production, and twice in the past month had major events where orchestrator saved us many hours and worry.
Orchestrator supports “standard” replication: log file:pos kind of replication. Non GTID, non-parallel. Good (?) old replication.
Why not GTID? We’re using MySQL 5.5. We’ve had issues while evaluating 5.6; and besides, migrating to GTID is a mess (several solutions or proposed solutions seem to exist). At this time the majority of MySQL users seem to run 5.5, and a minority of those running 5.6 uses GTID (this is according to an unofficial “raise your hands” survey during last Percona Live event). “Standard” replication still applies to the majority of users. Support for GTID may be added in the future.
Read the FAQ for further questions on supported replication technologies.
How do you like it?
Orchestrator can run as a command line tool (no need for Web). It can server HTTP JSON API (no need for visualization) or it can server as HTTP web interface (no need to use command line options). Have it your way.
The technology stack
Orchestrator is released as open source under the Apache 2.0 license and is available at: https://github.com/outbrain/orchestrator
Read the Manual
Get the bundled binary+web files tarball, RPM or DEB packages. Or just clone the project. It’s free.
Like many java projects these days, we use Spring in Outbrain for configuring our java dependencies wiring. Spring is a technology that started in order to solve a common, yet not so simple, issue – wiring all the dependencies in a java project. This was done by utilizing the IoC (Inversion of Control) principles. Today Spring does a lot more than just wiring and bootstrapping, but in this post I will focus mainly on that.
When Spring just started, the only way to configure the wirings of an application, was to use XMLs which defined the dependencies between different beans. As Spring had continued to develop, 2 more methods were added to configure dependencies – the annotation method and the @Configuration method. In Outbrain we use XML configuration. I found this method has a lot of pain points which I found remedy to using spring @Configuration
What is this @Configuration class?
You can think of a @Configuration class just like XML definitions, only defined by code. Using code instead of XMLs allows some advantages over XMLs which made me switch to this method:
- No typos - You can’t have a typo in code. The code just won’t compile
- Compile time check (fail fast) – With XMLs it’s possible to add an argument to a bean’s constructor but to forget to inject this argument when defining the bean in the XML. Again, this can’t happen with code. The code just won’t compile
- IDE features come for free – Using code allows you to find usages of the bean’s constructor to find out easily the contexts that use it; It allows you to jump back and forth between beans definitions and basically everything you can do with code, you get for free.
- Feature flags - In Outbrain we use feature-flags a lot. Due to the continuous-deployment culture of the company, a code that is pushed to the trunk can find itself in production in a matter of minutes. Sometimes, when developing features, we use feature flags to enable/disable certain features. This is pretty easy to do by defining 2 different implementations to the same interface and decide which one to load according to the flag. When using XMLs we had to use the alias feature which makes it not intuitive enough to create feature-flags. With @Configuration, we can create a simple if clause for choosing the right implementation.
Our example case
Step 1: Migrate <beans> to @Configuration
Step 2: Create a method for each Bean
A few things to notice:
- Each method is defined to return an interface type. In the method body we create the concrete class.
- The name that’s defined in the @Bean annotation is the same as the id that is defined in the XML for the beans.
- The bean anotherBean is injected with someBean in the XML. In the scenario here, we just call the getSomeClass() method. This doesn’t create another bean, this just uses the bean someBean (the same as it was in the XML).
We notice that we’re missing the property someInterestingProperty and the bean beanFromSomewhereElse.
Step 3: Import other XMLs or other @Configuration classes
If this bean resides in another @Configuration class you can use a different annotation @Import to import it:
In order to complete the picture, here’s how you can import a @Configuration class from an XML configuration file:
<context:annotation-config/> <bean class="some.package.ByeXmlApplicationContext"/>
The <context:annotation-config/> needs to be defined once in the context in order to make spring aware to @Configuration classes
Step 4: Import beans from other XMLs (or @Configuration class, or @Component etc… classes)
I usually prefer the first method as it is less verbose. But of course, that’s just a matter of taste.
Just remember – the beans you import must be loaded to the application context – either by @Import or @ImportResource from this class, or using any other method from anywhere else (XML, @Configuration or annotations).
Step 5: Import properties
Step 6: Import @Configuration from web.xml
- Split to different @Configuration classes and don’t put all of your beans in one class
- Give meaningful names and even decide on a naming convention
- Avoid any logic inside the @Configuration classes. Aside maybe for things like feature-flags.
This post is based on a tech-talk I gave in Outbrain. You can find the slides here.
You can find it also in my personal blog
Find me on Twitter: @AviEtzioni
Introducing Propagator: multi-everything deployment made easy
Outbrain is happy to release its own Propagator as open source. Propagator is a schema & data deployment tool which makes it easy to deploy, review, audit & fix deployments to your database servers.
What does multi-everything mean? It is:
- Multi-server: push your schema & data changes to multiple instances in parallel
- Multi-role: different servers have different schemas
- Multi-environment: recognizes the differences between development, QA, build & production servers
- Multi-technology: supports MySQL, Hive (Cassandra on the TODO list)
- Multi-user: allows users authenticated and audited access
- Multi-planetary: TODO
With dozens of database servers in our company (and these are master database servers), from development machines to testing machines, through build machines to production servers, and with a growing team of over 70 engineers, we faced the growing problem of controlling our database schema evolution. Controlling creation of tables, columns, keys, foreign keys; controlling creation of data that must be consistent across all servers became an infeasible task. Some changes were lost; some servers forgotten along the way, and inconsistencies blocked our development & deployments again and again. (more…)
Using Storm for real time distributed computations has become a widely adopted approach, and today one can easily find more than a few posts on Storm’s architecture, internals, and what have you (e.g., Storm wiki, Understanding the parallelism of a storm topology, Understanding storm internal message buffers, etc).
So you read all these posts and and got yourself a running Storm cluster. You even wrote a topology that does something you need, and managed to get it deployed. “How cool is this?”, you think to yourself. “Extremely cool”, you reply to yourself sipping the morning coffee. The next step would probably be writing some sort of a validation procedure, to make sure your distributed Storm computation does what you think it does, and does it well. Here at Outbrain we have these validation processes running hourly, making sure our realtime layer data is consistent with our batch layer data – which we consider to be the source of truth.
It was when the validation of a newly written computation started failing, that we embarked on a great journey to the land of “How does one go about debugging a distributed Storm computation?”, true story. The validation process was reporting intermittent inconsistencies when, intermittent being the operative word here, since it was not like the new topology was completely and utterly messed up, rather, it was failing to produce correct results for some of the input, all the time (by correct results I mean such that match our source of truth).
Earlier today, Outbrain was the victim of a hacking attack by the Syrian Electronic Army. Below is a description of how the attack unfolded to help others protect against similar attempts. Updates will continue to be posted to this blog.
On the evening of August 14th, a phishing email was sent to all employees at Outbrain purporting to be from Outbrain’s CEO. It led to a page asking Outbrain employees to input their credentials to see the information. Once an employee had revealed their information, the hackers were able to infiltrate our email systems and identify other credentials for accessing some of our internal systems.
At 10:23am EST SEA took responsibility for hack of CNN.com, changing a setting through Outbrain’s admin console to label Outbrain recommendations as “Hacked by SEA.”
At 10:34am Outbrain internal staff became aware of the breach.
By 10:40am Outbrain network operations began investigating and decided to shut down all serving systems, degrade gracefully and block all external access to the system.
By 11:03am Outbrain finished turning off its service from all sites where we operate.
We are continuing to review all systems before re-initiating service.
We are aware that Outbrain was hacked earlier today and we took down service as soon as it was apparent. The breach now seems to be secured and the hackers blocked out, but we are keeping the service down for a little longer until we can be sure it’s safe to turn it back on securely. Please stayed tuned here or to our Twitter feed for updates.
Yes it’s been a long time since we last updated this blog – Shame on us!
A lot has happened since our last blog post, which came while we we dealing the effects of hurricane Sandy. In the end, our team handled it bravely and effectively, with no downtime and no business impact. However, a storm is still a storm, and did have to do an emergency evacuation from our old New York data center and move to a new one.
More things have happened since and today I want to focus on one major aspect of our life in the last year. We have made some cultural decisions that somehow changed the way we treat our work. Yes, the Devops movement has its influence here. When we stood in front of the decision of “NOC or NOT”, Basically, we adopted the theme of “You build it, You run it!”.
Instead of hiring 10 students, attempting to train them on the “moving target” of a continuously changing production setup , we decided to hire 2 engineers and concentrate effort on building strong monitoring system that will allow engineers to take ownership on monitoring their systems
Now, Outbrain is indeed a high scale system. Building a monitoring system that enables more then 1000 machines and more then 100 services to report metrics every minute is quite a challenge. We chose the stack of Logstash, RabbitMQ and Graphite for that mission. In addition we developed an open source project called Graphitus which enables us to build dashboards from graphite metrics. Since adopting it we have more then 100 dashboards the teams are using daily. We also developed Dashanty which enables each team to develop an operational dashboard for itself.
On the alerting front we stayed with Nagios but improved it’s data sources. Instead of Nagios polling metrics by itself, we developed a Nagios/Graphite plugin where Nagios querys Graphite for the latest metrics and according to thresholds shoots appropriate alerts to relevant people. On top of that, the team developed an application called RedAlert that enable each and every team/engineer to configure their own alerts on their own owned services, configure when alerts are critical and when such alert should be pushed to them. This data goes into Nagios that start monitoring the metric in Graphite and will fire an alert if something goes wrong. “Push” alerts are configured to go to PagerDuty that will be able to locate the relevant engineer, email, text or call him as needed.
Now that’s on the technical part. What is more important to make it happen is the cultural side that this technology supports:
We truly believe in ”End to End Ownership”. “You build it, You run it!” is one way to say that. In an environment where everybody can (and should) change production at any moment , putting someone else to watch the systems makes it impossible. We were also very keen about MTTR (Mean Time To Recover). We don’t promise our business people 100% fault free environment, but we do promise fast recovery time. When we put these two themes in front of us, we came to the conclusion it is best that alerts will be directed to owner engineers as fast as we can, with fewer mediators on the way. So, we came up with the following:
- We put a baseline of monitoring systems to support the procedure – and we continuously improve it.
- Engineers/teams are owners of services (very SOA architecture). Lets use the term “Owner”. We try to eliminate services without clear owners.
- Owners push metrics into graphite using calls on code or other collectors.
- Owners define alerts on these metrics using RedAlert system.
- Each team defined “on call schedule” on PagerDuty. “On call” engineer is the point of contact for any alerting service under the team ownership.
- Ops are owners for the infrastructure (Servers/Network/software infra) – they also have “Ops on shift” – awake 24/7 (we use the team distribution between NY and IL for that).
- Non push alerts that does not require immediate action are gathered along non working hours and treated during working hours.
- Push Alerts are routed via PagerDuty the following way: Ops on shift get them and if he can address them or correlate them with infrastructure issue – he acknowledge them. In case Ops on Shift doesn’t know what to do with it, Pager duty continues and rout the alerts to the engineer on call.
- Usually the next thing that will happen is that both of then will jump on the HipChat and start tackling the issue to shorten MTTR and resolve it.
The biggest benefit of this method is increased sense of “ownership” for everyone in the team. The virtual wall between Ops and Dev (which was initially somehow low in Outbrain) was completely removed. Everybody is more “Production sensitive”.
Few things that helped us through it:
- Our team. As management we encouraged it and formalized it but the motivation came from the team. It is very rare to see engineers that want (not to say hardly push) to take more ownership on their products and to really “Own them”. I feel lucky that we have such team. It made our decisions much simpler.
- Being so tech-ish and pushing our monitoring capabilities to such edges instead of going to the easy, labor intensive, half ass solution (AKA NOC).
- A 2 week “Quality Time” of all engineering that was devoted to improving MTTR and building all necessary to support this procedure. – All Credits to Erez Mazor for running this week.
Hi all As Hurricane Sandy is about to hit the east coast US, and as Outbrain’s main Datacenter is located in downtown Manhattan, we are taking measures to make as little service interruption as possible for our partners and customers. Outbrain is normally serving from 3 data centers and in case of NY data center loss, we will supply the service from one the other data centers. On this page, below – we will update on any service interruption and ETAs for problem solving. We assume all will go well and we will not have to update but… just in case
[UPDATE - Nov 3rd 3:45pm EST] - At this time Utility power is back to all our datacenters and HQ office. It is now time to restore the service from NY and get the office back to work. This will take some time but systems will gradually be put back up over the next week or so. There should be no effect on users, publishers or clients.
Our HQ will also start working gradually depending on the availability of public transportation.
We are here closing this reporting post – if you see any issues, please report to firstname.lastname@example.org or your rep.
I hope the storm of the century will be the last one for the next century (at least).
[UPDATE - Nov 1st 9:30am EST] - Our HQ, located on 13th between 5th and 6th in downtown New York City is still without power and therefore closed. Thankfully, our NY-based team is safe and in dry locations, and will continue to try and work as best they can. We highly appreciate the concern and best wishes we received from our partners and clients across the globe; thank you!
We are doing our best to continue to provide the best in class service, one we hope you’ve come to expect from us. As an update, our datacenter in NY is still without power and we expect it to be down for a few more days. We will continue to serve from our other datacenters located in Chicago and Los Angeles. To reiterate, our service did not go down, and we are currently still serving across our client’s sites. As of this morning, we recovered and updated all our reporting capabilities, so we should be back to 100%.
If you are experiencing any difficulties or seeing different, please reach out to your respective contacts. We’ll also continue to operate under emergency mode until Monday, you can reach us 24/7 at am-emergency-support@outbrain.
[UPDATE - Oct 31st 6:46am EST] - Serving still holds strong from our LA and Chicago data centers and we are not aware of any disruption to our service. We are working hard to recover our dashboard reporting capabilities, but it will probably take a couple more days before we’re able to get back to normal mode. Sorry for any inconvenience caused by this. Send us a note to am-emergency-support@outbrain.
[UPDATE - 6:51pm EST] - Again, not much to update – All is stable with both LA and Chicago datacenters. It’s the end of the day here in Israel and we are trying to get some rest. Our team mates in the US are keeping an eye on the system and will alert us if there is anything wrong. Good night.
[UPDATE - 3:35am EST] - Actually not much to update about the service. All is pretty much stable. we are safely serving from LA and Chicago. most back-end services are running in LA Datacenter and our tech team in Israel and NY are monitoring and handling issues as they raise. Our Datacenter vendors in NY are working with FDNY to pump the water from the flooded generator room so it will take a while to recover this datacenter
[UPDATE - 10:50am EST] - The clients dashboard is back up.
[UPDATE - 10am EST] – The clients dashboard on our site is periodically down – we are handling the issues there and will update soon.
[UPDATE - 5am EST] Our NY Data center went down. Our service is fully operational and we are serving through our Chicago and LA Data centers. If you’re accessing your Outbrain dashboard you may experience some delays in data freshness. We are working to resolve this issue and will continue to update.
[UPDATE - 2am EST] – Our NY Data center went completely off – We are fully serving from our Chicago and LA Data centers. External reports on our site are still down but we are working to fail over all services from the LA Datacenter. – we will follow with updates.
[Update - 12:50am EST] – power just went all off in our NY Datacenter and provider has evacuated the facility – we are taking our measures to move all functionality to other datacenters.
[UPDATE] - at 9pm EST] commercial power went down on our NY Datacenter. Provider failed over to generator and we continue to serve smoothly from this Datacenter. We continue to monitor the service closely and ready to take actions if needed.
Many of our internal applications were developed using the Extjs framework.
It is very difficult to write automated test to Ext application with selenium because Ext generates many <div> and <span> tags with an automatically-generated ID (something like “ext-comp-11xx”). Accessing these tags through Selenium is the big challenge we are trying to solve. We wanted to find a way to get these automatically-generated IDs automatically.
How do we approach this?
Ext has a component manager, where all of the developers’ components are being saved. We can “ask” the component manager for the component ID by sending it a descriptor of the component. To simplify – we (the selenium server) tell the component manager “I need the ID of the current visible window which, btw, is labeled as ‘campaign editor’”.
This will look something like:
ComponentLocatorFactory extjsCmpLoc = new ComponentLocatorFactory(selenuim);
Window testWin = new Window(extjsCmpLoc.createLocator(”campaign editor”Xtype.WINDOW));
Then we can to use Ext window method like close -> testWin.close();
Anther Example :
ComponentLocatorFactory extjsCmpLoc = new ComponentLocatorFactory(selenuim);
Button newButton = new Button(extjsCmpLoc.createLocator(“Add Campaign”, ExtjsUtils.Xtype.BUTTON));
You can ask for all of the visible components by type, by label or both:
TextField flyfromdate = new TextField( extjsCmpLoc.createLocator(ExtjsUtils.Xtype.DATEFIELD, 0));
TextField flytodate = new TextField(extjsCmpLoc.createLocator(ExtjsUtils.Xtype.DATEFIELD, 1));
Here’s a simple diagram of our solution:
link to project in git-hub : https://github.com/simbal/SelenuimExtend
This solution is Open Source. In the meantime, if you have any questions, feel free to contact me directly. Asaf at outbrain dot com.
At outbrain, we like things that are awesome.
Cassandra is awesome.
Ergo, we like Cassandra.
We’ve had it in production for a few years now.
I won’t delve into why the developers like it, but as a Sysadmin on-call in the evenings, I can tell you straight out I’m glad it has my back.
We have MySQL deployed pretty heavily, and it is fantastic at what it does. However, MySQL has a bit of an administrative overhead compared to a lot of the new alternative data stores out there, especially when making MySQL work in a large geographically distributed environment.
If you can model your data in Cassandra, are educated about the trade-offs, and have an undying wish not to have to worry too deeply about managing replication and sharding, it is a no-brainer.
Us Sysadmins fear change, because it is our butt on the line if there is an outage. With executives anxiously pacing behind us and revenue flushing down the drain, we’re the last line of defense if there is an issue and we’re the ones who will be torn away from families in the evenings to handle an outage.
So, yeah… we’re a conservative lot
That being said, change and progress can be good, especially when it frees you up. Cassandra is resilient, fault-graceful and elastic. Once you understand how so, you’ll be slightly less surly. Your developers might not even recognize you!
These slides are for the Sys Admin, noble fellow, to assuage his fears and get him started with Cassandra.