orchestrator-agent is a side-kick, complementary project of orchestrator, implementing a daemon service on one’s MySQL hosts which communicates with and accepts commands from orchestrator, built with the original purpose of providing an automated solution for provisioning new or corrupted slaves.
It was built by Outbrain, with Outbrain’s specific use case in mind. While we release it as open source, only a small part of its functionality will appeal to the public (this is why it’s not strictly part of the orchestrator project, which is a general purpose, wide-audience solution). Nevertheless, it is a simple implementation of a daemon, such that can be easily extended by the community. The project is open for pull-requests!
A quick breakdown of orchestrator-agent is as follows:
- Executes as a daemon on linux hosts
- Interacts and invokes OS commands (via bash)
- Does not directly interact with a MySQL server running on that host (does not connect via mysql credentials)
- Expects a single MySQL service on host
- Can control the MySQL service (e.g. stop, start)
- Is familiar with LVM layer on host
- Can take LVM snapshots, mount snapshots, remove snapshots
- Is familiar with the MySQL data directory, disk usage, file system
- Can send snapshot data from a mounted snapshot on a running MySQL host
- Can prepare data directory and receive snapshot data from another host
- Recognizes local/remote datacenters
- Controlled by orchestrator, two orchestrator-agents implement an automated and audited solution for seeding a new/corrupted MySQL host based on a running server.
Outbrain is proud to announce Aletheia, our solution for a uniform data delivery and flow monitoring across data producing and consuming subsystems. At Outbrain we have great amounts of data being constantly moved and processed by various real time and batch oriented mechanisms. To allow fast recovery and high SLA, we need to be able to detect problems in our data crunching mechanisms as fast as we can, preferably at near real time. The later problems are detected, the harder it is to investigate them (and thus fix them), and chances of business impact grow rapidly.
To address these issues, we’ve built Aletheia, a framework providing a uniform way to deliver and consume data, with built in monitoring capabilities allowing both producing and consuming sides to report statistics, which can be used to monitor the pipeline state in a timely fashion.
What Is A/B Testing
A/B testing is a method widely used to validate assumptions about web site optimizations. With A/B tests we can test two configurations, configuration A and configuration B, of a web page design and compare them according to some metrics that define what a success result is. In other words, you test your new design against the current design and measure which one produces better results. To decide which design is better than the other, you split the traffic to your web page between these two configurations and then you can measure which configuration had better performance and apply this configurations as the default configuration of your site.
What To Test?
The choice of what to test depends on your goals. In Outbrain each configuration is called an A/B test variant. The idea of Outbrain’s A/B testing is to allow publishers to test two different designs of their widgets, and measure which design had better Click Through Rate (CTR) and Revenue Per 1,000 Impressions (RPM) performance.
There are more than two hundred online settings that directly affect the widget. Each of these settings can be tested within A/B test variants. For example, one of these online settings is called “Widget Structure”. This setting configures the look and feel of the widget.
Widget structure – look and feel of the widget
If your goal is to test an addition of a new widget structure, you can configure the variant A with the new widget structure addition, against variant B that uses the original design of the widget structure and serves as the control group.
When the test comes to an end many questions may come up. How did it affect the customers? Did the new design of the widget structure deliver better CTR and RPM performance? Maybe if we changed the title of the new widget structure it would have resulted in better performance? Maybe if we changed the images size of the old widget structure, it would have resulted in better performance? All of these questions can be answered one by one if we set appropriate A/B test variants.
Even though each A/B test in our system is unique, there are certain widget settings that are usually tested for every variant:
- Number of paid recommendations
- Number of organic recommendations
- Image size in the widget
- The number of recommendations on the widget unit
- Widget structure
A/B Tests in Outbrain
Once you decided that you want to create a new A/B test, you can do it using an internal tool named Wabbit – Widget A/B testing tool. The tool gives you the ability to create/edit an existing A/B test or to pull internal reports with Key Performance Indicator (KPI) performance for the test.
The A/B test can be defined on a specific widget on one site or it can be done on a group of sites that use the same widget.
When the test ends, we pull the A/B test report to measure which configuration had better performance. If the data indicates one of the configurations is an improvement according to our KPIs and the test has experienced enough traffic to be considered significant, we give the option to apply the new configuration as the default for the widget.
- In Outbrain we recommend running experiments for at least two weeks and no more than a month. The main reason for that is to eliminate the “day of the week” effect because users who visit the site on the weekend might represent different segment than those who visit the site during the week.
- On the other hand, running an A/B test more than a month leads to unreliable test results, such as cookie expiration that causes the users to start see different configurations which compromises the consistency of the test.
- At Outbrain, we also recommend allocating at least 5% of traffic toward an AB test to increase the probability of ending the test with results that have more than a 90% confidence level based on statistical analysis. Here’s a calculator from KissMetrics that will allow you to easily figure out if you’re A/B test results are significant.
In this blog post I will be implementing a file download with a progress indicator using cookies, AngularJS and the promises.
Promises are a powerful concept with a number of advantages, in the following implementation pay attention to these points (your more then welcome to comment):
- Clarity and readability of code
- Error handling
- Separation of concerns
I thought of showing the same implementation without promises, but I think anyone who has tried to handle more than one callback and handle the error cases properly will easily see the difference.
A download button that changes it’s text with set intervals.
At the end it should be in a success state or an error state.
To complicate things a little and show the power of promises I added another step called “validateBeforeDownload”, this step will call the server to validate the download and fail it if necessary.
Downloading a file
The standard way of downloading a file is with a simple “a” tag with an href.
In order to do be able to add the “validateBeforeDownload” step and avoid passing “dom” to a service – I am using an Iframe which a service creates and destroys. This will trigger the download and if the server headers are appropriate the download will begin.
Adding in the progress
Easier said then done! Downloading a file can’t be done with an simple ajax call, so you can’t tell when the download is complete.
The solution I’m using is setting a cookie, let’s call it “download_file” with a timer that checks for a cookie every 500ms.
- While the cookie exists the loading state is preserved.
- Once the request completes, the server deletes the cookie and the timer is stopped.
This isn’t the best solution but is simple and doesn’t require sockets or external plugins.
Just to get the full stack of implementation here is the code for handling the response data and the clearing of the cookie.
Wrapping everything together with promises
Pay attention to the comments in the code, some of the code is there to simulate the server requests and response and are only there for the full picture.
Each visual state of the button is determined by it’s text (scope.downloadExcelText).
Notice $timeout mocks an asynchronous call and it’s response to a server.
this would normally be done with $http.
This is were our hard work pays off and promises start to shine.
Lets step into the promise mechanism -
Prepending the “downloadService.validateBeforeDownload” to the “downloadService.downloadExcel” with the “then” method creates a third promise which shares callbacks for: success, failure and notifications (for the progress).
There is also a finally callback attached to this promise that we use for sharing code between the success and failure.
But the really nice thing here is it also enables handling errors just from the “validateBeforeDownload”, and bubbling them up if needed with $q.reject or by simply throwing the error.
Pay attention that each step towards completion of the promise seems to be handled in an async manner and the actual asynchronicity is handled by the promise mechanism and the service. Magic!
Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner
Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner
- Orchestrator reads your replication topologies (give it one server – be it master or slave – in each topology, and it will reveal the rest).
- It keeps a state of this topology.
- It can continuously poll your servers to get an up to date topology map.
- It visualizes the topology in a clear and slick D3 tree.
- It allows you to modify your topology; move slaves around. You can use the command line variation, the JSON API, or you can use the web interface.
Nothing like nice screenshots
To move slaves around the topology (repoint a slave to a different master) through orchestrator‘s web interface, we use Drag and Drop,
Orchestrator keeps you safe. It does so by:
- Correctly calculating the binary log files & positions (aka coordinates) of the slave you’re moving, its current master, its new master; it properly stops, starts and stalls your replication till everything is in sync.
- Helping you to avoid shooting yourself in the leg. It will not allow moving a slave that uses STATEMENT based replication under a ROW based replication server. Or a 5.5 under a 5.6. Or anything under a server that doesn’t have binary logs. Or log_slave_updates. Or if one of the servers involed lags too much. Or more…
It also points out a few problems, visually. While it is not – and will not be – a monitoring tool, it requires some replication status info for its own purposes. Too much lag? Replication not working? Server cannot be accessed? Server under maintenance? This all shows up in your topology. We use it a lot to get a holistic view over our current replication topologies status.
Orchestrator keeps the state of your topologies. Unlike other tools that will drill down from the master and just pick up on whatever’s connected right now, orchestrator remembers what used to be connected, too. If a slave is not replicating at this very moment, that does not mean it’s not part of the topology. Same for a MySQL service that has been temporarily stopped. And this includes all their slaves, if any. Until told otherwise (or until too much time passes and a server is assumed dead), orchestrator keeps the map intact.
Orchestrator supports a maintenance-mode state; it’s a flag saying “this server is in maintenance mode right now”. But this flag includes an owner and a reason for audit purposes. And while a server is under maintenance, orchestrator will disallow replication topology changes that include this server.
Operations performed via orchestrator are audited (well, almost all). You have a complete history on what slave has been moved from where to where; what server has been taken under maintenance and when, etc.
The most important thing is of course automating error-prone human sequences of actions. Repointing slaves is a mess (when you don’t have GTIDs). Automation saves time and greatly reduces the possibility that something goes wrong (of course never eliminates). We happen to use orchestrator at Outbrain on production, and twice in the past month had major events where orchestrator saved us many hours and worry.
Orchestrator supports “standard” replication: log file:pos kind of replication. Non GTID, non-parallel. Good (?) old replication.
Why not GTID? We’re using MySQL 5.5. We’ve had issues while evaluating 5.6; and besides, migrating to GTID is a mess (several solutions or proposed solutions seem to exist). At this time the majority of MySQL users seem to run 5.5, and a minority of those running 5.6 uses GTID (this is according to an unofficial “raise your hands” survey during last Percona Live event). “Standard” replication still applies to the majority of users. Support for GTID may be added in the future.
Read the FAQ for further questions on supported replication technologies.
How do you like it?
Orchestrator can run as a command line tool (no need for Web). It can server HTTP JSON API (no need for visualization) or it can server as HTTP web interface (no need to use command line options). Have it your way.
The technology stack
Orchestrator is released as open source under the Apache 2.0 license and is available at: https://github.com/outbrain/orchestrator
Read the Manual
Get the bundled binary+web files tarball, RPM or DEB packages. Or just clone the project. It’s free.
Like many java projects these days, we use Spring in Outbrain for configuring our java dependencies wiring. Spring is a technology that started in order to solve a common, yet not so simple, issue – wiring all the dependencies in a java project. This was done by utilizing the IoC (Inversion of Control) principles. Today Spring does a lot more than just wiring and bootstrapping, but in this post I will focus mainly on that.
When Spring just started, the only way to configure the wirings of an application, was to use XMLs which defined the dependencies between different beans. As Spring had continued to develop, 2 more methods were added to configure dependencies – the annotation method and the @Configuration method. In Outbrain we use XML configuration. I found this method has a lot of pain points which I found remedy to using spring @Configuration
What is this @Configuration class?
You can think of a @Configuration class just like XML definitions, only defined by code. Using code instead of XMLs allows some advantages over XMLs which made me switch to this method:
- No typos - You can’t have a typo in code. The code just won’t compile
- Compile time check (fail fast) – With XMLs it’s possible to add an argument to a bean’s constructor but to forget to inject this argument when defining the bean in the XML. Again, this can’t happen with code. The code just won’t compile
- IDE features come for free – Using code allows you to find usages of the bean’s constructor to find out easily the contexts that use it; It allows you to jump back and forth between beans definitions and basically everything you can do with code, you get for free.
- Feature flags - In Outbrain we use feature-flags a lot. Due to the continuous-deployment culture of the company, a code that is pushed to the trunk can find itself in production in a matter of minutes. Sometimes, when developing features, we use feature flags to enable/disable certain features. This is pretty easy to do by defining 2 different implementations to the same interface and decide which one to load according to the flag. When using XMLs we had to use the alias feature which makes it not intuitive enough to create feature-flags. With @Configuration, we can create a simple if clause for choosing the right implementation.
Our example case
Step 1: Migrate <beans> to @Configuration
Step 2: Create a method for each Bean
A few things to notice:
- Each method is defined to return an interface type. In the method body we create the concrete class.
- The name that’s defined in the @Bean annotation is the same as the id that is defined in the XML for the beans.
- The bean anotherBean is injected with someBean in the XML. In the scenario here, we just call the getSomeClass() method. This doesn’t create another bean, this just uses the bean someBean (the same as it was in the XML).
We notice that we’re missing the property someInterestingProperty and the bean beanFromSomewhereElse.
Step 3: Import other XMLs or other @Configuration classes
If this bean resides in another @Configuration class you can use a different annotation @Import to import it:
In order to complete the picture, here’s how you can import a @Configuration class from an XML configuration file:
<context:annotation-config/> <bean class="some.package.ByeXmlApplicationContext"/>
The <context:annotation-config/> needs to be defined once in the context in order to make spring aware to @Configuration classes
Step 4: Import beans from other XMLs (or @Configuration class, or @Component etc… classes)
I usually prefer the first method as it is less verbose. But of course, that’s just a matter of taste.
Just remember – the beans you import must be loaded to the application context – either by @Import or @ImportResource from this class, or using any other method from anywhere else (XML, @Configuration or annotations).
Step 5: Import properties
Step 6: Import @Configuration from web.xml
- Split to different @Configuration classes and don’t put all of your beans in one class
- Give meaningful names and even decide on a naming convention
- Avoid any logic inside the @Configuration classes. Aside maybe for things like feature-flags.
This post is based on a tech-talk I gave in Outbrain. You can find the slides here.
You can find it also in my personal blog
Find me on Twitter: @AviEtzioni
Introducing Propagator: multi-everything deployment made easy
Outbrain is happy to release its own Propagator as open source. Propagator is a schema & data deployment tool which makes it easy to deploy, review, audit & fix deployments to your database servers.
What does multi-everything mean? It is:
- Multi-server: push your schema & data changes to multiple instances in parallel
- Multi-role: different servers have different schemas
- Multi-environment: recognizes the differences between development, QA, build & production servers
- Multi-technology: supports MySQL, Hive (Cassandra on the TODO list)
- Multi-user: allows users authenticated and audited access
- Multi-planetary: TODO
With dozens of database servers in our company (and these are master database servers), from development machines to testing machines, through build machines to production servers, and with a growing team of over 70 engineers, we faced the growing problem of controlling our database schema evolution. Controlling creation of tables, columns, keys, foreign keys; controlling creation of data that must be consistent across all servers became an infeasible task. Some changes were lost; some servers forgotten along the way, and inconsistencies blocked our development & deployments again and again. (more…)
Using Storm for real time distributed computations has become a widely adopted approach, and today one can easily find more than a few posts on Storm’s architecture, internals, and what have you (e.g., Storm wiki, Understanding the parallelism of a storm topology, Understanding storm internal message buffers, etc).
So you read all these posts and and got yourself a running Storm cluster. You even wrote a topology that does something you need, and managed to get it deployed. “How cool is this?”, you think to yourself. “Extremely cool”, you reply to yourself sipping the morning coffee. The next step would probably be writing some sort of a validation procedure, to make sure your distributed Storm computation does what you think it does, and does it well. Here at Outbrain we have these validation processes running hourly, making sure our realtime layer data is consistent with our batch layer data – which we consider to be the source of truth.
It was when the validation of a newly written computation started failing, that we embarked on a great journey to the land of “How does one go about debugging a distributed Storm computation?”, true story. The validation process was reporting intermittent inconsistencies when, intermittent being the operative word here, since it was not like the new topology was completely and utterly messed up, rather, it was failing to produce correct results for some of the input, all the time (by correct results I mean such that match our source of truth).
Earlier today, Outbrain was the victim of a hacking attack by the Syrian Electronic Army. Below is a description of how the attack unfolded to help others protect against similar attempts. Updates will continue to be posted to this blog.
On the evening of August 14th, a phishing email was sent to all employees at Outbrain purporting to be from Outbrain’s CEO. It led to a page asking Outbrain employees to input their credentials to see the information. Once an employee had revealed their information, the hackers were able to infiltrate our email systems and identify other credentials for accessing some of our internal systems.
At 10:23am EST SEA took responsibility for hack of CNN.com, changing a setting through Outbrain’s admin console to label Outbrain recommendations as “Hacked by SEA.”
At 10:34am Outbrain internal staff became aware of the breach.
By 10:40am Outbrain network operations began investigating and decided to shut down all serving systems, degrade gracefully and block all external access to the system.
By 11:03am Outbrain finished turning off its service from all sites where we operate.
We are continuing to review all systems before re-initiating service.
We are aware that Outbrain was hacked earlier today and we took down service as soon as it was apparent. The breach now seems to be secured and the hackers blocked out, but we are keeping the service down for a little longer until we can be sure it’s safe to turn it back on securely. Please stayed tuned here or to our Twitter feed for updates.