Announcing Aletheia – A streaming data delivery framework

This post was written by Stas Levin

Outbrain is proud to announce Aletheia, our solution for a uniform data delivery and flow monitoring across data producing and consuming subsystems. At Outbrain we have great amounts of data being constantly moved and processed by various real time and batch oriented mechanisms. To allow fast recovery and high SLA, we need to be able to detect problems in our data crunching mechanisms as fast as we can, preferably at near real time. The later problems are detected, the harder it is to investigate them (and thus fix them), and chances of business impact grow rapidly.

To address these issues, we’ve built Aletheia, a framework providing a uniform way to deliver and consume data, with built in monitoring capabilities allowing both producing and consuming sides to report statistics, which can be used to monitor the pipeline state in a timely fashion.

Overview of Aletheia

 

Aletheia enables one to easily deliver his domain entities (represented as classes) to, and from, what we call EndPoints. Aletheia consists of two main components, a DatumProducer, and a DatumConsumer. As their names imply, each is responsible for either delivering data to or consuming it from some endpoint. Both the DatumProducer and DatumConsumer report their ongoing statistics using what we call “breadcrumbs”, which are essentially messages consisting of metadata concerning the produced/consumed data. By comparing breadcrumbs reported by DatumProducers and a DatumConsumers one can obtain an idea about what portion of the data is available for consumption and what portion has already been consumed. This can come in even more handy if pipelines have more than one producer/consumer components, creating a more complex graph structure, with breadcrumbs reported at each stage.

Architecture

Aletheia_Architecture.png

Aletheia consists of two main components:

  1. DatumProducer – the component responsible for producing (delivering) and auditing data to a certain EndPoint (a kafka topic, a file, etc)

  2. DatumConsumer – the component responsible for consuming and auditing of data from a certain EndPoint (a kafka topic, a file, etc)

In addition, there are are also senders and receivers, responsible for communicating with the particular endpoint types, be it Kafka, log files or other custom endpoint type.

Aletheia is all about getting a datum from one place to another, where datum is a single unit of information, typically a client’s domain entity, say a click, or an impression event. A datum is packed into a “DatumEnvelope” that consists of some metadata and the actual serialized datum. Aletheia comes with a native support for Kafka and text log file endpoints, but was built with extensibility in mind, making the introduction of new endpoint types easy.

Both the DatumProducer and DatumConsumer keep record of their progress and periodically send out their aggregated monitoring information in the form of “breadcrumbs” which can be used to form a comprehensive real-time picture of your pipelines state.

Use cases at Oubrain

 

  • Producing data to Kafka clusters in multiple data centers

  • Consuming data from Kafka as part of our Storm topologies

  • Producing distributed log files generated by frontend servers

  • (Work in progress) Loading data files into Hadoop

Conclusion

 

Aletheia has consolidated the way we manage our data production and consumption here at Outbrain. Its uniform API to deliver and consume data from different sources, along with the built in Breadcrumb emission, make for what we have found to be a convenient abstraction layer. 

Aletheia can be found at https://github.com/outbrain/Aletheia

 

2 Comments
  1. Stas Levin

    While both address challenges involved in designing and implementing data pipelines, their scope is quite different.
    Apache Camel targets the enterprise, employs EIPs, and has a rather rich set of features like a visual IDE, XML DSL, and more enterprise like features. I also believe that real-time high volume streams were not its main focus (though Kafka is claimed to be supported via the camel-kafka connector).
    Aletheia, on the other hand, strives to provide a simple and coherent set of features, aimed at getting you started with delivering high volume data streams easily and quickly.

    Aletheia is more like Netflix’s Suro, while Apache Camel is more like Spring Integration.

    Hope this answers your question, at least to some extent 🙂

Leave a Reply

Your email address will not be published.