Data Flow

Note: Kafka is no longer used in Openmesh's data collection pipeline due to it's centralized architecture, see: Openmesh Core

We use Apache Kafka for transporting our messages through our pipeline. Kafka is inherently distributed, scalable, and fault tolerant – essential for our high throughput, low latency, and critical data. The Openmesh infrastructure has been built from the ground up for horizontal scalability, leveraging Kafka’s parallelism and distributed architecture to ensure we can always keep up with intense throughputs.

CEX data, once collected, is all produced into a raw topic, which has semi-structured JSON that’s partitioned by exchange-symbol tuples. Blockchain data is already structured, so it’s produced to different topics in an Apache Avro format. Apache Avro is a data format that is JSON-like, but preserves schema information and minimizes sizes.

At this point, a network of stream processors consumes the data. This is a massive consumer group that ingests all raw market data to process it. Thanks to partitioning, all of these processes run in parallel, yet events for specific exchanges and symbols are guaranteed to be processed in order. All raw data is also archived in object storage as compressed JSON. Blockchain data is also processed, specifically smart contract events, which can have important data points included within that need to be processed to be revealed.

Once processed, the data is sorted and produced to a collection of standardized topics, which have consistent schemas for each message. At this point, all of the data is structured and so is encoded as Avro.

Processed data is archived into object storage by a process which converts the Avro data into Apache Parquet, another data format which is column based and has tiny file sizes, suitable for structured data that is to be placed into object storage. This periodically collect chunks of data, processes it into a single file, and stores it based on a datetime partition. The file structure organises events by year, month, day, and hour, for efficient indexing and retrieval.

Another process places the data into a Postgres database, where it is indexed and queryable. PowerQuery connects to this database directly, where users can write arbitrary SQL and perform advanced analytics. This process processes up to 1000 rows at a time, placing each event topic into its own table.

Finally, the data is consumed by our websocket broadcaster consumer group which streams out market events to our users. These are typical Kafka consumers which micro-batch messages to increase throughput. The consumer group is immensely scalable, and constant rebalances occur as the group scales up and down to meet throughput demands and handle any drops.

Last updated