Streaming Data Platform

Pranav Kohli
2 min readDec 4, 2022

--

Most of the companies today are racing to build their data systems in place. Gone are the days when the business was happy with reports and dashboards made on stale data. Real time is the need of the hour.
Let’s see how we can enable analytics with real time data.
Usually most of the companies use RDBMS as a transactional system to store data. It works great for inserts, updates, deletes and in point queries(tables with index) but sucks at analytics. Below is an architecture that can help you with building a real time analytics platform.

Change Data Capture System:

To build a data platform, you need to get data ingested to it first. Here we have Debezium Server Icerberg as the cdc capture tool. It’s hooked onto the RDBMS system and captures CDC data by reading the WAL files and dumps to an iceberg sink on the cloud storage of your choice. This sink supports two modes: Append and Upsert.

Datalake:
We use Iceberg format on top of S3 as the datalake. Iceberg is an open table format for massive analytic datasets. It has the following capabilities

  • Snapshot-based read-write separation and backfill
  • Stream-batch unified write and read
  • No forcing binding between the computing and storage engines
  • Multi-version ACID semantics and data
  • Table, schema, and partition change

The Debezium server provides a service provider interface for creating the Apache Iceberg consumer. The Iceberg consumer replicates database CDC events to destination Iceberg tables.

Query Engine:

In the lakehouse concept, a query engine is used for querying data at scale on a datalake. Here you can use Dremio/Presto to query iceberg tables directly on top of S3. These both work on a similar architecture of having a coordinator node with multiple executor/worker nodes working in parallel.

Bonus:

Having different databases exposed to users can be pain point. It’s better to have a unified access point to query different sources. Here is where Redash shines. It provides an abstraction on top of different data sources. you can configure endpoints from different databases/data sources and make it available in one place. Also you can query and join different database results via this. A pretty nifty tool to have in your arsenal.

--

--

Pranav Kohli
Pranav Kohli

No responses yet