Data analytics platform
Today in the 21st century data has become the new oil. Companies are scrambling today to get data from multiple sources, clean, transform and make it available for real time analytics.
There was a time when hadoop used to be the defacto choice for a data platform, but now in 2022 things have changed a lot. With the modern data stack we have a plethora of new technologies and tools that are better suited for real time analytics.
Currently in my company, we use Postgres as the main database to capture all the data from our app. It shines as a transactional db but when it comes to analytics(merge and join queries), it fails miserably. For some analytical queries, it takes ~5 hrs just to run the computation. To reduce this lead time and enable real time decisions, we have built a platform centred around snowflake as the data warehouse.
Here’s an architecture that we are using currently to ingest and transform hundreds of gigabytes of data on a daily basis. CDC data from Postgres is captured by DMS and dumped in a S3 bucket. From there onwards an airflow job is scheduled to run every hour that takes this CDC data and converts to Hudi format. This data then gets dumped to Snowflake. Any post transformations on data are done in snowflake using Dbt.
This is built on top of the data lakehouse concept. The idea to use an open format like Hudi to transform data gives two benefits
- Costing is less as data transformation is in spark and the storage layer is S3
- Final data becomes available at the S3 layer itself.
Once the transformation is done, this final data is copied over to snowflake incrementally in intervals of 1hr
Data availability:
- S3 layer. For ad hoc analysis, or for ML purposes the user can directly consume data from S3
- Snowflake. This is recommended for more BI heavy use cases where the user wants to perform complex queries, aggregates and joins.
With this architecture, our daily jobs take ~15 mins to run every hour and we get near real time data for analysis. Hudi also provides support for streaming data using their DeltaStreamer, in case if we want to get data in minute’s latency.
At the end of the day, data begets data so keep it in good shape.