Every data-driven company, in its initial stages, starts with managing data with SQL and NoSQL databases and eventually as the amount of data grows it becomes time-consuming to perform complex analysis on the underlying data stores. In the current era, where the technology is being consumed at a fast pace, the data is growing by leaps and bounds. This leads to the process shift to Handling and Analysing Big Data which the legacy SQL NoSQL DBs can’t handle easily.
It goes without saying that such was our case too! Hence we started building our own Data Lake based on streaming data, Python-Spark ETL process and using Presto engine based AWS Athena. Well, we thought of writing this blog, to explain our approach and way of thinking to the world, so hold on to your mouse and keep scrolling.
“In God we trust. All others must bring data” — No, just kidding!
Idea / Motivation
We at slice started building our tools (Mobile app and backend infra) back in 2015/16. We were small back then, trying to figure out the ways to engage our target group of young professionals and students with our idea of providing them a credit line based on their creditworthiness.
Initially, we built our systems on NoSQL DB due to its ease of use and easy data/documents modification process. Now this is all because of its fluid/changing capabilities due to it being NoSQL. With the consumer base being small, this was easy to use for analytics up to a certain scale to build our credit policies and risk models.
Over the time as we started growing by leaps and bounds, our customer base grew with us too. NoSQL DB was still good enough for the Mobile app related functionalities but it couldn’t be used easily for Analytical use cases. So we decided to start making use of Redshift, the AWS analytical DB engine. While this appeared the right choice at the time, but again with our growth and expansion plans coupled with an influx of new customers, the data under concern for analytical use cases had started growing exponentially. Just to give you an idea — the data that we received in 2016-2017 is nothing as compared to the data we receive on a monthly basis now, in terms of volume.
AWS Redshift is nice for analytical use cases, but it comes with a lot of technical expertise requirement and restrictions, namely:
- The sort and distribution keys need to be defined for any new table by thinking of all the possible query usage patterns that the table would be used in and also thinking about the query patterns which could be used in future.
- The queries are fast if every best practice is followed here, but again that requires expertise in the area of DB designing, clustering, etc.
- As per the current trends of data growth and the usage of data, nothing remains constant forever. The table schema which appears best, for now, may not be good enough for the future querying and analytical needs.
Similar was the case for us! The overhead of maintaining the tables and data distribution sanity as well as an increase in the query timings plus the data’s exponential growth led us to think of creating our own Data lake.
We process a significant amount of data to make sure our policies are in line with the current market conditions, RBI guidelines, and special situations like the coronavirus outbreak, the moratorium on interest, etc.
So, for our Data lake, we chose S3 + Athena, which is comparatively faster and cheaper in terms of storage and accessibility for analytical processes, querying TBs of data within seconds, and all you need to pay for is just the data scanned when queried, with zero cost for running and managing the infra as in case of DB engines
This architecture was proposed to achieve the following key results:
- Prepare a Data Lake, a source of data which can be used for analytical purposes, plus is easily accessible and queryable.
- Reduce the size of data which is not relevant for the realtime processing that is to be stored in MongoDB
- Reduce the overall cost being incurred on MongoDB for storing, indexing and accessing the data.
- To set up a process to Transform the data in a ready to digest form, so that information and facts can be readily available.
- To do real-time aggregation on the streaming data so that the backend APIs can get ready to consume data from RDS, reducing the backend API response timing, as a result.
Backend server has the role of dumping the data into Mongo as well as the S3.
AWS Kinesis, dumps the data in an S3 bucket which we use as an unprocessed datastore. The data is dumped periodically in near real-time. The data remains in its raw format as received from the devices. Apart from this, the AWS Kinesis Firehose uses AWS Lambda to transform the real-time stream data and dump it into AWS RDS for backend API consumption.
The data from the unprocessed data store goes through an ETL process via AWS EMR or AWS Glue so that the data can be processed and partitioned as per requirement for analysis. This also works on reducing the unwanted clutter and duplicity in the data by comparing the unprocessed data with the already created data lake.
The data in the MongoDB is kept with a TTL of 45/60 days as per the use case, which helps us in keeping a check on the growth in size over MongoDB
Our use cases mainly revolve around users UUID and the time at which the data was taken. Accordingly, we finalised the partitions for the different dataset as per the use-cases of the different collections, as of now.
The data is converted in Parquet format with compression in snappy providing us at max 8 times size compression in S3 resulting in low storage cost and also low cost incurred for accessing data by size.
Besides, with the data being Parquet, the querying is much faster and easy due to the columnar nature of data. This logically gives you the power to query on any particular field as if it was a primary or a sort key in itself.
- The data is converted in Parquet format with compression in snappy providing us at max 8 times size compression in S3 resulting in low storage cost and also low cost incurred for accessing data by size.
- Moving things out of Mongo to our data lake, resulted in 65-70% less data storage and transfer costs, thereby, helping us in reducing the instance size requirement as well.
- Current data storage and transfer cost, and the data backup cost for mongo is reduced by 70-75%
- Due to less amount of data being stored and less dependency of Analytics to be performed on MongoDB we do not require the use of heavy Atlas AWS instance configuration as well, which resulted in further infrastructure cost saving by ~50%.
- Previously, it was required to download the data from Mongo and convert it into a consumable format for reporting or analysis. Data Lake has automated this process and reduced the time involved in doing so, resulting in faster analysis.
- Complete removal of the internal Product API servers’ dependency on MongoDB data by transforming the incoming sync data in real-time, and making it available within AWS RDS to be directly used by APIs without any need of complex calculation being performed by the API servers.
- Creating a highly scalable Data sync engine, a project which caters to the data sync and data lake use case, allowing the Slice Apps backend servers to cater to user requests in a better manner, without sync processes hogging up our current backend server infrastructure.
So, that was all about our thought process behind the Data Lake we incorporated. Here at slice, we are continually building, developing, optimizing our systems. So, stay tuned for more such technical scoops!