S3 to Redshift / RDS data propagator

The one thing that is almost always taken for granted by firms is data. Be it customer information, sensitive information from the industry, trade secrets exclusive to the company or even basic employee information — data should be like treasure and its protection the utmost priority. However, this is easier said than done. Data is of no use if it’s always hidden and padlocked from additions and modifications. So, what do you do in such a situation? You monitor and filter!

The RDS data propagator does exactly that, minus any extra workforce. It is an automated tool that creates required tables and loads data into them using the S3 file upload operations. This, in turn, performs insert and update operations on Redshift tables, using the best practices for creating/ inserting/ updating tables, for the analytics.

So, now you must be thinking, what problems exactly did this automation solve and how did this help slice in streamlining its data? Well, here’s how:

How RDS data propagator streamlined data handling at Slice 

Here at Slice, we regularly use data to analyse trends and assess risk factors. In general, we can categorise data-related users in 2 distinct segments. They are:

  1. The Producers – (Softwares, SDEs, ETL developers)  
  2. The Consumers – (Data Analysts, Risk Analysts)

So long, the bridge between the producers and consumers were the Database Administrators (DBAs).

But that’s old school now. In the current IT Startup culture, maintaining the above structure is not always possible due to cost crunches, fast-paced developments, small team sizes, etc. In such a scenario, eventually, both the producers (developers) and consumers (analysts) have to perform certain tasks exclusive to the DBA in order to solve their topical data needs. While doing so, people often fail to make proper decisions regarding the underlying schema of their tables. This could be due to lack of knowledge or even sometimes by not being able to pre-calculate the future needs of the data. The biggest downside? Well, this, in the long run, leads to bad query performance as the data size increases.

Our case, you see, was no different. Let me take you through our story for a better viewpoint!

Reading data from a given table

Creating a table

What Keys?? 

Primary Keys, Foreign Keys, Sort Keys, Distribution Keys

Solution???

Research

Ask for help

Insert into table/ Update table

The part where we read the data from the tables was fine. 

The catch, however, lies in dealing with creating tables or inserting/ updating data into those tables for Analytical use cases. The Analysts dreaded upon performing such operations. Also, from a data security perspective, permission to run DML queries can’t be provided to all. In such a scenario the DBA used to be the go-to person for their help. On the other hand, with a very small data team in the firm, it was not feasible to provide a better turnaround time, while these small but important tasks needed to be prioritized, deprioritizing other tasks at hand. With an increasing Analyst team size and various data stores being added, these types of requests became more and more frequent, and as the old saying in IT goes : 

“If you have to do it more than twice, AUTOMATE IT.”

Hence, this tool was built.

So now that you know why it is required, let’s look into the technicalities. 

Here are the tactical details!

Technology Stack

  • AWS S3
  • AWS Lambda
  • AWS Redshift
  • Slack (for notifications)

Architecture

Features

  • Creating Redshift tables involving a check that no table is created without sort and distribution keys.
  • Updating table schemas.
  • Inserting and updating data into tables
  • Success and Failure notifications
  • Scaled up performance through Lambda

Usage Guide

The tool makes use of S3 to process files and an AWS lambda function listens to all S3 object creation events. Based on certain rules specified in the sections below, the lambda takes action and runs required Redshift queries.

The S3 to Redshift data propagator follows a strict folder and file structure, as shown in the architecture diagram

  • <destination_name> : Refers to the type of DB being used like: redshift, postgres, etc
  • <database_name>: DB name (postgres) or Schema name (redshift) in which the table is present
  • <table_name>: Name of the respective table which will contain the data
  • schema:
    • create: contains a json file specifying the column names and datatypes along with the primary, foreign, sort and distribution keys.
    • update: contains a json file specifying new column(s) to be added to the existing table
  • input:
    • load: csv files to be inserted directly into the tables are to place in this folder
    • update: csv files to be used for upserting into the tables are placed here.
      • Note: update functionality can happen if the table has any primary key defined.
  • succeeded: csv files that are loaded/ uploaded into the tables successfully are moved to this folder
  • failed: csv files that fail to get loaded into the tables are moved to this folder.

Eg:-

Creating Redshift tables

Schema definition:

The schema of a table should be provided in the format displayed below:

{
"column_specification" : {
<column_name} : <redshift_data_type}
},
"primary_key" : [column1, column2, ….], # optional
"dist_key" : <column_name}, #required
"sort_key" : [column1, column2, ….] #required
}

Eg:

{
"column_specification": {
"id": "INTEGER",
"name": "VARCHAR",
"phone_no": "INTEGER",
"time_stamp": "VARCHAR",
"uuid": "VARCHAR"
},
"primary_key": [
"uuid"
],
"dist_key": "uuid",
"sort_key": [
"uuid"
]
}

A json file is to be made using the above specification given.

Path to load the file : s3://****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/schema/create/<file_name>.json

Once the file is loaded into the appropriate path, the lambda triggers and fires a redshift command to create a table as per the schema details provided in the JSON file.

Updating table schemas

There are two types of operations supported under this:

  1. Adding column(s)
  2. Deleting column(s)

Adding columns:

Table columns to be added should be provided in the format displayed below:

{
"column_specification" : {
"<column_name>" : "<redshift_data_type>"
}, #required
"operation_type" : "add" #required
}

eg:

{
"column_specification": {
"city": "VARCHAR",
"state": "VARCHAR"
},
"operation_type": "add"
}

A json file is to be made using the above specification given.

Deleting column(s):

Table columns to be deleted should be provided in the format displayed below:

{
column_specification : [column1, column2, ….], #required
operation_type : "drop" #required
}

eg:

{
"column_specification": ["city", "state"],
"operation_type": "drop"
}

A json file is to be made using the above specification given.

Path to load the file : s3://*****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/schema/update/<file_name>.json

Once the file is loaded into the appropriate path, the columns in the specified redshift table are added or removed (dropped).

Note: Redshift doesn’t support adding or dropping multiple columns using a single query. However, this feature adds the functionality to support adding and dropping multiple columns by providing the info in the JSON file as specified above.

 Inserting and updating data into tables

A CSV file containing the data to be uploaded into the redshift table should be uploaded into the S3 bucket.

Path to load the file for inserting : s3://*****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/input/load/<file_name>.csv

Path to load the file for updating (upserting) : s3://*****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/input/update/<file_name>.csv

Success and Failure notifications

The success and failure notification of any process started using this architecture is sent on the Slack channel, in our case we send it to #data-loader-noitification

Success Notification

Failure Notification

Other usage

Apart from dealing with manual file upload tasks, this tool is also used for managing data insertion and updation flow into Redshift by our ETL cron jobs. This is because, it follows the best practices for inserting and updating data in redshift and is easy to plug and play with the manual and automated data flow requirements

Future scope

  1. Allowing the feature to update required columns. Currently, the tool supports updating an entire row.
  2. Adding a feature to support different file types like JSON, Parquet.
  3. Adding more destinations to the list, like Postgres, MySQL, etc.