CloudCustodian At Slice

Regulation is an exciting topic and is directly proportional to fair and clean working of a process. Interestingly, it is also inversely related to the freedom with which the process could be performed. Take Indian markets, for instance, when the 1991 moment removed regulation, the productivity of the market increased, but lawmakers still have to come every year to amend the processes of the market to be fair and clean.

At slice, our infrastructure is growing, and so is our team. We wanted to be productive to an extent. A developer should ideally get a bucket up and ready in seconds but we also wanted to make sure that it is launched as per the configuration and is compliant with our organisation’s policy. But it’s extremely tricky to convey to every developer in an ever-growing team about all the types of configuration required. Would you want him to just create the bucket or first go through a document of how to launch the bucket in org ? See, Security and Productivity are inversely proportional, just like the Regulation and the Markets are.

Enters cloudcustodian,

Cloud Custodian runs a set of policies against your cloud resources. It checks if your resource is compliant with the set of filters defined in the policy as well as performs an action if you want to – make the bucket private, which you accidentally made public.

A typical cloudcustodian policy for the above use case looks like this .

It has two major components:

  1. Filters: The set of filters to be run against your resources. You can apply operators, combine them together as well . Cloudcustodian also comes with predefined standard filters for resources, or you can create them custom by filtering the values based on the describe-resource api call for aws resource. Ex: aws ec2 describe-instances.
  2. Actions: The set of actions you want to perform on the resources selected via filters. Again, cloudcustodian comes with predefined standard actions for resources, or you can attach a lambda handler to perform any custom action. 

Our infrastructure team defines what the right configuration to launch an s3 bucket is, commits it into a git repo. Next, our custodian runs on the hercules platform and picks these policies up from an s3 bucket. It further runs against all resources in our multiple AWS accounts and sends alerts, and aggregates into AWS Security Hub. Lastly, it performs an action if it’s a critical configuration mismatch. 

Drawing comparisons from the market analogy, the custodian runs like the constant regulator in a market, making sure it’s fair and clean. The developers are like the entrepreneurs who remain productive without ever having a second thought about the security.

So, this was all about the CloudCustodian at slice. To know more about the amazing things that slice’s engineering team does, keep an eye on this space! 

Slice Data Lake

Every data-driven company, in its initial stages, starts with managing data with SQL and NoSQL databases and eventually as the amount of data grows it becomes time-consuming to perform complex analysis on the underlying data stores. In the current era, where the technology is being consumed at a fast pace, the data is growing by leaps and bounds. This leads to the process shift to Handling and Analysing Big Data which the legacy SQL NoSQL DBs can’t handle easily.

It goes without saying that such was our case too! Hence we started building our own Data Lake based on streaming data, Python-Spark ETL process and using Presto engine based AWS Athena. Well, we thought of writing this blog, to explain our approach and way of thinking to the world, so hold on to your mouse and keep scrolling.

“In God we trust. All others must bring data” — No, just kidding! 

Idea / Motivation

We at slice started building our tools (Mobile app and backend infra) back in 2015/16. We were small back then, trying to figure out the ways to engage our target group of young professionals and students with our idea of providing them a credit line based on their creditworthiness.

Initially, we built our systems on NoSQL DB due to its ease of use and easy  data/documents modification process. Now this is all because of its fluid/changing capabilities due to it being NoSQL. With the consumer base being small, this was easy to use for analytics up to a certain scale to build our credit policies and risk models.

Over the time as we started growing by leaps and bounds, our customer base grew with us too. NoSQL DB was still good enough for the Mobile app related functionalities but it couldn’t be used easily for Analytical use cases. So we decided to start making use of Redshift, the AWS analytical DB engine. While this appeared the right choice at the time, but again with our growth and expansion plans coupled with an influx of new customers, the data under concern for analytical use cases had started growing exponentially. Just to give you an idea — the data that we received in 2016-2017 is nothing as compared to the data we receive on a monthly basis now, in terms of volume.

AWS Redshift is nice for analytical use cases, but it comes with a lot of technical expertise requirement and restrictions, namely:

  1. The sort and distribution keys need to be defined for any new table by thinking of all the possible query usage patterns that the table would be used in and also thinking about the query patterns which could be used in future.
  2. The queries are fast if every best practice is followed here, but again that requires expertise in the area of DB designing, clustering, etc.
  3. As per the current trends of data growth and the usage of data, nothing remains constant forever. The table schema which appears best, for now, may not be good enough for the future querying and analytical needs.

Similar was the case for us! The overhead of maintaining the tables and data distribution sanity as well as an increase in the query timings plus the data’s exponential growth led us to think of creating our own Data lake.

We process a significant amount of data to make sure our policies are in line with the current market conditions, RBI guidelines, and special situations like the coronavirus outbreak, the moratorium on interest, etc.

So, for our Data lake, we chose S3 + Athena, which is comparatively faster and cheaper in terms of storage and accessibility for analytical processes, querying TBs of data within seconds, and all you need to pay for is just the data scanned when queried, with zero cost for running and managing the infra as in case of DB engines

Architecture

Explanation

This architecture was proposed to achieve the following key results:

  1. Prepare a Data Lake, a source of data which can be used for analytical purposes, plus is easily accessible and queryable.
  2. Reduce the size of data which is not relevant for the realtime processing that is to be stored in MongoDB
  3. Reduce the overall cost being incurred on MongoDB for storing, indexing and accessing the data.
  4. To set up a process to Transform the data in a ready to digest form, so that information and facts can be readily available.
  5. To do real-time aggregation on the streaming data so that the backend APIs can get ready to consume data from RDS, reducing the backend API response timing, as a result.

Backend server has the role of dumping the data into Mongo as well as the S3.

AWS Kinesis, dumps the data in an S3 bucket which we use as an unprocessed datastore. The data is dumped periodically in near real-time. The data remains in its raw format as received from the devices. Apart from this, the AWS Kinesis Firehose uses AWS Lambda to transform the real-time stream data and dump it into AWS RDS for backend API consumption.

The data from the unprocessed data store goes through an ETL process via AWS EMR or AWS Glue so that the data can be processed and partitioned as per requirement for analysis. This also works on reducing the unwanted clutter and duplicity in the data by comparing the unprocessed data with the already created data lake.

The data in the MongoDB is kept with a TTL of 45/60 days as per the use case, which helps us in keeping a check on the growth in size over MongoDB

Data partitions

Our use cases mainly revolve around users UUID and the time at which the data was taken. Accordingly, we finalised the partitions for the different dataset as per the use-cases of the different collections, as of now.

Data storage

The data is converted in Parquet format with compression in snappy providing us at max 8 times size compression in S3 resulting in low storage cost and also low cost incurred for accessing data by size.

Besides, with the data being Parquet, the querying is much faster and easy due to the columnar nature of data. This logically gives you the power to query on any particular field as if it was a primary or a sort key in itself.

Outcomes

  • The data is converted in Parquet format with compression in snappy providing us at max 8 times size compression in S3 resulting in low storage cost and also low cost incurred for accessing data by size.
  • Moving things out of Mongo to our data lake,  resulted in 65-70% less data storage and transfer costs, thereby, helping us in reducing the instance size requirement as well.
  • Current data storage and transfer cost, and the data backup cost for mongo is reduced by 70-75%
  • Due to less amount of data being stored and less dependency of Analytics to be performed on MongoDB we do not require the use of heavy Atlas AWS instance configuration as well, which resulted in further infrastructure cost saving by ~50%.
  • Previously, it was required to download the data from Mongo and convert it into a consumable format for reporting or analysis. Data Lake has automated this process and reduced the time involved in doing so, resulting in faster analysis.
  • Complete removal of the internal Product API servers’ dependency on MongoDB data by transforming the incoming sync data in real-time, and making it available within AWS RDS to be directly used by APIs without any need of complex calculation being performed by the API servers.

Future Scope

  • Creating a highly scalable Data sync engine, a project which caters to the data sync and data lake use case, allowing the Slice Apps backend servers to cater to user requests in a better manner, without sync processes hogging up our current backend server infrastructure.

So, that was all about our thought process behind the Data Lake we incorporated. Here at slice, we are continually building, developing, optimizing our systems. So, stay tuned for more such technical scoops! 

Deployment of static websites with AWS Amplify

Every brand needs a website. And guess what? We have one and two too.

Once we were done building our JAM stack site, inevitably we had to change a few things here and there. That is what marked the beginning of the deployment drama.

We’ve done it. Here’s how you could do it too!

Upload your code on AWS S3, route it via AWS CloudFront for the HTTPS, connect it to AWS Route53 and you’re good to go.

Now, you must be thinking you’d only need to set these up the first time and later, you’ll just upload to s3. Right? Well, not so fast.

Every time we pushed something to the s3 bucket, we’d have to do all of these.

  1. Check for correct permissions for the S3 objects which were uploaded.
  2. Is it working? No, Change didn’t reflect 🤨
  3. Realize that the CloudFront cache wasn’t flushed 🤦
  4. Goto CloudFront (CF) dashboard,
  5. Search for that particular CF distribution.
  6. Switch to the cache invalidations tab.
  7. What was that format to flush all objects again? /* or *. Ah, It’s /*.
  8. Keep refreshing the CF deployment status page, till it says it’s done.
  9. Verify the change again.

There’s always some inherent frustration going on when a change this small takes so many steps and so much effort.

What are some of the alternatives then?

  • Github Actions with Github Pages
  • Netlify
  • Heroku
  • Firebase
  • Digital Ocean
  • AWS CodePipeline
  • AWS Amplify

Since we use AWS, we tried AWS CodePipeline. It was an eye-opening ride, seeing all the service roles, stages, notification configuration on SNS for slack. 

But an eye-opening ride shouldn’t feel like you are greying early 👨‍🦳.

So, we switched to AWS Amplify for deployment. The setup looks super easy, all in one dashboard.

  1. Create a frontend app.
  2. Configure git source.
  3. Specify branches.
  4. Configure Build step.
  5. Configure custom domain configuration of Route53, while selecting a domain listed in our account.
  6. Deploy.

What’s to like in AWS Amplify?

  • Single provider – if something goes wrong, tech-support is at one place.
  • Single dashboard to set up and manage the whole deployment as well as be notified of.
  • Auto configuring of domains with certificates.
  • Auto cache flushes on each deployment.
  • New endpoints for each PR created with AWS Amplify Preview.
  • Build step for static site generators.
  • Independently use AWS Amplify to host only front-end applications.
  • An option to build your backend with it too.

Here are some Gotchas!

  • If you already have an AWS CloudFront distribution for your domain, make sure you have a temporary site in a different provider configured. There may be chances that your site will be down while your site is served via AWS Amplify. This is because Amplify internally uses Cloudfront and may conflict with the existing one.
  • The nifty AWS Amplify Pull Previews feature doesn’t support custom domains as of yet.
  • No SNS Support for notifications, only emails notification as of now.
  • No notifications when a deployment is done in AWS Amplify Previews as of now.

Sayonara.

Session Manager: Driving operational excellence at slice!

Goodbye SSH and bastion hosts. Hello SSM!

As much as we’d like to run our servers like cattle (pets vs cattle mantra), there are times that call for interactive shell access to instances. 

This translates to audited secure access to cloud resources either through bastion hosts or through SSH keys, which in turn opens up a Pandora’s box of bastion management and tight SSH security. 

Surely this age-old problem of remote server access was looking for a cloud-native solution for it. Enter session manager, the future of remote server management. 

So how does session manager improve upon traditional remote access technologies? Here are a list of it’s features: 

  • No inbound security rules required to access instances. This means, 0 ports have to be opened to allow remote access. 
  • All user sessions and commands are logged with optional encryption via KMS.
  • Integration with existing IAM policies to allow robust access control.
  • SSH tunneling over session manager. 

The architecture diagram below provides a high level overview of how session manager works.

session manager architecture

Let’s look at how to setup and enable session manager for AWS instances.

Configuring session manager

1. IAM permissions for instance: The easiest way to get started is to attach the AmazonSSMRoleForInstancesQuickSetup role to your instance.

IAM role for session manager

If your instance already has a role attached to it, the AmazonSSMManagedInstanceCore policy can be attached to the existing role.

IAM policy for session manager

2. IAM permissions for users: You need to create policies to allow access to an EC2 instance for specific IAM users and roles. The below policy grants access to EC2 instances with the name tag of API:

{
   "Version": "2012-10-17",
   "Statement": [
     {
       "Effect": "Allow",
       "Action": [
         "ssm:StartSession"
       ],
       "Resource": "arn:aws:ec2:::instance/*",
       "Condition": {
         "StringEquals": {
           "ssm:resourceTag/name": "API"
         }
       }
     },
     {
       "Effect": "Allow",
       "Action": [
         "ssm:TerminateSession"
       ],
       "Resource": [
         "arn:aws:ssm:::session/${aws:username}-*"
       ]
     },
     {
       "Effect": "Allow",
       "Action": [
         "ssm:GetConnectionStatus",
         "ssm:DescribeSessions",
         "ssm:DescribeInstanceProperties",
         "ec2:DescribeInstances"
       ],
       "Resource": "*"
     }
   ]
 }

More info on configuring policy can be found here

3. SSM agent installation: You need to make sure your Amazon Machine Images (AMIs) have SSM Agent installed. SSM Agent is preinstalled, by default on popular AMI’s like Amazon Linux, Ubuntu Server etc. If not, the agent can be manually installed from the command:

sudo yum install -y https://s3.region.amazonaws.com/amazon-ssm-region/latest/linux_amd64/amazon-ssm-agent.rpm

More info on installing and enabling agent can be found here

4. Audit logs: Session Manager can store audit logs in a CloudWatch log group or an S3 bucket. However, the option has to be enabled in Session Manager -> Preferences.

S3 logging for session manager

Using session manager

A session can be started by an authenticated user either from the AWS management console or through CLI. 

  1. Starting a session (console): Either the EC2 console or the Systems Manager console can be used to start a session.
Connect through the EC2 console

2. Starting a session (AWS CLI): Using session manager through the CLI calls for an additional requirement of installing the SSM plugin:

  • Prerequisites: 
    1. AWS CLI version 1.16.12 or higher
    2. Session manager plugin – Install instructions for different systems here
  • Starting a session: 
aws ssm start-session --target "<instance_id>"

3. Using SSH and SCP with session manager: One of the major limitations of session manager when it was launched was its inability to copy files without going through S3. 

Now the AWS-StartSSHSession document supports tunnelling SSH traffic through session manager.

Note: Using this functionality requires the use of a key that is associated with the instance. Logging is unavailable for sessions that connect through SSH as SSH encrypts all transit data.

Steps to use SSH/SCP with session manager: 

  1. Verify that prerequisites mentioned above are met.
  2. Add the below lines to SSH config to allow session manager tunneling. The SSH configuration file is typically located at ~/.ssh/config.
# SSH over Session Manager

host i-* mi-*
ProxyCommand sh -c "aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"

SSH into instance with Session Manager: SSH can be performed as normal using the instance-id as the hostname. Example:

% ssh ec2-user@<instance_id>
Last login: Wed Oct 28 10:53:22 2020 from ip-<instance_ip>.ap-south-1.compute.internal
[ec2-user@ip-<instance_ip> ~]$

SCP to copy files with Session Manager: SCP can be performed as normal using the instance-id as the hostname. Example:

% scp test ec2-user@<instance_id>:test
test           100%    0     0.0KB/s   00:00

Wrapping up

Session manager defies the saying,

“Convenience is the enemy of security by being both convenient and secure.” 

The ease of using session manager along with its ability to tunnel SSH traffic allows us to phase out SSH and switch completely to session manager. No more open SSH ports!

Combining session manager with the extended capabilities systems manager provides like patching, automation documents and run command makes for a powerful ops workflow.

If you are invested in AWS cloud, leveraging session manager is a no brainer!

Here at slice, we are constantly working towards creating new tools, every day, to streamline our workflow. So, stay tuned for more!

Jarvis: Slack notification framework

Do you run crons regularly and are worried about the progress of your cron? Or are simply curious if they’ll run perfectly or not? Or maybe you want to store logs in your server and send some part of those logs or the entire log file to yourself, in case of an exception, and want to avoid seeing a long list of logs in Elastic search / Cloudwatch. Or are you looking for a convenient way to send stats (in form of jpeg or CSV) of your server to your team on slack? Either way, Jarvis is the perfect tool for you!

Be it sending notifications on your dedicated slack channel and tagging the entire channel / yourself or just sending a message with or without any attachment, Jarvis does the deed.

It is a dedicated tool that can be integrated with your server/cron that sends a slack notification to a given channel name according to your requirement. All you need to do is just add Jarvis to your channel, and call a simple lambda code with all the required parameters and you will receive your notifications.

How Jarvis helped to smoothen things out at slice!

We, as a data-team run several crons for migration and data updation in databases. So, quite naturally, we wanted to be kept informed if a cron execution was completed or there was any kind of exception in any of the crons. Hence, Jarvis was created for a smooth transmission of notifications.
In case of an exception, we wanted to inform everyone in the team immediately, so that the team could act upon it at their earliest. 

Finally, the ultimate requirement that motivated us to create Jarvis was also that we wanted to send an attachment in our notification like CSV or Log file.

So, in short, the main aim for the creation of Jarvis was –


“To create a centralized tool that can be used by different services and codebases to send a notification to slack for different actions with or without an attachment”

Architecture

The basic architecture of Jarvis is pretty simple. Here’s a run-through:

The whole codebase of Jarvis runs in an AWS lambda code. If any server/cron wants to send a notification, then the flow is as follows.

  1. If the user wants to send a notification without any attachment
    1. User has to call the lambda function with the required function and then a slack notification gets sent to the given channel name, provided that the Jarvis app is added to your slack channel.
  2. If the user wants to send a notification with any attachment
    1. User needs to upload the required document from their machine to a predefined S3 bucket (in preferred format i.e., key should be /service_name/ file
    2. Next, they need to call the lambda function with the required function and a slack notification will be sent to the given channel name, provided that the Jarvis app is added to your slack channel.

How to add Jarvis to your channel:

  1. Go to your channel.
  2. Click on add app

3. Select Jarvis

Without adding the Jarvis app to your channel, you cannot send a notification to that channel.

How to invoke lambda function with custom payload

Lambda can be invoked with custom payload with or without an attachment. Here’s how:

  • Without attachment

    To send a notification, you need to invoke the lambda with the given payload
{
    "heading": "Heading",
    "message": "Message",
    "slack_channel": "slice-lambda-notifications"
}

// Here heading is optional while message and slack_channel are compulsory parameters. 
slack_channel is the channel name where you want to send the notification.

  • With attachment

To send an attachment (CSV, jpeg) using Jarvis we need to follow the given steps:

  1. Your code or product should upload the attachment in S3, in bucket slack-notification-framework-attachments before triggering the notification.
  2. As a good practice please upload your file at
    s3://slack-notification-framework-attachments/<channel name>/<attachment file>.<file format>
  3. Invoke the lambda using the following payload
{
    "heading": "Heading",
    "message": "Message",
    "slack_channel": "slice-lambda-notifications",
    "attachment": {
        "bucket": "slack-notification-framework-attachments",
        "key": "testing/test_file.csv"
    }
}
// Here heading is optional while attachment, message, and slack_channel are compulsory.
slack_channel is the channel name where you want to send the notification

How to involve the whole channel/user in the slack notification

If you want to send a notification in which the whole channel needs to be tagged, just send a message with appended <!channel> to the message key in the payload

Example:

If your message key is:  “Failure in deleting file”

And you want to tag the whole channel, send the message key as: <!channel>  Failure in deleting file

If you want to send a notification in which a particular user is tagged, just send a message with appended <@UserId> to the message key in the payload

Example:

If your message key is:  “Failure in deleting file”

And you want to tag the person,  send the message key as: <@UNKDY50FH>  Failure in deleting file

How to invoke lambda from code

NodeJS

var AWS = require('aws-sdk');

// you shouldn't hardcode your keys in production! See http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-configuring.html
AWS.config.update({accessKeyId: 'akid', secretAccessKey: 'secret'});

var lambda = new AWS.Lambda();
var params = {
  FunctionName: 'slice-slack-notifier', /* required */
  Payload: JSON.stringify({
      {
                "heading": "Heading",
                "message": "Message",
                "slack_channel": "slice-lambda-notifications",
                "attachment": {
                    "bucket": "slack-notification-framework-attachments",
                    "key": "testing/test_file.csv"
                }
}
    }) /*Payload as defined above*/
};
lambda.invoke(params, function(err, data) {
  if (err) console.log(err, err.stack); // an error occurred
  else     console.log(data);           // successful response});


Python

import boto3
boto3_session = boto3.Session(region_name="ap-south-1")
lambda_client = boto3_session.client("lambda")
lambda_client.invoke(
    FunctionName='slice-slack-notifier',
    InvocationType='RequestResponse',
    Payload=json.dumps(
        {
            {
            "heading": "Heading",
            "message": "Message",
            "slack_channel": "slice-lambda-notifications",
            "attachment": {
                "bucket": "slack-notification-framework-attachments",
                "key": "testing/test_file.csv"
            }
            }
        }
    ).encode("utf-8")
)



Error in payload / Notification message:

In case, an error occurs in sending the notification because of the payload or message format, a notification with the required error message will be sent to slack channel  #jarvis-error-notifications

That was all about Jarvis and how it has simplified transmission of notifications into our slack channels with or without an attachment. Here at slice, we are constantly working towards creating new tools, every day, to streamline our workflow. So, stay tuned for more!

S3 to Redshift / RDS data propagator

The one thing that is almost always taken for granted by firms is data. Be it customer information, sensitive information from the industry, trade secrets exclusive to the company or even basic employee information — data should be like treasure and its protection the utmost priority. However, this is easier said than done. Data is of no use if it’s always hidden and padlocked from additions and modifications. So, what do you do in such a situation? You monitor and filter!

The RDS data propagator does exactly that, minus any extra workforce. It is an automated tool that creates required tables and loads data into them using the S3 file upload operations. This, in turn, performs insert and update operations on Redshift tables, using the best practices for creating/ inserting/ updating tables, for the analytics.

So, now you must be thinking, what problems exactly did this automation solve and how did this help slice in streamlining its data? Well, here’s how:

How RDS data propagator streamlined data handling at Slice 

Here at Slice, we regularly use data to analyse trends and assess risk factors. In general, we can categorise data-related users in 2 distinct segments. They are:

  1. The Producers – (Softwares, SDEs, ETL developers)  
  2. The Consumers – (Data Analysts, Risk Analysts)

So long, the bridge between the producers and consumers were the Database Administrators (DBAs).

But that’s old school now. In the current IT Startup culture, maintaining the above structure is not always possible due to cost crunches, fast-paced developments, small team sizes, etc. In such a scenario, eventually, both the producers (developers) and consumers (analysts) have to perform certain tasks exclusive to the DBA in order to solve their topical data needs. While doing so, people often fail to make proper decisions regarding the underlying schema of their tables. This could be due to lack of knowledge or even sometimes by not being able to pre-calculate the future needs of the data. The biggest downside? Well, this, in the long run, leads to bad query performance as the data size increases.

Our case, you see, was no different. Let me take you through our story for a better viewpoint!

Reading data from a given table

Creating a table

What Keys?? 

Primary Keys, Foreign Keys, Sort Keys, Distribution Keys

Solution???

Research

Ask for help

Insert into table/ Update table

The part where we read the data from the tables was fine. 

The catch, however, lies in dealing with creating tables or inserting/ updating data into those tables for Analytical use cases. The Analysts dreaded upon performing such operations. Also, from a data security perspective, permission to run DML queries can’t be provided to all. In such a scenario the DBA used to be the go-to person for their help. On the other hand, with a very small data team in the firm, it was not feasible to provide a better turnaround time, while these small but important tasks needed to be prioritized, deprioritizing other tasks at hand. With an increasing Analyst team size and various data stores being added, these types of requests became more and more frequent, and as the old saying in IT goes : 

“If you have to do it more than twice, AUTOMATE IT.”

Hence, this tool was built.

So now that you know why it is required, let’s look into the technicalities. 

Here are the tactical details!

Technology Stack

  • AWS S3
  • AWS Lambda
  • AWS Redshift
  • Slack (for notifications)

Architecture

Features

  • Creating Redshift tables involving a check that no table is created without sort and distribution keys.
  • Updating table schemas.
  • Inserting and updating data into tables
  • Success and Failure notifications
  • Scaled up performance through Lambda

Usage Guide

The tool makes use of S3 to process files and an AWS lambda function listens to all S3 object creation events. Based on certain rules specified in the sections below, the lambda takes action and runs required Redshift queries.

The S3 to Redshift data propagator follows a strict folder and file structure, as shown in the architecture diagram

  • <destination_name> : Refers to the type of DB being used like: redshift, postgres, etc
  • <database_name>: DB name (postgres) or Schema name (redshift) in which the table is present
  • <table_name>: Name of the respective table which will contain the data
  • schema:
    • create: contains a json file specifying the column names and datatypes along with the primary, foreign, sort and distribution keys.
    • update: contains a json file specifying new column(s) to be added to the existing table
  • input:
    • load: csv files to be inserted directly into the tables are to place in this folder
    • update: csv files to be used for upserting into the tables are placed here.
      • Note: update functionality can happen if the table has any primary key defined.
  • succeeded: csv files that are loaded/ uploaded into the tables successfully are moved to this folder
  • failed: csv files that fail to get loaded into the tables are moved to this folder.

Eg:-

Creating Redshift tables

Schema definition:

The schema of a table should be provided in the format displayed below:

{
"column_specification" : {
<column_name} : <redshift_data_type}
},
"primary_key" : [column1, column2, ….], # optional
"dist_key" : <column_name}, #required
"sort_key" : [column1, column2, ….] #required
}

Eg:

{
"column_specification": {
"id": "INTEGER",
"name": "VARCHAR",
"phone_no": "INTEGER",
"time_stamp": "VARCHAR",
"uuid": "VARCHAR"
},
"primary_key": [
"uuid"
],
"dist_key": "uuid",
"sort_key": [
"uuid"
]
}

A json file is to be made using the above specification given.

Path to load the file : s3://****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/schema/create/<file_name>.json

Once the file is loaded into the appropriate path, the lambda triggers and fires a redshift command to create a table as per the schema details provided in the JSON file.

Updating table schemas

There are two types of operations supported under this:

  1. Adding column(s)
  2. Deleting column(s)

Adding columns:

Table columns to be added should be provided in the format displayed below:

{
"column_specification" : {
"<column_name>" : "<redshift_data_type>"
}, #required
"operation_type" : "add" #required
}

eg:

{
"column_specification": {
"city": "VARCHAR",
"state": "VARCHAR"
},
"operation_type": "add"
}

A json file is to be made using the above specification given.

Deleting column(s):

Table columns to be deleted should be provided in the format displayed below:

{
column_specification : [column1, column2, ….], #required
operation_type : "drop" #required
}

eg:

{
"column_specification": ["city", "state"],
"operation_type": "drop"
}

A json file is to be made using the above specification given.

Path to load the file : s3://*****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/schema/update/<file_name>.json

Once the file is loaded into the appropriate path, the columns in the specified redshift table are added or removed (dropped).

Note: Redshift doesn’t support adding or dropping multiple columns using a single query. However, this feature adds the functionality to support adding and dropping multiple columns by providing the info in the JSON file as specified above.

 Inserting and updating data into tables

A CSV file containing the data to be uploaded into the redshift table should be uploaded into the S3 bucket.

Path to load the file for inserting : s3://*****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/input/load/<file_name>.csv

Path to load the file for updating (upserting) : s3://*****-data-propagator/destination=<destination_name>/database=<database_name>/table_name=<table_name>/input/update/<file_name>.csv

Success and Failure notifications

The success and failure notification of any process started using this architecture is sent on the Slack channel, in our case we send it to #data-loader-noitification

Success Notification

Failure Notification

Other usage

Apart from dealing with manual file upload tasks, this tool is also used for managing data insertion and updation flow into Redshift by our ETL cron jobs. This is because, it follows the best practices for inserting and updating data in redshift and is easy to plug and play with the manual and automated data flow requirements

Future scope

  1. Allowing the feature to update required columns. Currently, the tool supports updating an entire row.
  2. Adding a feature to support different file types like JSON, Parquet.
  3. Adding more destinations to the list, like Postgres, MySQL, etc.

Deployments using Immutable Infrastructure

Don’t you feel like pulling out your hair when your code works in testing but not in production? Don’t you just hate it when you face downtime due to faulty deployments that are out of your control? We do too. Read on to find out how we resolved this nerve-wracking problem.

What is a mutable and immutable infrastructure?

As the name suggests, mutable infrastructure is the infrastructure that will change (mutate) over time in an incremental manner. Suppose you have a server (a bare-metal machine or a virtual machine) with your web application deployed. As you add new exciting features to your app, you continually SSH into your machine, pull the latest code, install new dependencies, and restart your application. Now, you can do this either manually or automate it, but it is mutable because you are modifying the existing machine.

On the other hand, the immutable infrastructure is unchanging. If you want to deploy a new version of your app, you tear down the old infrastructure and create a new one. As simple as that.

Why bother with immutable infrastructure?

Consider all the steps involved in updating a server using the mutable approach like multiple network calls, to Github (that surely can’t fa– 😞 ), downloading dependencies, installing dependencies, etc. If you’ve ever tried doing this manually, then you need not be told that things can go wrong. And if you remember the good old Murphy’s Law, what can go wrong will go wrong.

You might be thinking, “Sure, it can go wrong but it rarely does”. It does work 99% of the time. You are right but that makes it all the more difficult to debug and fix things when they do go wrong. While you scale, you increase the number of machines you deploy to and thus, increase the magnitude of the problem. Things can go wrong on a higher number of machines in a variety of different ways as an intermediate step can fail and you might eventually end up with a half-cooked update. Instead of going from version 1.0 to version 2.0, you might end up at version 1.5 or 1.8.

With immutable infrastructure, you always deploy what you test. There is no intermediate state your server might end up in. It’s either the previous version or the tested new version. If you use a template to launch your machines, such as an AWS AMI, you can replicate a tried and tested version thousands of times with a guarantee. What’s more, it gives you peace of mind.

What tools did we use for our immutable infrastructure deployments?

Finally, a short section on how we did it. We use TypeScript with Node.js for our backend application. We generate an AMI with the new version of our software using AWS CodeBuild integrated with GitHub and HashiCorp Packer. After we test this AMI and verify it works, we update the launch template for our AWS AutoScalingGroup with this AMI. In the end, we trigger an Instance Refresh for our auto-scaling group. Easy and Safe!

  • AWS CodeBuild
  • HashiCorp Packer
  • AWS AMI
  • AWS Launch Template
  • AWS Auto Scaling Group

How to do it?

The following instructions are specific to AWS. However, the concepts can be translated to other cloud providers as well.

Step 1: Building an AMI using HashiCorp Packer

Packer is a tool provided by HashiCorp for free. You can use it to build an AMI by providing it a JSON template that contains instructions on how to build your AMI. Then you can execute it using the following commands

packer validate template.json # check if the template is well formed
packer build template.json

This will start the execution of Packer which performs the following steps:

  1. It will provision a new EC2 instance based on a source_ami provided by you and wait for the instance to become available.
    Packer gives you options to specify the AWS region, VPC, ssh key pair for this new instance. It will use sensible defaults if not provided.
  2. It will perform the steps as detailed by you to convert that new instance into your desired server configuration.
    You have many options here. You can choose to have packer ssh into that machine and execute certain shell commands or you could have packer SCP some files into that machine. For all options, click here.
  3. It will save the state of that instance as an AMI and wait for the AMI to become available and then exit.
    You have to provide the name for the AMI and you can choose to have Packer add certain tags to that AMI to easily identify the AMI later.

Here is a sample packer template file

{
	"variables": {
		"aws_region": "{{env `AWS_REGION`}}",
		"aws_ami_name": "<unique-ami-name>",
		"source_ami": "{{env `SOURCE_AMI`}}"
	},

	"builders": [
		{
			"type": "amazon-ebs",
			"region": "{{user `aws_region`}}",
			"instance_type": "t3.medium",
			"ssh_username": "ubuntu",
			"ami_name": "{{user `aws_ami_name`}}",
			"ami_description": "project build ami for production",
			"associate_public_ip_address": "true",
			"source_ami": "{{user `source_ami`}}",
			"tags": {
				"branch": "master",
				"timestamp": "{{timestamp}}"
			}
		}
	],

	"provisioners": [
		{
			"type": "shell",
			"inline": [
				"git clone <github-repo>",
				"cd <project-dir>",
				"git checkout master",
				"npm run build",
				"echo built project",
				"sudo systemctl enable service"
			]
		}
	]
}

You can use this template as a starting point. For more details, visit this page

Step 2: Create an AWS CodeBuild pipeline to run Packer

You can find great detailed instructions here on how to create an AWS CodeBuild project which will run the packer template you created in the previous step.

Note: When you build new AMI for a long-running branch, you will end up with an ever-growing store of AMI. To bound this, you can decide to store only the latest 5 or 10 images. To achieve this, you will need to delete an AMI when you create a new one. Below is a shell script to do that:

#!/bin/bash
set -e
ami-list=$( /usr/local/bin/aws ec2 describe-images --filters "Name=tag:branch,Values=$1" --query 'Images[*].[CreationDate,ImageId]' --output json | jq '. | sort_by(.[0])' )
numImages=$( echo "$ami-list" | jq '. | length' )
maxImages=$2
toDelete="$(($numImages - $maxImages))"
echo "Images to delete: $toDelete"
while [ $toDelete -gt 0 ]
do
	index="$(($toDelete - 1))"
	ami_id=$( echo $ami-list | jq -r ".[$index][1]" )
	echo "deleting $ami_id"
	bash -e ./delete_ami.sh "$ami_id"
	toDelete="$(($toDelete - 1))"
	echo "deleted $ami_id"
done

The delete_ami.sh script can be found below. Remember, if you are using an EBS backed AMI, only de-registering an AMI isn’t enough, you also need to delete the EBS snapshot.

#!/bin/bash
set -e
ami_id=$1
temp_snapshot_id=""
# shellcheck disable=SC2039
ebs_array=( $(/usr/local/bin/aws ec2 describe-images --image-ids $ami_id --output text --query "Images[*].BlockDeviceMappings[*].Ebs.SnapshotId") )
ebs_array_length=${#ebs_array[@]}
echo "Deregistering AMI: $ami_id"
/usr/local/bin/aws ec2 deregister-image --image-id "$ami_id" > /dev/null
echo "Removing Snapshot"
for (( i=0; i<$ebs_array_length; i++ ))
do
	temp_snapshot_id=${ebs_array[$i]}
	echo "Deleting Snapshot: $temp_snapshot_id"
	/usr/local/bin/aws ec2 delete-snapshot --snapshot-id "$temp_snapshot_id" > /dev/null
done

Final step: Use the newly created AMI to update your auto-scaling group

This part assumes that you already have an auto-scaling group running which uses a launch template. If not, there is a good resource to get started here.

  1. Get the latest AMI id for the branch you need to deploy.
#!/bin/bash
ami_list=$( /usr/local/bin/aws ec2 describe-images --filters "Name=tag:branch,Values=$1" --query 'Images[*].[CreationDate,ImageId]' --output json | jq '. | sort_by(.[0])' )
numImages=$( echo "$ami_list" | jq '. | length' )
latest_ami_id=$( echo $ami_list | jq -r ".[$(($numImages - 1))][1]" )
echo "$latest_ami_id" >> latest_ami_id

The above shell script fetches the AMI list for the branch you specified by filtering on the tags you specified in your packer template. Then it sorts the list by creation time and chooses the latest. Finally, it writes the AMI id to a file.

2. Create a new version for your launch template.

Note: there is a limit on how many launch template versions you can have. So, it is a good idea to bound this number too. A good limit for this would be the same limit you put on your AMI as launch template versions will have a one-to-one mapping with AMIs.

#!/bin/bash
set -e
launch_template_id=$1
ami_id=$2
versions_to_keep=$3
if [ $versions_to_keep -lt 2 ]
	then
		echo "Must keep at least 2 versions of launch template"
		exit 1
	else
		echo "keeping $versions_to_keep of launch template"
fi
new_version=$( /usr/local/bin/aws ec2 create-launch-template-version --launch-template-id "$launch_template_id" --source-version 1 --launch-template-data "{\"ImageId\":\"$ami_id\"}" )
version_number=$( echo "$new_version" | jq '.LaunchTemplateVersion.VersionNumber' )
if [ $version_number -gt $(($versions_to_keep + 1)) ]
	then
		version_to_delete=$(($version_number - $versions_to_keep))
		out=$( /usr/local/bin/aws ec2 delete-launch-template-versions --launch-template-id "$launch_template_id" --versions "$version_to_delete" )
	else
		echo "Not deleting any launch template version"
fi

3. Start ASG instance refresh.

#!/bin/bash
set -e
config_file=$1
asg_name=$2
out=$( /usr/local/bin/aws autoscaling start-instance-refresh --cli-input-json "file://$1" )
interval=15
((end_time=${SECONDS}+3600+${interval}))
while ((${SECONDS} < ${end_time}))
do
  status=$( /usr/local/bin/aws autoscaling describe-instance-refreshes --max-records 1 --auto-scaling-group-name "$2" | jq -r '.InstanceRefreshes[0].Status' )
  echo "Instance refresh $status"
  if [ "$status" = "Successful" ]
  	then
  		echo "ASG updated"
  		exit 0
  fi
  if [ "$status" = "Failed" ]
  	then
  		echo "ASG update failed"
  		exit 1
  fi
  if [ "$status" = "Cancelled" ]
  	then
  		echo "ASG update cancelled"
  		exit 1
  fi
  sleep ${interval}
done

echo "ASG update exceeded timeout"
exit 1

This is a sample config file for ASG instance refresh

{
	"AutoScalingGroupName": "my_asg",
	"Preferences": {
		"InstanceWarmup": 5,
		"MinHealthyPercentage": 50
	}
}

Voila! We now have an immutable deployment pipeline.

Dynamic feature modules

Things to note before jumping in to implement dynamic feature modules in the project.

Google Play’s Dynamic Delivery, one of the App Bundle features, helps developers in reducing the app size at the time of installation and thus increase the user onboarding rate. Then based on the stages or levels of the user, they can be exposed to certain features that can be downloaded on demand. 

This certainly is a major driving factor to implement DFM and the internet will of course help you with the blogs and videos on how to do that. But, before jumping straight to the implementation, let’s note down certain caveats of the process & the problems which will come up.

1. Modularise first, dynamise later

If you are dealing with some legacy code in your organization and that is a kind of big monolithic code, then you should plan to break it into modules—feature-wise & utility-wise. And then, you can convert the relevant feature modules into DFMs. However, be careful in choosing for conversions. As a tip, try to avoid download until a user has completed a stage or level in your app.

2. DFM is different from the normal module

You know that the normal modules are independent of the app module and thus can be plugged into any app as it is. But, the issue with DFMs is that they depend on the app module to work i.e. reversed. This means that you can’t access the methods from the DFMs into the app module at compile-time. One obvious use-case will be launching an activity of DFM from the app module. You will have to use the name of the class as a string and you can’t use .class notation like:

val intent = Intent()
intent.setClassName(applicationId,"com.myapp.DfmActivity")
startActivity(intent)

Another issue with this is that code from the app module is accessible to the DFM and thus, you might end up using the methods from the app module in the DFM and thus destroying the plug & play behavior. 

3. Another app like behaviour

When you are done converting the module to DFM, and after running the app from the android studio, you will get ClassNotFoundException when launching the DfmActivity(Main activity of the DFM). This is because you might have selected the configuration to install the only app. So, the DFM is not installed. Thankfully, this is easy to solve. Just edit the run configuration, and select your DFM too to install it with the app.
If you will check the logs, you will get to know that DFM is installed as another apk and thus it behaves as a different app. This means that you have to take care of the possible UX issues that may come up while interacting.

One example: When you are on DfmActivity and then a call comes up, or if you switch to any other app for some reason, and then resume to the DfmActivity again, you will see everything is normal. But, now when you press the back button, you will see you are thrown out of your app instead of the activity which launched DfmActivity. Handling this particular scenario requires you to relaunch the calling activity, and yes be careful about the launch modes.

4. Giving out builds for QA testing

After getting the apk from the android studio by clicking ‘build apk’ & installing that apk, you will find on the launch of DfmActivity will result in the ClassNotFoundException. Why? Because the DFM apk is not generated this way. And as such, there is no direct way to install DFM over the main apk.


There are two ways to solve this :
1. Internal app sharing: Instead of apk, build the app bundle, upload that to internal sharing and share the link with the QA team. One problem with this approach is that if you have different applicationID for dev and pre-prod environments, you will have to create different applications on google play console for the different environments1
2. Bundletool: Generate the app bundle and extract universal apk with the help of bundletool. The universal apk is meant to work on all devices irrespective of the configurations. If you have a continuous delivery(CD) setup done already, it is recommended to install the bundletool at the CD server too, and have it as part of the build process to get the universal apk as artifacts only.

5. Testing on universal APK is not sufficient

Universal apk is just a solution for the above issue and to avoid any blockages in testing the business logic of the app. Since universal apk is loaded with all configurations, it is not equivalent to what user downloads from the play store. Sometimes, because of the conflicting names, the app resources are stripped from DFMs resulting in ResourceNotFound exception. This doesn’t appear in universal apk.
Debugging such issues is also a pain as it may involve uploading the builds to internal app sharing again and again.

Testing via internal app sharing, though tedious, is better in that case.
Another thing to note here is that, if your DFM is having WebView anywhere in the flow then, it may also result in the crash. This is pretty weird and is also acknowledged as a bug on the issue tracker.

So, do we hate DFMs?

No, dynamic delivery is a good offering from the Android team, despite its challenges. It is definitely the future of the app ecosystem. But, since most of the articles only tell about the benefits & tutorials of the implementation, having prepared for the above-mentioned issues will help the developers.

Pulse—keeping a check on our services

Developers take pride in their services’ uptime—they want to know when services go down or become lethargic in their response.

At Slice, we use Elastic Heartbeat to monitor the uptime of internal services and alert us when they go down. Heartbeat has a nifty dashboard(through Kibana) that displays the uptime of all the services it is monitoring.

Heartbeat in a nutshell

You install Heartbeat in a server and configure the services you want Heartbeat to monitor. We have a configuration file for each of our services

A sample Heartbeat configuration file in YAML:

-type: http
 id: foo
 name: bar
 enabled: true
 urls: [https://foo.sliceit.com/ping]
 schedule: '@every 30s'

With the above, Heartbeat will ping foo.sliceit.com/ping every 30 seconds.

Every service that we want Heartbeat to monitor maps to one configuration file of the above format. 

Whenever we design systems at Slice, one of our guiding principles is to make it easy for everyone to use. 

In line with this principle, we wanted to give our developers the least resistance way to monitor their services’ uptime. We asked ourselves, how can we make it easy for Slice developers to add their services to Heartbeat monitoring? 

Pulse is the framework that we came up with.

Pulse

Developers write Heartbeat configuration files for their services and commit them to a Github repo. The check-in triggers CI/CD workflow(through AWS CodeBuild) that syncs these files to an AWS S3 bucket. On the Heartbeat server, we have a cron that periodically syncs the configuration files from the S3 bucket to a local directory. We have configured Heartbeat to look for configuration files in this directory.

With Pulse in place; to monitor a new service, all the developer has to do is check-in the Heartbeat configuration file to the Pulse Github repo. After a minute or so, the service starts appearing in the heartbeat monitoring dashboard on Kibana. We have integrated alerting with Heartbeat using Elastic Watcher to notify us of service downtimes.

Pulse has brought in visibility into the uptime of our services, thus making Slice snappier and reliable.


Heartbeat dashboard image from Elastic Heartbeat website.

Hercules—the job scheduler

In this post, we will walk you through how we replaced our cron jobs with Hercules—a job scheduling framework that we developed internally.


If working on such problems excite you, our infrastructure engineering team is hiring. If you care about developer productivity and tooling—check out the job description and apply here.


Hassles of cron jobs

Deploying a recurring job through cron is a pain. You ssh into a server, figure out the cron expression, and the permissions. You worry about how you will get to know if your job fails. You fret about your job’s log management and log shipping. You think about the cost—why am I paying for this server when it is in use only for a couple of hours a day, the rest of the time it is sitting idle doing nothing. You get anxious about some other cron job on the server consuming all the CPU and memory, thus starving your job of resources and killing it.

As a developer, you want to concentrate on writing the code and let someone else manage all the mundane housekeeping stuff for you. That someone for us is Hercules—our job scheduling framework.

We used to use the good old cron to schedule recurring workloads. As Slice grew, so did the number of job workloads. After a particular stage, scheduling jobs through cron becomes a hassle.

Capacity planning

When you have a team of developers scheduling jobs through cron, after a particular scale, the jobs start trampling on each other. A job might be running, consuming a significant portion of the CPU and memory. While this job is still executing, the cron scheduler might kick off another job. Due to this, you end up with erratic job failures.

Capacity planning and efficient utilization of hardware is a problematic area with cron jobs. When the jobs are not running, the servers are dormant, but you end up paying for them. At the same time, if you have multiple jobs running simultaneously, you need to provision the hardware for peak usage.

Observability

Observability is another challenge with cron jobs.

You do not have a single view of all the jobs(spanning machines) and their execution history. Cron does not give you the ability to monitor job execution and alert on job failures.

Creating and maintaining a consistent execution environment for cron jobs is problematic—if the jobs depend on a specific directory structure, files, or other dependencies.

Hercules

To sidestep all the above challenges, we developed Hercules—a container job scheduling framework. With Hercules, we wanted to keep operational overhead to a minimum. Hence, we developed Hercules over AWS Fargate—a serverless compute engine for containers.

Developers package their jobs as Docker container images and push these images to AWS ECR; we have automated this as part of our build pipeline.

They create a schedule.json file with:

  • Docker image URL
  • CPU and memory requirements
  • Docker run directive
  • Schedule(cron or rate expression).

An example schdule.json file below:
{
  "image": "56789076.dkr.ecr.ap-south-1.amazonaws.com/foo", //image
  "name": "foo", //name
  "cpu": 512, //CPU
  "memory": 1024, //memory
  "command": ["./foo"], //run directive
  "scheduleExpression": "rate(5 minutes)" //schedule
}

We have integrated Hercules with our build pipeline. During a build, Hercules scans for schedule.json files in the build artifact. Hercules creates a graph of all the jobs specified in the build, queries Fargate for the existing jobs, creates a diff between the two, and schedules jobs accordingly—adding new jobs, modifying existing jobs, and removing jobs not needed anymore.

Notifications are first-class citizens in Hercules. If a job fails pre-maturely, Hercules triggers an alert to a Slack channel.

Hercules relays job execution logs to Cloudwatch and our internal Elasticsearch cluster for easy search and analysis.

Conclusion

With Hercules bearing the burden of scheduling jobs and taking care of all the mundane stuff, developers can concentrate on writing their code and let Hercules do the rest. Developers do not have to bother about sshing into servers, deploying their jobs, and getting anxious about CPU, memory requirements, and job failures.

Hercules has made scheduling jobs at Slice a declarative and productive process—enjoyable and hassle free.


If working on such problems excite you, our infrastructure engineering team is hiring. If you care about developer productivity and tooling—check out the job description and apply here.