Pulse—keeping a check on our services

Developers take pride in their services’ uptime—they want to know when services go down or become lethargic in their response.

At Slice, we use Elastic Heartbeat to monitor the uptime of internal services and alert us when they go down. Heartbeat has a nifty dashboard(through Kibana) that displays the uptime of all the services it is monitoring.

Heartbeat in a nutshell

You install Heartbeat in a server and configure the services you want Heartbeat to monitor. We have a configuration file for each of our services

A sample Heartbeat configuration file in YAML:

-type: http
 id: foo
 name: bar
 enabled: true
 urls: [https://foo.sliceit.com/ping]
 schedule: '@every 30s'

With the above, Heartbeat will ping foo.sliceit.com/ping every 30 seconds.

Every service that we want Heartbeat to monitor maps to one configuration file of the above format. 

Whenever we design systems at Slice, one of our guiding principles is to make it easy for everyone to use. 

In line with this principle, we wanted to give our developers the least resistance way to monitor their services’ uptime. We asked ourselves, how can we make it easy for Slice developers to add their services to Heartbeat monitoring? 

Pulse is the framework that we came up with.

Pulse

Developers write Heartbeat configuration files for their services and commit them to a Github repo. The check-in triggers CI/CD workflow(through AWS CodeBuild) that syncs these files to an AWS S3 bucket. On the Heartbeat server, we have a cron that periodically syncs the configuration files from the S3 bucket to a local directory. We have configured Heartbeat to look for configuration files in this directory.

With Pulse in place; to monitor a new service, all the developer has to do is check-in the Heartbeat configuration file to the Pulse Github repo. After a minute or so, the service starts appearing in the heartbeat monitoring dashboard on Kibana. We have integrated alerting with Heartbeat using Elastic Watcher to notify us of service downtimes.

Pulse has brought in visibility into the uptime of our services, thus making Slice snappier and reliable.


Heartbeat dashboard image from Elastic Heartbeat website.

Hercules—the job scheduler

In this post, we will walk you through how we replaced our cron jobs with Hercules—a job scheduling framework that we developed internally.


If working on such problems excite you, our infrastructure engineering team is hiring. If you care about developer productivity and tooling—check out the job description and apply here.


Hassles of cron jobs

Deploying a recurring job through cron is a pain. You ssh into a server, figure out the cron expression, and the permissions. You worry about how you will get to know if your job fails. You fret about your job’s log management and log shipping. You think about the cost—why am I paying for this server when it is in use only for a couple of hours a day, the rest of the time it is sitting idle doing nothing. You get anxious about some other cron job on the server consuming all the CPU and memory, thus starving your job of resources and killing it.

As a developer, you want to concentrate on writing the code and let someone else manage all the mundane housekeeping stuff for you. That someone for us is Hercules—our job scheduling framework.

We used to use the good old cron to schedule recurring workloads. As Slice grew, so did the number of job workloads. After a particular stage, scheduling jobs through cron becomes a hassle.

Capacity planning

When you have a team of developers scheduling jobs through cron, after a particular scale, the jobs start trampling on each other. A job might be running, consuming a significant portion of the CPU and memory. While this job is still executing, the cron scheduler might kick off another job. Due to this, you end up with erratic job failures.

Capacity planning and efficient utilization of hardware is a problematic area with cron jobs. When the jobs are not running, the servers are dormant, but you end up paying for them. At the same time, if you have multiple jobs running simultaneously, you need to provision the hardware for peak usage.

Observability

Observability is another challenge with cron jobs.

You do not have a single view of all the jobs(spanning machines) and their execution history. Cron does not give you the ability to monitor job execution and alert on job failures.

Creating and maintaining a consistent execution environment for cron jobs is problematic—if the jobs depend on a specific directory structure, files, or other dependencies.

Hercules

To sidestep all the above challenges, we developed Hercules—a container job scheduling framework. With Hercules, we wanted to keep operational overhead to a minimum. Hence, we developed Hercules over AWS Fargate—a serverless compute engine for containers.

Developers package their jobs as Docker container images and push these images to AWS ECR; we have automated this as part of our build pipeline.

They create a schedule.json file with:

  • Docker image URL
  • CPU and memory requirements
  • Docker run directive
  • Schedule(cron or rate expression).

An example schdule.json file below:
{
  "image": "56789076.dkr.ecr.ap-south-1.amazonaws.com/foo", //image
  "name": "foo", //name
  "cpu": 512, //CPU
  "memory": 1024, //memory
  "command": ["./foo"], //run directive
  "scheduleExpression": "rate(5 minutes)" //schedule
}

We have integrated Hercules with our build pipeline. During a build, Hercules scans for schedule.json files in the build artifact. Hercules creates a graph of all the jobs specified in the build, queries Fargate for the existing jobs, creates a diff between the two, and schedules jobs accordingly—adding new jobs, modifying existing jobs, and removing jobs not needed anymore.

Notifications are first-class citizens in Hercules. If a job fails pre-maturely, Hercules triggers an alert to a Slack channel.

Hercules relays job execution logs to Cloudwatch and our internal Elasticsearch cluster for easy search and analysis.

Conclusion

With Hercules bearing the burden of scheduling jobs and taking care of all the mundane stuff, developers can concentrate on writing their code and let Hercules do the rest. Developers do not have to bother about sshing into servers, deploying their jobs, and getting anxious about CPU, memory requirements, and job failures.

Hercules has made scheduling jobs at Slice a declarative and productive process—enjoyable and hassle free.


If working on such problems excite you, our infrastructure engineering team is hiring. If you care about developer productivity and tooling—check out the job description and apply here.