Docs Blog Sign in Get started
← Blog

Heartbeat monitoring: know when your jobs stop running

Uptime monitoring tells you when your server stops responding. But some of the most painful outages happen the other way around: the server is fine, the cron scheduler fired, and nothing visibly broke — the job just quietly stopped doing anything useful.

A nightly data sync that hasn't run in four days. A report generation job that started throwing an unhandled exception three weeks ago. A backup that "completed" successfully but wrote zero bytes. These failures are invisible to a traditional HTTP monitor because the endpoint never went down.

This is the problem heartbeat monitoring solves.

The dead-man's switch pattern

A heartbeat monitor works the opposite way from uptime monitoring. Instead of Tickstem calling your endpoint, your job calls Tickstem — at the end of every successful run. If the ping stops arriving, something has gone wrong: the job crashed, was never scheduled, or completed without doing its actual work.

You configure two things: an interval (how often you expect a ping) and a grace window (how long to wait past the deadline before alerting). Miss enough consecutive intervals, and Tickstem sends an email.

The ping URL requires no API key — the token embedded in the URL is the credential. This means you can ping from a shell script, a curl one-liner, or any HTTP client, without managing secrets in the job runtime.

Wiring it up in Go

Create the heartbeat once and save the token:

import "github.com/tickstem/heartbeat"

client := heartbeat.New(os.Getenv("TICKSTEM_API_KEY"))

hb, err := client.Create(ctx, heartbeat.CreateParams{
    Name:         "nightly-sync",
    IntervalSecs: 86400, // expect a ping every 24 hours
    GraceSecs:    3600,  // allow 1 hour buffer before alerting
})
// save hb.Token somewhere permanent — it's your ping credential

Then at the end of your job, after all the real work is done:

func runNightlySync(ctx context.Context) error {
    // ... do the actual work ...
    if err := syncData(ctx); err != nil {
        return err
    }

    // only ping on success — silence means failure
    if err := client.Ping(ctx, token); err != nil {
        log.Println("heartbeat ping failed:", err) // non-fatal
    }
    return nil
}

The ping is non-fatal. A transient network error shouldn't block your job from returning — it's far worse to abort a successful sync than to miss one ping. Two consecutive missed intervals trigger the alert, not one.

Wiring it up in Node.js

import { HeartbeatClient } from "@tickstem/heartbeat"

const hb = new HeartbeatClient(process.env.TICKSTEM_API_KEY)

const heartbeat = await hb.create({
  name: "nightly-sync",
  interval_secs: 86400,
  grace_secs: 3600,
})
// save heartbeat.token

At the end of every successful run:

async function runNightlySync() {
  await syncData()

  try {
    await hb.ping(token)
  } catch (err) {
    console.error("heartbeat ping failed:", err) // non-fatal
  }
}

Or just use curl

No SDK needed. If your job is a shell script or a language without an official SDK, add one line at the end:

#!/bin/bash
set -e

# ... do the work ...
rsync -avz /data/ backup@server:/backup/

# ping on success
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_TOKEN/ping

The token goes in the URL, not a header. If curl fails, the script still exits cleanly — don't let a monitoring call block your actual job.

Grace windows and why they matter

A 24-hour job doesn't always finish at exactly the same time. A database backup that usually takes 10 minutes might take 45 on a large data day. A grace window absorbs that variance without generating false alerts.

For a job with a 1-hour interval and a 10-minute grace window, Tickstem waits until 1h 10m has passed since the last ping before considering the interval missed. For a daily job with variable runtime, a 1-2 hour grace window is usually right.

Pausing during deployments

Deployments are the most common source of false heartbeat alerts. If your job is down for 20 minutes while a new version rolls out, and your interval is 15 minutes, you'd get an alert for a non-problem. Pause the heartbeat before the deploy:

# in your deploy script
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/pause \
  -H "Authorization: Bearer $TICKSTEM_API_KEY"

# ... deploy ...

curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/resume \
  -H "Authorization: Bearer $TICKSTEM_API_KEY"

While paused, missed pings are ignored completely. Alerting resumes the moment you call resume — the next missed interval after that will trigger the alert.

What to monitor

Any job that runs on a schedule and produces output that matters is a candidate:

Uptime monitoring and heartbeat monitoring are complementary. Uptime tells you the server is alive. Heartbeat tells you the job actually did what it was supposed to. A complete setup has both.

Heartbeat monitoring is available now in the Tickstem dashboard. Go SDK: go get github.com/tickstem/heartbeat. Node.js SDK: npm install @tickstem/heartbeat.