Heartbeat monitoring: know when your jobs stop running
Uptime monitoring tells you when your server stops responding. But some of the most painful outages happen the other way around: the server is fine, the cron scheduler fired, and nothing visibly broke — the job just quietly stopped doing anything useful.
A nightly data sync that hasn't run in four days. A report generation job that started throwing an unhandled exception three weeks ago. A backup that "completed" successfully but wrote zero bytes. These failures are invisible to a traditional HTTP monitor because the endpoint never went down.
This is the problem heartbeat monitoring solves.
The dead-man's switch pattern
A heartbeat monitor works the opposite way from uptime monitoring. Instead of Tickstem calling your endpoint, your job calls Tickstem — at the end of every successful run. If the ping stops arriving, something has gone wrong: the job crashed, was never scheduled, or completed without doing its actual work.
You configure two things: an interval (how often you expect a ping) and a grace window (how long to wait past the deadline before alerting). Miss enough consecutive intervals, and Tickstem sends an email.
The ping URL requires no API key — the token embedded in the URL is the credential. This means you can ping from a shell script, a curl one-liner, or any HTTP client, without managing secrets in the job runtime.
Wiring it up in Go
Create the heartbeat once and save the token:
import "github.com/tickstem/heartbeat"
client := heartbeat.New(os.Getenv("TICKSTEM_API_KEY"))
hb, err := client.Create(ctx, heartbeat.CreateParams{
Name: "nightly-sync",
IntervalSecs: 86400, // expect a ping every 24 hours
GraceSecs: 3600, // allow 1 hour buffer before alerting
})
// save hb.Token somewhere permanent — it's your ping credential
Then at the end of your job, after all the real work is done:
func runNightlySync(ctx context.Context) error {
// ... do the actual work ...
if err := syncData(ctx); err != nil {
return err
}
// only ping on success — silence means failure
if err := client.Ping(ctx, token); err != nil {
log.Println("heartbeat ping failed:", err) // non-fatal
}
return nil
}
The ping is non-fatal. A transient network error shouldn't block your job from returning — it's far worse to abort a successful sync than to miss one ping. Two consecutive missed intervals trigger the alert, not one.
Wiring it up in Node.js
import { HeartbeatClient } from "@tickstem/heartbeat"
const hb = new HeartbeatClient(process.env.TICKSTEM_API_KEY)
const heartbeat = await hb.create({
name: "nightly-sync",
interval_secs: 86400,
grace_secs: 3600,
})
// save heartbeat.token
At the end of every successful run:
async function runNightlySync() {
await syncData()
try {
await hb.ping(token)
} catch (err) {
console.error("heartbeat ping failed:", err) // non-fatal
}
}
Or just use curl
No SDK needed. If your job is a shell script or a language without an official SDK, add one line at the end:
#!/bin/bash
set -e
# ... do the work ...
rsync -avz /data/ backup@server:/backup/
# ping on success
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_TOKEN/ping
The token goes in the URL, not a header. If curl fails, the script still exits cleanly — don't let a monitoring call block your actual job.
Grace windows and why they matter
A 24-hour job doesn't always finish at exactly the same time. A database backup that usually takes 10 minutes might take 45 on a large data day. A grace window absorbs that variance without generating false alerts.
For a job with a 1-hour interval and a 10-minute grace window, Tickstem waits until 1h 10m has passed since the last ping before considering the interval missed. For a daily job with variable runtime, a 1-2 hour grace window is usually right.
Pausing during deployments
Deployments are the most common source of false heartbeat alerts. If your job is down for 20 minutes while a new version rolls out, and your interval is 15 minutes, you'd get an alert for a non-problem. Pause the heartbeat before the deploy:
# in your deploy script
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/pause \
-H "Authorization: Bearer $TICKSTEM_API_KEY"
# ... deploy ...
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/resume \
-H "Authorization: Bearer $TICKSTEM_API_KEY"
While paused, missed pings are ignored completely. Alerting resumes the moment you call resume — the next missed interval after that will trigger the alert.
What to monitor
Any job that runs on a schedule and produces output that matters is a candidate:
- Database backups
- Data sync and ETL pipelines
- Report generation
- Invoice processing
- Cache warming
- Any job where "it ran" and "it did something useful" are different things
Uptime monitoring and heartbeat monitoring are complementary. Uptime tells you the server is alive. Heartbeat tells you the job actually did what it was supposed to. A complete setup has both.
Heartbeat monitoring is available now in the Tickstem dashboard.
Go SDK: go get github.com/tickstem/heartbeat.
Node.js SDK: npm install @tickstem/heartbeat.