← Blog

Heartbeat monitoring for AI agent pipelines

You deploy an AI agent to run nightly. It summarises data, writes a report, sends a Slack message. You set up uptime monitoring on the endpoint. The monitor stays green. Three days later you notice the Slack messages stopped. The agent hasn't run since Tuesday — and nothing alerted you.

This is the failure mode heartbeat monitoring is designed to catch. Here's how it works and why it's particularly important for AI agent pipelines.

The dead man's switch pattern

A dead man's switch alerts when something stops happening. Traditional monitoring alerts when something starts happening — a server goes down, an error rate spikes, a response time increases.

For AI agents, the dangerous failure is silence. The agent stops running. No error is thrown. No endpoint goes down. The work just quietly ceases. A dead man's switch catches this by expecting a regular signal — if the signal stops, something is wrong.

The implementation is straightforward: at the end of every successful agent run, after the real work is done, send a ping to a heartbeat URL. If the ping stops arriving within the expected window, you get an alert.

Why AI agents need this more than traditional jobs

Traditional cron jobs fail loudly — a non-zero exit code, an exception in the logs, a failed database write. You usually know something went wrong.

AI agents fail quietly. The model might hit a rate limit and return a graceful fallback response. A tool call might silently fail and the agent continues without it. The task might complete but produce empty or corrupted output — and your application code never raises an error because it got a valid HTTP response.

In all these cases, the endpoint is up, the job "ran," and traditional monitoring sees nothing. The heartbeat sees everything — because the agent itself decides whether to send the ping, and it only pings on genuine success.

Wiring it up

Create the heartbeat once and save the token:

curl -X POST https://api.tickstem.dev/v1/heartbeats \
  -H "Authorization: Bearer $TICKSTEM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "nightly-summary-agent",
    "interval_secs": 86400,
    "grace_secs": 3600
  }'
# → {"token": "your-64-char-token", ...}

Then in your agent's task handler, ping only after all the real work is verified complete:

async function runNightlySummaryAgent() {
  const summary = await generateSummary()

  // verify the output is valid before pinging
  if (!summary || summary.length < 100) {
    throw new Error("summary generation failed or returned empty output")
  }

  await postToSlack(summary)
  await writeToDatabase(summary)

  // only ping after everything succeeded
  await fetch(`https://api.tickstem.dev/v1/heartbeats/${HEARTBEAT_TOKEN}/ping`, {
    method: "POST"
  }).catch(err => console.error("heartbeat ping failed:", err)) // non-fatal
}

The ping is fire-and-forget — a network error on the ping should never block your agent from returning. It's far worse to abort a successful run than to miss one ping. Two consecutive missed intervals trigger the alert, not one.

Setting the right interval and grace window

The interval is how often you expect the agent to run. The grace window absorbs variance — an agent that usually completes in 2 minutes but occasionally takes 20 on large inputs needs a grace window that covers that variance without generating false alerts.

A practical starting point for common agent schedules:

After a week of runs, check your actual execution durations and tighten the grace window to 2-3x your p95 runtime.

Multi-step pipelines

For agents that run a pipeline — fetch data, process it, write results, notify downstream — consider a heartbeat per stage if any stage can fail silently. One heartbeat at the end of the full pipeline tells you the pipeline completed. Individual stage heartbeats tell you exactly where it stopped.

A useful rule: the heartbeat ping should only fire after your agent has verified its own output. If the agent checks that the database write succeeded, the Slack message was delivered, and the output passes a sanity check — then it pings. Not before.

Pausing during deployments

Deployments are the most common source of false heartbeat alerts. If your agent is down for 10 minutes during a rolling deploy and its interval is 15 minutes, you'd get an alert for a non-problem. Pause the heartbeat before deploying:

# before deploy
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/pause \
  -H "Authorization: Bearer $TICKSTEM_API_KEY"

# after deploy completes
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/resume \
  -H "Authorization: Bearer $TICKSTEM_API_KEY"

Via MCP

If you're using Claude Code or another MCP-compatible client, the Tickstem MCP server exposes create_heartbeat and ping_heartbeat as native tools. The agent can set up its own dead man's switch during the initial scaffolding step — no separate dashboard visit needed.

Add a dead man's switch to your AI agent

Heartbeat monitoring with configurable intervals, grace windows, and email alerts. Free tier, no credit card required.

Get started →

Related: Monitoring AI agents in production — the full three-layer monitoring stack for autonomous agents. Also: Scheduling recurring tasks in AI agent workflows · Heartbeat monitoring for background jobs. See the heartbeat monitoring tool →