Monitoring and Runs

Every cron worker keeps a 30-day history of its runs, integrates with the project topology, emits OpenTelemetry traces, and writes audit log entries for every mutation. This page explains how to use those surfaces to monitor and debug.

The worker detail page

Open a worker from the project’s Cron Workers tab. The detail page gives you everything in one place:

A status badge — active, suspended, or deleted.
The next scheduled fire with a live countdown, computed from the cron expression and timezone.
A last-run summary with timestamp and outcome.
The run history table (newest first, paginated).
Action buttons: Manual trigger, Pause / Resume, Edit destination, Delete.

While a run is in progress, the page polls run history at a short interval and updates in place — you don’t need to refresh manually.

Run history

Each row in the table is a single run with these columns:

Column	What it shows
Status	`running`, `success`, or `failed`.
Triggered by	`schedule` (the cron expression fired it) or `manual` (you clicked the button).
Attempt	Which attempt this is (1 of N). Retries appear as separate attempts.
Scheduled for	The cron slot’s timestamp (only set for scheduled runs). Useful to spot drift between slot and start.
Started	When the attempt actually ran.
Finished	When the attempt finished (or `—` if still running).
Duration	Elapsed time of the attempt.
Failure reason	Structured code (e.g. `timeout`, `http_5xx`, `ssrf_blocked`) — only on failed runs.

Clicking a row expands to show the full error message (when failed) and the resolved payload that was actually sent.

Drift between scheduled and started

If the Started time is noticeably later than Scheduled for, the cron slot was queued for some time before it actually ran. Common causes:

A short orchestrator leadership transition (sub-second usually, never longer than a few seconds).
A previous run still in flight when concurrency is capped at 1.
The destination was slow to accept the connection.

Drift of a few seconds is normal. Drift of more than a minute typically indicates a destination problem — check the failed runs around the same time.

Manual triggers

The Manual trigger button fires the worker on demand, outside its schedule. Use it to:

Test a new worker without waiting for the next slot.
Backfill a missed run after fixing a destination issue.
Verify connectivity after an environment-variable change.

Manual triggers go through the same retry, timeout, and audit pipeline as scheduled runs. They appear in run history with Triggered by: manual and scheduled_for empty.

The button is rate-limited per worker to prevent accidental floods. If you hit the limit, the dashboard shows a friendly message and you can try again in a few seconds.

Pause, resume, and delete

Pause stops future scheduled fires. The worker keeps its full configuration and history; runs already in flight finish normally.
Resume restarts the schedule from the next slot. No catch-up — runs missed while paused are not replayed.
Delete removes the worker permanently. Run history is also deleted. There is a confirmation dialog.

On the project topology

Cron workers appear as nodes on the project topology map. They use a clock icon and a dashed border to distinguish them from services. A cron_trigger edge connects the worker to its destination.

Cron edges are drawn in the platform’s secondary brand color with a dashed line and without the animation used for live request traffic — cron is scheduled, not continuous. Cron edges are also excluded from the request-rate and error-rate aggregations on the topology, so a chatty cron worker does not skew your service-traffic metrics.

Click a cron node on the map to jump to its detail page.

In your traces

Every cron fire opens an OpenTelemetry span and propagates a W3C traceparent header to the destination. That means a trigger that POSTs to your service shows up in Traces as a single trace spanning:

The cron worker’s cron.trigger.fire span.
The HTTP request span on the receiving service.
Anything that service does as a result.

If your job fans out further (downstream HTTP calls, queue publishes), those land in the same trace as long as you propagate the headers in your code.

Audit logging

Every mutation on a cron worker is written to your Audit Logs with the user, timestamp, IP, the action, and the diff of changed fields. The actions are:

cron_worker.create
cron_worker.update
cron_worker.suspend
cron_worker.resume
cron_worker.delete
cron_worker.manual_trigger

You can filter the audit log by action_type to surface just cron worker activity, and by outcome (success / failed) to find failed mutations.

Investigating failures — a checklist

When a run fails, walk through this checklist:

Read the failure_reason. It points to the layer of the failure (network, protocol, destination, etc.) before you read the message.
Check the trace — open the trace from the run row. The waterfall shows whether the destination service even received the request and how it responded.
Check the destination’s logs. Open the destination service in the dashboard and look at logs around the run’s started_at time.
Check connectivity. For HTTP destinations, manually trigger the worker — if it succeeds, the schedule is fine and the original failure was transient.
Check credentials. A failure_reason: credential means the catalog service credentials could not be resolved or were rejected. Restart the catalog service and try again.

If the failure is repeatable and you cannot resolve it, contact support with the run ID — the audit log and trace both reference it, so support can find everything in seconds.

From the CLI

The same run history is available from guara crons runs <slug> — table view by default, bordered cards with --detail, append-only live tail with --watch. Use it when investigating from a shell session, scripting checks in CI, or watching a job land after fixing something.

guara crons runs hourly                            # Table, latest 20 runs
guara crons runs hourly --status failed            # Filter to failed runs
guara crons runs hourly --detail                   # Bordered cards with full context
guara crons runs hourly --watch --detail           # Live cards as runs complete
guara crons runs hourly --watch --status failed    # Wait for a failure to reproduce

Sample card output (--detail):

┌─ 4f2ab91d-5c08-4e6c-9c8f-2b6d1c3a7e10 ──────  ✕ failed ─┐
│ Triggered by  schedule                                  │
│ Attempts      3                                         │
│ Started       2026-05-02T13:00:01.214Z                  │
│               Finished 2026-05-02T13:00:31.215Z         │
│ Duration      30001ms                                   │
│ Failure       timeout                                   │
│               The destination didn't respond within     │
│               the configured timeout.                   │
│ Error         upstream request timeout                  │
│ Payload       {                                         │
│                 "job": "rotate-uploads",                │
│                 "max_age_hours": 24                     │
│               }                                         │
└─────────────────────────────────────────────────────────┘

Failure reasons

Every failed run carries a structured failure_reason — the CLI prints it next to a one-line human explanation in --detail mode. The full taxonomy:

Code	Explanation
`timeout`	The destination didn’t respond within the configured timeout.
`dns`	The destination hostname could not be resolved (DNS failure).
`connection`	The destination refused the connection or was unreachable.
`tls`	The destination presented an invalid or untrusted TLS certificate.
`http_4xx`	The destination returned a 4xx response (client error).
`http_5xx`	The destination returned a 5xx response (server error).
`ssrf_blocked`	The destination URL was blocked by SSRF protection.
`blocked_hostname`	The destination hostname is on the platform’s deny list.
`blocked_ip`	The destination IP is on the platform’s deny list.
`destination_unresolvable`	The destination reference could not be resolved at trigger time.
`credential`	The destination authentication failed.
`protocol`	The destination spoke an unexpected protocol or version.
`unknown`	An unspecified failure occurred (see `error` field for details).

From a failed run to its trace

Every cron fire propagates a W3C traceparent, so a failed run has a corresponding distributed trace. Copy the trace ID out of the run card (or out of the JSON envelope when using --json) and pipe it into the trace viewer:

# Find the latest failed run for a worker
guara crons runs hourly --status failed --limit 1 --json \
  | jq -r '.data.items[0].id'

# Then open the trace for that run (trace_id is on the run row in --json)
guara services traces get <traceId>

The waterfall shows whether the destination service even received the request, what each downstream span did, and where the time was actually spent — almost always the fastest path from “the cron failed” to “this query is the problem”.