Operate

Observability

Production telemetry runs on three legs: Sentry for errors, OpenTelemetry traces for the hot path, Horizon for queue health. The admin Site Health pill (see Site health & failed jobs) summarizes them at a glance.

Sentry

Set SENTRY_DSN and unhandled exceptions across the app flow into Sentry with stack traces, request context, and user/workspace metadata. The breadcrumb trail captures the last 100 log lines for every error. Integrated via sentry-laravel.

Useful filters in Sentry:

Tag workspace_id to scope errors to a tenant.
Tag agent_id when the error originates in a widget request.
Release tags match deploy commit SHA — easy to bisect when a regression appears.

OpenTelemetry traces

The OTEL exporter ships traces to OTEL_EXPORTER_OTLP_ENDPOINT — typically Honeycomb or Grafana Cloud Tempo. Spans wrap the hot path:

widget.message.receive — incoming HTTP, validation, JWT verify.
rag.curated.match — short-circuit check.
rag.embed — query embedding call.
rag.vector.search — ANN search.
rag.rerank — cross-encoder.
rag.prompt.assemble — local CPU work.
rag.llm.first_token — time-to-first-token (the headline metric).
rag.llm.stream — full stream duration.
rag.persist.async — post-stream save.

Each span is tagged with workspace_id, agent_id, conversation_id, provider (cloudflare / openai), and any cache-hit flags. The big one is p95 of rag.llm.first_token — that's your hot-path SLO.

Horizon

/horizon is the queue dashboard. Required for production — without it, you're blind to backlogs. Watch:

Wait time — how long jobs sit before being picked up. Healthy is < 1s; investigate > 10s.
Throughput — jobs/min by queue.
Failed jobs — anything that lands in failed_jobs shows here too.

Queues to monitor:

Queue	What's on it
`default`	Misc: usage events, gap detection, audit logs, webhook deliveries.
`crawl`	CrawlSourceJob, CrawlPageJob, IngestNotionPageJob, IngestGoogleDocJob. Tends to be the longest queue depth.
`index`	IndexDocumentJob, IndexTextSourceJob. Embedding-heavy.

Logs

Standard Laravel logging. Default channels:

stdout — captured by Laravel Cloud / Docker.
sentry — error level and above.
slack — critical level, posts to ops channel.

Tail logs locally with php artisan pail.

Health endpoint

GET /up is the readiness probe — returns 200 with a small JSON body if the app boots. Use it for load balancer health checks. For deeper checks, App\Support\PlatformAdminHeader runs the multi-step health check and exposes the result via the Inertia shared prop on every admin page.

Metrics to watch

The handful of metrics that matter most:

p95 first-token latency — < 1s.
p95 full-response latency — < 5s for short answers.
Crawl queue depth — should drain within minutes.
Index queue depth — should drain within minutes.
Failed-jobs count — 0 in steady state. Anything > 50 is alarm-worthy.
LLM provider error rate — < 1% of streams.
Vector store query latency — p95 < 100ms.

Alerts

Recommended PagerDuty / Slack alerts:

Sentry — new release-blocking error.
Honeycomb — first-token p95 > 1.5s for 5 minutes.
Horizon — failed-jobs delta > 10 in 5 minutes.
Stripe webhook — > 5 consecutive verification failures (signing key mismatch).
Reverb — process down.

Site Health pill

The header pill in the admin panel is a quick visual check that everything is configured. Green is the steady state; if it goes amber, the dropdown tells you exactly which check failed and links to the settings page to fix it. See Site health & failed jobs.