Operate
Observability
Production telemetry runs on three legs: Sentry for errors, OpenTelemetry traces for the hot path, Horizon for queue health. The admin Site Health pill (see Site health & failed jobs) summarizes them at a glance.
Sentry
Set SENTRY_DSN and unhandled exceptions across the app
flow into Sentry with stack traces, request context, and user/workspace
metadata. The breadcrumb trail captures the last 100 log lines for
every error. Integrated via sentry-laravel.
Useful filters in Sentry:
- Tag
workspace_idto scope errors to a tenant. - Tag
agent_idwhen the error originates in a widget request. - Release tags match deploy commit SHA โ easy to bisect when a regression appears.
OpenTelemetry traces
The OTEL exporter ships traces to OTEL_EXPORTER_OTLP_ENDPOINT
โ typically Honeycomb or Grafana Cloud Tempo. Spans wrap the hot path:
widget.message.receiveโ incoming HTTP, validation, JWT verify.rag.curated.matchโ short-circuit check.rag.embedโ query embedding call.rag.vector.searchโ ANN search.rag.rerankโ cross-encoder.rag.prompt.assembleโ local CPU work.rag.llm.first_tokenโ time-to-first-token (the headline metric).rag.llm.streamโ full stream duration.rag.persist.asyncโ post-stream save.
Each span is tagged with workspace_id, agent_id, conversation_id,
provider (cloudflare / openai), and any cache-hit flags. The big one
is p95 of rag.llm.first_token โ that's
your hot-path SLO.
Horizon
/horizon is the queue dashboard. Required for production โ
without it, you're blind to backlogs. Watch:
- Wait time โ how long jobs sit before being picked up. Healthy is < 1s; investigate > 10s.
- Throughput โ jobs/min by queue.
- Failed jobs โ anything that lands in
failed_jobsshows here too.
Queues to monitor:
| Queue | What's on it |
|---|---|
default | Misc: usage events, gap detection, audit logs, webhook deliveries. |
crawl | CrawlSourceJob, CrawlPageJob, IngestNotionPageJob, IngestGoogleDocJob. Tends to be the longest queue depth. |
index | IndexDocumentJob, IndexTextSourceJob. Embedding-heavy. |
Logs
Standard Laravel logging. Default channels:
stdoutโ captured by Laravel Cloud / Docker.sentryโ error level and above.slackโ critical level, posts to ops channel.
Tail logs locally with php artisan pail.
Health endpoint
GET /up is the readiness probe โ returns 200 with a small
JSON body if the app boots. Use it for load balancer health checks.
For deeper checks, App\Support\PlatformAdminHeader
runs the multi-step health check and exposes the result via the
Inertia shared prop on every admin page.
Metrics to watch
The handful of metrics that matter most:
- p95 first-token latency โ < 1s.
- p95 full-response latency โ < 5s for short answers.
- Crawl queue depth โ should drain within minutes.
- Index queue depth โ should drain within minutes.
- Failed-jobs count โ 0 in steady state. Anything > 50 is alarm-worthy.
- LLM provider error rate โ < 1% of streams.
- Vector store query latency โ p95 < 100ms.
Alerts
Recommended PagerDuty / Slack alerts:
- Sentry โ new release-blocking error.
- Honeycomb โ first-token p95 > 1.5s for 5 minutes.
- Horizon โ failed-jobs delta > 10 in 5 minutes.
- Stripe webhook โ > 5 consecutive verification failures (signing key mismatch).
- Reverb โ process down.
Site Health pill
The header pill in the admin panel is a quick visual check that everything is configured. Green is the steady state; if it goes amber, the dropdown tells you exactly which check failed and links to the settings page to fix it. See Site health & failed jobs.