D Diagent docs

Operate

Observability

Production telemetry runs on three legs: Sentry for errors, OpenTelemetry traces for the hot path, Horizon for queue health. The admin Site Health pill (see Site health & failed jobs) summarizes them at a glance.

Sentry

Set SENTRY_DSN and unhandled exceptions across the app flow into Sentry with stack traces, request context, and user/workspace metadata. The breadcrumb trail captures the last 100 log lines for every error. Integrated via sentry-laravel.

Useful filters in Sentry:

  • Tag workspace_id to scope errors to a tenant.
  • Tag agent_id when the error originates in a widget request.
  • Release tags match deploy commit SHA โ€” easy to bisect when a regression appears.

OpenTelemetry traces

The OTEL exporter ships traces to OTEL_EXPORTER_OTLP_ENDPOINT โ€” typically Honeycomb or Grafana Cloud Tempo. Spans wrap the hot path:

  • widget.message.receive โ€” incoming HTTP, validation, JWT verify.
  • rag.curated.match โ€” short-circuit check.
  • rag.embed โ€” query embedding call.
  • rag.vector.search โ€” ANN search.
  • rag.rerank โ€” cross-encoder.
  • rag.prompt.assemble โ€” local CPU work.
  • rag.llm.first_token โ€” time-to-first-token (the headline metric).
  • rag.llm.stream โ€” full stream duration.
  • rag.persist.async โ€” post-stream save.

Each span is tagged with workspace_id, agent_id, conversation_id, provider (cloudflare / openai), and any cache-hit flags. The big one is p95 of rag.llm.first_token โ€” that's your hot-path SLO.

Horizon

/horizon is the queue dashboard. Required for production โ€” without it, you're blind to backlogs. Watch:

  • Wait time โ€” how long jobs sit before being picked up. Healthy is < 1s; investigate > 10s.
  • Throughput โ€” jobs/min by queue.
  • Failed jobs โ€” anything that lands in failed_jobs shows here too.

Queues to monitor:

QueueWhat's on it
defaultMisc: usage events, gap detection, audit logs, webhook deliveries.
crawlCrawlSourceJob, CrawlPageJob, IngestNotionPageJob, IngestGoogleDocJob. Tends to be the longest queue depth.
indexIndexDocumentJob, IndexTextSourceJob. Embedding-heavy.

Logs

Standard Laravel logging. Default channels:

  • stdout โ€” captured by Laravel Cloud / Docker.
  • sentry โ€” error level and above.
  • slack โ€” critical level, posts to ops channel.

Tail logs locally with php artisan pail.

Health endpoint

GET /up is the readiness probe โ€” returns 200 with a small JSON body if the app boots. Use it for load balancer health checks. For deeper checks, App\Support\PlatformAdminHeader runs the multi-step health check and exposes the result via the Inertia shared prop on every admin page.

Metrics to watch

The handful of metrics that matter most:

  • p95 first-token latency โ€” < 1s.
  • p95 full-response latency โ€” < 5s for short answers.
  • Crawl queue depth โ€” should drain within minutes.
  • Index queue depth โ€” should drain within minutes.
  • Failed-jobs count โ€” 0 in steady state. Anything > 50 is alarm-worthy.
  • LLM provider error rate โ€” < 1% of streams.
  • Vector store query latency โ€” p95 < 100ms.

Alerts

Recommended PagerDuty / Slack alerts:

  • Sentry โ€” new release-blocking error.
  • Honeycomb โ€” first-token p95 > 1.5s for 5 minutes.
  • Horizon โ€” failed-jobs delta > 10 in 5 minutes.
  • Stripe webhook โ€” > 5 consecutive verification failures (signing key mismatch).
  • Reverb โ€” process down.

Site Health pill

The header pill in the admin panel is a quick visual check that everything is configured. Green is the steady state; if it goes amber, the dropdown tells you exactly which check failed and links to the settings page to fix it. See Site health & failed jobs.