D Diagent docs

Build your agent

Auto-index visited pages

Auto-index is a per-agent toggle that grows your knowledge base automatically as visitors browse your site. When a visitor lands on a page the agent has never seen, that page goes into the crawl queue โ€” silently, in the background.

How to enable

Open the agent's settings page (/app/agents/{id}/settings) and flip Auto-index visited pages. The change takes effect immediately โ€” there's no separate publish step for this toggle.

What it does

On every /v1/widget/init call, after the agent passes the origin and quota checks, AutoIndexPageVisit::attempt() runs a chain of seven guards. All seven must pass for a crawl to be queued:

  1. The agent has auto_index_visited_pages = true.
  2. The page URL is a valid http/https URL.
  3. The host is not private โ€” RFC1918 (10.x, 172.16-31.x, 192.168.x), loopback (localhost, 127.x, ::1), link-local (169.254.x, fe80:), 0.x, IPv6 ULA (fc00:), and .local / .internal domains are blocked.
  4. The path doesn't look private โ€” /admin, /login, /checkout, /profile, /account, /settings, /cart, /api are skipped.
  5. The URL hasn't already been indexed for this agent.
  6. We're under the rate limit โ€” 30 crawls per agent per hour, tracked in Redis.
  7. The visitor's actual Origin header matches the agent's allowed_origins (or matches the page URL's origin when allowed_origins is *).

If everything passes, we lazy-create a type=auto source and dispatch a CrawlPageJob on the crawl queue. The visitor's request returns immediately โ€” auto-index never blocks the hot path.

What it skips

The path blocklist exists because authenticated pages are noisy and risky to index โ€” a logged-in /profile or /account/orders page leaks the visitor's data into your knowledge base. The full list lives in AutoIndexPageVisit::SKIP_PATH_PATTERNS (case-insensitive, matches the path segment with optional trailing slash):

  • /account, /my-account, /profile, /settings
  • /admin
  • /login, /signin, /signup, /register, /logout, /auth, /password
  • /checkout, /cart, /order, /orders
Don't index logged-in surfaces
If your app's auth-gated pages live under unusual paths (e.g. /portal, /customer), the default blocklist won't catch them. Disable auto-index, or pre-list the exact public URLs you want crawled and skip the toggle entirely.

Rate limiting

The 30-crawls-per-agent-per-hour cap is a token bucket keyed in Redis as auto-index:agent:{id}:hour:{YmdH}. If a popular page on your site is getting hammered the limit will quickly throttle, but normal traffic patterns rarely hit it.

What gets indexed

Auto-indexed pages become type=auto sources. They show up in the regular Sources list with a small "auto" pill so you can see what's been picked up. You can preview, reindex, or delete them like any other source.

The auto-source's title is the page <title> if available, otherwise the URL. Path-based deduplication means the same URL crawled twice doesn't create two sources.

Disabling and pruning

Turn the toggle off and no new pages will be queued, but existing auto-sources stay. To clean them up, filter the sources list by type = auto and delete in bulk.