Build your agent
Auto-index visited pages
Auto-index is a per-agent toggle that grows your knowledge base automatically as visitors browse your site. When a visitor lands on a page the agent has never seen, that page goes into the crawl queue โ silently, in the background.
How to enable
Open the agent's settings page (/app/agents/{id}/settings) and
flip Auto-index visited pages. The change takes effect
immediately โ there's no separate publish step for this toggle.
What it does
On every /v1/widget/init call, after the agent passes the
origin and quota checks, AutoIndexPageVisit::attempt() runs
a chain of seven guards. All seven must pass for a crawl to be queued:
- The agent has
auto_index_visited_pages = true. - The page URL is a valid http/https URL.
- The host is not private โ RFC1918 (
10.x,172.16-31.x,192.168.x), loopback (localhost,127.x,::1), link-local (169.254.x,fe80:),0.x, IPv6 ULA (fc00:), and.local/.internaldomains are blocked. - The path doesn't look private โ
/admin,/login,/checkout,/profile,/account,/settings,/cart,/apiare skipped. - The URL hasn't already been indexed for this agent.
- We're under the rate limit โ 30 crawls per agent per hour, tracked in Redis.
- The visitor's actual
Originheader matches the agent'sallowed_origins(or matches the page URL's origin when allowed_origins is*).
If everything passes, we lazy-create a type=auto source and
dispatch a CrawlPageJob on the crawl queue. The
visitor's request returns immediately โ auto-index never blocks the hot
path.
What it skips
The path blocklist exists because authenticated pages are noisy and
risky to index โ a logged-in /profile or
/account/orders page leaks the visitor's data into your
knowledge base. The full list lives in
AutoIndexPageVisit::SKIP_PATH_PATTERNS (case-insensitive,
matches the path segment with optional trailing slash):
/account,/my-account,/profile,/settings/admin/login,/signin,/signup,/register,/logout,/auth,/password/checkout,/cart,/order,/orders
/portal, /customer), the default blocklist
won't catch them. Disable auto-index, or pre-list the exact public
URLs you want crawled and skip the toggle entirely.
Rate limiting
The 30-crawls-per-agent-per-hour cap is a token bucket keyed in Redis as
auto-index:agent:{id}:hour:{YmdH}. If a popular page on your
site is getting hammered the limit will quickly throttle, but normal
traffic patterns rarely hit it.
What gets indexed
Auto-indexed pages become type=auto sources. They show up
in the regular Sources list with a small "auto" pill so
you can see what's been picked up. You can preview, reindex, or delete
them like any other source.
The auto-source's title is the page <title> if available,
otherwise the URL. Path-based deduplication means the same URL crawled
twice doesn't create two sources.
Disabling and pruning
Turn the toggle off and no new pages will be queued, but existing
auto-sources stay. To clean them up, filter the sources list by
type = auto and delete in bulk.