Build your agent
Knowledge sources
Sources are how an agent learns about your business. This page covers every kind of source, the ingestion pipeline, and what to expect after you click "Add".
Source types
| Type | Use it for | What we ingest |
|---|---|---|
url | One specific page | Crawl + extract main content + chunk + embed |
sitemap | A whole site at once | Read sitemap, fan out to one CrawlPageJob per URL |
feed | RSS / Atom blogs | Same as sitemap but reads <item> entries |
text | FAQs, snippets, anything you can paste | Skip the crawl, chunk + embed directly |
notion | Notion pages or databases | OAuth into Notion, fetch via API, treat each page as a document |
google_doc | Google Docs (Workspace) | OAuth, fetch via Drive API, ingest as a document |
auto | Pages visitors land on | Auto-queued by AutoIndexPageVisit from /v1/widget/init |
Add a source
Open /app/agents/{id}/sources. The Add source
modal handles all types in one form. Behind the scenes:
- Validate. URLs must be http/https; private hosts (
10.x,192.168.x,127.x,::1) are blocked to prevent SSRF. - Create the source row with
status = pending. - Dispatch a job โ
CrawlSourceJobfor url/sitemap/feed;IngestNotionPageJob/IngestGoogleDocJobfor connected sources;IndexTextSourceJobfor pasted text. - The job runs on the
crawlqueue, fetches content, creates Document rows, then dispatchesIndexDocumentJobon theindexqueue. - The status flips from
pending โ crawling โ done(orfailedwith an error message you can read in the UI).
Auto-discovery
On the sources page, the Discover button takes a domain and probes it for crawlable pages without you having to list them. We:
- Read
robots.txtfor sitemap declarations. - Probe a sitemap directly when present.
- Try a small set of common paths:
/about,/pricing,/features,/products,/faq,/docs,/help,/support,/contact. - Return a checkable list. Tick which to ingest, hit Add selected.
Sitemap fan-out
Adding a Source of type sitemap dispatches one
CrawlPageJob per URL in the sitemap, staggered by a
small per-page delay so Cloudflare Browser Rendering doesn't
rate-limit on burst. The discoverer (SitemapDiscoverer)
handles three input shapes:
-
Domain root (
https://example.com) โ probes/sitemap.xml+/sitemap_index.xml. -
Direct sitemap URL
(
https://example.com/sitemap.xmlorhttps://example.com/products/sitemap.xml) โ fetched verbatim. Pre-fix the discoverer used to append a second/sitemap.xmlhere and 404 the request. -
Sitemap-index (the
<sitemapindex>XML many CMSes โ WordPress, Shopify, Webflow โ emit by default) โ recurses one level into each child sitemap and aggregates page URLs.
Output is deduped (so a URL listed in two child sitemaps gets
indexed once) and capped at
services.crawl.max_pages_per_source (default 500,
override via CRAWL_MAX_PAGES_PER_SOURCE). The cap
used to be 25 โ a buyer adding a 100-URL sitemap silently lost 75
pages โ the new default is generous enough for most marketing /
docs sites. Very large catalogues should split the sitemap by
section anyway.
Crawler strategies
The crawler is provider-driven. In order of preference:
- Cloudflare Browser Rendering โ preferred. Full JS rendering, fast, no SSRF risk because egress is on Cloudflare. Used when
CLOUDFLARE_ACCOUNT_ID+CLOUDFLARE_API_TOKENare set. - Browserless โ fallback when
BROWSERLESS_TOKENis set. Same headless-Chrome behavior on a different vendor. - Plain HTTP โ last resort for server-rendered sites. No JS execution. Free.
Once HTML is in hand, ReadabilityExtractor strips nav,
footer, ads, etc., leaving the article body. Pages under 200 chars or
detected as 404s are dropped.
File upload parsing
Direct file uploads (Sources → Upload files) are parsed locally first, then handed to the same chunk + embed pipeline crawled pages use. The parser is picked by file extension:
| Extension | Parser | Network call? |
|---|---|---|
.pdf, .docx, .doc, .xlsx, .xls, .odt, .ods | Cloudflare Workers AI toMarkdown when CF creds are configured; Smalot\PdfParser / PhpOffice\PhpWord otherwise | Yes — one multipart POST per file to /ai/tomarkdown (free of cost, 0 Neurons) |
.csv | League\Csv — emits one segment per row formatted as col: value | col: value | No |
.md, .markdown, .txt | Plain text, split on H1/H2 headings | No |
The Cloudflare path is preferred for binary office formats because
Smalot and PhpWord are unreliable on real-world documents: Word-
exported PDFs that put body text in one big content stream, scanned
PDFs with a thin text layer, and DOCX files with nested tables or
text frames all tend to extract poorly. Workers AI's
toMarkdown returns structured markdown (headings, lists,
tables preserved) which feeds the chunker much better.
Pricing: toMarkdown is free for every format above.
Only image-to-markdown conversion consumes Workers AI Neurons (we
do not send images). When Cloudflare credentials are absent (BYOK
OpenAI customers, fresh installs), or when the Cloudflare call
fails, the in-process PHP parsers take over so uploads never
silently break.
Whichever parser ran, the resulting text is persisted under
storage/app/private/uploads/{source_id}/segment-N.txt.
That's the file the Reindex button reads — you don't need to
re-upload the original to re-index.
Chunking and embedding
The extractor's text goes into Chunker, a recursive
splitter that prefers semantic boundaries:
- Split on markdown headings, then blank lines (paragraphs).
- Pack paragraphs greedily up to a target size (~2000 chars / ~500 tokens).
- If a paragraph is too big, fall back to sentence boundaries.
- Char-window as the absolute last resort.
- Add a small overlap between chunks so cross-chunk facts stay linkable.
Each chunk is embedded in a batch (default 100 chunks per call) and
upserted into the vector store with metadata: agent_id,
document_id, chunk_id, url,
workspace_id, source_id, lang.
Crawl retry policy
Each CrawlPageJob attempts up to 3 times with
backoff [30s, 90s, 180s]. The retry path is split
by failure class:
- Rate-limited (HTTP 429, "too many requests" upstream) โ releases back to the queue with a fresh 60-second delay without burning a retry slot. Every fan-out page on the same workspace tends to hit the same 429 wave; the shared wait is productive.
- Permanent failure (curl DNS resolve / connection refused, HTTP 400 / 401 / 403 / 404 / 410 / 451, malformed URL) โ short-circuits via
fail()so the Source row gets the real reason immediately instead of being stranded behind two more retries that will deterministically fail. - Transient (5xx, network blip) โ normal retry with backoff.
- Per-job timeout โ 90s.
failOnTimeout=trueso a worker SIGTERM still flips the source tofailedwith a customer-readable error.
Buyer-facing error messages on the Sources list are sanitized
via SourceErrorPresenter โ raw upstream JSON
envelopes (Cloudflare 401 bodies, Browserless stack traces) get
rewritten to friendly lines like "We couldn't reach this page"
or "The crawl service is busy right now โ we will retry
automatically." Operators still see the full raw message
under Show details.
Reindex and preview
From the sources list, each row has:
- Reindex โ re-runs the crawl + chunk + embed pipeline. For uploaded files the reindex reads back the persisted segment text under
storage/app/private/uploads/{source_id}/segment-N.txtโ no need to re-upload the original. If the persisted file is missing (pre-fix uploads or a disk wipe) the UI surfaces a "Re-upload" prompt. - Preview โ shows the extracted documents and a sample of chunks so you can spot bad extraction (e.g. nav bar polluting the text).
- Delete โ removes the source, its documents, its chunks, and the corresponding vector points.
Notion and Google Docs
Both use OAuth. Connect once from /app/integrations; the
token is encrypted at rest. After connecting, the source modal lets you
pick pages or documents directly.
Re-syncs are manual (per-source Reindex button) โ we don't poll your Notion / Drive on a schedule. If you change a Notion page, click Reindex on that source.
"My agent doesn't know about the file I just uploaded"
Cloudflare Vectorize has eventual consistency on metadata-filtered
queries โ even after an upsert returns 200 OK, an
agent_id-filtered query against that vector typically
returns 0 hits for the first 30 to 60 seconds while
the metadata index propagates across edge regions.
Practical consequence: a freshly uploaded file shows up as
status=indexed in the Sources page immediately, but the
agent won't be able to answer questions about it until the propagation
window closes. The upload-success banner reminds the admin of this.
If the agent still doesn't return relevant chunks after a minute,
open the source's Preview to confirm the extracted
text isn't empty โ that's a parser-side issue, not a vector-side one.
Same gotcha applies to the very first upload after creating a Cloudflare Vectorize index for the first time โ the index itself has a ~2 minute provisioning lag before any queries return results, even unfiltered ones.
Storage and retention
- Postgres โ sources, documents, chunks (text + metadata).
- Vector store โ embeddings. Cloudflare Vectorize when configured, Qdrant otherwise.
- R2 / object storage โ original artifacts (PDFs, images) when uploaded.
Deleting a source cascades: documents, chunks, and vector points all go in one transaction. There's no soft-delete on sources.