Knowledge retrieval

Retrieval is the step that turns "answer this question" into "read these specific pages". It is also the step where most AI systems quietly fail: a model is only as good as what it gets to read, and getting the right material in front of it is harder than it looks.

This page explains how OCC retrieves knowledge, why it does so this way, and the trade-offs against more conventional approaches.

The problem with classic RAG

The standard answer to "give a language model access to a knowledge base" is retrieval-augmented generation (RAG): chop documents into small chunks, build vector embeddings of each chunk, embed the query, return the top-K nearest chunks, dump them into the prompt.

This works for some cases. It fails badly for others, in ways that are structural rather than tunable:

  • Lost context. A chunk taken out of a long document loses its place in the argument. The model sees a fragment and produces a confident answer that misses what the surrounding text would have made clear.
  • Provenance debt. Once chunks are flattened into a vector index, the relationship between a fact and its source becomes blurry. Citing the source means assembling the chunk back into "what document was this from", and even then, which page, which paragraph?
  • No internal structure. The knowledge base is just a flat soup of vectors. There is no way for one chunk to say "if you care about this, also look at that". Cross-references — the connective tissue of a real wiki — are gone.
  • All-or-nothing relevance. Top-K cosine similarity treats the corpus as homogeneous. There is no native concept of "this chunk belongs to topic A; that one to topic B" — useful when a question spans more than one topic.

OCC takes a different starting point.

The LLM-wiki pattern

OCC's pack format is adapted from a pattern Andrej Karpathy described in early 2026: rather than chunking documents for retrieval, you have a language model build a structured wiki from the source material — a small, dense graph of pages with titles, summaries, body text, and explicit cross-references.

The pieces of an OCC pack:

  • index.md — a table of every page in the pack, each row a (file, title, summary). This is the entry point. Retrieval finds pages by matching the question against titles and summaries, not by computing similarity against a vector store.
  • Concept pages — one per topic, with a fixed structure: frontmatter metadata, an abstract paragraph, dense factual sections, a ## See Also block with wikilinks to related pages, and a ## Sources block tracing back to the original documents.
  • schema.md — the pack's own conventions: naming, tone, what counts as in-scope.
  • log.md — an append-only record of when sources were ingested.
  • raw/ — the original source documents, immutable.
key design choice

Pages are the unit of retrieval, not chunks. A page is a coherent statement on a single topic, written to be read whole. When the model needs a topic, it gets the whole page — abstract, body, cross-links, sources — not a slice.

The full pack anatomy is described in Anatomy of a pack. What follows here is how OCC actually uses a pack to answer a question.

Step 1 — Decomposition

A question may touch one topic or several. Searching with a single phrase mixes the relevance signal: a query like "how do I containerize an MCP server in Docker?" returns a confused mix of Docker and MCP pages — or worse, only the dominant domain.

Before retrieval, OCC breaks the question into independent sub-queries:

  • "How do I containerize an MCP server in Docker?" → ["Docker containerization", "MCP server protocol"]
  • "What is photosynthesis?" → ["photosynthesis"]
  • "Compare Roman emperors with Neanderthals" → ["Roman emperors", "Neanderthals"]

A small language model call produces the decomposition. Single-topic questions stay as one sub-query; multi-domain questions become two to four. The decomposer is intentionally minimal — its job is to identify topical seams, not to interpret the question.

Step 2 — Search

Each sub-query is sent to the broker's /search endpoint. The broker maintains a full-text index over every page of every approved pack — (pack_path, page_file, title, summary) tuples — using SQLite FTS5, which provides BM25 ranking and sub-millisecond lookups even at scales of millions of rows.

The broker returns, for each sub-query, the top N most relevant pages with their pack path, file, title, summary, and a relevance score.

A few properties of this design worth noting:

  • The Node never sees the full pack catalog. It does not download a list of every pack and search locally. It asks the broker a focused question and receives a focused answer. With three packs or with three thousand, the same query takes the same time.
  • The index is metadata-only. The broker indexes titles and summaries, not page bodies. Page bodies stay in their files on disk and are fetched only when needed. This keeps the index compact and the search fast.
  • The broker is a service, not the authority. Anyone could host a replacement. The packs themselves — the truth layer — are version-controlled in a public repository.

Step 3 — Round-robin merge

When several sub-queries each return their own top N, the obvious next step is to merge them by score and pick the best.

the obvious step is wrong

BM25 scores are not comparable across queries: a sub-query that happens to match more keywords in a particular pack produces uniformly higher scores than another sub-query whose pack uses different vocabulary. A naive merge would let one sub-query starve the others — a multi-domain question would silently collapse into a single-domain answer.

OCC interleaves results in a round-robin: round 0 takes the top result of each sub-query, round 1 takes the second result of each, and so on, up to a per-sub-query cap and a total cap. The result is a balanced set of pages — every sub-query represented, no domain crowded out.

Step 4 — Fetch and See Also expansion

The Node fetches the selected pages from the broker, grouped by pack, in parallel. Each page arrives as a complete markdown file: frontmatter, abstract, body, See Also, Sources.

After the initial fetch, OCC does one more round of retrieval — but this round does not go back to the search index. It walks the See Also wikilinks in the fetched pages.

Each See Also entry is a [[slug]] reference to another page in the same pack. After the initial fetch, the Node:

  1. Parses the wikilinks out of the fetched pages.
  2. Filters them against that pack's index.md to make sure they exist.
  3. Fetches the linked pages (capped per pack, deduplicated against pages already fetched).
curated signal

Cross-links exist because someone — the pack author or the Forge cross-link pass — decided they were meaningful. They are a higher-quality signal than vector similarity because they survive a human review. The expansion is one hop deep: going further would risk drifting away from the question; one hop is usually enough to surface the supporting context an Expert needs.

Step 5 — Tier-aware budget

The retrieved pages are concatenated into a context that the Expert will read. There is a cap on how much can fit — and that cap depends on the user's hardware.

OCC's tier system, detailed in The Node, classifies machines from micro (CPU-only laptops) through server-l (multi-GPU servers). Each tier sets a retrieval_chars budget:

Tierretrieval_chars
micro8,000
small8,000
mid12,000
large32,000
xl32,000
server-s65,000
server-l65,000

A laptop assembles roughly 8 KB of curated text per question — enough for two or three pages. A server assembles 65 KB — enough for a dozen pages with room to spare. The deliberation step's caps (Expert draft length, Critic excerpt, Synthesizer input) all derive from the same retrieval_chars value, so the entire pipeline scales together.

This is what allows OCC to run usefully on modest hardware without producing degraded answers on stronger machines. The same code path serves both — only the budget changes.

What the next step receives

By the end of retrieval, the Node has produced two things:

  1. A context string — the concatenation of fetched page bodies, labelled with pack/page headers, capped at retrieval_chars. This is what the Expert reads.

  2. A manifest of retrieved pages — a structured record of every page that was fetched, with title, summary, and full body. This goes to the Critic in the next step, alongside the excerpt the Critic actually reads. The manifest lets the Critic know what exists in the retrieval set even when the body falls outside its own excerpt window — and it can call fetch_full_page to read any of those bodies in full when it needs to verify a specific claim.

The retrieval step ends here. From this point on, the Expert, Critic, and Synthesizer take over — see Multi-agent deliberation.


Why this design

A short summary of the trade-offs OCC made, for readers comparing against other approaches:

  • Pages over chunks — preserves coherence, makes provenance trivial, supports cross-references.
  • Title/summary search over body embeddings — keeps the index small enough to live on a single broker even at scale, makes it inspectable, avoids embedding-model lock-in.
  • Decomposition + round-robin — handles multi-domain questions without collapsing them.
  • See Also expansion over wider top-K — uses curated relationships instead of statistical similarity.
  • Tier-aware budgets — every hardware class produces its best possible answer with the same code.

Each of these choices makes sense in isolation. Together, they let small models grounded in curated knowledge produce answers that hold up under inspection.

Something missing or incorrect? Open an issue on GitHub