RAG is only as good as the knowledge you feed it — and most companies' "documentation" is a graveyard of PDFs.
RAG (Retrieval-Augmented Generation) means the system searches your documents first, then generates an answer from what it found — your policies, procedures, and project history — not from the public web.
Building that corpus is less about buying software and more about organizing how you already work. The previous article on system context explained why; this one shows how, step by step.
At a glance
- Start with one domain (HR policies, safety, client onboarding) — not the entire drive
- Clean, current, permissioned sources beat raw volume
- Connects to data governance and a named owner
- Maintenance is ongoing — stale RAG is worse than none
What belongs in a v1 knowledge base
| Include | Exclude (initially) |
|---|---|
| Approved policies and SOPs | Draft folders, "copy of copy" |
| FAQ maintained by a team | Personal desktop dumps |
| Stable templates and checklists | Years of stale project archives |
| Ticket resolutions (if sanitized) | Unclassified client files |
Same discipline as reducing firefighting — document what people actually use, not what theoretically exists.
In one SMB pilot, three different "onboarding checklists" lived in email, SharePoint, and a manager's desktop. Retrieval kept surfacing the oldest version. Consolidating to one approved doc per topic improved answer quality more than any embedding tweak (numeric text representations used for search).
Structure that retrieval loves
- One topic per document where possible — long PDFs retrieve poorly
- Clear titles — "Vacation policy 2026" not "final_v3_NEW"
- Metadata — owner, last review date, language (fr/en), status (draft vs approved)
- Version control — one canonical location; archive old explicitly
- Access mirrors HR/IT permissions — RAG shouldn't leak what SharePoint already restricted
Build steps (4–6 weeks typical)
Week 1–2 — Inventory and cleanse
- List top 30 questions staff ask repeatedly
- Map which docs should answer them
- Delete duplicates; mark obsolete
Week 3 — Index pilot corpus
- Choose platform (SharePoint, Notion, dedicated RAG SaaS, custom on Azure — depends on stack)
- Index green/yellow data only with classification
- Test retrieval without generation — are the right chunks returned?
Week 4 — Generate with citations
- Enable answers with source links
- Run 20 test questions
- Score: correct, partial, wrong, no answer
Weeks 5–6 — HITL (human in the loop) and rollout
- Human review for sensitive or client-adjacent answers
- Train staff on "ask + verify source"
- Collect misfires into a fix queue — each error improves the corpus
Freshness and ownership
Stale RAG is worse than none — confident wrong policy quotes erode trust.
- Owner per domain (HR, ops, IT)
- Review cadence — quarterly minimum for regulated content
- Trigger updates when policy changes — block old doc from index
Treat the knowledge base like any operational asset: someone owns it, someone reviews it, and stale content is removed — not left indexed "just in case."
Permissions and Quebec context
Personal information in indexed docs triggers Law 25 obligations. French and English corpora may need separate testing — retrieval quality differs by language.
Common failures
- Indexing everything — noise drowns signal
- No owner — corpus rots in six months
- Skipping retrieval tests — jump straight to chat UI
- Letting answers go external without human review
When RAG isn't the first move
If your problem is "we have no documented process," fix operations and friction first. RAG amplifies structure — it doesn't create it from chaos.
Once the pilot domain is stable, connect retrieval to agents only when answers are consistently sourced and reviewed — not when the chat UI looks impressive on day one.
Where you are
You've explored multimodal inputs; this article shows how to build the corpus that anchors AI in your documents. Next: Autonomous AI agents and workflows — useful when the frame is clear.
If your docs are scattered but the questions repeat, Let's pick a pilot domain together. We'll pick a pilot domain and a simple retrieval test — before another chat tool purchase.
