Building company knowledge bases for RAG

RAG is only as good as the knowledge you feed it — and most companies' "documentation" is a graveyard of PDFs.

RAG (Retrieval-Augmented Generation) means the system searches your documents first, then generates an answer from what it found — your policies, procedures, and project history — not from the public web.

Building that corpus is less about buying software and more about organizing how you already work. The previous article on system context explained why; this one shows how, step by step.

At a glance

Start with one domain (HR policies, safety, client onboarding) — not the entire drive
Clean, current, permissioned sources beat raw volume
Connects to data governance and a named owner
Maintenance is ongoing — stale RAG is worse than none

What belongs in a v1 knowledge base

Include	Exclude (initially)
Approved policies and SOPs	Draft folders, "copy of copy"
FAQ maintained by a team	Personal desktop dumps
Stable templates and checklists	Years of stale project archives
Ticket resolutions (if sanitized)	Unclassified client files

Same discipline as reducing firefighting — document what people actually use, not what theoretically exists.

In one SMB pilot, three different "onboarding checklists" lived in email, SharePoint, and a manager's desktop. Retrieval kept surfacing the oldest version. Consolidating to one approved doc per topic improved answer quality more than any embedding tweak (numeric text representations used for search).

Structure that retrieval loves

One topic per document where possible — long PDFs retrieve poorly
Clear titles — "Vacation policy 2026" not "final_v3_NEW"
Metadata — owner, last review date, language (fr/en), status (draft vs approved)
Version control — one canonical location; archive old explicitly
Access mirrors HR/IT permissions — RAG shouldn't leak what SharePoint already restricted

Build steps (4–6 weeks typical)

Week 1–2 — Inventory and cleanse

List top 30 questions staff ask repeatedly
Map which docs should answer them
Delete duplicates; mark obsolete

Week 3 — Index pilot corpus

Choose platform (SharePoint, Notion, dedicated RAG SaaS, custom on Azure — depends on stack)
Index green/yellow data only with classification
Test retrieval without generation — are the right chunks returned?

Week 4 — Generate with citations

Enable answers with source links
Run 20 test questions
Score: correct, partial, wrong, no answer

Weeks 5–6 — HITL (human in the loop) and rollout

Human review for sensitive or client-adjacent answers
Train staff on "ask + verify source"
Collect misfires into a fix queue — each error improves the corpus

Freshness and ownership

Stale RAG is worse than none — confident wrong policy quotes erode trust.

Owner per domain (HR, ops, IT)
Review cadence — quarterly minimum for regulated content
Trigger updates when policy changes — block old doc from index

Treat the knowledge base like any operational asset: someone owns it, someone reviews it, and stale content is removed — not left indexed "just in case."

Permissions and Quebec context

Personal information in indexed docs triggers Law 25 obligations. French and English corpora may need separate testing — retrieval quality differs by language.

Common failures

Indexing everything — noise drowns signal
No owner — corpus rots in six months
Skipping retrieval tests — jump straight to chat UI
Letting answers go external without human review

When RAG isn't the first move

If your problem is "we have no documented process," fix operations and friction first. RAG amplifies structure — it doesn't create it from chaos.

Once the pilot domain is stable, connect retrieval to agents only when answers are consistently sourced and reviewed — not when the chat UI looks impressive on day one.

Where you are

You've explored multimodal inputs; this article shows how to build the corpus that anchors AI in your documents. Next: Autonomous AI agents and workflows — useful when the frame is clear.

If your docs are scattered but the questions repeat, Let's pick a pilot domain together. We'll pick a pilot domain and a simple retrieval test — before another chat tool purchase.