Building company knowledge bases for RAG

RAG is only as good as the knowledge you feed it — and most companies' "documentation" is a graveyard of PDFs.

Retrieval-augmented generation lets AI answer from your policies, procedures, and project history instead of guessing. Building that corpus is less about buying software and more about organizing how you already work. Here's a practical path for SMBs and professional firms.

At a glance

Start with one domain (HR policies, safety, client onboarding) — not the entire drive
Clean, current, permissioned sources beat volume
Connects to context strategy and data governance
Maintenance is ongoing — assign an owner

What belongs in a v1 knowledge base

Include	Exclude (initially)
Approved policies and SOPs	Draft folders, "copy of copy"
FAQ maintained by a team	Personal desktop dumps
Stable templates and checklists	Years of stale project archives
Ticket resolutions (if sanitized)	Unclassified client files

Same discipline as reducing firefighting — documentation people actually use.

Structure that retrieval loves

One topic per document where possible — long PDFs retrieve poorly
Clear titles — "Vacation policy 2026" not "final_v3_NEW"
Metadata — owner, last review date, language (fr/en), status (draft vs approved)
Version control — one canonical location; archive old explicitly
Access mirrors HR/IT permissions — RAG shouldn't leak what SharePoint already restricted

Build steps (4–6 weeks typical)

Week 1–2 — Inventory and cleanse

List top 30 questions staff ask repeatedly
Map which docs should answer them
Delete duplicates; mark obsolete

Week 3 — Index pilot corpus

Choose platform (SharePoint, Notion, dedicated RAG SaaS, custom on Azure — depends on stack)
Index green/yellow data only with classification
Test retrieval without generation — are the right chunks returned?

Week 4 — Generate with citations

Enable answers with source links
Run 20 test questions
Score: correct, partial, wrong, no answer

Weeks 5–6 — HITL and rollout

Human review for external or client-adjacent answers
Train staff on "ask + verify source"
Collect misfires into a fix queue

Freshness and ownership

Stale RAG is worse than none — confident wrong policy quotes erode trust.

Owner per domain (HR, ops, IT)
Review cadence — quarterly minimum for regulated content
Trigger updates when policy changes — block old doc from index

Permissions and Quebec context

Personal information in indexed docs triggers Law 25 obligations. French and English corpora may need separate testing — retrieval quality differs by language.

Common failures

Indexing everything — noise drowns signal
No owner — corpus rots in six months
Skipping retrieval tests — jump straight to chat UI
Letting answers go external without review

When RAG isn't the first move

If your problem is "we have no documented process," fix operations and friction first. RAG amplifies structure — it doesn't create it from chaos.

Bottom line

Company knowledge bases for RAG are operational assets, not IT science projects. Curate, permission, test retrieval, review outputs — then connect to agents when stable.

Related on this site

Ready to index a focused corpus? Let's talk about domain choice and retrieval tests before you buy another chat box.