RAG is only as good as the knowledge you feed it — and most companies' "documentation" is a graveyard of PDFs.
Retrieval-augmented generation lets AI answer from your policies, procedures, and project history instead of guessing. Building that corpus is less about buying software and more about organizing how you already work. Here's a practical path for SMBs and professional firms.
At a glance
- Start with one domain (HR policies, safety, client onboarding) — not the entire drive
- Clean, current, permissioned sources beat volume
- Connects to context strategy and data governance
- Maintenance is ongoing — assign an owner
What belongs in a v1 knowledge base
| Include | Exclude (initially) |
|---|---|
| Approved policies and SOPs | Draft folders, "copy of copy" |
| FAQ maintained by a team | Personal desktop dumps |
| Stable templates and checklists | Years of stale project archives |
| Ticket resolutions (if sanitized) | Unclassified client files |
Same discipline as reducing firefighting — documentation people actually use.
Structure that retrieval loves
- One topic per document where possible — long PDFs retrieve poorly
- Clear titles — "Vacation policy 2026" not "final_v3_NEW"
- Metadata — owner, last review date, language (fr/en), status (draft vs approved)
- Version control — one canonical location; archive old explicitly
- Access mirrors HR/IT permissions — RAG shouldn't leak what SharePoint already restricted
Build steps (4–6 weeks typical)
Week 1–2 — Inventory and cleanse
- List top 30 questions staff ask repeatedly
- Map which docs should answer them
- Delete duplicates; mark obsolete
Week 3 — Index pilot corpus
- Choose platform (SharePoint, Notion, dedicated RAG SaaS, custom on Azure — depends on stack)
- Index green/yellow data only with classification
- Test retrieval without generation — are the right chunks returned?
Week 4 — Generate with citations
- Enable answers with source links
- Run 20 test questions
- Score: correct, partial, wrong, no answer
Weeks 5–6 — HITL and rollout
- Human review for external or client-adjacent answers
- Train staff on "ask + verify source"
- Collect misfires into a fix queue
Freshness and ownership
Stale RAG is worse than none — confident wrong policy quotes erode trust.
- Owner per domain (HR, ops, IT)
- Review cadence — quarterly minimum for regulated content
- Trigger updates when policy changes — block old doc from index
Permissions and Quebec context
Personal information in indexed docs triggers Law 25 obligations. French and English corpora may need separate testing — retrieval quality differs by language.
Common failures
- Indexing everything — noise drowns signal
- No owner — corpus rots in six months
- Skipping retrieval tests — jump straight to chat UI
- Letting answers go external without review
When RAG isn't the first move
If your problem is "we have no documented process," fix operations and friction first. RAG amplifies structure — it doesn't create it from chaos.
Bottom line
Company knowledge bases for RAG are operational assets, not IT science projects. Curate, permission, test retrieval, review outputs — then connect to agents when stable.
Related on this site
Ready to index a focused corpus? Let's talk about domain choice and retrieval tests before you buy another chat box.
