RAG is only as good as the knowledge you feed it — and most companies' "documentation" is a graveyard of PDFs.

Retrieval-augmented generation lets AI answer from your policies, procedures, and project history instead of guessing. Building that corpus is less about buying software and more about organizing how you already work. Here's a practical path for SMBs and professional firms.

At a glance

  • Start with one domain (HR policies, safety, client onboarding) — not the entire drive
  • Clean, current, permissioned sources beat volume
  • Connects to context strategy and data governance
  • Maintenance is ongoing — assign an owner

What belongs in a v1 knowledge base

IncludeExclude (initially)
Approved policies and SOPsDraft folders, "copy of copy"
FAQ maintained by a teamPersonal desktop dumps
Stable templates and checklistsYears of stale project archives
Ticket resolutions (if sanitized)Unclassified client files

Same discipline as reducing firefighting — documentation people actually use.

Structure that retrieval loves

  1. One topic per document where possible — long PDFs retrieve poorly
  2. Clear titles — "Vacation policy 2026" not "final_v3_NEW"
  3. Metadata — owner, last review date, language (fr/en), status (draft vs approved)
  4. Version control — one canonical location; archive old explicitly
  5. Access mirrors HR/IT permissions — RAG shouldn't leak what SharePoint already restricted

Build steps (4–6 weeks typical)

Week 1–2 — Inventory and cleanse

  • List top 30 questions staff ask repeatedly
  • Map which docs should answer them
  • Delete duplicates; mark obsolete

Week 3 — Index pilot corpus

  • Choose platform (SharePoint, Notion, dedicated RAG SaaS, custom on Azure — depends on stack)
  • Index green/yellow data only with classification
  • Test retrieval without generation — are the right chunks returned?

Week 4 — Generate with citations

  • Enable answers with source links
  • Run 20 test questions
  • Score: correct, partial, wrong, no answer

Weeks 5–6 — HITL and rollout

  • Human review for external or client-adjacent answers
  • Train staff on "ask + verify source"
  • Collect misfires into a fix queue

Freshness and ownership

Stale RAG is worse than none — confident wrong policy quotes erode trust.

  • Owner per domain (HR, ops, IT)
  • Review cadence — quarterly minimum for regulated content
  • Trigger updates when policy changes — block old doc from index

Permissions and Quebec context

Personal information in indexed docs triggers Law 25 obligations. French and English corpora may need separate testing — retrieval quality differs by language.

Common failures

  • Indexing everything — noise drowns signal
  • No owner — corpus rots in six months
  • Skipping retrieval tests — jump straight to chat UI
  • Letting answers go external without review

When RAG isn't the first move

If your problem is "we have no documented process," fix operations and friction first. RAG amplifies structure — it doesn't create it from chaos.

Bottom line

Company knowledge bases for RAG are operational assets, not IT science projects. Curate, permission, test retrieval, review outputs — then connect to agents when stable.

Related on this site

Ready to index a focused corpus? Let's talk about domain choice and retrieval tests before you buy another chat box.