"What's the return on investment (ROI) on AI?" — if your answer is only "hours saved," skeptics will rightly push back.
Time matters, but it's one line on a scorecard. Quality, cycle time, risk avoided, and adoption rate tell whether a pilot deserves scale — or a graceful stop. I use this framework with CFOs and operations leads who are tired of slide-deck promises.
At a glance
- Measure before pilot start — without a baseline, you'll argue anecdotes
- Balance efficiency metrics with quality, risk, and adoption in one scorecard
- Include human review time in cost — it's not "free"
- Connect to budget reality, not vendor case studies
Baseline first (two weeks)
Before any tool change, capture:
| Metric | How to measure |
|---|---|
| Time on task | Sample 10–20 instances; stopwatch honest |
| Error / rework rate | Misses, corrections, client complaints |
| Cycle time | Request → delivered |
| Cost of delay | Backlog, overtime, missed SLAs |
Without baseline, "50% faster" is marketing.
The four-quadrant scorecard
1. Efficiency
- Hours saved per week (team level, not hero user)
- Cost per transaction (if repeatable task)
- Throughput (items processed)
Caution: Shaving minutes on a broken process automates waste. Pair with friction mapping before scaling.
2. Quality
- Error rate before/after
- Rework tickets
- Client satisfaction on affected deliverables
AI that speeds up wrong answers is negative ROI.
3. Speed
- Cycle time reduction
- Time-to-first-draft (with human review still counted)
4. Risk and resilience
- Near-misses caught in review
- Consistency of documentation
- Reduced dependency on one person's tacit knowledge
Harder to quantify — but executives feel these when someone is on vacation.
Adoption metrics (don't skip)
- Active users / eligible users weekly
- Completion rate — started workflow vs finished
- Override rate — humans fixing AI output
- Qualitative — short survey: trust, would recommend
A brilliant tool with 15% adoption fails the business case.
Worked example: meeting notes pilot (8 weeks)
A professional services SMB (28 people) measured an automated meeting notes pilot across two teams (6 eligible users).
Annual costs (extrapolated from pilot):
| Item | Amount |
|---|---|
| Enterprise tool licenses | $4,800 |
| Integration time (IT + pilot lead) | $3,200 |
| Training (2 × 90 min sessions) | $1,800 |
| Human review time (12 min × 48 sets/mo) | $6,400 |
| Total cost | $16,200 |
Annual benefits (measured, not projected):
| Item | Amount |
|---|---|
| Drafting time saved (312 h × $85 loaded) | $26,520 |
| Rework avoided (2 errors/mo × 2 h × $85) | $4,080 |
| Publication delays avoided (conservative) | $2,400 |
| Total benefit | $33,000 |
ROI ≈ (33,000 − 16,200) / 16,200 = 104% — with 83% adoption (5/6 users active weekly). The CFO approved extension to a third team. Without the adoption column, leadership would have seen only "312 hours" and missed that 17% of the pilot team wasn't using the tool.
Simple ROI formula (SMB-friendly)
Annual benefit ≈ (hours saved × loaded hourly rate) + rework avoided + delay cost avoided
Annual cost ≈ licenses + integration + training + review time + governance overhead
ROI ≈ (benefit − cost) / cost
Include review time in cost — human-in-the-loop (HITL) review, where a named person approves before outputs leave the organization, is real work. Include ramp-up; month one is rarely steady state.
What convinces skeptics
- Side-by-side samples (anonymized) — before vs after
- Named process owner endorsing results
- Honest misses — "here's where it failed and what we changed"
- Bounded scale plan — not open-ended spend
When to stop or pivot
- Quality metrics worsen
- Review time exceeds time saved
- Adoption flat after training
- Governance incidents rise
Stopping a pilot isn't failure — it's discipline.
Reporting rhythm
- Weekly during pilot — operational tweaks
- Monthly — scorecard to leadership
- At pilot end — scale / extend / stop decision with numbers
Where you are
You've just completed the Concrete pilots series — meeting notes, field workflow, measured ROI. The next series, Govern and sustain, starts here: Human-in-the-loop: where AI stops and judgment starts — who approves what before expanding externally.
Building a scorecard for your pilot? Let's talk about metrics that match your CFO's language.
