Optimizing delivery across projects: how AI proved its business value to clients
Three years ago, I was sitting across from a CFO who had just approved a significant AI tooling budget and wanted to know what we had to show for it. I had prepared slides full of engineering metrics: velocity improvements, deployment frequency, and lead time reductions. I walked through them carefully; she listened politely, and then she asked, "But are we making more money? Are we delivering faster for clients?"
I didn't have a good answer.
That conversation changed how I think about AI ROI measurement. The issue wasn't that AI wasn't delivering value (because it was), but rather that I was measuring it in a language that meant nothing to the people who'd approved the investment. Stakeholders operate in a different frame entirely. What they're actually tracking is whether the project will land on time, whether it'll cost what you said it would, and whether your team can be trusted to surface problems before they become their problems. Velocity scores and deployment frequency don't answer any of those questions.
This article is about the three metrics I developed with a Product Manager to bridge that gap, the infrastructure required to track them reliably, and what changed when I stopped reporting engineering output and started reporting business outcomes. Before getting into the metrics themselves, it's worth understanding why the standard ones fail, because that failure isn't accidental.
The client's reality: they don't care about your velocity numbers
Here's something most engineering leaders learn the hard way: clients don't read sprint reports. They don't know what story points are, and they don't care. Stakeholders operate in a different frame entirely: what they're tracking is whether the project will land on time, whether it'll cost what you said it would, and whether your team can be trusted to surface problems before they become their problems.
Velocity is an internal coordination tool. It tells your team whether they're moving at a sustainable pace. It tells you nothing about whether that pace is translating into business value for your clients or for your own organization.
When AI entered our workflow, the velocity numbers went up. Engineers were generating code faster, reviews were moving quicker, and standups got shorter because there was less to untangle. Leadership was happy. Clients were unmoved. They were still getting the same surprises, the same-scope conversations late in the project, and the same "we need a bit more time" emails.
The metrics I had were measuring the engine. Nobody had built instruments to measure where the car was going.
Why multi-project companies need different AI metrics
Before introducing the three metrics that actually worked, it's worth addressing a structural issue that makes AI ROI measurement harder than it needs to be for most organizations: the problem compounds when you're running multiple projects simultaneously.
Single-project teams can get away with project-level metrics. Multi-project organizations, like agencies and consultancies with in-house teams managing several products simultaneously, need a portfolio layer above the project level, and that's where most AI ROI frameworks fall apart.
The problem is that AI benefits don't distribute evenly across projects. A senior engineer using AI-assisted development on a greenfield project in a familiar stack will show dramatic productivity gains, while the same engineer on a legacy migration with an unfamiliar codebase may show modest gains, or none at all, while still doing valuable work.
If you aggregate these effects into a single company-wide AI ROI number, you get a misleading average. If you look only at the strong performers, you overestimate portfolio-wide impact. If you let the weak performers anchor your perception, you underinvest in the contexts where AI is genuinely transformative.
The metrics that work at the portfolio level are ratios rather than absolutes: margin variance across projects, predictability distribution across the portfolio, ghost FTE density by project type, and stack maturity. These reveal where AI is creating value and where it isn't, which is far more actionable than a single headline number.
Enji's portfolio view makes this analysis accessible without requiring a data engineering team. The same project margins and delivery data that produce per-project reports roll up into a portfolio dashboard that shows distribution rather than averages. That's the view I use for quarterly executive reporting.
With that context in place, here are the three metrics that made AI ROI legible, first to my own leadership, then to clients.
The three metrics that changed the conversation
Fixing this required building a different measurement layer, one that tracked what clients and executives actually cared about, rather than what was easiest to extract from engineering tools. Over the course of several projects, I landed on three metrics that, taken together, answer the questions stakeholders are actually asking: is the project profitable, is it on track, and is AI making the team genuinely more capable? Each metric required specific data infrastructure to produce reliably, and each one revealed something the standard engineering dashboard had been hiding.
Metric 1: project margin – the first number clients actually read
Project margin is the ratio of value delivered to cost incurred, tracked in real time across the project lifecycle.
- "Cost incurred" is the sum of all logged engineering hours converted to monetary cost using loaded labor rates, plus any direct project expenses.
- "Value delivered" is the agreed contract value multiplied by the percentage of scope completed to date, earned value in the earned value management sense, not invoiced revenue.
The metric is most useful when tracked weekly and compared against a target margin established at project kickoff. Variance from the target is what triggers conversation; the absolute number matters less than the trend.
What it showed us
Before AI tooling, our project margins were opaque until invoicing. We knew roughly what a project cost in headcount terms, and we knew roughly what we'd quoted, but the actual margin, accounting for scope changes, unplanned work, rework, and the true time distribution across features, was something we reconstructed after the fact. By then, the damage was done.
After deploying AI-assisted development and connecting it to Enji's Project Margins, we had live visibility into cost versus value for every active project. Features that used to cost 40 hours were coming in at 28. That gap showed up in the margin data the same week it happened, not in a quarterly retrospective.
The more interesting effect was on client conversations: when a client asked about budget status, I could pull up a real-time margin view rather than going back to a spreadsheet. "We're at 62% of budget with 70% of scope delivered" is a different conversation than "we're on track." One is a number; the other is a story.
Project Margin quickly became the primary metric I report to clients because it translates directly into the language they already speak. They understand profit and cost per outcome. They don't understand burn rate in story points.
Where this metric can mislead you
Project Margin can improve for reasons that have nothing to do with AI. The most common: scope reduction. When AI speeds up delivery, teams sometimes unconsciously descope edge cases and hardening work to hit timelines. Margin improves, the AI gets the credit, but the actual cause is deferred work that shows up as a support burden six months later. Before attributing margin gains to AI, check whether your defect rate and post-launch support hours moved in the same period. If the margin went up and support volume went up, the gain was borrowed, not earned.
Metric 2: delivery predictability – why 92% beats 100% every time
Margin tells you whether the project is profitable, but not whether it's on track. A project can show healthy margins right up until a missed deadline triggers a penalty clause or a client conversation that resets the relationship. That's where the second metric comes in.
Every team I've worked with has had a moment where they hit 100% of their sprint goals and still delivered late. Sprints are not projects, velocity is not predictability, and a team that ships everything it commits to every sprint can still miss a client deadline by three weeks if the commitments themselves are unreliable.
Delivery predictability measures the percentage of committed delivery dates met within an acceptable variance over a defined period. So its formula is:
(Milestones delivered within variance threshold / Total milestones committed) × 100%
I use a 90-day rolling window and a variance threshold of ±5 % of the committed timeline. A milestone delivered three days late on a 60-day commitment falls within the threshold; one delivered two weeks late does not.
What it showed us
Before AI tooling, our delivery predictability was around 67%. We were consistently hitting sprint goals, but our quarterly delivery commitments were unreliable. The problem was estimation: we were good at breaking down sprints but poor at forecasting how sprint-level work would accumulate into project-level delivery.
Within six months of AI adoption and the surrounding process changes, delivery predictability climbed to 89%, then 92%. We've held it above 90% for the past year.
Here's why 92% beats 100%: a team that claims 100% predictability is either sandbagging estimates or not tracking failures honestly. A team that delivers 92% and can explain the 8% in specific, non-recurring terms is a team that understands its own performance. Clients trust the second team more because the data is credible. That trust translates directly into contract renewals and expanded scope.
A note on attribution
Two things changed at the same time, and conflating them would overstate what AI actually did. AI tooling reduced cycle time variance on individual tickets, and we also changed our estimation process to use historical data from PM Agent to flag optimistic estimates before they were committed. The second change would have improved predictability without AI. My honest assessment is that the process change accounted for roughly half the gain; AI tooling accounted for the rest, primarily through more consistent ticket-level execution.
If you want a cleaner attribution, isolate tickets where AI tooling was actively used and compare estimation accuracy and cycle time variance against a baseline of similar tickets where it wasn't. That's a tighter claim and considerably harder to dispute.
Where this metric can mislead you
The 90-day window is long enough to show a trend but short enough to miss debt accumulation. Teams that show strong predictability in months three through six sometimes see it deteriorate in months nine through twelve as AI-generated code enters maintenance. The metric looks healthy right up until accumulated technical debt starts affecting estimation accuracy on the same codebase. Track predictability continuously and watch for a reversal signal after the first major release cycle.
Metric 3: capacity unlock – how many "ghost FTEs" AI created
Margin tells you whether the work is profitable. Predictability tells you whether it lands on time. Neither tells you whether AI is actually making your team more capable, or whether you're just doing the same amount of work with better tooling. That's the question the third metric answers, and it's the one that finally made AI ROI legible to our finance team.
A ghost FTE (full-time equivalent) is the productivity equivalent of a full-time employee that AI tooling created without adding headcount:
- Establish baseline output per engineer for a defined period before AI adoption, measured in hours of delivery work completed per week, adjusted for ticket complexity using historical cycle time data by ticket type and size.
- Measure the same figure for an equivalent period after AI adoption, using the same complexity adjustment.
- Express the difference as equivalent headcount: if ten engineers are delivering what twelve delivered previously, the ghost FTE count is two.
- Convert to economic value using loaded labor cost for the equivalent roles: salary, benefits, recruiting, and onboarding for a senior engineer in your market.
The complexity adjustment is the methodologically critical step. Without it, the calculation conflates AI productivity gains with easier work, team composition changes, or reduced scope ambiguity. Use ticket type and historical cycle time to normalize, not story points, which are too subjective to produce a defensible number.
What it showed us
The ghost FTE calculation translated AI ROI into terms that finance could evaluate directly: the equivalent of hiring two senior engineers without a recruiting timeline, onboarding period, or additional management overhead. It moved the conversation from engineering metrics to headcount economics, and headcount economics is a language every CFO speaks.
Enji's Green Worklogs made this calculation possible by providing before-and-after time data at the task level. Without that granularity, we would have had to rely on proxy metrics that are too easy to dispute. With actual time data connected to actual deliverables, the number was defensible.
Where this metric can mislead you
Ghost FTE calculations break down when AI tooling produces low-quality output that requires significant human correction. The raw output numbers go up, but rework consumes the gain. If you don't strip rework hours from the denominator, the ghost FTE number flatters the result, sometimes dramatically. In legacy codebases where AI suggestions frequently conflict with established patterns and require substantial revision, we've seen ghost FTE calculations overstate productivity gains by 40% or more when rework isn't tracked separately. Always measure the rework rate alongside ghost FTE creation, and recalculate if rework is rising.
With all three metrics in place and tracked continuously, the next problem is practical: how do you present this to a client without turning a status update into a data literacy exercise? That's where the reporting format matters as much as the metrics themselves.
My client report template: 1 page, 3 charts, fewer questions
After iterating through several reporting formats, I landed on a single-page structure with three charts that consistently produces the reaction I want: understanding rather than follow-up questions.
- Chart 1: Project margin trend. A line chart showing current margin versus target margin over the project lifecycle, updated weekly. One line, one target, one trend direction. If the margin is above target and trending stable or improving, the project is healthy. If it's below target or declining, we discuss why in the narrative.
- Chart 2: Delivery predictability score. A simple gauge or percentage with a 90-day rolling window. Above 88%, green. Between 80-88%, yellow. Below 80%, we're having a different conversation.
- Chart 3: Milestone adherence. A timeline showing committed versus actual delivery dates for the three most recent major milestones. Not sprints: milestones. Things clients recognize as meaningful progress markers.
Below the three charts is a four-sentence narrative. What happened last period, what it means for the project trajectory, what we're doing about anything that's off target, and what the client needs to decide or provide for the next period.
I send this weekly. It takes me a couple of minutes to produce, because Enji assembles the underlying data automatically. Before this setup, the same report took two to three hours. The result: the number of status calls dropped sharply because clients already had the information they cared about. Getting to that point, however, required the rollout to be structured in a way that supported the measurement framework from day one, not retrofitted after the fact.
My AI rollout checklist: from experiment to company standard
The measurement framework only works if the rollout is structured to support it. Here's the sequence that's produced consistent results across the teams I've worked with:
Before you roll out any AI tooling:
- Establish baseline metrics for the three core measures: current project margin by project type, current delivery predictability over the trailing 90 days, and current output per engineer by role and stack.
- Connect your delivery data infrastructure. If you don't have worklog data connected to project financials before AI arrives, you won't be able to calculate ghost FTEs after. Enji or equivalent tooling needs to be in place first.
- Define what success looks like in business terms, not engineering terms. Set targets for project margin improvement, Predictability score, and ghost FTE creation before rollout. This prevents the inevitable debate about whether the numbers are good enough.
During rollout:
- Run in parallel for at least one project cycle before concluding. AI adoption curves are real; engineers need time to integrate new tools into existing workflows, and early measurements underestimate eventual gains.
- Track adoption rate separately from output metrics. An engineer who has access to AI tooling but isn't using it will look like a negative data point for AI ROI. The data needs to distinguish between adoption lag and tool inefficacy.
- Watch for debt acceleration signals. If AI is generating code faster than review processes can absorb it, the margin gains will be temporary, and the predictability gains will reverse. Early warning signs include rising PR review times and increasing rework rates.
- Track rework hours from the start. You won't be able to calculate a clean ghost FTE number later without it.
After rollout:
- Publish the ghost FTE calculation internally before publishing externally. Engineering teams need to understand that the metric doesn't threaten their positions but demonstrates the value they're creating. Teams that understand this frame tend to adopt AI tools more enthusiastically and use them more effectively.
- Review the three core metrics quarterly and adjust targets as the team matures. First-year gains are typically the steepest; second-year gains require more deliberate optimization of how AI is integrated into specific workflows.
- Connect metric outcomes to business decisions. When project margin data supports a pricing conversation with a client or when delivery predictability data supports a scope commitment, that connection reinforces the value of the measurement infrastructure and maintains team engagement with the data.
- Recheck ghost FTE calculations if the rework rate is rising. A deteriorating rework signal is the earliest indicator that your AI ROI calculation is overstating the gain.
The CFO conversation I described at the beginning of this piece happened again last quarter, with the same person. This time, I had three charts and a number she recognized: project margins up 18%, delivery predictability at 91%, and a ghost FTE equivalent of 2.4 senior engineers. She asked one question: "Can we do this across the other business units?"
That's the conversation AI ROI measurement is supposed to produce.

