In 2026, AI works when it improves real work: here's how to measure it without confusing "use" and "impact."
In 2026, the issue is no longer "who uses AI" or "how much output does it produce." The relevant question is another: is it actually improving real work?
Many projects stall not because the technology doesn't work, but because they're measured with metrics that don't capture value. If we only look at usage, frequency, or quantity of text generated, we risk two opposing errors: concluding that AI "isn't needed," or that it "works," when in reality it's introducing friction, rework, or risk.
The key point is simple: Value emerges when you measure productivity and quality on the right process, with a clear owner.
Why so many AI projects today are stuck on measurement (not adoption)
At first, adoption seems easy: some try it, some get immediate results, some get enthusiastic. But then comes the most delicate phase: understanding whether the use "holds up" in everyday life and whether it's worth consolidating. This is where everything often stalls. The reason is that without a clear way to measureThe discussion quickly becomes one of opinion versus opinion: some say "it saves time," others say "it wastes time," and still others say "it depends on how you use it." All true, but insufficient for decision-making. Without even minimal measurement, you can't figure out what to improve, where to intervene with training, which processes really make sense, and above all, you can't distinguish a "curious" test from operational use.
The most common mistake: “traditional software” KPIs (and why they don't describe the impact)
Many companies measure AI like they would traditional software: accesses, frequency of use, number of tasks completed "by the tool." The problem is that AI isn't a system that always performs the same procedure deterministically. It's more like a cognitive aid: it accelerates, suggests, reorganizes, synthesizes. And its value depends heavily on context and the who uses it.
If you measure like a management software, you risk rewarding the wrong signals: for example, "a lot of use" can mean enthusiasm, but it can also mean confusion and repeated attempts; "a lot of output" can mean productivity, but it can also mean rework and rewrites.
So the first step is not to choose "more sophisticated" KPIs. It's to choose KPIs more adherent to real work.
The paradigm shift: from rigid automation to "skill multiplier"
A simple way to understand it: AI brings value especially when multiply the competence of those who work, not when they try to “crystallize” a process with rigid constraints.
In practice, it's not just a matter of "automating." It's a matter of reducing friction: time spent searching for information, reformulating, summarizing, verifying, standardizing, and reconstructing context. When these blocks are reduced, the expert can do what they already know better, with greater consistency and less dispersion.
This changes the way we measure: instead of chasing an abstract promise (“AI will do X”), it makes more sense to ask: How much more useful work can I do, of the same or better quality, in the same amount of time?
Where it really makes sense to measure: 4 practical areas
To avoid "creative" metrics, it's best to start with very specific areas. Four dimensions are often used because they're observable and comparable.
1) Speed (time on process)
Not “how fast is the AI”, but How long does the end-to-end process take?AI can speed up the draft, but if it then increases revisions or generates uncertainty, the overall time won't improve.
2) Rework (how many times do you have to redo or correct)
Rework is often a more revealing metric than speed. If AI produces output that requires too many fixes, the time saved initially is lost later, and the experience deteriorates.
3) Repetitive errors (and reduction of systematic “oversights”)
When a team works on large volumes, certain errors become recurring: different interpretations, missing parts, incorrect references, inconsistencies. If AI helps standardize, you can see real signs here.
4) Perceived quality (but anchored to criteria)
Perceived quality is only useful if it doesn't remain a mere "feeling." It works when you link it to understandable criteria: clarity, consistency with internal rules, completeness, readability, and reduction of unnecessary steps.
Concrete example 1: regulated sector (bank or public administration) – what to measure in a real process
In a regulated sector, the issue is not just "doing it first." It is doing it in a consistent, traceable way, with less ambiguity. An example of an (illustrative) process could be that of a bank manager who must respond to recurring requests in compliance with internal policies and procedures.
Here, AI can become useful when it reduces the time spent reconstructing context: where what is written, which exceptions apply, which version is up to date. What can be realistically measured?
- Information search time: first how long did it take to find the correct policy and the relevant steps, after how long does it take to get to the same point.
- Rework on notes and tests: how many times the manager has to rewrite, integrate, or correct because elements are missing or the wording is not aligned with internal rules.
- Reduction of “clarification steps”: how many times do you need to ask a support team for confirmation or “realign” because the procedure was unclear.
- Output consistency: not in the abstract, but with respect to an internal reference (procedures, checklists, constraints).
In such contexts, the typical error is to measure only "how many responses it produces." The real point is: how many correct and coherent answers can you close with less friction.
KPIs to Avoid: Usage Metrics That Don't Equate to Value
There are KPIs that seem “objective,” but they don’t tell you whether you’re creating value.
- Number of active users: it can go up because the tool is useful, but also because "it is being tested" without standards.
- Number of prompts or messagesMore messages don't mean more efficiency; it often means the user is trying to achieve something that wasn't set up properly.
- Output products (documents, texts, summaries): it is not valuable if it then requires heavy revision or creates ambiguity.
- Estimated hours saved: if you don't have a baseline, it's an easy number to tell but difficult to sustain.
It doesn't mean they should never be looked at. It means that alone they cannot make a decision.
The most misunderstood point: the impact is not measured on the least experienced user
Many tests start with the "least ready" user, because they're the ones struggling the most and needing the most support. This is understandable, but it's a misguided perspective if you use them to assess value.
If AI is a skill multiplier, it's natural that value emerges first where there's more experience: people who know what to ask, how to verify, and how to use the output. If you measure only less experienced users, you risk concluding that "it doesn't work," when in reality you're measuring a lack of context, training, and internal standards.
The correct reading is often: If it works for the experienced owner, then you can understand what others are missing to get there. (training, guidelines, review processes), rather than rejecting it outright.
Owner and baseline: how to set accountability + primary metric + before/after comparison
Two elements make more difference than any dashboard.
Owner of This means: someone who owns that process and can say whether the outcome is good, consistent, and useful. This isn't a "control" figure. This is the person who knows the work and can guide its adoption.
Baseline It means: a simple reference for comparing before and after. There's no need for an endless project. Just decide:
- what is the observed process,
- what is the main metric (time, rework, errors, quality),
- what is the unit of comparison (a typical week, a set of practices, a sample of recurring requests).
Without a baseline, you're just comparing impressions.
Concrete example 2: public administration (official) – what to measure before/after
A second example (always illustrative) could be that of a public administration official which manages investigations and documents, with repetitive requests, internal regulations, forms and verification steps.
AI can help here, especially where there's information overload and a need for consistency. What should we measure?
- Investigation times for specific steps: not the entire process, but recurring moments (retrieving documents, summarizing requirements, preparing a coherent draft).
- Integration and rework requestsHow many times does a process get returned because a piece is missing, the communication isn't clear, or a document needs to be redone?
- Uniformity of responses/actions: not as a “style”, but as adherence to internal criteria and procedures, reducing arbitrary interpretations between different people.
- Reduced search time for internal documentation: how many consultations, steps and checks are needed to get to the right information.
Again, the useful metric is not “how much they use it,” but how it reduces friction and reduces the cost of rework.
Why it "seems not to work": no baseline, unstructured tests, only perceptual evaluations
When a project "seems not to work," it's often not because it doesn't add value. It's because it was poorly evaluated.
It typically happens when:
- there is no baseline (so you don't know what you're comparing);
- the test is left to spontaneity (therefore its use is heterogeneous and not comparable);
- the evaluation is only perceptual (therefore it depends on mood, workload or the individual negative case).
The result is that the project turns into an endless debate, rather than a path to improvement.
Mini operational checklist for measuring AI without complicating the project
To get off to a good start, all you need is an essential checklist, geared towards real work:
- Choose a recurring process, not an "exceptional" case. If the process is rare, measuring it is almost impossible.
- Appoint an owner who knows the job and decides what is “good”.
- Define a primary metric (only one) and 1–2 supporting metrics (e.g. time + rework).
- Create a simple baseline: a before/after sample or a comparable period.
- Establish a quality criterion minimum: what does “acceptable output” mean in that context.
Take a short, observable test, then fix it: if it doesn't improve, don't "fail" right away; ask yourself what's missing (context, rules, training, materials).
Conclusion: value = concrete improvement of work, not "use of the tool"
In 2026, the issue is not to demonstrate that "AI is usable." It is to demonstrate that improves real work, reduces friction and increases consistency, with an owner who drives quality and a baseline that makes change measurable.
If you measure well, adoption becomes easier. Because it stops being an opinion and becomes a process: what works, what doesn't yet work, and what's needed to make it actually work.