Afternoon BriefGEO / AEO

Measure GEO Like a Distribution, Not a Ranking

If you are still checking AI visibility with one prompt and one screenshot, your GEO reporting is lying to you. Here is the weekly measurement cadence, sample size, and decision rule I would hand to a growth team right now.

Christian LehmanApr 21, 2026

Measure GEO Like a Distribution, Not a Ranking

If your team is still measuring GEO with one prompt, one model run, and one screenshot, stop. A new April 2026 paper from the University of St. Gallen shows that AI search visibility is too unstable to measure that way, because the same prompt can return different cited sources and brand mentions across runs and across days. The move this week is simple: treat AI visibility as a distribution, sample it repeatedly, and make content decisions from patterns, not anecdotes. That is the difference between reporting noise and managing a real channel. (Don’t Measure Once)

Key Takeaways

One prompt run is not a reliable GEO measurement. Consecutive-day source overlap in the St. Gallen study was only 34 to 42 percent. (Don’t Measure Once)
Weekly repeated sampling is the minimum viable operating model. If you do not rerun prompts across multiple days, you cannot tell luck from repeatability. (Quantifying Uncertainty in AI Visibility)
The best GEO dashboards report uncertainty, not just scores. Mention rate, citation share, and variance are more decision-useful than a single visibility number. (AI Visibility Score Accuracy)

One run is not measurement

AI search visibility changes too much for single-run reporting to be reliable. The St. Gallen paper tracked four AI search engines over a 45 to 46 day window and found that cited-source overlap between consecutive days was only 34 to 42 percent. Brand-set overlap was only 45 to 59 percent. If you check one prompt once, you are not measuring performance. You are catching one draw from a moving system. (Don’t Measure Once)

That finding lines up with a second March 2026 paper, which argues that AI visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. That is the statistical version of what operators already feel in practice: the dashboard moves because the environment moves. (Quantifying Uncertainty in AI Visibility)

The reason this matters right now is simple. AI search is becoming a large enough discovery surface that measurement quality changes budget decisions. A 2026 St. Gallen working paper notes that generative search changes performance measurement because brand inclusion behaves more like a probability distribution than a stable ranking set. (Don’t Measure Once PDF)

Here is the operational consequence. If someone on your team says, “we showed up in ChatGPT yesterday,” that is not a durable result. It is a single observation. Useful, but nowhere near enough to justify a content win, a budget shift, or a vendor claim.

I would reset the team around three rules:

Run every priority prompt multiple times per engine.
Repeat the measurement on multiple days.
Report mention probability and citation share ranges, not single scores.

That lines up with what we have been pushing in this GEO measurement framework, with the more skeptical view in AI visibility score accuracy, and with Machine Relations research on why citation systems break when retrieval verification is weak. (Citation Systems Still Break Without Retrieval Verification)

The weekly GEO measurement setup I would use

A workable GEO measurement system does not need to be fancy, it needs to be repeatable. The paper’s core lesson is methodological: repeated measurement beats screenshot theater. (Don’t Measure Once)

For a lean B2B team, I would run this cadence every week:

Element	Minimum standard	Why it matters
Priority prompts	20 to 30	Covers the real buyer questions that shape shortlist formation
Engines	3 to 4	ChatGPT, Google AI Mode or AI Overviews, Perplexity, and one buyer-relevant alternate
Runs per prompt	3	Reduces false confidence from one lucky mention
Days per week	3 non-consecutive days	Catches day-to-day citation drift
Core outputs	mention rate, citation share, source overlap	Tells you if visibility is recurring or random

That gives you enough volume to see whether your brand is consistently present or just occasionally lucky.

If you want a simple decision rule, use this one: do not call a prompt “won” unless your brand appears in at least half of the sampled runs across the week. Below that threshold, you are still in volatility territory.

A practical reason to build the cadence now is that buyer behavior is already moving faster than measurement systems. Forrester reported in January 2026 that 89 percent of business buyers use AI in their buying process and 61 percent use private AI tools provided by their organization. If that research workflow is happening behind private interfaces, you need repeated observation just to regain partial line of sight. (B2B Buyers Make Zero-Click Buying Number One)

What to change when the distribution is weak

When visibility is inconsistent, the fix is usually source quality and evidence spread, not prompt hacking. Forrester has been right about the macro shift: marketers are losing line of sight into buyer intent as research moves into answer engines, and visibility now matters more than raw traffic. In a March 25, 2026 post, Forrester argued that marketers now face a “visibility vacuum” as research shifts into answer engines and buyer activity becomes harder to observe. (Forrester)

If your weekly sample comes back noisy, do these three things first:

Expand independent proof. Publish or earn more third-party references, not just first-party pages.
Tighten entity consistency. Make sure your brand, category, and claims match across the pages likely to be cited.
Map prompt gaps to asset gaps. If you vanish on comparison queries, you probably need comparison-grade evidence. If you vanish on definition queries, you probably need cleaner educational assets.

That is why I keep pointing teams back to the Machine Relations definition of Generative Engine Optimization, the broader AI visibility model, and practical comparisons of how to prove AI search performance without fantasy metrics. (AI Visibility Score Accuracy)

The failure mode: reporting averages without uncertainty

Most GEO reporting fails because it compresses volatility into one neat number. A clean dashboard is seductive. It is also dangerous if it hides spread, variance, and repeatability. The March 2026 uncertainty framework makes this explicit: citation visibility should be interpreted as a statistical estimate with uncertainty, not a fixed truth. (Quantifying Uncertainty in AI Visibility)

So stop sending executives a single “visibility score” with no context. Send this instead:

mention rate by engine
citation share by engine
range across runs
week-over-week direction
prompts with repeated absence

That gives leadership something useful. Not a vanity number, an operating signal.

Where this fits in Machine Relations

GEO measurement sits in the Machine Relations stack as a diagnostic layer, not the whole game. You still need the upstream inputs that make citation possible: earned media, authoritative pages, entity consistency, and coverage breadth. If those inputs are weak, repeated measurement will just prove that your visibility is weak more accurately. (Machine Relations)

That is the useful part. It tells you whether the problem is instrumentation or infrastructure. In Machine Relations terms, the engine can only cite what the web has already made legible and trustworthy. Start there, then measure the output with enough repetition that the numbers mean something.

If you want the fast next step, run a three-day repeated measurement sprint on your top 20 prompts before the next planning meeting. You will learn more from that than from another month of screenshot collections.

If you want help finding the prompts and source gaps that matter most, run a visibility audit.

FAQ

How many times should I run a GEO prompt before reporting it?

At minimum, run each priority prompt three times across three non-consecutive days. One run tells you almost nothing about repeatability.

What metric matters most in GEO measurement?

Start with mention rate and citation share across repeated runs. Those tell you whether you show up consistently enough to matter.

Can I use one AI visibility score for executive reporting?

You can, but only if you also show the variance behind it. A single score without spread hides the real risk.