Afternoon BriefAI Search & Discovery

Inference Costs Fell 1,000x — AI Engines Still Cite Only 3–5 Sources Per Query

From $60 to $0.06 per million tokens in 3 years — but cheaper inference didn't expand citations. AI engines still select the minimum high-authority sources per answer. The economics of who gets cited and four structural moves to be in that set.

Christian LehmanMay 29, 2026

Inference costs dropped 1,000x in three years — from $60 to $0.06 per million tokens. Query volume exploded 100x. The net effect on brand visibility: AI engines can now afford to answer far more questions, but each answer still cites only the minimum high-authority sources needed to sound trustworthy. Cheaper compute expanded the surface area of AI-mediated discovery without expanding the citation slots per query. If your content isn't structured to earn one of those slots, the economics are compounding against you on every query your buyers ask.

Here's how inference economics shape citation decisions, and four structural moves to be in the cited set.

Key Takeaways

AI inference costs dropped 1,000x in 3 years — from $60 to $0.06 per million tokens for equivalent-quality models, faster than Moore's Law
Total inference spend is still climbing — per-token costs fell 10x annually, but consumption grew 100x (Jevons paradox)
Citation decisions are shaped by token budgets — AI engines cite the minimum high-authority sources needed for a trustworthy answer
The same model behaves differently across providers — only 11% citation overlap across engines means single-engine monitoring misses 75%+ of your visibility surface
Brands that are easy to cite win — structured, entity-dense, answer-first content reduces token overhead and earns disproportionate citation share

The 1,000x Price Drop That Changed Everything

The cost of LLM inference has fallen by a factor of 1,000 in three years. Andreessen Horowitz tracked this decline from GPT-3's launch in November 2021 at $60 per million tokens down to $0.06 per million tokens for equivalent-quality open models by late 2024. They call the trend LLMflation — a 10x cost decline per year, faster than Moore's Law during the PC revolution and faster than bandwidth drops during the dotcom boom.

For GPT-4-class models (MMLU score of 83), the decline is 62x since March 2023. This is not a marginal improvement. It is a structural reset in what AI engines can afford to do on every single query.

What does this mean for you? It means the AI engines serving your buyers can now afford to generate longer, more detailed, more citation-rich answers than they could 18 months ago. The question is whether your content is structured to be one of those citations.

Cheaper Tokens, Bigger Bills: The Jevons Paradox in AI

Here is the counterintuitive reality: despite a 10x annual drop in per-token costs, total enterprise AI inference spend is climbing. VentureBeat reported that while cost per token fell by an order of magnitude over two years, consumption rose more than 100x. Economists call this the Jevons paradox — when a resource gets cheaper, total consumption grows faster than the price falls.

Anindo Sengupta, VP of Products at Nutanix, put it plainly: "Every employee with an AI assistant, every automated workflow, every agent pipeline needs models for inferencing and generates a lot of tokens." Cost per token and GPU utilization are now primary operational metrics for enterprise IT, sitting alongside uptime and throughput.

This is directly relevant to brand visibility. More queries, more tokens, more AI-generated answers means more opportunities for your brand to appear — or to be absent. The total surface area of AI-mediated buyer research is expanding by orders of magnitude, even as the cost of each individual answer shrinks.

How Inference Economics Shape Citation Decisions

This is the part most visibility strategies miss entirely. AI engines do not cite sources for free. Every citation in an AI answer costs tokens — the model must retrieve, evaluate, attribute, and format each source reference. A response with six inline citations costs measurably more inference than a response with zero.

That creates an economic pressure inside every AI engine: cite the minimum number of sources needed to produce a trustworthy answer. When researchers at TU Dortmund studied how generative AI disrupts web search, they found that AI answers synthesize from trusted sources and deliver results directly. The structural incentive is to consolidate citations around the highest-authority sources rather than distribute them across many.

What this means for your brand: if your content is not already established as a primary authority source for your category queries, the inference economics work against you. AI engines under token-budget pressure will default to the sources they have already learned to trust. The rich get richer. I have been tracking this pattern across five AI engines at AuthorityTech, and the data is consistent — brands with high entity density and structured, extractable content earn disproportionate citation share.

The Same Model Is Not the Same Service

If you are evaluating AI visibility based on which model an engine uses, you are looking at the wrong variable. A measurement study from Li et al. (2026) found that the same open-weight LLM served by different providers behaves as a fundamentally different service. Latency, throughput, context capacity, error handling, and pricing all vary by provider and by time of day.

Their key finding: intelligent routing between providers can reduce costs by 37.8% or increase throughput by 90% for the same model. Listed prices are more stable than actual performance, which means AI engines are constantly optimizing where and how they serve inference — and those optimizations affect response quality, citation depth, and source selection.

For brand visibility, this means your content needs to perform across the full service landscape, not just one model snapshot. The Machine Relations framework calls this citation resilience — your ability to be cited regardless of which provider endpoint, quantization level, or routing decision the engine makes on any given query.

The Macro Signal: Inference Costs Pass Through to the Real Economy

This is not just an infrastructure story. Researchers at arXiv published the first unified economic theory of AI inference costs and their pass-through to inflation, developing what they call the Inference-Cost Phillips Curve. Their empirical finding: AI inference costs pass through to consumer prices with near-unit elasticity (0.987) across G7 economies.

The practical implication: as inference costs decline, AI-mediated services become cheaper to deliver, which means more services will be AI-mediated. The total volume of AI-generated answers in buyer research is going to grow faster than most forecasts suggest. Brands that are citation-ready now will compound their advantage as query volume scales.

4 Tactical Moves to Win the Inference Economics Game

Based on what I have seen working across the AuthorityTech portfolio and tracking citation performance across ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews, here is what actually works:

1. Maximize citation value per token

Structure your content so AI engines can cite you with minimal token overhead. That means answer-first formatting, clear entity attribution, and structured data (Article, FAQPage, BreadcrumbList schema). The less work the model has to do to extract and attribute your answer, the more likely it is to cite you under token-budget pressure. AuthorityTech's quality gate system scores exactly this.

2. Build entity density, not page count

I covered this recently: the content volume trap is real. More pages do not mean more citations. AI engines consolidate around the highest-authority entity nodes. Build fewer, denser pages that the model recognizes as the canonical source for your category queries. Two pages with 10 entity associations each outperform 20 thin pages every time.

3. Monitor across the full service landscape

The same model can cite you on one provider and skip you on another. Use a multi-engine audit approach — we found only 11% citation overlap across engines. If you are only checking one AI engine, you are seeing less than a quarter of your actual visibility surface.

4. Treat AI bot traffic as pipeline signal

LLM-referred traffic converts at 5-17x the rate of organic search. When ChatGPT-User or PerplexityBot retrieves your page, that is a buyer research event. Instrument your analytics to capture these signals — they are the leading indicator of whether inference economics are working for or against you.

The Bottom Line

Inference economics is not an infrastructure topic. It is a visibility strategy topic. The 1,000x cost decline in AI inference is creating a massive expansion in AI-mediated buyer research, while the Jevons paradox ensures total query volume grows faster than costs fall. Inside that expanding system, citation decisions are shaped by token economics — and the brands that make themselves easy to cite under token-budget constraints will compound their advantage.

The operators who understand this will build for citation efficiency. Everyone else will keep measuring clicks in a zero-click world.

Frequently Asked Questions

What is inference economics in the context of AI visibility?

Inference economics refers to the cost structure of running large language models on buyer queries. Every AI-generated answer requires GPU compute measured in tokens. The economics of that compute — how much it costs, how engines optimize for token efficiency, and how citation decisions are affected by cost pressure — directly determines which brands appear in AI answers.

How much does it cost an AI engine to generate one answer?

For GPT-4-class models, the cost is roughly $0.01-0.03 per answer as of mid-2026, depending on response length and provider. Open models can be 10-100x cheaper. The cost has dropped 1,000x in three years, and Andreessen Horowitz projects continued 10x annual declines. These costs are borne by the AI engine operator, not the brand, but they shape which sources the engine chooses to cite.

Does the cost of inference affect which brands get cited?

Yes. Every citation in an AI answer costs additional tokens for retrieval, evaluation, and attribution. Under token-budget optimization, AI engines favor the minimum number of high-authority sources needed to produce a trustworthy answer. Brands with structured, extractable, entity-dense content are cheaper to cite and therefore cited more frequently.

How should brands adapt their content strategy to inference economics?

Focus on citation efficiency: structure content so AI engines can extract and attribute your answers with minimal token overhead. Use answer-first formatting, structured data markup, and high entity density. Monitor citation performance across multiple AI engines and providers, since the same model can behave differently depending on how it is served.

What is the Jevons paradox in AI inference?

The Jevons paradox describes how making a resource cheaper increases total consumption faster than the price drops. Applied to AI inference: per-token costs have fallen 10x per year, but total inference volume has grown more than 100x. This means the total surface area of AI-mediated buyer research is expanding rapidly, creating more opportunities for citation-ready brands.