Afternoon BriefAI Search & Discovery

Open-Source AI Search Agents Now Outperform GPT-5.4 on Recall — Here Is What That Means for Your Brand Visibility

An open-source AI search agent just beat GPT-5.4 on information recall. The retrieval surface your brand needs to appear in is fracturing — and most marketing teams are only measuring three engines. Here is what the research says about getting found across the full AI discovery surface.

Christian Lehman
Christian LehmanJun 12, 2026
Open-Source AI Search Agents Now Outperform GPT-5.4 on Recall — Here Is What That Means for Your Brand Visibility

An open-source 20-billion-parameter AI search agent called Harness-1 just scored 73% average recall on information retrieval benchmarks, beating GPT-5.4's 70.9% and the next strongest open-source search agent by 11.4 percentage points. That is not a model curiosity. It means the field of AI retrieval systems that can surface — or bury — your brand just widened, and most marketing teams are still only tracking three engines.

The AI Retrieval Surface Is No Longer Three Engines

I have been telling operators to measure brand visibility across ChatGPT, Perplexity, Gemini, Claude, and Google AI Mode. That was the right list six months ago. It is not sufficient now.

Harness-1 separates the search decision-making from the bookkeeping that bogs down traditional AI search agents. The model handles what to search and which documents to keep; a state-externalizing harness manages working memory, candidate pools, and verification records. The result is a leaner agent that generalizes well to domains it was never trained on — gains were "especially strong on held-out transfer benchmarks."

What makes this operationally relevant: Harness-1 is open-source, 20B parameters, and runnable by any company that wants to embed search into its product. The same pattern applies to ManuSearch, an open multi-agent search framework designed to democratize deep research in language models. These are not experimental demos. They are production-path architectures that enterprise tools, vertical SaaS products, and internal knowledge systems will adopt to power AI-assisted discovery.

Every one of those deployments becomes another surface where your brand either appears or does not. And each one retrieves differently.

Why Half of Mid-Market Brands Never Surface in AI Recommendations

A 37,000-run audit across retrieval-augmented commercial recommendation systems measured how brand prominence maps to AI visibility. The findings are blunt:

  • L4-L5 brands (specialists, regional players): 48–52% never surfaced in any of the 37,000 runs. Not underrepresented — absent.
  • L1 brands (category leaders): Appeared in nearly all relevant retrievals but won only 25–41% of recommendation slots. Visibility is not their problem; differentiation is.
  • L2 challengers: Showed the strongest conversion rates (37–52%) but lost ground to persona-mediated substitution on Anthropic's models specifically.
  • L3 mid-market: Coverage dropped to 88% with conversion falling to 34–40% — the inflection point where retrieval architecture starts to matter more than content volume.

The researchers' conclusion: "The right marketing investment depends on where the brand sits on the prominence ladder." This is not a content quality problem. A well-written page that no AI retrieval system surfaces in its candidate pool is invisible regardless of how strong the writing is.

Different Engines Have Different Source Preferences — And That Changes Everything

A 12-model study across six providers found that large language models have systematic, latent source preferences — they prioritize information from some sources over others in ways that persist even when explicitly instructed to avoid bias. Source preference can outweigh the actual quality or relevance of the content itself.

That means optimizing for one engine's preferences can actively work against you in another. A page that ChatGPT trusts and absorbs may be deprioritized by a Perplexity-powered retrieval stack or an enterprise agent built on an open-source backbone.

Supporting this: a study of Google Search, Gemini, and AI Overviews found that 51.5% of representative real-user queries now trigger AI Overviews displayed above organic results. The source overlap across these platforms measured less than 0.2 on the Jaccard similarity index — meaning the sources each platform retrieves for the same query are almost entirely different. And websites blocking Google's AI crawler faced reduced visibility in AI summaries even when their content was otherwise accessible.

This is the operational reality: each retrieval system is a distinct discovery surface with its own source preferences, crawl behavior, and citation logic. Brands that optimize for one and assume the others will follow are losing ground systematically.

Citation Selection Is Not Citation Absorption — Measure the Right Thing

A 602-prompt measurement study across ChatGPT, Google AI Overview, and Perplexity introduced a distinction most marketing teams have not internalized: citation selection (whether a platform picks your page as a source) is a different metric from citation absorption (whether your content actually influences the generated answer).

The data:

  • Perplexity and Google cite more sources on average, but ChatGPT cites fewer sources with substantially higher average citation influence per page.
  • High-absorption pages shared consistent structural features: longer content, structured formatting, semantic alignment with likely queries, and rich extractable elements — definitions, data, comparisons, and step-by-step procedures.

Separately, structural feature engineering research across six generative engines found that optimizing content structure at the macro, meso, and micro levels produced a 17.3% improvement in citation rate and an 18.5% improvement in subjective quality scores. This is content architecture work, not keyword optimization.

If you are measuring whether your brand gets mentioned in AI responses, you are measuring citation selection. What matters more is whether your content actually shaped the answer the buyer received. That is the difference between a citation and influence.

What Operators Should Measure Right Now

Here is the framework I use:

1. Multi-engine retrieval coverage. Check whether your brand surfaces across at least five AI discovery systems — ChatGPT, Perplexity, Gemini, Claude, and Google AI Mode — for your top buyer queries. If you only appear in one or two, you are invisible on the others, and open-source agents will widen that gap further.

2. Citation absorption, not just citation count. Use the selection-vs-absorption framework to audit whether your cited pages actually influence the generated answer or just appear in footnotes. If your page is cited but the answer reflects a competitor's framing, you are losing at the absorption layer.

3. Content structure audit. Run your top pages through a structural check: do they have extractable definitions, data tables, comparison frameworks, and step-by-step procedures? The 17.3% citation rate improvement from structural optimization is the cheapest leverage most brands have not pulled.

4. Prominence tier assessment. Use the brand prominence ladder to determine whether your problem is visibility (you do not surface), differentiation (you surface but do not win), or substitution (you win on some models but lose to competitors on others). The fix is different for each tier.

This is what Jaxon Parrott built AuthorityTech to solve — measuring citation architecture across the full AI retrieval surface, not just individual engine checks. The Machine Relations discipline exists because the retrieval surface is too fragmented and too structurally biased for single-engine optimization to work. The Harness-1 result makes that gap wider, not smaller.

FAQ

What is Harness-1 and why should marketing teams care?

Harness-1 is a 20-billion-parameter open-source AI search agent that scored 73% average recall on retrieval benchmarks, outperforming GPT-5.4 (70.9%). It matters because open-source agents like this will power AI-assisted search in enterprise tools, vertical platforms, and internal discovery systems. Each deployment becomes a new surface where brands either appear or remain invisible — and each retrieves content differently based on distinct source preferences and structural biases.

How do you measure brand visibility across multiple AI search agents?

Start by testing your top buyer queries across at least five AI systems — ChatGPT, Perplexity, Gemini, Claude, and Google AI Mode. Track both citation selection (does the engine pick your page as a source?) and citation absorption (does your content actually shape the generated answer?). A 602-prompt study found these are different metrics with different optimization paths. AuthorityTech tracks this through citation architecture measurement across the full retrieval surface.

What content structure features increase AI citation probability?

Research across six generative engines found that optimizing content structure at macro (document architecture), meso (information chunking), and micro (formatting emphasis) levels improved citation rates by 17.3%. High-influence pages consistently feature longer, structured content with extractable elements: definitions, comparison tables, data with context, and step-by-step procedures. This is structural work, not keyword density — and it is the cheapest visibility lever most brands have not used.

Related Reading