Signal — Issue 7 — 19/05/2026 — AI Knowledge Signal

Published by Digital Human Assistants · aiknowledgesignal.io · Weekly practitioner briefing

This Week in Brief

Google Gemini has overtaken Perplexity as the second-largest AI chatbot referral source to websites, per Statcounter data published in May 2026 — a structural shift that invalidates any AI visibility strategy built around only two platforms. Meanwhile, the practitioner consensus on GEO continues to harden around four concrete levers: citable passage length, JSON-LD schema, llms.txt, and brand entity signals. Two new RAG benchmarks reinforce that retrieval strategy — not generation — remains the dominant variable in answer quality.

AI Lab Signals

Gemini Surpasses Perplexity as #2 AI Referral Source, Statcounter Data Shows

Searchless.ai Blog · 19/05/2026

Statcounter data published in May 2026 shows Google Gemini now holds approximately 15–21% of AI chatbot referral traffic to websites, up from 2.31%, displacing Perplexity to third place. ChatGPT retains roughly 60–64% share but is down from 70%+ in early 2025. Practitioners optimising solely for ChatGPT and Perplexity are now under-indexed on a platform that has already captured a material share of referral volume.

AI Overviews Appear on 18% of All Google Queries, 57% of Long-Tail Queries

Qwestyon · 01/05/2026

According to figures cited in Qwestyon's May 2026 GEO guide, Google's AI Mode produces zero clicks on 93% of searches, while AI Overviews appear on 18% of all Google queries and 57% of long-tail ones. For practitioners, long-tail informational content now faces the highest probability of answer interception — making citation placement, not click-through, the primary success metric for that query segment.

AI Referral Traffic to Top Websites Spiked 357% Year-Over-Year in June 2025

AuraSearch · 01/05/2026

AuraSearch reports that AI referrals to top websites reached 1.13 billion visits in June 2025, a 357% year-over-year increase, with nearly 29% of buyers now turning to AI-powered search tools more frequently than traditional Google. Content with verifiable data reportedly earns 30–40% more visibility in LLM-generated answers than purely qualitative content — a signal that sourced, quantified claims are a concrete ranking lever, though these figures should be treated as vendor-reported and not independently peer-reviewed. (Unconfirmed)

Training Data & Crawl

LLM Training Pipeline Primer: Web Corpora Remain the Dominant Data Source in 2026

Medium / Ali Salisai · 01/05/2026

A May 2026 practitioner walkthrough of the full LLM build pipeline confirms that raw web text — filtered, deduplicated, and quality-scored — remains the primary pre-training corpus for general-purpose models. The piece notes that data quality decisions made at the crawl stage propagate through every downstream capability, reinforcing why structured, entity-rich, consistently formatted content has an advantage in being retained through quality filters.

iMerit Identifies Top 10 LLM Training Datasets for 2026, Highlighting Niche Domain Corpora

iMerit · 01/05/2026

iMerit's 2026 dataset survey lists leading corpora across web-scale, instruction-tuning, code, and domain-specific categories, with MIMIC-IV (healthcare) highlighted as a widely used credentialed dataset. For GEO practitioners in regulated verticals — healthcare, legal, finance — the implication is that domain-specific LLMs are increasingly trained on curated, credentialed sources, raising the barrier for citation in those sectors.

AI Search & ASO

Perplexity Responds to 99.95% of Queries vs Google AI Overviews' 55% Coverage — But Error Rate Favours Perplexity

TechShali · 01/05/2026

A comparative analysis drawing on four independent studies (2025–2026) finds that Perplexity responds to 99.95% of queries versus Google AI Overviews' approximately 55% coverage (SE Ranking, 2025). On accuracy, Perplexity recorded a 13% error rate on PPC questions versus Google AI Overviews' 26% (WordStream, 2026). Perplexity's monthly active users have more than doubled to 45 million (DemandSage, 2026). For ASO practitioners, the near-universal query response rate means Perplexity citation opportunities exist across a broader content surface than AI Overviews — though Gemini's referral surge (see AI Lab Signals) means neither platform should be deprioritised.

Four Concrete GEO Levers Now Represent Practitioner Consensus: Passage Length, Schema, llms.txt, Entity Signals

W2B Agency · 02/05/2026

W2B Agency's May 2026 guide, authored by Esteban Padilla, specifies that citable passages should run 134–167 words as self-contained direct answers, schema should be delivered in JSON-LD naming entities and relationships, llms.txt should be placed at the site root as a structured summary for AI crawlers, and brand entity signals should include consistent name, founders, sameAs links, and Wikidata presence. These four levers align with guidance published independently by Navoto, Qwestyon, and ASP Marketing in the same period, suggesting emerging practitioner consensus — though no controlled trial data is yet available to weight one lever against another.

Research Radar (arXiv)

Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

Bal, D.P. and Puhan, S. · https://arxiv.org/abs/2605.02520v1 · 04/05/2026

This paper runs a controlled comparison of five retrieval strategies — Dense Vector Search, Hybrid BM25 + Dense, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance — within a fixed biomedical RAG pipeline using GPT-4o-mini and ChromaDB, isolating retrieval as the sole variable. (Pre-publication / arXiv) For GEO practitioners, the study provides early evidence that retrieval architecture — not generation model choice — is the primary determinant of answer quality in RAG systems, which has direct implications for how knowledge bases and content corpora should be structured to survive retrieval filtering.

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering (MED-VRAG)

Chen, X. et al. · https://scirate.com/arxiv/2604.27724 · 01/05/2026

MED-VRAG proposes an iterative multimodal RAG framework that retrieves and reasons over full document page images — including tables, figures, and structured layouts — rather than OCR'd text chunks, scaling to approximately 350,000 pages with sub-30ms Stage-1 retrieval. (Pre-publication / arXiv) For content and GEO practitioners, the paper signals that visual document structure — including formatted tables and figures — may become a retrievable signal as multimodal RAG adoption grows, adding weight to the case for structured, visually organised content beyond plain prose.

Practitioner Takeaway

Audit your AI visibility stack for Gemini specifically: check whether your brand is being cited in Gemini responses for your target queries, and if not, prioritise Google's entity and structured data signals (JSON-LD, Google Knowledge Panel, sameAs markup) — Gemini's referral surge from 2.31% to 15–21% share means a Gemini citation gap is now a material pipeline gap, not a secondary concern.

Sources This Edition

Get the full AI Knowledge Signal Publication Framework

The 6-phase framework used to structure this newsletter is available as a complete methodology guide — including audit tools, templates, and implementation checklists.

Get the Framework — $20/mo or $200/yr

New to AI knowledge publication? Download the free briefing flyer — the data case for why your organisation cannot wait.