Signal — Issue 11 — 16/06/2026

Published by Digital Human Assistants · aiknowledgesignal.io · Weekly practitioner briefing

This Week in Brief

Two large-scale citation studies — from Ahrefs Brand Radar and Goodie — reframe the core GEO challenge: AI Overview citation overlap with top-10 organic rankings sits at just 38%, while ChatGPT's share of B2B AI referral traffic has fallen from 89% to 63% as Claude, Gemini, and Perplexity claim growing slices. Simultaneously, Microsoft's MAI-Thinking-1 training-data disclosures face scrutiny over the gap between its 'clean, commercially licensed' claims and evidence of public-web and Common Crawl usage — a fault line with direct implications for content licensing and crawl-access strategy.

Market Analysis — GEO & ASO

AI Overview CTR Study v3: Declining Click-Through Rate Trend Reverses in Q1 2026

Seer Interactive · 24/04/2026

Per a study from Seer Interactive covering 53 brands, 5.47M tracked queries, and 2.43B organic impressions across full-year 2025 plus Q1 2026 actuals, the expected continued decline in organic CTR driven by AI Overviews has levelled off — and early 2026 data shows a directional reversal. For GEO practitioners, this suggests that the zero-sum framing of AI Overviews cannibalising all organic clicks may be overstated, and that tracking CTR segmented by AIO presence remains essential for isolating true impact.

2026 AI Search Traffic Report: ChatGPT's B2B Referral Share Falls to 63%; Claude Reaches 18.5%

Goodie · 21/05/2026

Wave 2 of Goodie's longitudinal AI Search Market Share Report — drawing on GA4 referrer data from an anonymised brand panel triangulated against SimilarWeb data covering 25.77B visits (January–April 2026) — found ChatGPT's share of B2B AI referrals averaged 62.6% in March–April 2026, down from 89.1% in Wave 1 (May–August 2025). Claude rose from 1.4% to 18.5%, Gemini from roughly 2.5% to 10.6%, and Perplexity more than doubled to 7.3%. For GEO practitioners, single-engine optimisation strategies built around ChatGPT alone now leave material referral share unaddressed.

AI Overview Citation Study: 38% of Citations Pull From Top-10 Organic Results (4M URL Analysis)

Ahrefs Brand Radar · 16/06/2026

Ahrefs analysed 863K keyword SERPs and 4M AI Overview URLs — more than double their prior study — and found that 38% of AI Overview citations come from pages ranking in the top 10 organic results for the same query, with the remaining 62% drawn from outside the first page. The analysis notes that AI Overviews are now powered by Gemini 3 as of January 2026. For GEO practitioners, the data confirms that organic rank remains a relevant but insufficient signal: the majority of citation-eligible content sits outside conventional top-10 optimisation targets.

AI Search & ASO

Conductor 7-Month Study: Each AI Engine Has a Distinct 'Editorial Identity' for Source Selection

Conductor · 07/05/2026

A 7-month tracking study by Conductor covering ChatGPT, ChatGPT Search, Perplexity, Google AI Overviews, Google AI Mode, Gemini, and Claude (September 2025–March 2026; 1,056 data points) found that each engine exhibits a persistent, intent-specific source preference: ChatGPT and ChatGPT Search are alone in surfacing Wikipedia; Perplexity and Google Gemini both favour YouTube across most intents; Google AI Mode preferentially routes users back to Google properties. For GEO practitioners, this signals that a single content strategy cannot achieve consistent citation coverage across the full AI search ecosystem — engine-level optimisation is now a practical requirement, not an edge case.

Perplexity Integrates Deep Research Into 'Computer' Feature, Routing Tasks Across 20+ Models

OpenTools · 13/06/2026

Perplexity has moved its Deep Research capability into its 'Computer' agentic feature, enabling multi-step research tasks to be routed dynamically across more than 20 AI models. For ASO and GEO practitioners, the shift matters because agentic research pipelines retrieve and synthesise content differently from single-turn queries — content structured for extractability and multi-hop reasoning is more likely to survive intact through a chained retrieval workflow than prose optimised purely for conversational snippets.

AI Lab Signals

Microsoft MAI-Thinking-1: 30 Trillion Token Pre-Training Corpus Raises Clean-Data Questions

Winbuzzer · 05/06/2026

Microsoft's 109-page MAI-Thinking-1 technical report describes a 30-trillion-token pre-training corpus — 54.6% code — positioned as fully human-authored and commercially licensed, with no open-source or HuggingFace data included. However, scrutiny of the materials revealed references to public-web and Common Crawl data, raising questions about whether public pages were formally licensed or simply crawled as accessible content; Microsoft states its crawler respects robots.txt opt-out controls. For GEO practitioners, the disclosure reinforces that robots.txt and structured crawl-access signals remain active levers in determining whether content enters frontier model training pipelines — not just real-time retrieval.

Epoch AI Projection: High-Quality Public Internet Text Approaching Full Utilisation for AI Training

AI Advances (Medium / ai.gopubby.com) · 11/06/2026

Analysis citing Epoch AI research projects that the supply of high-quality human-generated text available on the public internet will be fully utilised by AI training pipelines in the near term — a dynamic the piece characterises as a structural 'data debt' now driving labs toward synthetic data and licensed private corpora. Note: the Epoch AI underlying study is referenced but not directly linked in the source; treat the projected timeline as indicative rather than confirmed. For GEO practitioners, the trend strengthens the case for publishing original, human-authored research and proprietary data — content types that remain scarce and disproportionately valuable to retrieval systems as commodity web text saturates.

Training Data & Crawl

MAI-Thinking-1 Discloses No Data Providers Despite Naming Every Tool in the Pipeline

Kili Technology · 16/06/2026

A detailed breakdown of Microsoft's MAI-Thinking-1 dataset documentation found that while every tool in the training pipeline is named, no data provider is disclosed — including vendors behind the human preference data used to train safety behaviour. The report notes that the 'no synthetic data' claim holds for pre-training but breaks in reinforcement learning, where both SWE problems and tool-use environments are synthesised; agentic RL retained 265,617 verified SWE environments from 102M GitHub pull requests (a 5.5% survival rate). For practitioners managing content licensing, the opacity around data providers makes it difficult to verify whether specific content assets are present in the training corpus or covered by any licensing arrangement.

Research Radar (arXiv)

Retrieval-Augmented Generation for Natural Language Processing: A Survey

Multiple authors (open access) · https://link.springer.com/article/10.1007/s10462-026-11605-7 · 01/06/2026

Published in Artificial Intelligence Review (Springer, open access), this systematic survey introduces a taxonomy of retrieval fusion strategies — query-based, logits-based, latent, and parametric — and provides structured comparisons across accessibility, efficiency, and use cases for RAG systems applied to NLP tasks. For GEO practitioners, the taxonomy is directly applicable: understanding which fusion method an AI search engine is likely to employ helps predict whether passage-level extraction, entity-dense structuring, or parametric knowledge reinforcement will be the dominant citation pathway for a given content type.

Graph RAG: When Knowledge Graphs Beat Vector Search

Perivitta Rajendran · https://pr-peri.github.io/ai-engineering/2026/06/01/graph-rag.html · 01/06/2026

This engineering analysis demonstrates that standard vector-similarity RAG systematically fails on multi-hop relational queries — questions whose answers require connecting entities across multiple documents — while Graph RAG, which builds a structured knowledge graph from the corpus, resolves these cases by making entity relationships explicit and traversable. For GEO practitioners, the implication is concrete: content that explicitly names entities, defines relationships between them, and uses structured markup (Schema.org, JSON-LD) is better positioned for citation in the class of AI answers that require synthesising across multiple sources rather than extracting a single passage.

Practitioner Takeaway

Audit your GEO programme against the multi-engine reality documented in Conductor's 7-month citation study and Goodie's Wave 2 referral data: if your content and measurement strategy is built primarily around Google AI Overviews and ChatGPT, you are now structurally blind to Claude (18.5% of B2B AI referrals), Gemini, and Perplexity. This week, map your top 20 target queries across at least four engines — ChatGPT, Perplexity, Google AI Overviews, and Claude — and record which source types each engine favours for those intents. Use the gap between your current citations and the engine's revealed source preferences to prioritise your next content investment, whether that is YouTube-hosted explainers for Gemini and Perplexity, Wikipedia-adjacent authority signals for ChatGPT, or answer-first structured pages for Google AI Overviews.

Sources This Edition

Get the full AI Knowledge Signal Publication Framework

The 6-phase framework used to structure this newsletter is available as a complete methodology guide — including audit tools, templates, and implementation checklists.

Get the Framework — $20/mo or $200/yr

New to AI knowledge publication? Download the free briefing flyer — the data case for why your organisation cannot wait.