Blog — AI Knowledge Signal

How AI Training Pipelines Actually Work: From Web Crawl to Language Model

AI Knowledge Strategy April 2026

How AI Training Pipelines Actually Work: From Web Crawl to Language Model

Most content doesn't fail AI training pipelines because it's wrong — it fails because it lacks the structural signals that pipeline filters are calibrated to detect. Here is the five-stage process that decides what AI systems know, and what they don't.

Christopher Foster-McBride Read →

The Atlas of AI Training Data: Every Major Dataset Powering Large Language Models

LLM Training Data April 2026

The Atlas of AI Training Data: Every Major Dataset Powering Large Language Models

Every frontier LLM traces its capabilities to roughly 50 datasets — most derived from a single source, Common Crawl — and that finite pool is approaching exhaustion. This structured reference profiles every major dataset: provenance, scale, legal status, and what the data wall means for AI development.

Christopher Foster-McBride Read →

What Large Language Models Are Actually Trained On

AI Knowledge Strategy April 2026

What Large Language Models Are Actually Trained On: A Comprehensive Audit of LLM Training Data

Frontier LLMs train on corpora ranging from 300 billion to 40 trillion tokens — yet most developers treat data composition as proprietary. This audit documents what is publicly known about training data across every major model family, maps the legal cases rewriting data sourcing rules, and quantifies the industry's transparency collapse using Stanford FMTI scores.

Christopher Foster-McBride Read →

Why GEO/ASO Is Critical Right Now — And What to Do About It

GEO & ASO April 2026

Why GEO/ASO Is Critical Right Now — And What to Do About It

Generative Engine Optimisation (GEO) and AI Search Optimisation (ASO) are reshaping how brands are found, cited, and trusted. This article explains the shift, what it demands of your content, and how structured knowledge publication gives you a systematic response.

Christopher Foster-McBride Read →

GEO & ASO: How to Structure Web Content So AI Systems Cite It

GEO & ASO April 2026

GEO & ASO: How to Structure Web Content So AI Systems Cite It

Search behaviour is shifting: generative AI systems now surface answers directly, bypassing click-through entirely. GEO and ASO are the disciplines that determine whether your content is cited, paraphrased, or ignored. This explainer defines both terms, distinguishes them from SEO, and gives a structured method for producing citation-worthy content.

Christopher Foster-McBride Read →