A reference glossary of AI visibility, GEO, crawlability, retrieval, and machine-readable knowledge terms — including canonical definitions for concepts introduced by the AI Knowledge Signal Framework.
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Accessibility | Whether AI crawlers, search engines, and answer systems can technically reach and read your website content. | If content is blocked, hidden, or poorly served, AI systems may never see it. |
| AI Crawler | A bot used by AI companies, search engines, or data providers to discover, visit, and collect web content. | Crawlers are often the first step in whether your content can enter AI search, retrieval, or training pipelines. If they can't access or interpret it, you effectively don't exist to AI systems. |
| AI Knowledge Signal | The set of signals your organisation sends across the web that help AI systems understand, trust, retrieve, and cite you. | Strong knowledge signals increase the chance your organisation is accurately represented in AI-generated answers. |
| AI Visibility | How often and how accurately your organisation appears in AI-generated responses. | Visibility is becoming as important as traditional search ranking because users increasingly receive answers without clicking through to websites. |
| Answer Engine | A system that gives users a direct answer rather than only a list of search results. | ChatGPT, Perplexity, Gemini, Copilot, and Google AI Overviews are shifting discovery from ranked links to generated answers. |
| Answer Search Optimisation | The practice of optimising content so AI answer engines can find, understand, and use it in responses. | It moves beyond traditional SEO by focusing on answer inclusion, representation, and citation — not just ranking on a search results page. |
| Authority | The perceived credibility, expertise, and trustworthiness of your organisation, website, or content. | AI systems are more likely to rely on sources that appear authoritative, consistent, and well-supported. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Citation Probability | The likelihood that an AI system will cite, reference, or draw from your content in an answer. | Higher citation probability means greater influence, visibility, and trust in AI-mediated discovery. |
| Common Crawl | A large, publicly available archive of web crawl data collected from across the internet. | Many AI datasets and research pipelines have used Common Crawl as a raw input, so being present and well-structured on the open web can affect downstream AI visibility. |
| Conversational Search | A mode of search where users ask natural-language questions and receive synthesised answers rather than a list of keyword-matched results. | Discovery is shifting from "search and click" to "ask and receive" — meaning organisations must optimise for answer inclusion, not just search ranking. As conversational search becomes the dominant interface for knowledge retrieval, organisations whose content is structured for synthesis — rather than for click-through — gain representation advantages that traditional SEO does not address. |
| Corpus Survival Likelihood Coined term | The estimated probability that a piece of content survives AI training-data quality filters rather than being removed, ignored, or down-weighted before model training. | Content can be crawled but still fail to influence AI systems if it is filtered out for quality, duplication, low authority, poor structure, or weak provenance. The AI Knowledge Signal Audit scores content against the dimensions that determine this likelihood. |
| Crawlability | Whether bots can technically access, navigate, and index your website. | Poor crawlability means important pages may be missed, even if the content itself is strong. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Direct Answer | A clear, concise answer to a specific question, usually placed near the top of a page or section. | AI systems favour content that answers questions directly and can be extracted cleanly. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Entity | A distinct person, organisation, place, product, concept, or topic that machines can identify and connect to other information. | Clear entity signals help AI systems understand who you are, what you do, and how you relate to your market. |
| Epistemic Presence Strategy Coined term | The deliberate design of organisational knowledge so it remains visible, trusted, and influential inside machine-mediated knowledge systems. | Reframes content strategy around AI-era influence: not just publishing information, but ensuring that information survives, is understood, and shapes answers. Distinct from SEO (optimising for search ranking) and authority optimisation (optimising for credibility signals). Epistemic presence strategy targets accurate representation within AI training corpora and retrieval pipelines. |
| Epistemic Risk Coined term | The risk that organisational knowledge changes meaning as it moves through AI systems due to summarisation, distortion, circular authority, provenance loss, or outdated source reuse. | AI systems can misstate, flatten, or recontextualise organisational knowledge — creating reputational, strategic, legal, and trust risks. This risk does not arise from malicious intent. It arises from how machine learning systems ingest, compress, and re-express knowledge during training and retrieval. |
| Extractability | How easily AI systems can pull useful meaning, facts, answers, entities, and relationships from your content. | If content is vague, buried, visual-only, or poorly structured, AI systems may ignore or misunderstand it. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Generative Engine Optimisation (GEO) | The practice of structuring digital content and managing online presence to improve visibility in responses generated by generative AI systems — including ChatGPT, Google Gemini, Claude, and Perplexity AI. Related terms: answer engine optimisation (AEO), artificial intelligence optimisation (AIO). | Where SEO targets ranking in search result pages, GEO targets accurate representation in AI-synthesised answers. GEO is a whole-of-entity strategy — AI systems synthesise signals from your entire digital presence, not a single URL. See also: Wikipedia: Generative engine optimization |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| JSON-LD | A structured data format used to describe webpages, organisations, articles, FAQs, products, and other entities in machine-readable form. | JSON-LD helps machines interpret your content more accurately and can strengthen your schema markup. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Knowledge Base | A structured collection of important information, definitions, FAQs, services, evidence, and organisational knowledge. | A strong knowledge base gives AI systems a clear source of truth about your organisation. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Large Language Model (LLM) | An AI model trained on large volumes of text and other data to generate, summarise, classify, and reason with language. | LLMs increasingly mediate how users discover, interpret, and trust organisational information. |
| llms.txt | A proposed website file that points AI systems to useful pages, policies, or machine-readable guidance about your content. | It may become a useful signal for guiding AI crawlers and answer engines to high-value content. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Machine Knowledge Readiness (MKR) | The degree to which an organisation's public web presence exposes the technical, structural, and semantic signals that support AI-mediated discovery, retrieval, citation, and representation. | MKR reframes web presence around what machines can read, not just what humans can see. An organisation can rank well in traditional search yet have low MKR — invisible to the AI systems through which audiences increasingly discover and decide. Introduced in the AI Knowledge Signal Framework as the organisational-readiness counterpart to Corpus Survival Likelihood (which scores individual content). |
| Measurement | Tracking how your organisation appears across AI systems over time. | Without measurement, you cannot tell whether your AI visibility is improving, declining, or being captured by competitors. |
| Metadata | Descriptive information about a webpage, such as title, description, author, date, topic, and structured tags. | Metadata helps search engines and AI systems understand what a page is about. |
| Model Collapse | The progressive degradation of AI model quality that occurs when models are trained on AI-generated content rather than original human knowledge. As AI-generated text enters the web at scale, the distinction between primary and derivative knowledge erodes. | Model collapse is a structural risk for AI training pipelines: when Common Crawl fills with AI-generated content, each generation of models trains on the distorted output of the last. Organisations that publish original, epistemically rigorous content act as anchors against this degradation. See: Shumailov et al. (2024), Nature, "AI models collapse when trained on recursively generated data." |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Presence | The breadth, depth, and consistency of your organisation's appearance across the web. | AI systems often rely on repeated, corroborated signals from multiple sources. |
| Prompt Testing | Testing common user questions across AI systems to see how your organisation, competitors, and key topics appear. | It helps identify gaps, inaccuracies, missed citations, competitor visibility, and opportunities for improvement. |
| Provenance | Information about where content came from, who produced it, when it was created, and how it has changed. | Clear provenance helps AI systems and users assess reliability, authority, and freshness. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Retrieval | The process where an AI system finds relevant information before generating an answer. | If your content is not retrievable, it is unlikely to influence AI-generated responses. |
| Retrieval-Augmented Generation (RAG) | An AI pattern where a model retrieves external information before generating an answer. | RAG makes source quality, structure, and retrievability central to whether your content appears in AI answers. |
| Robots.txt | A website file that gives instructions to crawlers about what they can or cannot access. | It affects whether search, AI, and data crawlers are allowed to visit parts of your site. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Schema Markup | Structured data added to webpages, often using JSON-LD, to explain entities, pages, products, FAQs, articles, and organisations. | Schema helps machines interpret content more accurately and confidently. |
| Search Engine Optimisation (SEO) | The practice of improving the visibility and overall performance of websites and web pages in search engine results pages (SERPs). It focuses on increasing the quantity and quality of traffic from unpaid (organic) search results. | Key strategies include creating high-quality content, keyword research, on-page optimisation, improving user experience, and building backlinks. SEO addresses click-through visibility; GEO addresses AI-synthesised representation. |
| Search Engine Results Page (SERP) | The page of results returned by a search engine in response to a query. | SERPs are where visibility is won or lost — AI systems often learn which sources are authoritative based on what ranks here. |
| Semantic Search | Search based on meaning and intent rather than exact keyword matching. | It rewards clear concepts, entities, relationships, and context rather than simple keyword repetition. |
| Sitemap | A file that lists important website pages and helps crawlers discover them. | A good sitemap improves discovery of key content by search engines and AI crawlers. |
| Source Consistency | The alignment of facts about your organisation across your website, directories, articles, profiles, and third-party sources. | Inconsistent information can reduce trust and cause AI systems to misrepresent your organisation. |
| Structured Content | Content organised with clear headings, summaries, lists, tables, FAQs, definitions, and logical sections. | Structure makes content easier for AI systems to parse, retrieve, and reuse. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Training-Time Visibility | The chance your content is included in datasets used to train, fine-tune, or improve AI models. | Content that is filtered out during training may have less influence on model knowledge. |
| Trust Signal | Evidence that supports the reliability of your content, such as citations, authorship, dates, expertise, external references, and corroboration. | Trust signals help AI systems decide whether your content should be used, cited, ignored, or down-weighted. |
| Term | Direct Answer | Why it Matters |
|---|---|---|
| Vector Embedding | A numerical representation of content that captures meaning so machines can compare concepts, documents, and queries. | Embeddings support semantic search and retrieval, so clear content improves how your organisation is matched to user questions. |