The Three Lives of Schema Markup: A Beginner's Guide to How Google and LLMs Actually Use It

Schema markup is read by 3 completely different systems for 3 completely different purposes. Most articles conflate them. Here is what each one actually does, in plain English, with examples.

By: Suganthan Mohanadasan · · 16 min read

Every schema markup article you have read in the last 12 months is partially right and mostly missing the point.

On 11 May 2026, Ahrefs published a causal study of 1,885 pages and found that adding JSON-LD schema produced no significant uplift in AI citations. Half of SEO Twitter is now declaring schema dead. The other half is quoting vendor blogs claiming schema gives you a 2.5x citation boost. Both sides are talking past each other.

The reason they cannot agree is simple. Schema markup is read by 3 different systems for 3 different purposes, at 3 different points in time. Almost nobody writing about it bothers to separate them. So the same word ends up meaning 3 things depending on who is saying it.

This post walks through each one with examples a beginner can follow. By the end you will know what schema actually does, what it does not do, and which of the 3 jobs you are actually paying for when you add it to your site.

What schema markup actually is

Start at the bottom.

Imagine you have written a page about your dog. To you, the page is obviously about a golden retriever called Rex who was born in 2019. To a computer reading the raw HTML, the page is just a block of words and pictures. The computer has to guess what is a name, what is a breed, what is a date. Sometimes it guesses right. Often it does not.

Schema markup removes the guessing. You add a small block of structured data to the page, usually in a format called JSON-LD, that labels the facts in a machine-readable way.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Pet",
  "name": "Rex",
  "breed": "Golden Retriever",
  "birthDate": "2019-04-10"
}
</script>

That is it. You are telling the computer that this page is about a pet called Rex, breed Golden Retriever, born April 10, 2019. The computer has nothing left to infer. (I’m using this as an example to explain the concept and worth noting the above example is not valid schema)

Schema.org is the dictionary of available labels (Person, Product, Article, Recipe, Organization, and thousands more). JSON-LD is the format. Google, Bing, and Yandex all recommend JSON-LD because it sits in a separate <script> block and does not interfere with the visible HTML.

So far so simple. Now here is where it gets interesting. Different systems read this block of structured data for different reasons, at different times, in completely different ways. There are 3 of them.

Life 1: Google’s index pipeline

This is the original use case and the one schema was designed for.

When Googlebot fetches your page, it extracts both the visible HTML and the JSON-LD block. These feed two different downstream systems inside Google’s indexing layer. The visible content goes to the ranking pipeline, where systems like BERT, RankBrain, and Neural Matching figure out which queries the page should match. The JSON-LD feeds an entity pipeline that does the following.

  1. Identifies what entity the page is about. Rex the dog. Or your business. Or this article.
  2. Checks whether that entity already exists in Google’s Knowledge Graph (the big database of structured facts Google maintains).
  3. If it does not exist yet, Google may add it as a new entity.
  4. If it does, Google updates what it knows about it.
  5. Uses sameAs links inside your schema to connect this entity to other places it lives on the web (your Wikipedia page, your Wikidata entry, your LinkedIn profile).

Deeper version for the curious. Googlebot reads hardcoded JSON-LD on its initial fetch. JavaScript-injected schema queues for Google’s Web Rendering Service (WRS), which renders the page in a second pass and extracts the dynamically generated content. Either path ends in the same place inside Google’s index. The route just differs based on how your page delivers the schema.

JSON-LD is the cleanest, most explicit input to this pipeline, but Google treats it as a signal, not as absolute truth. If your schema conflicts with the visible content, with anchor text from inbound links, or with established authority signals, Google can override or ignore the markup entirely. This is how Google prevents schema spam. Schema’s job is to make the entity unambiguous when the other signals agree with it, not to overrule them.

This all happens at index time, which means it runs offline before any user has searched for anything. Google’s servers do this work in batch, in the background.

The output of this pipeline is the stuff most SEOs already know about.

  • Rich results in search (recipe cards, FAQ snippets, product carousels, breadcrumb trails)
  • Knowledge Panels on the right side of search results for brands and people
  • Better entity matching when someone searches for you by an ambiguous name
  • Author attribution across Google’s properties

This is why your competitor with proper Organization schema has a Knowledge Panel and you do not. It is also why writers with consistent Person schema get attributed correctly across search results, while writers without it disappear into the void.

This entire layer has nothing to do with AI search or LLMs in the direct sense. It is classic Google infrastructure that has been running since the early 2010s (Schema.org launched in 2011, the Knowledge Graph in 2012). It still works. It still matters. It is where most of schema’s value has historically lived.

At Google Search Central Live Toronto in April 2026, Google structured data engineer Ryan Levering reaffirmed that this pipeline also feeds Google’s own generative AI systems. The session recap lists 4 reasons structured data still matters in the AI era. Structured data is more precise than LLM extraction, especially on complex content like product pricing. It also lets you define information that does not appear on the visible page, including identifiers and metadata. Parsing structured data is much cheaper for Google than repeatedly inferring meaning from raw text. And the right markup focuses attention so machines do not pull in irrelevant context.

Even Google’s AI systems are downstream of the Life 1 pipeline. When Google AI Overviews mentions your brand, it is partly because of the entity work this layer did months earlier.

Life 2: LLM pretraining

This is where it gets less obvious.

ChatGPT, Claude, Gemini, and the rest of the large language models are trained on enormous quantities of text scraped from the open web. The most common source is Common Crawl, a public dataset of web pages refreshed roughly monthly.

Common Crawl captures everything when it grabs your page. The visible HTML, the meta tags, AND the JSON-LD block. Your schema markup survives the snapshot intact.

Here is where the debate gets murky. The best-documented LLM training-data cleaning pipeline is C4, the dataset Google used for T5. The T5 paper describes heuristics that strip code-like content, including lines with curly brackets (They removed any pages that contained a curly bracket), which would remove JSON-LD from most pages. So in C4 specifically, schema markup mostly does not survive into the training text.

Modern labs use much more diverse corpora (RedPajama, Dolma, FineWeb, plus heavy proprietary scraping). How each one handles JSON-LD is opaque. Microsoft has been the transparent exception. In March 2025, Fabrice Canel, Principal Product Manager at Microsoft Bing, publicly confirmed at SMX Munich that Microsoft uses structured data to support how its LLMs (Copilot) interpret web content. So at least one major LLM platform uses schema as a direct signal rather than scrubbing it out.

So how does schema earn its keep across the others? Probably as an indirect document-quality signal. Pages with clean, valid structured data tend to be technically well-built, on stable domains, with editorial discipline. These are also the signals quality classifiers tend to favour during corpus filtering. The result is hard to prove from outside, but the directional claim is reasonable. Schema may help your pages survive into training corpora even when the JSON-LD itself gets scrubbed.

If that mechanism is right, the model’s mental map of your entity ends up sharper not because it memorised your schema, but because schema travelled with the kind of pages that survived filtering.

This layer is invisible to you. You cannot run a 30-day test and measure your impact on it directly. But the directional case is plausible, and it compounds. Every new model version trains on a fresh Common Crawl snapshot. Clean schema on your site today plausibly helps shape how next year’s GPT or Claude or Gemini talks about your entity.

Life 3: LLM runtime retrieval

Now the third life. This is the one the Ahrefs causal study measured, and the one that drives most of the current debate.

When you ask ChatGPT “What did Suganthan say about WebMCP?”, ChatGPT does not just answer from memory. It fetches the live web. It hits search results, picks pages, fetches them, parses the content, and uses that to construct an answer with citations.

This happens at query time, in real time, while you wait. It is a completely different pipeline from Life 1 and Life 2.

In fact you can see in the logs the exact time it happened. (I did this search around the same time) as i have enabled web search for this to stimulate the actual real time search. You can see from Cloud flare logs the ChatGPT-User used for real time fetching.

The searchVIU experiment tested this in October 2025 across 5 systems (ChatGPT, Claude, Perplexity, Gemini, Google AI Mode). All of them behaved the same way. When the LLM fetched a live page during this retrieval step, it stripped the JSON-LD blocks and relied entirely on visible HTML content. The schema was ignored at this layer.

A caveat. Behaviour depends entirely on the scraping tool the AI is using at that moment. Standard text extractors strip scripts before parsing. Headless browsers that render the full DOM, including client-side JavaScript, can see schema and even pick up structured data that was injected dynamically. Modern AI crawlers vary in capability. The 5 systems searchVIU tested behave one way today. They might switch to full-rendering retrieval tomorrow. Build your site so it works whether the AI reads your schema or not.

That is also what the Ahrefs causal study was measuring. They tracked 1,885 pages that added schema and asked whether AI citations went up over the following 30 days. They did not. ChatGPT +2.2%, Google AI Mode +2.4%, both indistinguishable from zero. Google AI Overviews actually went down 4.6% (small, but statistically significant).

The reason is not that schema is broken. It is that the study measured Life 3, where schema is mostly not used in the way most people think. Life 1 and Life 2 operate on completely different timescales and cannot be detected in a 30-day citation window on pages that were already being cited.

One subtle nuance is worth knowing here. In February 2026, the SEO community lit up over a test popularised by Mark Williams-Cook, with original credit to Richard Barrett. They created a fake company page (DUCKYEA t-shirts) and placed the address ONLY in invalid, made-up JSON-LD. Both ChatGPT and Perplexity extracted and returned the address when asked. The reason is mechanical. LLMs tokenise the entire HTML response, including the <script> blocks that contain JSON-LD. They can see the text inside schema. But they do not parse it as structured data. They read it as plain text alongside the rest of the page.

So at Life 3, schema is functionally text. It is not being interpreted as structured information. The model is just reading the characters.

Where the confusion comes from

Most of the schema-for-AI debate happens because writers, and the tool vendors paying them, conflate these 3 lives.

You see headlines like “Schema gives you a 2.5x boost in AI citations”. That number comes from correlation studies and is then served up as if schema causes citations. It does not. Pages with schema are usually also pages with strong content, decent backlinks, established entity presence, and Knowledge Graph membership. The schema travels with all the other signals. Take it away and you would still get cited, because the other signals are doing the heavy lifting.

The other way round, when Ahrefs ran a clean causal study and found no significant uplift, the reaction was equally wrong. “Schema is dead” only makes sense if you assume schema’s job was Life 3 (forcing AI citations through runtime fetches). It was not. It still is not. The Ahrefs study explicitly flagged in its limitations that it did not test Life 1 or Life 2 effects.

Gianluca Fiorelli wrote the cleanest explanation of this so far. His argument is that the Ahrefs study is methodologically right but tests the wrong thing. Schema, in his words, is “entity infrastructure”, not a citation lever.

That phrase is the one that should reset the whole conversation.

A simpler mental model

Forget the AI hype for a second. Think of schema markup as business registration, not advertising.

You do not register your company with the government because you expect a direct sales boost. You register because being a formally recognisable legal entity is the foundation that other things sit on top of. You can sign contracts. You can open a bank account. You can be referenced in legal documents and tax filings. None of that produces revenue by itself. All of it is required for the systems that DO produce revenue to function.

Schema works the same way. Being a formally recognisable structured entity is the foundation for everything that follows. Google’s Knowledge Graph can index you properly. AI training pipelines treat your data as high confidence. Entity disambiguation works in your favour when someone searches for an ambiguous name.

As you can see there are two people with the same name and only one tall, dark and handsome SEO guy ;)

This is partly because of how I have implemented my person schema. Over time Google can tell the SEO Suganthan from the Academic Suganthan.

Adding schema does not directly cause citations or rankings. It makes you legible to the systems that decide whether to feature you. Those are very different things.

What schema actually does for you

Broken down by life.

Life 1 (Google’s index pipeline). This is where schema pays off fastest. Rich results in Google search (recipe cards, product carousels, review snippets, breadcrumbs) all require the relevant schema types. A Knowledge Panel for your brand needs Organization schema with sameAs links to Wikipedia, Wikidata, and your social profiles. Author attribution across Google’s properties only works when you have Person schema connected to your articles.

Action - Implement Organization, Person, Article, and Breadcrumb schemas at minimum. Add sameAs links to every authoritative profile you have. More importantly, connect these schemas to each other using @id node identifiers so search engines understand the relationships. An isolated Person block, an isolated Article block, and an isolated Organization block tell Google three separate stories. A nested graph with @id references says they are all part of the same entity. This is the difference between schema as decoration and schema as entity infrastructure. Verify in Google’s Rich Results Test.

Life 2 (LLM pretraining). You cannot directly measure your impact on training corpora. But you can measure your standing within the dataset that feeds them. Metehan Yeşilyurt’s free CC Rank Checker shows your Harmonic Centrality rank and PageRank inside Common Crawl’s WebGraph (lower is better). The metric is link-graph based rather than schema-specific, but the kind of sites that rank well there tend to be the kind that survive filtering into training. It is the closest available proxy for “how visible am I to LLM training data”. Beyond that, stack the deck. Consistent schema across your site builds entity recognition that compounds across snapshots.

Action - Keep your schema clean, consistent, and present across the entire site, not just on a few key pages. Do not change @id URLs once they are set, as those are how the model is learning to identify your entity across pages. Treat your entity identity as a long-term asset.

Life 3 (LLM runtime retrieval). Schema does not directly help here. Your visible content does. Clean HTML, good headings, natural language, decent formatting. That is what AI assistants actually read at retrieval time.

Action - Write content that is structured semantically with proper headings, lists, and paragraphs. Answer questions directly in visible text. Do not try to hide facts in schema and expect AI to find them. It can, technically, but it is unreliable.

One more risk worth knowing about. Google has long maintained that structured data must match the content visible on the page. Adding facts in schema that do not appear in visible HTML is a content parity violation and can trigger a manual action against your site. The Williams-Cook test was a research demonstration, not a deployment strategy. If a fact matters, put it in visible HTML and mirror it in schema.

What to do by site type

Publisher or blog. Article schema with proper Author markup. Get your Person schema right with sameAs links to your professional profiles. This earns you Knowledge Graph membership and author attribution. Do not expect schema to drive AI citations directly. The content does that. The schema makes sure the citations get attributed to you when they happen.

Ecommerce. Product schema is non-negotiable. Add ItemList, OfferCatalog, and AggregateRating where applicable. This unlocks Life 1 rich results which still send enormous amounts of traffic. AI shopping experiences (ChatGPT, Perplexity) are still early, but Bing and Google are already using your structured data when responding to shopping queries.

B2B or SaaS. Organization schema with sameAs to every credible profile you have (LinkedIn, Crunchbase, Wikipedia, Wikidata). SoftwareApplication schema on your product pages. FAQPage schema on those pages too, even though Google killed the visual rich result on 7 May 2026. The SERP feature is gone for everyone, but the schema itself remains valid code, still parsed by Google and AI systems. It has shifted from a visual SEO tactic to entity infrastructure for Life 1 and Life 2. Leave your existing FAQ schema in place and keep adding it to new pages.

Heads up on the phased removal though. FAQ reporting in Search Console and the Rich Results Test both go away in June 2026, and Search Console API support for FAQ drops in August 2026. Do not panic when those disappear from your dashboards. The schema itself is still being read.

Personal brand. Person schema linked from your homepage to your articles. sameAs to LinkedIn, X, GitHub, anywhere you publish. This is the cheapest and highest-leverage schema work you can do. Almost nobody bothers with it, which is exactly why it works.

Closing

Schema markup is not dead. It is also not a magic AI citation lever.

It is entity infrastructure. It makes you formally recognisable to the systems that decide whether and how to feature you. 3 of those systems are reading your schema right now. Google’s index pipeline handles rich results, Knowledge Graph membership, and entity disambiguation. LLM pretraining probably uses schema as an indirect quality signal, though Microsoft’s Copilot has confirmed direct use and the others are opaque. LLM runtime retrieval mostly strips it, though the picture there depends on which scraper the AI is using on the day.

If you treat schema as advertising you will be disappointed. If you treat it as registration, the value compounds quietly in the background for years.

Add it cleanly. Keep it consistent. Do not expect overnight wins. Then go write good visible content and let both work together.

That, properly, is how schema works.

Stay in the loop

I'll email you when I publish something new. No spam. No fluff.

Join other readers. Unsubscribe anytime.

Suganthan Mohanadasan
Suganthan Mohanadasan

Entrepreneur & Search Journey Optimisation Consultant. Co-founder of Keyword Insights and Snippet Digital.