Does Schema Markup Predict AI Citation? A Cross-Platform Empirical Study of Structured Data and Generative Engine Optimization

20260 citationsPreprintgreen Open Access

Authors

Kurt Fischman · Foundation for Growth Science

Abstract

This study examines whether JSON-LD schema markup independently predicts the probability that a web page will be cited in AI-generated responses. We collected 730 AI citations from ChatGPT (GPT-4o with web browsing) and Gemini (1.5 Pro with search grounding) across 75 commercial queries spanning five categories: SaaS and Technology, Health and Medical, Finance and Insurance, Professional Services, and How-To and DIY. Google top-10 organic results for the same queries were collected via SerpAPI as a control set, yielding 1,006 total unique pages analyzed for schema characteristics and domain authority (Ahrefs DR). Initial pooled analysis produced a significant negative association between schema presence and AI citation (OR = 0.546, p < .001) — suggesting schema actively reduced citation probability. This finding proved to be a methodological artifact: Google's ranking algorithm systematically enriches top-10 organic results for schema-bearing pages, inflating schema prevalence in the non-cited control population. A within-Google diagnostic revealed that schema prevalence among AI-cited and non-cited Google pages was statistically indistinguishable (43.1% vs. 44.8%), collapsing the apparent effect entirely. Corrected models using Generalized Estimating Equations with query-clustered standard errors produced a null result for schema presence (OR = 0.678, p = .296), entity richness score (OR = 1.001, p = .833), and schema-to-query alignment (OR = 1.068, p = .626). The dominant predictor of AI citation was Google organic rank position (OR = 0.762 per position, p < .001). Position-1 pages were cited in 43% of queries in which they appeared, declining to 5% at position 7. This gradient implies that each rank position reduces citation odds by approximately 24%, and that AI citation behavior is substantially mediated by the search backend ranking that precedes AI-level content evaluation. One significant exception emerged: pages implementing Product or Review schema with populated concrete attribute fields — pricing, aggregateRating, specifications — were cited at substantially higher rates than pages implementing generic schema types such as Article, Organization, or BreadcrumbList (61.7% vs. 41.6%, p = .012). This attribute-rich advantage was most pronounced among lower-authority domains (DR ≤ 60), consistent with the interpretation that factual payload in structured data partially compensates for weak authority signals. Sophisticated entity-linking techniques — Wikidata sameAs links, genuine @id cross-referencing — appeared on fewer than 4% of schema-present pages and could not be evaluated statistically. These findings support a more precise version of the schema-helps hypothesis than the practitioner consensus has articulated: attribute-rich schema that provides extractable factual content may confer modest citation advantages for lower-authority domains, while generic schema provides none. The dominant practical implication is that traditional organic rank position remains the primary lever for AI visibility, and that GEO-specific optimization efforts are most productive when directed at content quality and authority rather than generic structured data implementation.

Topics & Keywords

Artificial Intelligence in Healthcare and Education Ethics and Social Impacts of AI Misinformation and Its Impacts

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18728696

Command Palette

Does Schema Markup Predict AI Citation? A Cross-Platform Empirical Study of Structured Data and Generative Engine Optimization

Authors

Abstract

Topics & Keywords

Publication Details