🎉 Introducing AIQ — the new platform from Five Blocks that shows you exactly what AI says about your brand. Discover AIQ →

What data sources do AI models use to answer questions about brands?

Quick answer

Training data (the corpus the model learned from), retrieval data (live web pulled at query time), structured knowledge (Wikidata, Knowledge Graph), and increasingly Reddit, YouTube, and forum content.

Modern AI engines draw on four source categories. The first is the training corpus: the public web at the model’s training cutoff, including news archives, Wikipedia, books, academic papers, and large amounts of forum and social content. The second is retrieval-augmented generation: live web pages fetched at the moment the user asks a question, used by Perplexity, ChatGPT Search, Google AI Overviews, and others. The third is structured knowledge: Wikidata, the Google Knowledge Graph, and other databases that the engines query directly for entity facts. The fourth, and the one that has shifted the picture most over the last two years, is user-generated content: Reddit threads, YouTube transcripts, podcast episodes, and platform-specific forums that classic SEO ignored. A reputation program that influences only Google search results misses the second, third, and fourth categories.

Last reviewed: 19/05/2026

Error: Contact form not found.

Skip to content