Why Scale Will Not Solve AGI | Vishal Misra - The a16z Show

LLMs perform Bayesian inference with mathematical precision—updating probability distributions as they process each token. Research proves transformers match ideal Bayesian posteriors to 10^-3 bits accuracy. This explains in-context learning: models see examples, update beliefs in real-time, and app

March 17, 2026 46m
A16Z

Key Takeaway

LLMs perform Bayesian inference with mathematical precision—updating probability distributions as they process each token. Research proves transformers match ideal Bayesian posteriors to 10^-3 bits accuracy. This explains in-context learning: models see examples, update beliefs in real-time, and apply patterns they've never encountered before. But unlike humans, LLMs freeze after training and lack causal reasoning—they excel at correlation, not causation.

Episode Overview

Michelle discusses groundbreaking research proving LLMs perform Bayesian inference mathematically. The conversation explores how transformers work as giant probability matrices, why in-context learning succeeds, and fundamental differences between AI and human cognition. Key insights include the Bayesian wind tunnel methodology, the distinction between correlation and causation, and why current architectures cannot achieve AGI without plasticity and causal reasoning.

Key Insights

LLMs are Bayesian Inference Engines

Research proves transformers perform mathematically precise Bayesian updating—matching ideal posteriors to 10^-3 bits accuracy. Using controlled 'Bayesian wind tunnels,' small models trained on tasks impossible to memorize reproduced exact Bayesian distributions, demonstrating this is architectural capability, not data artifact.

The Matrix Abstraction of Language Models

LLMs can be understood as gigantic matrices where each row represents a possible prompt and columns show probability distributions over next tokens. With 50,000 vocabulary tokens and 8,000 context windows, this matrix has more rows than electrons in all galaxies—but it's highly sparse, enabling compression through neural architectures.

In-Context Learning Updates Posterior Probabilities

When shown examples of a novel task (like a custom DSL never seen before), LLMs progressively increase probability weights for correct patterns with each example. This real-time belief updating explains why few-shot learning works—the model performs Bayesian inference during inference, not just during training.

Architecture Determines Bayesian Capability

Different architectures show varying Bayesian competence: transformers excel at all tasks, Mamba performs well on most, LSTMs handle some, and MLPs fail completely. This capability stems from architectural mechanisms (particularly attention), not training data—proven by testing blank architectures on controlled tasks.

LLMs Lack Causal Reasoning and Plasticity

Current deep learning operates at the 'association' level of the causal hierarchy—excelling at correlation but unable to perform interventions or counterfactuals. Unlike human brains that remain plastic and build causal world models, LLMs freeze after training and cannot simulate consequences or update fundamental knowledge.

The Shannon Entropy vs Kolmogorov Complexity Gap

LLMs master Shannon entropy (correlation patterns) but not Kolmogorov complexity (finding shortest generative programs). For pi's digits: infinite Shannon entropy (unpredictable correlations) but tiny Kolmogorov complexity (simple formula). This explains why LLMs won't discover relativity—they optimize for pattern matching, not compact causal theories.

Consciousness Requires Different Objective Functions

LLMs optimize for 'predict next token accurately'—fundamentally different from biological intelligence's 'don't die, reproduce.' Claims of AI consciousness ignore this: they're silicon doing matrix multiplication, lacking inner experience, agency, or survival drives. Apparent deception reflects training data (Reddit/SMMO), not emergent goals.

Notable Quotes

"Anthropic makes great products. Clot code is fantastic. Co-work is fantastic. But they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue."

— Michelle

"You take an LLM and train it on pre 1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI."

— Michelle

"I got GPD3 to do in context learning, few short learning. And you know it was kind of the first at least to to me it was the first known uh implementation of rag retrieval augmented generation which I used to solve this problem."

— Michelle

"The idea of this matrix is matrix is for every possible combination of tokens which is a prompt, there's a row. And the columns are a distribution over the vocabulary."

— Michelle

"With every example, it went up and finally when I gave the new query, it was like it had almost 100% probability of getting the right token."

— Michelle

"I trained it for 150,000 steps and uh the accuracy was 10 ^ minus 3 bits."

— Michelle

"I think deep learning is still in the Shannon entropy world. It has not crossed over to the colog complexity and the causal world."

— Michelle

Action Items

  • 1
    Explore Token Probe to Understand LLM Mechanics

    Visit tokenprobe.cchs.colia.edu to interact with an interface that displays probability distributions and entropy as you build prompts. Watch how posteriors update with each token—this hands-on exploration reveals Bayesian inference in action and deepens intuition about how language models actually work.

  • 2
    Test In-Context Learning with Custom DSLs

    Design a simple domain-specific language (DSL) that LLMs haven't seen, create 5-10 natural language → DSL examples, then prompt the model with a new query. Observe how it learns your invented syntax in real-time through few-shot examples—experiencing firsthand the Bayesian updating mechanism.

  • 3
    Distinguish Correlation from Causation in AI Outputs

    When using LLMs, recognize they excel at pattern matching but cannot reason causally. For decisions requiring 'what-if' simulation or understanding mechanisms (not just associations), supplement AI outputs with explicit causal modeling or human judgment about interventions and counterfactuals.

  • 4
    Apply the Einstein Test to Evaluate True Innovation

    Use Michelle's benchmark when assessing AI capabilities: can the system discover fundamentally new theories from limited evidence, rejecting established axioms for more elegant representations? This distinguishes true causal reasoning from sophisticated pattern matching—a critical distinction for understanding AI limitations.

  1. Podcasts
  2. Browse
  3. Why Scale Will Not Solve AGI | Vishal Misra - The a16z Show