TL;DR: Large Language Models (LLMs) transformed Artificial Intelligence (AI) over text and, increasingly, images and video. Acting reliably in the physical world requires more than next-token prediction. It requires world models: learned representations and simulators that maintain state, predict dynamics, and support counterfactual reasoning for planning and control. This article defines what practitioners mean by world models, explains why the approach is timely, lays out evaluation yardsticks, surveys recent systems into a practical taxonomy, and shows what changes for real deployments; it closes with risks, open challenges, and a short roadmap, plus an Appendix with an Annotated Reading List, Glossary, and Timeline.
Introduction
Late last year, OpenAI reportedly offered $ 500 million to acquire Medal, a platform where gamers share gameplay video. The deal fell through, but the signal was clear. Internet text is largely saturated. The next frontier is interactive video that teaches models how the world unfolds over time.
Videogames and simulated worlds are powerful training grounds for a class of models called World Models. Unlike systems that only learn statistical patterns in text or pixels, world models learn and simulate environment dynamics. In simulation, an agent can practice, fail safely, and learn cause and effect for actions such as jumping, cornering on a wet road, or recovering a slipping grasp. It is far cheaper and safer to stage a highway pileup, a near miss on a runway, or a factory robot fault in a virtual environment than in the physical world.
LLMs have reshaped AI, yet they often falter on tasks that require persistent world state, accurate physics, and closed loop decision making. World models fill that gap by maintaining an internal representation of the world and predicting the consequences of actions so agents can imagine, plan, and decide before they act.
Reader roadmap
- Why now: Interactive video and simulation are the next data frontier; reliable action needs more than next-token prediction by Large Language Models (LLMs).
- What “world model” means: Three uses in practice: an internal learned dynamics model, an external simulator, and a learned foundation-level world simulator.
- Beyond next-token prediction: Limits of Large Language Models (LLMs) for persistent state, physics, and closed-loop planning.
- Capabilities unlocked: Sample efficiency via imagination; safety through simulation; counterfactual and causal reasoning; tighter bridge from perception to control.
- How to evaluate: Closed-loop task utility; physical and geometric consistency; uncertainty and out-of-distribution detection; sim-to-real correlation.
- Systems, 2024 to 2025: DeepMind Genie 3; NVIDIA Cosmos; Meta V-JEPA 2 (Video Joint Embedding Predictive Architecture 2); Wayve GAIA-2 (Generative Artificial Intelligence for Autonomy 2) and LINGO-2; Waabi Mixed Reality Testing; OpenAI Sora 2; World Labs.
- What changes for practitioners: Robotics and autonomy workflows; data and synthetic-data provenance; simulator-in-the-loop validation.
- Risks and open challenges: Reality gaps and calibration; causal control versus correlation; compute, sustainability, licensing, and provenance.
- What to watch next: Longer stable interactive sessions; open models and benchmarks; zero-shot robotics from video-pretrained backbones; governance for world simulators.
- Appendix: Annotated Reading List, Glossary, and Timeline.
What practitioners mean by “world model”
The phrase covers three related ideas:
- Learned Internal World Models: These are predictive models learned inside an agent that capture the dynamics of the environment. The agent uses the model to imagine future states and outcomes, effectively “dreaming” or planning internally before acting. In reinforcement learning (RL), a world model encodes the agent’s sensory inputs into a compact state and predicts how the state will evolve given potential actions. This lets the agent simulate trajectories and select actions leading to desirable outcomes without directly trial, and, error in the real environment. For example, a robot might learn a neural network model of physics from experience and then use it to predict the consequences of its actions (e.g. “if I push this cup, it will tip over”). Such internal models are learned from data (often via self, supervised or unsupervised objectives) and live inside the agent’s brain, enabling imagination and planning.
- External Simulators: In other cases, “world model” refers to an external simulation environment, not learned by the agent itself, that is used for training or evaluating AI systems. These are the physics engines, virtual reality simulators, or game environments that provide a sandbox for agents. For instance, the driving simulator CARLA or a Mujoco physics environment can be seen as external world models that approximate how the world behaves. Developers craft these simulators with programmed rules of physics and realistic graphics; agents can be trained within them to perform tasks. External simulators are invaluable for safe training (no real, world harm) and for generating varied experiences, but they are typically domain, specific (e.g. a driving sim can’t simulate cooking). They serve as platforms to test agents but are not themselves learned or adaptive. In summary, an external world model is like a flight simulator for AI, providing a controllable copy of the world’s dynamics for the agent to learn in.
- “World Foundation Models” (General, Purpose Simulators): The newest meaning of world model, and the focus of much current research, is a general, purpose, broad, prior model that can simulate worlds in an open, ended way, analogous to how large language models simulate text. These might be called foundation world models, taking inspiration from “foundation models” in NLP. A foundation world model is trained on vast amounts of broad data (e.g. internet videos, 3D scenes, physics interactions) so that it learns the general patterns of our world’s physics and semantics. The model can then generate or predict new environments or experiences across many domains. For example, given a text prompt, it might generate an interactive 3D scene with realistic physics, essentially AI as a world simulator. Unlike a fixed game engine, a foundation world model is learned and can generalize to create countless novel scenarios. DeepMind’s Genie models and NVIDIA’s Cosmos models are positioned as the first of these foundation world models, built to “simulate aspects of the world” broadly for any task or agent. They serve as general simulation platforms, potentially allowing AI agents to train in an “unlimited curriculum” of scenarios. In short, this concept envisions one model that captures the physics and rules of many environments, which can be customized or prompted to generate any world needed, much like GPT, 4 can generate text on any topic. These foundation world models marry the open, ended creativity of generative models with the structural consistency of simulators, aiming to become “broad, prior” world simulators for AI.
In summary, the phrase ‘world model’ can mean (1) an internal predictive model learned by the agent, (2) an external simulator provided to the agent, or (3) a foundation world model trained to simulate diverse environments. The third category has advanced fastest recently, with several high-profile releases. These meanings are complementary: an agent can maintain a learned internal model and refine it by training inside external simulators, including those instantiated by a foundation world model. All three reflect the same core idea of modeling environment dynamics, but at different levels (internal versus external) and scopes (domain specific versus general).
Beyond Next-Token Prediction: The Limits of LLMs for Reasoning and Planning
Large Language Models (LLMs) like GPT-4 have dazzled us with their ability to generate coherent text and even code. They learn an implicit “world knowledge” from patterns in language, but they lack an explicit model of the physical world’s dynamics. An LLM is essentially a very sophisticated next-word predictor; it doesn’t truly simulate what it talks about. This leads to obvious gaps: you can prompt an LLM with a physical puzzle or a planning problem, and it might produce a plausible-sounding answer that is actually nonsense when executed in reality. For instance, an LLM might confidently say “Yes, you can fit an elephant in a sedan by folding it,” because linguistically it’s not impossible, but any understanding of physics tells us that’s absurd.
The next-token paradigm has inherent limitations for multi-step logical reasoning and planning. LLMs don’t have a persistent memory of a “state of the world” that they update as actions are taken. They also have no sense of consequences except by referencing similar sequences in training data. This is why LLMs can struggle with problems that require simulating a sequence of events or the passage of time. They often hallucinate inconsistent outcomes because they aren’t grounded in a model that enforces consistency over multiple steps.
Yann LeCun and others have argued that to achieve true reasoning AIs, we need models that can simulate outcomes, not just recall patterns. In his view, an agent should have a world model that it can query: “if I do X, what might happen?”, something LLMs are not built for. The emergence of chain-of-thought prompting in LLMs (where the model is nudged to internally simulate reasoning steps) is an attempt to give LLMs a pseudo-model, but it’s limited by context window and still doesn’t ensure physical realism.
World models explicitly address this: a learned dynamics model can be rolled forward many steps to predict consequences, enabling planning beyond the immediate next step. This is essential for tasks like robotics (you can’t decide each motor command purely by next-token prediction; you must foresee stability, collisions, etc.) and complex decision-making (in business or science applications, you must simulate “what if we do this… then what” scenarios). In essence, world models give AI a sort of “imagination” that’s grounded in how the world works, a faculty that pure LLMs lack.
Thus, one key motivation is to overcome the myopia of next-word prediction and endow AI with foresight. By having an internal model of the world, an AI can roll out hypothetical futures internally, which is the crux of both reasoning (mental simulation) and planning (evaluating outcomes before acting). This aligns with how humans reason about physical situations: we imagine scenarios unfolding in our mind’s eye, using our intuitive physics (a kind of mental world model) to guide decisions.
Motivations and capabilities unlocked
1. Sample efficiency, learning faster by “dreaming”
Training high-performing AI systems often requires massive real-world data: reinforcement learning agents may take millions of trial-and-error steps, and robots may need years of experience to master a task. World models promise large gains in sample efficiency because they let agents learn from simulated experience that is cheap and safe, which reduces the number of real-world interactions required. This improves efficiency, enables off-policy exploration inside the learned model, and supports transfer across tasks.
The Dreamer3 paper reported that larger world models not only achieved higher final scores but also required fewer real interactions to solve tasks, giving practitioners a predictable way to trade compute for data: train a bigger model and you will need fewer real episodes to learn the behavior. In robotics this matters because physical robot time is slow and expensive.
World models also support safe off-policy learning. Once you have a model, you can evaluate new policies inside it without risking the real system. For example, Wayve’s GAIA-2 (Generative AI for Autonomy 2) can generate diverse driving scenes to train and stress-test a driving policy, effectively augmenting the training data manifold.
Finally, world models enable multi-task learning and transfer. A single model can be reused across many tasks, while model-free systems often learn each task from scratch. V-JEPA 2 (Video Joint Embedding Predictive Architecture 2) is a representative example: it learns physical regularities from web videos that can then transfer to robotic control with minimal additional data, which is far more sample-efficient than training a robot purely on its own experience.
In summary, world models let AI “learn by dreaming” in the sense of generating synthetic experiences, performing mental rehearsal, and consolidating knowledge in ways that raw experience alone cannot. This is vital for scaling AI into domains where real data is scarce, expensive, or slow to collect, such as medical procedures and space robotics.
2. Safety via simulation, training and testing in virtual worlds
For safety-critical domains, learned world models let teams rehearse hazardous scenarios and validate behavior before any real-world exposure. Two capabilities matter most. First, safe training: an agent can practice handling high-risk events entirely in simulation, accumulating the equivalent of years of experience without endangering people or equipment. Second, rigorous testing and validation: simulation enables exhaustive coverage of rare edge cases that are unlikely to appear during limited field trials, as well as stress testing under distribution shifts such as unusual weather, sensor dropouts, or degraded actuators. Teams can measure scenario coverage, run large Monte Carlo sweeps, and probe counterfactual decisions to identify failure modes early.
These benefits extend beyond faster iteration. Simulated rollouts support offline policy evaluation, safety buffers, and guardrails that are tuned before deployment. Digital twins of vehicles, factories, or hospitals allow policy updates to be vetted at scale and compared across versions with reproducible seeds. Research efforts around video world simulators aim to expand scenario diversity further, giving embodied Artificial Intelligence systems access to virtually unlimited training data without real-world risk. Example: Waabi’s Mixed Reality system exposes a self-driving stack to events such as drunk drivers or tire blowouts entirely in simulation, so the policy practices dangerous situations without endangering people or equipment.
3. Counterfactual and causal reasoning, “what-if” analysis
Strong world models support interventions and counterfactuals, letting agents ask “What if I take action A instead of B?” and “What if variable X had been different?” This moves systems beyond correlation toward causal understanding, which improves planning, diagnosis of near-misses, and human-facing explanations. In practice, counterfactual probing helps isolate the drivers of success or failure, assess sensitivity to timing and uncertainty, and choose actions that avoid hazards rather than merely reacting to them after the fact.
This capability also improves transparency. By simulating alternative histories, an agent can generate concise rationales such as “Had I not braked here, we would have entered an unsafe distance,” which aligns with how experts reason about risk. Although reliable causal edits in learned latent spaces remain an active research frontier, even partial counterfactual competence helps with data-efficient learning, safer policy updates, and clearer accountability. Example: Wayve’s GAIA-2 can answer driving counterfactuals such as “What if a pedestrian stepped out now?” by inserting the event in the modeled scene and projecting the outcome, enabling the policy to plan proactively.
4. Bridging perception and control in embodied agents
Embodied agents must convert raw, high-dimensional observations into actions that reliably achieve goals. World models provide the glue between seeing and doing: they learn a latent state that captures “what is where” and “how things move,” then predict how that state will evolve under candidate controls. Planners can therefore search or optimize directly in this compact state space, and controllers can use short-horizon predictions to avoid unsafe trajectories. This reduces brittle hand-offs between perception, planning, and control modules and enables end-to-end training where the representation is shaped by downstream decision quality.
In closed-loop operation, the agent compares predicted and observed outcomes to refine its belief about the current state, similar in spirit to a learned Kalman filter. The result is more robust behavior under partial observability, sensor noise, or delays, because the model maintains a coherent internal narrative of the scene and how actions change it. This architecture also supports sim-to-real transfer: by learning general visual-dynamics structure first, the agent needs fewer task-specific trials to reach competent control. Example: V-JEPA 2 learned a visual latent that predicts future observations given actions, then transferred this representation to robot manipulation with minimal task-specific data, showing that a shared latent can align perception with control outcomes.
Overall: World models deliver predictive understanding for safer training, counterfactual reasoning for causal decision-making, and a tight link between perception and control for reliable closed-loop behavior. Together, these capabilities make them a foundational ingredient for Artificial Intelligence systems that must operate autonomously and responsibly in complex real-world settings.
The 2024–2025 wave: what shipped and why it matters
DeepMind Genie 3
- What it is: Text to interactive worlds with real time control hooks for agents.
- Inputs and control surface: Text prompts; session state; keyboard or programmatic actions.
- Evidence and results: Interactive scenes at high resolution with intuitive physics and playability.
- Why it is different: Focus on interactivity and curriculum generation at scale.
- Notes and limits: Session length and long horizon stability continue to evolve.
NVIDIA Cosmos
- What it is: Open weight world foundation models for “physical AI” with evaluation suites.
- Inputs and control surface: Video conditioning, multi view geometry signals, optional actions.
- Evidence and results: Physical alignment metrics and multi view consistency benchmarks.
- Why it is different: Emphasis on open weights, reproducibility, and standardized metrics.
- Notes and limits: General purpose by design; domain tuning required for specific tasks.
Meta Video Joint Embedding Predictive Architecture 2 (V-JEPA 2)
- What it is: Self supervised video model that transfers to action.
- Inputs and control surface: Video masking; post training with minimal action data.
- Evidence and results: Strong video understanding and zero or few shot robot manipulation.
- Why it is different: Predicts in representation space instead of pixels; efficient post training.
- Notes and limits: Action grounding remains a bottleneck for the most dexterous tasks.
Wayve Generative Artificial Intelligence for Autonomy 2 (GAIA-2)
- What it is: Controllable multi camera driving world model.
- Inputs and control surface: Ego actions, maps, other agents, weather, and time of day.
- Evidence and results: Rare, safety critical scenarios synthesized at scale for training and validation.
- Why it is different: Surround view realism with explicit scenario controls for autonomy.
- Notes and limits: Geography and sensor distribution shifts still require careful adaptation.
Wayve Language INstructed drivinG mOdel 2 (LINGO-2)
- What it is: Vision language action model that drives while narrating reasoning.
- Inputs and control surface: Perception streams and natural language guidance.
- Evidence and results: Closed loop driving with instruction following and explanations.
- Why it is different: Human legible commentary aligned with actions improves interpretability.
- Notes and limits: Language alignment must avoid over trust; evaluation protocols are emerging.
Waabi Mixed Reality Testing
- What it is: Sensor level neural simulation blended with real vehicles on closed courses.
- Inputs and control surface: Real time sensor feeds augmented with reactive virtual actors.
- Evidence and results: Exposure to edge cases at sensor fidelity with safety preserved.
- Why it is different: Mixed reality closes the gap between offline sim and on road testing.
- Notes and limits: Requires high performance infrastructure and disciplined scenario design.
OpenAI Sora 2
- What it is: Text to video with stronger physics, synchronized audio, and multi scene control.
- Inputs and control surface: Text prompts and shot level directives.
- Evidence and results: Highly realistic long shots with improved temporal coherence.
- Why it is different: Consumer facing reach and governance debates shape norms for simulators.
- Notes and limits: Provenance, consent, and watermark robustness are critical for scale.
World Labs
- What it is: A focused bet on spatial intelligence and three dimensional world models.
- Inputs and control surface: Mixed 3D and video data; developer tools in progress.
- Evidence and results: Early demonstrations oriented to augmented reality and robotics.
- Why it is different: Industry signal and research leadership converging on world modeling.
- Notes and limits: Roadmap and benchmarks are evolving with the ecosystem.
What changes for practitioners
Robotics and autonomous driving
- Action conditioned world models support planning in latent space and reduce real robot data needs through imagination rollouts.
- Controllable generation produces long tail hazards and domain shifts on demand.
- Mixed reality enables closed loop testing at sensor fidelity before on road or factory deployment.
Data strategy, synthetic data, and provenance
- Learned simulators decouple coverage from what nature provides and allow explicit curriculum design.
- Provenance, consented data, and watermark robustness must be part of the engineering plan as video and image libraries are licensed and synthesized data enters training and validation.
Open challenges
- Reality gaps and uncertainty calibration in long horizon rollouts, including detection and recovery when predictions drift.
- Causal control versus correlation in internet scale training data, including separating true dynamics from shortcut cues.
- Compute and data provenance for high fidelity simulators, including sustainable training and licensed evaluation at scale.
What to watch next
- Longer, more stable interactive sessions. Enables curriculum learning and long horizon planning for agents that must reason over minutes, not seconds.
- Open models and benchmarks. Improves reproducibility and apples to apples comparison across labs and domains.
- Zero shot robotics with video pretrained backbones. Tests whether broad video pretraining truly encodes physics and affordances.
- Governance for world simulators. Addresses provenance, consented data, and watermark robustness as simulation enters consumer and enterprise workflows.
Further reading and primary sources
- Appendix A: Annotated Reading List. Canonical papers and system write ups.
- Appendix B: Glossary. Definitions of Joint Embedding Predictive Architecture (JEPA), Reinforcement Learning (RL), Neural Radiance Fields (NeRF), Fréchet Video Distance (FVD), Structural Similarity Index (SSIM), and more.
- Appendix C: Timeline. Milestones from Dyna and early model based Reinforcement Learning through today’s foundation world models and mixed reality testing.
Bottom line
If LLMs made AI good at recalling and composing, world models are how we make AI good at foreseeing and doing. The practical path forward is to combine representation centric learning, controllable generation, and simulator in the loop validation into a single, testable stack that is judged by physical consistency, closed loop utility, and sim to real correlation.
Appendix
A. Annotated Reading List
Foundational Works:
1. World Models — Ha & Schmidhuber (2018): The paper that rekindled interest in learned world models. Introduced a VAE‑RNN to model game environments and showed an agent can train inside its own dream. Why read: It is accessible and demonstrates key ideas like latent representations and dreaming for policy learning. Ha’s blog interactive version is also insightful.
2. Dreamer: Learning Control by Latent Imagination — Hafner et al. (ICLR 2020): First Dreamer paper. Learned world model (RSSM) on pixel tasks and policy trained by imagining ahead. Why read: It provides technical details on learning a world model with reconstruction and reward‑prediction losses, and shows improved data efficiency.
3. Recurrent World Models Facilitate Policy Evolution — Ha & Schmidhuber (NeurIPS 2018): Conference version of World Models; often cited for the CarRacing dream visuals.
4. Integrated Architectures for Learning, Planning, and Reacting — Sutton (1990): Proposal of the Dyna architecture for model‑based reinforcement learning. Why read: Conceptual precursor explaining how an agent can learn a model and use it for planning.
5. A Path Towards Autonomous Machine Intelligence — Yann LeCun (2022): LeCun’s manifesto for self‑supervised learning and Joint Embedding Predictive Architecture (JEPA). Lays out why predictive world models are needed beyond passive large language models. Why read: Frames the big picture and introduces the JEPA concept that underpins I‑JEPA and V‑JEPA.
6. MuZero — Schrittwieser et al. (Nature 2020): MuZero learns a model of game dynamics on the fly to plan moves. Why read: Landmark achievement using an implicit world model in planning, matching model‑free methods in complex games.
Recent Breakthroughs (2023–2025):
1. V‑JEPA: Video Joint Embedding Predictive Architecture — Meta AI Blog (2024): Blog post introducing V‑JEPA (precursor to V‑JEPA 2). Why read: Gives intuition on masking video and predicting in representation space, and why this approach scales to learning physics.
2. V‑JEPA 2: Self‑Supervised Video Models Enable Understanding, Prediction and Planning — Assran et al. (arXiv 2025): Full paper for V‑JEPA 2. Why read: Shows how web‑scale video pre‑training plus minimal robot data yields an actionable world model. Contains results on video question answering and robotics.
3. Genie 3: A new frontier for world models — DeepMind Blog (2025): Official announcement of Genie 3. Why read: Illustrates foundation world models and their role in interactive simulation, with examples and visuals.
4. Cosmos: World Foundation Model Platform for Physical AI — NVIDIA Technical Report (arXiv 2025): In‑depth report on Cosmos models and benchmarks. Why read: Covers architecture choices, training setup, and introduces evaluation metrics for physical alignment. Useful for building or evaluating video world models.
5. DreamerV3: Mastering Diverse Domains through World Models — Hafner et al. (Nature 2025): Latest Dreamer showing a single algorithm solving many tasks. Why read: Demonstrates state‑of‑the‑art model‑based reinforcement learning and discusses robustness techniques and scaling effects.
6. Sora is here — OpenAI Blog (2024): OpenAI’s post on its first Sora model. Why read: Articulates the goal of world simulators for physical understanding.
7. Sora 2 is here — OpenAI (2025): Follow‑up post detailing improvements in Sora 2 including physics, audio, and control. Also interesting from a deployment viewpoint.
8. GAIA‑2: Pushing the Boundaries of Video Generative Models — Wayve Blog (2025): Explains GAIA‑2’s features for driving simulation. Why read: Real‑world case study of domain‑specific world model usage in autonomous vehicles, with control parameters and scenario generation.
9. LINGO‑2: Driving with Natural Language — Wayve Blog (2024): Describes how LINGO‑2 works and its significance. Why read: Pioneering combination of language and control, offering insights into aligning a model’s reasoning with human communication.
Overviews and Perspectives:
1. The Godmother of AI Wants Everyone to Be a World Builder — Steven Levy, Wired (2024): Profile on Fei‑Fei Li’s World Labs. Why read: High‑level vision of where world models might head and industry perspective, including funding and key hires.
2. How One AI Model Creates a Physical Intuition of Its Environment — Quanta Magazine (2025): Accessible article on world models, likely focusing on V‑JEPA 2. Why read: Provides context and explanation for general audience without heavy math.
3. What are World Models? — Turing Post: Conceptual explanation of world models. Why read: Good for beginners; defines world models, components, and references the 2018 Ha and Schmidhuber work.
4. The Singularity Project on JEPA: Blog that breaks down LeCun’s JEPA framework. Why read: Useful for understanding the rationale behind non‑generative predictive learning and how it differs from traditional generative modeling.
5. Our New Model Helps AI Think Before it Acts — Meta (2025): Press release for V‑JEPA 2. Why read: Straightforward overview of why physical reasoning is needed for advanced AI and new benchmarks for physical reasoning.
B. Glossary of Terms
1. World Model: A model that represents environment state and dynamics so an agent can predict consequences of actions and plan. Can mean an internal learned predictive model, an external simulator, or a general foundation simulator.
2. Model-Based Reinforcement Learning (MBRL): A reinforcement learning approach that learns a model of transition dynamics and reward, then uses it for planning and policy learning.
3. Next-Token Prediction: The language modeling objective used in large language models to predict the next symbol. It lacks explicit persistent world state and long-horizon planning.
4. Sample Efficiency: How much data or experience is required to reach a target performance. World models improve efficiency by learning in imagination and simulation.
5. Counterfactual Reasoning: Reasoning about what would happen under alternative actions or conditions. Enabled by simulating interventions in a world model.
6. Open-Loop vs Closed-Loop Evaluation: Open-loop evaluates predictions without feedback. Closed-loop evaluates models or agents where predictions affect future states, which is crucial for planning and control.
7. Fréchet Video Distance (FVD): A distributional metric for video generation quality that compares generated and real videos using deep features. Useful but insufficient for closed-loop control evaluation.
8. Structural Similarity Index (SSIM): A perceptual metric for frame-wise similarity to a reference image. Often used for short-horizon prediction quality.
9. Foundation Model: A model pretrained on broad data at scale and adaptable to many tasks. A foundation world model aims to simulate diverse environments and physics.
10. Joint Embedding Predictive Architecture (JEPA): A self-supervised framework that predicts latent representations of future or masked content instead of pixels. Basis for I-JEPA and V-JEPA.
11. Latent Dynamics Model: A model that predicts future states in a compact latent space rather than pixels, commonly used in modern world-model agents such as Dreamer.
12. Mixed Reality Testing (MRT): A method that blends real hardware and tracks with simulated actors inserted into sensor streams in real time for safe, reactive closed-loop testing.
13. Sim-to-Real (Sim2Real) Transfer: Deploying agents trained in simulation to the real world while minimizing performance degradation from simulator-reality differences.
14. Imagination or Dreaming in Reinforcement Learning: Using a learned world model to simulate trajectories for planning or policy learning without real-world interaction.
15. Physics Alignment or Physical Consistency: The degree to which model predictions obey geometric and physical laws such as conservation and correct 3D consistency across views.
16. Uncertainty Quantification: Estimating predictive uncertainty to detect out-of-distribution states and avoid overconfident wrong plans in model-based control.
17. Ensemble World Models: Using multiple learned models to improve robustness and provide epistemic uncertainty estimates for safer planning.
18. Neural Radiance Fields (NeRF): A neural scene representation that enables photo-realistic novel view synthesis from images. A building block for spatial world modeling.
19. Gaussian Splatting: A fast neural rendering technique that represents scenes with 3D Gaussians for efficient, high-quality view synthesis.
20. Epipolar Geometry and Sampson Error: Multi-view geometry concepts used to assess 3D consistency of generated sequences. Lower Sampson distance indicates better geometric alignment.
21. PerceptionTest and Temporal Compass: Benchmarks introduced by Meta to evaluate physical reasoning and temporal understanding in video world models.
C. Timeline of Major Milestones in World Models
1. 1990 — Dyna architecture proposed: Richard Sutton describes integrating learning and planning by learning a model and using simulated experience for updates.
2. 2018 — World Models: Demonstrates a learned world model used for training policies in imagination with successful transfer to the real environment.
3. 2019 — MuZero: A planning agent that learns a model of dynamics and rewards implicitly, achieving superhuman performance in games.
4. 2019 — Dreamer (latent imagination): Learns a latent dynamics model from pixels and optimizes control by rolling out imagined trajectories.
5. 2022 — LeCun’s AMI and JEPA vision: A roadmap for autonomous machine intelligence via predictive world models that learn abstract representations.
6. 2023 — Mixed Reality Testing for AVs highlighted by Waabi: Neural simulation fused with real trucks on closed courses to test edge cases safely with sensor-level fidelity.
7. 2024 — OpenAI Sora 1: Text-to-video generation framed as a step toward general world simulators for physical understanding.
8. 2024 — Wayve LINGO-2: First closed-loop vision-language-action model driving on public roads with real-time explanations.
9. 2024 — World Labs vision for spatial intelligence: Fei-Fei Li’s startup raises major funding to build 3D world models for AR and robotics.
10. 2025 — NVIDIA Cosmos 1.0: Open-weight world foundation models and benchmarks focused on 3D consistency and physical alignment.
11. 2025 — DeepMind Genie 3: Interactive text-to-world generation with real-time agent control for training curricula in simulation.
12. 2025 — Meta V-JEPA 2: Self-supervised video world model that transfers to zero-shot robotic planning with minimal action data.
13. 2025 — OpenAI Sora 2: Improved physics, audio, and multi-scene control with consumer-facing access and governance considerations.
14. 2025 — Wayve GAIA-2: Controllable multi-camera video world model for driving to generate rare and safety-critical scenarios at scale.
