Why Robots Need World Models
A working guess about what is around the robot, what is hidden, and what may happen if it moves.
A robot can have cameras and still not understand what is happening.
A camera gives it pixels. A lidar gives it distance. A force sensor tells it pressure. These are useful signals. But they do not tell the robot what will happen if it reaches behind a box, steps over a cable, or lifts something heavier than it looks.
For that, the robot needs a world model — its working guess about how the world works.
Seeing is not enough
People often talk about robots as if perception is the main problem. Can it see the object? Can it detect the person? Can it recognize the door? Those questions matter — but they are only the start.
A robot also has to understand consequences. If it pushes the tote, will the tote slide or tip? If it grabs the mug by the rim, will it slip? If it steps on the loose mat, will its foot catch?
“I see a box” is not the same as “I can lift this box from this side without hitting the shelf.”
A world model is a working guess
A world model is the robot's best current guess about the world and how it changes.
- 01A map
Where things are — aisle, wall, shelf, charging dock.
- 02Body state
Where the robot's limbs are and how it is balanced.
- 03Object state
What objects exist, how they are placed, whether one is tilted.
- 04Hidden state
What may exist even when it cannot be seen — a cable continuing behind the cart.
- 05Dynamics
How things change after actions — a door swings, a ball rolls.
- 06Task state
What has been done and what remains.
- 07Risk
What could go wrong — a grasp may slip.
The best systems mix maps, geometry, physics, learned models, safety rules, and human instructions.
How a world model turns into action
- 01Sense the world.
- 02Update the model.
- 03Imagine a few possible actions.
- 04Choose one.
- 05Act.
- 06Check what actually happened.
- 07Update the model again.
Take a humanoid moving a tote in a warehouse. It sees the tote. It notices the shelf and a cart behind it. It predicts the tote may swing if lifted too fast. It chooses a grasp. It bends. It shifts balance. It lifts. Then reality answers.
If the difference between expected and actual is small, it keeps going. If the difference is large, it slows down, replans, stops, or asks for help. That compare-and-update step is where much of the intelligence lives.
Why humanoid robots need world models
A wheeled robot mainly has to know where it can drive. A fixed arm mainly has to know where to reach. A humanoid has more ways to fail.
It can lose balance. Bump a shelf with its elbow. Place a foot badly. Reach too far. Drop an object because walking changed the load on its hand. The robot's body and the world are linked.
A humanoid does not just need to know “what object is this?” — it needs to know “what will happen if this body does this action to that object right now?”
A world model is not the whole robot brain
- Predicts what may happen.
- Stores a working picture.
- Helps the robot imagine before it moves.
- Can be wrong without consequence — until acted on.
A policy chooses what to try. Controllers make motors move stably. Safety layers slow, stop, or ask for help. Perception turns sensors into useful information. Hardware actually performs the motion. A robot can predict well and still move badly if any of these break.
What people often misunderstand
- Mistake 01
A world model is just a map.
Maps are one kind of model. Humanoids need to know what can move, break, slip, bend, or should not be touched.
- Mistake 02
A world model is a perfect simulation.
It is usually incomplete on purpose. A tote-moving robot does not need every dust particle — only what matters for the task.
- Mistake 03
Realistic video means the robot understands physics.
A model can generate plausible video and still fail at physical action. Contact, friction, weight, delay, and sensor noise are harder than looking right.
- Mistake 04
A world model proves autonomy.
The real test is practical: does the model help the robot complete the task with fewer failures, fewer unsafe actions, and less human help?
Evidence from the real world and research
- SLAM
Established robotics method. A robot builds a map while estimating where it is — the simplest kind of world model. Useful, but only one part.
- “World Models” (Ha & Schmidhuber, 2018)
Foundational research. An agent learned a compressed model and trained inside its own “dream.” Research benchmark, not humanoid deployment.
- Dreamer / DayDreamer
Model-based RL across many tasks, including four real robots learning online without simulators. Direct robotics research; not commercial scale.
- V-JEPA 2 (Meta)
Video-trained world model fine-tuned on DROID data for robot-arm tasks. Company/research claim — needs independent replication.
- NVIDIA Cosmos
World foundation models for robots, generating photoreal physics-based synthetic data. Platform claim — synthetic data still has to transfer to reality.
What is still hard
- Contact — predicting how an object will move, bend, slip, or resist.
- Hidden state — the back of a shelf, the weight of a box, the friction under a foot.
- Long tasks — small errors stack across many steps.
- Uncertainty — knowing when to act, slow down, get a better view, or ask for help.
- Evaluation — closed-loop performance is harder to measure than nice predictions.
A world model matters only if it helps the robot act more safely, recover better, or need less human help.
The simple test for any world-model claim
Does the model actually change what the robot does — and does that change show up in real behaviour?
- What does the model predict?
- How is it connected to action?
- What happens when the model is wrong?
- Where has it been tested — sim, lab, real robot, named site, measured deployment?
- What evidence shows it helped — better success, fewer failures, fewer interventions, safer behaviour?
- A robot needs more than cameras — it needs a working guess about what will happen next.
- A world model is not a perfect simulation. It is a useful model for action.
- Maps are one kind of world model; humanoids need models of objects, contact, balance, and change.
- The core loop is: sense, update, imagine, act, check, update again.
- Realistic video is not the same as reliable physical action.
- The best evidence is not a nice prediction — it is better robot behaviour in the real task.
- World model
- A robot's working guess about the world and how it changes.
- Physical AI
- AI that acts through a body in the real world.
- Perception
- The part of the system that turns sensor data into useful information.
- SLAM
- A method that lets a robot build a map while estimating where it is inside that map.
- Dynamics
- How things change after actions — a door swings, a ball rolls, a box tips.
- Latent model
- A compressed internal version of what matters, rather than every raw sensor detail.
- Policy
- The part of the robot system that chooses actions.
- Planner
- A system that compares possible actions before choosing one.
- Controller
- The lower-level system that turns a chosen action into motor commands.
- Closed-loop evaluation
- Testing whether the robot acts better when its predictions affect its next actions.
- Hallucination
- When a model predicts something plausible-looking but wrong.
- Uncertainty
- How unsure the robot should be about its own model or prediction.
Sources and evidence notes
What this essay leans on
| Claim | Evidence | Strength | Note |
|---|---|---|---|
| Learned world models can train agents to choose better actions. | Ha & Schmidhuber, “World Models,” 2018. | Strong | Research benchmarks, not humanoid deployment. |
| Maps are a practical world model in real robotics. | SLAM — established robotics method. | Strong | Only one part of world modelling. |
| Model-based learning predicts outcomes and improves behaviour. | Hafner et al., Dreamer, Nature 2025. | Strong | Broad research tasks; not proof of general humanoid work. |
| World models can be applied directly on physical robots. | Wu et al., DayDreamer, 2022 — four robots learning online without simulators. | Strong | Research settings, not commercial scale. |
| Video-trained world models are a current direction. | Meta V-JEPA 2, 2025; NVIDIA Cosmos, 2025. | Medium | Company/platform claims — need independent replication. |
| Open challenges include contact, hallucination, alignment, evaluation. | World Models for Robotic Manipulation survey, 2026. | Strong | Field is changing quickly. |