Why Data Is the Hardest Problem
Robot data is physical experience. It needs bodies, sensors, time, mess, failure, and recovery — none of which sit on the web.
Robots need experience.
A language model can learn from text people already wrote. An image model can learn from pictures people already shared. The internet is full of that kind of data.
Robots need something different. They need records of physical attempts.
A useful robot dataset may include the camera image, the spoken instruction, the joint positions, the gripper command, the force on the fingers, the timing of the motion, whether the object slipped, and whether the task worked. That data usually does not already exist — someone has to make it.
Robot data is not just information
Say a robot has to pick up a soft bag from a table. A human sees the bag and knows a few things at once. The bag may bend. The handle may twist. The bottom may sag. If the bag is full, it will pull downward. If the grip is too weak, it will slip.
For a robot to learn that, it needs more than a photo of a bag. It needs examples of attempts — what the robot saw before the action, where the hand moved, how the fingers closed, what force was used, whether the lift worked, and what failure looked like.
For Physical AI, the action is part of the data.
Why text and image AI had an easier path
There is no public internet full of robot hands trying to fold laundry, route cables, load dishwashers, clean spills, carry odd-shaped boxes, open tight drawers, or recover from bad grasps.
There are videos of people doing those things. Those videos can help — they teach a model about objects, scenes, language, and human goals. But a video of a person folding a towel does not directly tell a robot what motor commands to send to its joints.
RT-2 combines web and robot data. Web data helps with “what is a cup.” Robot data teaches the hand how hard to grip.
What robot data includes
A useful robot learning example is dense. One full record of an attempt is often called a trajectory — not just a video clip, but the path through a task.
- What the robot saw through its cameras.
- What task it was given.
- Where the robot's joints were.
- What action it took, and how fast.
- Whether it touched the object, and how much force it felt.
- Whether the object moved as expected.
- Whether the task succeeded.
- What a human did when the robot failed.
A failed text prediction costs almost nothing. A failed robot action can break a part, damage a tool, or create a safety risk.
Why humanoid data is harder
A simple robot arm on a table has one job area. A humanoid has legs, arms, hands, a torso, cameras, balance, and often many more joints. It may move through spaces built for people, carry objects while walking, and use both hands.
A humanoid picking up a tote must decide where to stand, where to place its feet, how to bend, where to put each hand, how much force to use, how to keep balance, how to turn, how to avoid people, and how to place the tote without dropping it.
Each part of that job can fail. A humanoid body does not remove the data problem — it increases it.
The field is building bigger datasets
- DROID
76,000 demonstration trajectories, 350 hours, 564 scenes, 86 tasks, 50 collectors over 12 months across three continents.
- BridgeData V2
60,096 trajectories across 24 environments and 13 skills — most teleoperated demonstrations.
- Open X-Embodiment
Pooled 60 datasets from 34 labs. Over 1 million real robot trajectories across 22 robot embodiments.
- OpenVLA
7B vision-language-action model trained on 970,000 robot episodes from Open X-Embodiment.
- π0 (Physical Intelligence)
Internet-scale VL pretraining plus Open X plus company-collected dexterous data from eight robots. Described as an early prototype.
These matter. They also show the scale. Even a million robot trajectories is small compared with text and image data — and many cover narrow actions in limited settings.
Teleoperation is part of the data engine
When a person teleoperates a robot, the system can record the scene and the action together. That record becomes a demonstration. Mobile ALOHA showed that 50 demonstrations per task plus co-training improved success rates on cabinets, elevators, and rinsing a pan.
For now, many robots learn physical skills from human hands — even when the goal is to need those hands less over time.
Synthetic data helps, but reality still gets a vote
Simulation can create data faster than the real world. NVIDIA reported that its GR00T workflow produced 780,000 synthetic trajectories in 11 hours, and that mixing synthetic with real data improved GR00T N1 performance by 40% compared with real data alone.
That is a serious result. But synthetic data has a limit. Real objects bend, stick, slip, wobble, and get placed in strange ways. The test is not whether synthetic data looks good — it is whether it improves real-world performance.
Failure data may be the most valuable data
A robot that only sees clean success may become fragile. It needs to know what failure looks like: the gripper closing too early, the cable catching, the drawer jamming, the object heavier than expected.
But failure data is hard to collect. A failure can damage hardware, slow a workplace, or create safety risks. A 2024 Stanford paper argued that real-world autonomous collection still faces major challenges — and that more human demonstrations often gave better return per unit of effort.
What people often misunderstand
- Mistake 01
“Internet data will solve robot data.”
It helps with words, scenes, and goals. It does not tell a robot how to move motors, control timing, apply force, or recover from contact.
- Mistake 02
“Human videos are the same as robot demonstrations.”
Videos are clues. They rarely include joint positions, action commands, force, or timing — the things a robot body needs to learn from directly.
- Mistake 03
“More data is always better.”
More bad data makes systems worse. Volume matters; coverage matters more. Good data needs variety and clear labels.
- Mistake 04
“A big dataset proves a general robot.”
Datasets are training material, not workers. The real test is what the trained system can do in a new place, on a real task, over time.
- Mistake 05
“Synthetic data replaces real-world data.”
It can help and lower cost. But if the simulator is wrong about friction, cloth, or human behaviour, the robot learns the wrong lesson.
Deployment data is different from demo data
- Object placed in the right spot.
- Good lighting.
- Robot starts from a known pose.
- Clip begins after setup, ends before cleanup.
Shift changes, bad lighting, clutter, bent objects, tired workers, blocked paths, worn hardware, calibration drift, and edge cases that were not in the demo. This is where you find out whether interventions are falling, failures are less severe, and the system still works after a week or a month.
The simple test for any robot data claim
Where did the data come from, and would it survive a new robot, room, or object?
- What kind of data was it — real trajectories, human video, teleop, simulation, deployment logs?
- Which robot body collected it?
- What tasks did it cover?
- How was success measured?
- How much failure data was included?
- How much human help was needed?
- Did it transfer — new objects, new rooms, new lighting, new body?
- Robot data is physical experience, not just information.
- A useful dataset connects what the robot saw, what it did, and what happened next.
- Web data helps with language and vision — it does not replace robot action data.
- Teleoperation is one of the main ways to collect high-quality demonstrations.
- Synthetic data can help, but still needs real-world testing.
- A big dataset does not prove a general-purpose robot.
- Failure data is valuable, but hard and risky to collect.
- For humanoids, the body has more ways to move and fail — so the data problem is bigger.
- Robot data
- Records of robots acting in the real world or in simulation — images, actions, sensor readings, task labels, and results.
- Trajectory
- One recorded attempt at a task, including what the robot saw, what it did, and what happened next.
- Demonstration
- An example of a task done by a human, often through teleoperation, that a robot can learn from.
- Teleoperation data
- Data collected while a human controls or guides a robot from a distance.
- Action data
- The commands or movements the robot sends to its motors, arms, hands, base, or body.
- State
- A snapshot of the robot and scene — joint positions, camera views, object positions, sensor readings.
- Force data
- Data about pressure, touch, load, or resistance.
- Imitation learning
- A way for robots to learn by copying demonstrations.
- Reinforcement learning
- A way for robots to learn through trial and error using rewards or scores.
- Synthetic data
- Data made in simulation or generated by software instead of collected from the real world.
- Embodiment
- The robot's physical body — its shape, arms, hands, wheels, legs, sensors, limits.
- Cross-embodiment learning
- Using data from one kind of robot to help another kind of robot learn.
- Vision-language-action model
- A model that takes images and language and outputs actions a robot can perform.
- Generalization
- Handling new objects, places, tasks, or conditions not exactly in the training data.
Sources and evidence notes
What this essay leans on
| Claim | Evidence | Strength | Note |
|---|---|---|---|
| Robotics needs action outputs and both web and robot data. | Google DeepMind RT-2. | Strong | Research evidence; not commercial deployment proof. |
| Real robot data collection is hard and slow. | DROID — 76k trajectories, 350 hours, logistical and safety challenges. | Strong | Strong dataset evidence. |
| Human demonstrations remain central. | BridgeData V2 — 50,365 teleoperated demonstrations. | Strong | Dataset evidence. |
| Shared robot datasets are growing. | Open X-Embodiment — 1M+ trajectories, 22 embodiments, 34 labs. | Strong | Research resource. |
| Generalist robot models use large robot datasets. | OpenVLA — 7B VLA trained on 970k Open X episodes. | Strong | Research model. |
| Synthetic data can help robot training. | NVIDIA GR00T N1 — 780k synthetic trajectories in 11 hours; 40% improvement when mixed with real data. | Medium | Company-reported; not independent. |
| Autonomous real-world data collection remains hard. | Stanford autonomous data collection paper, 2024. | Strong | Cautionary research evidence. |