Lilac Flower

Reward Shaping Is the Hardest Part of Robot Learning

If you ask most people what the hard part of training a robot policy is, they'll say data collection, or compute, or sim-to-real transfer. In my experience, the answer is none of those. The hardest part is reward shaping — and it's the part that gets the least attention in papers.

The problem is deceptively simple to state: you need to tell the agent what "good" looks like in a way that's dense enough to learn from but specific enough not to be gamed. In practice, this is where the most training time gets spent, not in hyperparameter tuning or architecture search. I've watched well-designed agents learn to exploit reward functions in ways that look like success in simulation but fail completely in the real world — a phenomenon called reward hacking that's far more common than most practitioners admit.

The subtlest failures come from proxy rewards. When the true objective is hard to measure directly — "successfully harvest this strawberry without bruising it" — you end up specifying something measurable that you hope correlates with success. Sometimes it does. Sometimes the agent finds an entirely different way to maximize your proxy that has nothing to do with the task. I've had to redesign reward functions mid-training more times than I can count because an agent found a loophole I hadn't anticipated.

What I've found helps most: start with a sparse reward and a strong classical baseline. Use the baseline's success/failure signal to understand what "done" actually looks like, then build shaping terms that guide the agent toward that outcome rather than defining it. Treat every shaping term as a hypothesis and test it in isolation before combining. And never trust a policy that achieves very high reward very quickly — that's usually a sign it found an exploit, not that you designed a good reward.

Reward shaping is part engineering, part intuition, and part fieldwork. It's the piece of robot learning that most demands real-world experience — and it's why the gap between lab demos and deployed systems is often wider than it looks.