Menlo
Locomotion Training

Reward Design

This chapter documents the reward design used for Asimov locomotion and the main differences from common open-source baselines.

1. Reward design was not the main bottleneck

The locomotion policy did not become deployable through reward shaping alone. Stable transfer depended more strongly on:

  • actuator modeling
  • observation timing
  • deployable observation design
  • real controller constraints

Reward design still matters, but it should be understood as one component of the stack rather than the sole driver of performance.

The practical lesson from the legs stack is that reward changes alone did not solve transfer. The walking policy became deployable only after the actuator model, timing model, and observation interface were brought closer to hardware.

2. Core rewards kept from existing baselines

The Asimov reward set was heavily influenced by open-source humanoid locomotion work, especially Booster-style reward structure.

Representative retained terms include:

RewardWeightPurpose
tracking_lin_vel+1.0 (base, curriculum-scaled)Follow commanded linear velocity
tracking_ang_vel+0.5 (base, curriculum-scaled)Follow commanded yaw rate
orientation-5.0Penalize deviation from upright orientation
uprightcurriculumMaintain stable torso posture
action_rate-1.0Smooth action changes
torques-2e-4Encourage efficient actuation

3. No gait clock

Some locomotion baselines provide an explicit gait phase clock to the policy. Asimov does not.

This choice was made because:

  • Asimov kinematics are not identical to baseline robots
  • the ankle range is limited by the parallel mechanism
  • the policy should discover a gait that fits this hardware rather than follow a hand-imposed gait phase

This makes the policy less prescriptive and more hardware-specific.

4. Asymmetric pose tolerances

Uniform pose tolerances across all joints are not appropriate for Asimov. The legs use different tolerances depending on the joint and the hardware structure.

Representative walking tolerances are:

JointTypical tolerance
hip pitch0.5
hip roll0.25
hip yaw0.2
knee0.5
ankle pitch0.2
ankle roll0.12

The ankle tolerances are tight because the real ankle range is limited.

5. Narrow-stance stability penalties

Asimov has a narrower stance than many humanoid baselines. This increases lateral balance sensitivity and motivates stronger stability penalties.

Representative terms include:

RewardWeight
body_ang_vel-0.08
angular_momentum-0.03

These terms help reduce large pelvis rotation and unstable whole-body motion.

6. Contact-force limits

The reward set penalizes excessive ground reaction forces.

This serves two purposes:

  • it discourages aggressive stomping behavior
  • it protects the real robot from unnecessary impact loading

Representative terms include:

RewardWeightNote
feet_contact_force_limit-5e-4penalizes forces above approximately 350 N
feet_stumble-1.25penalizes large horizontal-to-vertical contact ratios

7. Air-time reward

Asimov legs are light enough to support dynamic walking with noticeable swing and brief unloaded phases. An air-time reward is therefore used to discourage shuffling behavior.

Representative term:

RewardWeight
air_time+0.5

This reward encourages dynamic gait emergence rather than static stepping.

8. Consolidated reward table

The legs policy used a compact reward set rather than a large collection of highly specialized terms.

RewardWeightRole
tracking_lin_vel+1.0 (curriculum-scaled)commanded linear velocity tracking
tracking_ang_vel+0.5 (curriculum-scaled)commanded yaw tracking
orientation-5.0penalize orientation deviation
air_time+0.5dynamic stepping
action_rate-1.0smooth action changes
torques-2e-4efficient actuation
posecurriculumposture shaping
uprightcurriculumtorso stability
body_ang_vel-0.08pelvis rotation penalty
angular_momentum-0.03global stability penalty
self_collisions-1.0reject self-contact
feet_stumble-1.25discourage unstable foot strikes
feet_contact_force_limit-5e-4discourage excessive ground impact

9. Practical lesson

The most important lesson from this stack is that reward design should remain consistent with the hardware interface.

It is counterproductive to reward behaviors that require:

  • unavailable sensors
  • unrealistic joint range
  • unrealistically fast force response
  • contact conditions that the deployed robot cannot reproduce

For Asimov, reward design works best when it reflects the real limitations and affordances of the leg hardware.

How is this guide?

On this page