Reward Design

This chapter documents the reward design used for Asimov locomotion and the main differences from common open-source baselines.

1. Reward design was not the main bottleneck

The locomotion policy did not become deployable through reward shaping alone. Stable transfer depended more strongly on:

actuator modeling

observation timing

deployable observation design

real controller constraints

Reward design still matters, but it should be understood as one component of the stack rather than the sole driver of performance.

The practical lesson from the legs stack is that reward changes alone did not solve transfer. The walking policy became deployable only after the actuator model, timing model, and observation interface were brought closer to hardware.

2. Core rewards kept from existing baselines

The Asimov reward set was heavily influenced by open-source humanoid locomotion work, especially Booster-style reward structure.

Representative retained terms include:

Reward	Weight	Purpose
`tracking_lin_vel`	`+1.0` (base, curriculum-scaled)	Follow commanded linear velocity
`tracking_ang_vel`	`+0.5` (base, curriculum-scaled)	Follow commanded yaw rate
`orientation`	`-5.0`	Penalize deviation from upright orientation
`upright`	curriculum	Maintain stable torso posture
`action_rate`	`-1.0`	Smooth action changes
`torques`	`-2e-4`	Encourage efficient actuation

3. No gait clock

Some locomotion baselines provide an explicit gait phase clock to the policy. Asimov does not.

This choice was made because:

Asimov kinematics are not identical to baseline robots

the ankle range is limited by the parallel mechanism

the policy should discover a gait that fits this hardware rather than follow a hand-imposed gait phase

This makes the policy less prescriptive and more hardware-specific.

4. Asymmetric pose tolerances

Uniform pose tolerances across all joints are not appropriate for Asimov. The legs use different tolerances depending on the joint and the hardware structure.

Representative walking tolerances are:

Joint	Typical tolerance
hip pitch	`0.5`
hip roll	`0.25`
hip yaw	`0.2`
knee	`0.5`
ankle pitch	`0.2`
ankle roll	`0.12`

The ankle tolerances are tight because the real ankle range is limited.

5. Narrow-stance stability penalties

Asimov has a narrower stance than many humanoid baselines. This increases lateral balance sensitivity and motivates stronger stability penalties.

Representative terms include:

Reward	Weight
`body_ang_vel`	`-0.08`
`angular_momentum`	`-0.03`

These terms help reduce large pelvis rotation and unstable whole-body motion.

6. Contact-force limits

The reward set penalizes excessive ground reaction forces.

This serves two purposes:

it discourages aggressive stomping behavior

it protects the real robot from unnecessary impact loading

Representative terms include:

Reward	Weight	Note
`feet_contact_force_limit`	`-5e-4`	penalizes forces above approximately `350 N`
`feet_stumble`	`-1.25`	penalizes large horizontal-to-vertical contact ratios

7. Air-time reward

Asimov legs are light enough to support dynamic walking with noticeable swing and brief unloaded phases. An air-time reward is therefore used to discourage shuffling behavior.

Representative term:

Reward	Weight
`air_time`	`+0.5`

This reward encourages dynamic gait emergence rather than static stepping.

8. Consolidated reward table

The legs policy used a compact reward set rather than a large collection of highly specialized terms.

Reward	Weight	Role
`tracking_lin_vel`	`+1.0` (curriculum-scaled)	commanded linear velocity tracking
`tracking_ang_vel`	`+0.5` (curriculum-scaled)	commanded yaw tracking
`orientation`	`-5.0`	penalize orientation deviation
`air_time`	`+0.5`	dynamic stepping
`action_rate`	`-1.0`	smooth action changes
`torques`	`-2e-4`	efficient actuation
`pose`	curriculum	posture shaping
`upright`	curriculum	torso stability
`body_ang_vel`	`-0.08`	pelvis rotation penalty
`angular_momentum`	`-0.03`	global stability penalty
`self_collisions`	`-1.0`	reject self-contact
`feet_stumble`	`-1.25`	discourage unstable foot strikes
`feet_contact_force_limit`	`-5e-4`	discourage excessive ground impact

9. Practical lesson

The most important lesson from this stack is that reward design should remain consistent with the hardware interface.

It is counterproductive to reward behaviors that require:

unavailable sensors

unrealistic joint range

unrealistically fast force response

contact conditions that the deployed robot cannot reproduce

For Asimov, reward design works best when it reflects the real limitations and affordances of the leg hardware.

Reward Design

On this page