Menlo
Locomotion Training

Reinforcement Learning for Locomotion

This chapter describes the policy design used for Asimov locomotion and the reasoning behind its observation interface.

Asimov legs locomotion overview

Figure: Asimov legs during early locomotion development. This image anchors the policy chapter to the real hardware platform the controller was trained for, rather than to a generic humanoid benchmark or a simulator-only setup.

1. Locomotion as a data-interface problem

The locomotion policy is a standard feedforward neural network. The central design question is not the novelty of the network itself, but whether the policy receives the correct information at the correct time.

For this reason, the locomotion problem is framed as:

How can the domain of data in simulation be made to match the domain of data on hardware as closely as possible?

This framing leads to several design choices:

  • avoid observations that are unavailable on the robot
  • model timing skew and delay explicitly
  • give the critic access to privileged training-only information
  • treat actuator behavior as part of the learning problem

Asimov legs walking animation

Figure: Qualitative walking result from the trained legs policy. The motion shown here is useful as a visual reference for the kind of gait the observation design, critic structure, and actuator model were intended to produce on hardware.

2. Actor observations

The actor is limited to signals that can be produced on the real robot. The policy observation vector has 45 dimensions.

Observation termDimensionsNotes
base angular velocity3IMU angular velocity
projected gravity3Orientation proxy used instead of ground-truth pose
command3Target v_x, v_y, w_z
joint position groups 1-312Grouped by CAN timing
joint velocity groups 1-312Grouped by CAN timing
previous actions12Smoothed control history

The observation design intentionally excludes base linear velocity.

3. No ground-truth linear velocity

Many locomotion baselines feed ground-truth base linear velocity into the policy. Asimov does not.

The reason is simple:

  • the real robot does not measure ground-truth base velocity
  • the robot has an IMU and encoder-derived joint state
  • training with unavailable information encourages brittle policies

If the actor depends on an observation that disappears at deployment time, transfer quality degrades immediately.

4. Asymmetric actor-critic

Training uses an asymmetric actor-critic structure. The actor is restricted to deployable observations, while the critic receives additional privileged information that improves value estimation.

The critic receives everything the actor sees, plus:

Privileged termDimensionsPurpose
base linear velocity3Ground-truth motion during training
foot height2Contact and swing-state context
foot air time2Step timing context
foot contact2Binary contact state
foot contact forces6Ground interaction
toe joint position2Passive toe state
toe joint velocity2Passive toe dynamics

This setup allows the critic to learn from simulator-only signals without forcing the actor to depend on unavailable hardware data.

5. Why toe state belongs in the critic

The passive toes affect support, push-off, and recovery from forward pitching. However, they are not actively actuated and are not instrumented like the main leg joints.

Toe state is therefore exposed to the critic only.

This allows the training process to capture the relationship between toe deflection and stability, while still requiring the actor to infer toe behavior indirectly from:

  • ankle motion
  • IMU state
  • body response during stance and push-off

6. Contact force as privileged information

Contact force is also useful during training. It helps the critic evaluate whether the robot is loading the ground in a stable way, even though the actor does not receive direct force measurements as a deployment input.

In practice, this improves:

  • stance stability
  • push-off behavior
  • foot placement quality

The resulting policy does not react to contact changes as aggressively as force-rich commercial systems, but it achieves useful and stable behavior without relying on a direct force-sensing action policy.

7. Network structure

The policy uses a straightforward multilayer perceptron.

NetworkStructure
Actor45 -> 512 -> 256 -> 128 -> 12
Critic(45 + privileged) -> 512 -> 256 -> 128 -> 1

Additional settings:

  • activation: ELU
  • observation normalization: enabled
  • initial policy noise standard deviation: 1.0

The network design is intentionally simple. Sim2real performance came primarily from the observation and actuation interface, not from architectural novelty.

How is this guide?

On this page