Reinforcement Learning for Locomotion

This chapter describes the policy design used for Asimov locomotion and the reasoning behind its observation interface.

Asimov legs locomotion overview

Figure: Asimov legs during early locomotion development. This image anchors the policy chapter to the real hardware platform the controller was trained for, rather than to a generic humanoid benchmark or a simulator-only setup.

1. Locomotion as a data-interface problem

The locomotion policy is a standard feedforward neural network. The central design question is not the novelty of the network itself, but whether the policy receives the correct information at the correct time.

For this reason, the locomotion problem is framed as:

How can the domain of data in simulation be made to match the domain of data on hardware as closely as possible?

This framing leads to several design choices:

avoid observations that are unavailable on the robot

model timing skew and delay explicitly

give the critic access to privileged training-only information

treat actuator behavior as part of the learning problem

Asimov legs walking animation

Figure: Qualitative walking result from the trained legs policy. The motion shown here is useful as a visual reference for the kind of gait the observation design, critic structure, and actuator model were intended to produce on hardware.

2. Actor observations

The actor is limited to signals that can be produced on the real robot. The policy observation vector has 45 dimensions.

Observation term	Dimensions	Notes
base angular velocity	3	IMU angular velocity
projected gravity	3	Orientation proxy used instead of ground-truth pose
command	3	Target `v_x`, `v_y`, `w_z`
joint position groups 1-3	12	Grouped by CAN timing
joint velocity groups 1-3	12	Grouped by CAN timing
previous actions	12	Smoothed control history

The observation design intentionally excludes base linear velocity.

3. No ground-truth linear velocity

Many locomotion baselines feed ground-truth base linear velocity into the policy. Asimov does not.

The reason is simple:

the real robot does not measure ground-truth base velocity

the robot has an IMU and encoder-derived joint state

training with unavailable information encourages brittle policies

If the actor depends on an observation that disappears at deployment time, transfer quality degrades immediately.

4. Asymmetric actor-critic

Training uses an asymmetric actor-critic structure. The actor is restricted to deployable observations, while the critic receives additional privileged information that improves value estimation.

The critic receives everything the actor sees, plus:

Privileged term	Dimensions	Purpose
base linear velocity	3	Ground-truth motion during training
foot height	2	Contact and swing-state context
foot air time	2	Step timing context
foot contact	2	Binary contact state
foot contact forces	6	Ground interaction
toe joint position	2	Passive toe state
toe joint velocity	2	Passive toe dynamics

This setup allows the critic to learn from simulator-only signals without forcing the actor to depend on unavailable hardware data.

5. Why toe state belongs in the critic

The passive toes affect support, push-off, and recovery from forward pitching. However, they are not actively actuated and are not instrumented like the main leg joints.

Toe state is therefore exposed to the critic only.

This allows the training process to capture the relationship between toe deflection and stability, while still requiring the actor to infer toe behavior indirectly from:

ankle motion

IMU state

body response during stance and push-off

6. Contact force as privileged information

Contact force is also useful during training. It helps the critic evaluate whether the robot is loading the ground in a stable way, even though the actor does not receive direct force measurements as a deployment input.

In practice, this improves:

stance stability

push-off behavior

foot placement quality

The resulting policy does not react to contact changes as aggressively as force-rich commercial systems, but it achieves useful and stable behavior without relying on a direct force-sensing action policy.

7. Network structure

The policy uses a straightforward multilayer perceptron.

Network	Structure
Actor	`45 -> 512 -> 256 -> 128 -> 12`
Critic	`(45 + privileged) -> 512 -> 256 -> 128 -> 1`

Additional settings:

activation: ELU

observation normalization: enabled

initial policy noise standard deviation: 1.0

The network design is intentionally simple. Sim2real performance came primarily from the observation and actuation interface, not from architectural novelty.