Reinforcement Learning for Locomotion
This chapter describes the policy design used for Asimov locomotion and the reasoning behind its observation interface.

Figure: Asimov legs during early locomotion development. This image anchors the policy chapter to the real hardware platform the controller was trained for, rather than to a generic humanoid benchmark or a simulator-only setup.
1. Locomotion as a data-interface problem
The locomotion policy is a standard feedforward neural network. The central design question is not the novelty of the network itself, but whether the policy receives the correct information at the correct time.
For this reason, the locomotion problem is framed as:
How can the domain of data in simulation be made to match the domain of data on hardware as closely as possible?
This framing leads to several design choices:
- avoid observations that are unavailable on the robot
- model timing skew and delay explicitly
- give the critic access to privileged training-only information
- treat actuator behavior as part of the learning problem

Figure: Qualitative walking result from the trained legs policy. The motion shown here is useful as a visual reference for the kind of gait the observation design, critic structure, and actuator model were intended to produce on hardware.
2. Actor observations
The actor is limited to signals that can be produced on the real robot. The policy observation vector has 45 dimensions.
| Observation term | Dimensions | Notes |
|---|---|---|
| base angular velocity | 3 | IMU angular velocity |
| projected gravity | 3 | Orientation proxy used instead of ground-truth pose |
| command | 3 | Target v_x, v_y, w_z |
| joint position groups 1-3 | 12 | Grouped by CAN timing |
| joint velocity groups 1-3 | 12 | Grouped by CAN timing |
| previous actions | 12 | Smoothed control history |
The observation design intentionally excludes base linear velocity.
3. No ground-truth linear velocity
Many locomotion baselines feed ground-truth base linear velocity into the policy. Asimov does not.
The reason is simple:
- the real robot does not measure ground-truth base velocity
- the robot has an IMU and encoder-derived joint state
- training with unavailable information encourages brittle policies
If the actor depends on an observation that disappears at deployment time, transfer quality degrades immediately.
4. Asymmetric actor-critic
Training uses an asymmetric actor-critic structure. The actor is restricted to deployable observations, while the critic receives additional privileged information that improves value estimation.
The critic receives everything the actor sees, plus:
| Privileged term | Dimensions | Purpose |
|---|---|---|
| base linear velocity | 3 | Ground-truth motion during training |
| foot height | 2 | Contact and swing-state context |
| foot air time | 2 | Step timing context |
| foot contact | 2 | Binary contact state |
| foot contact forces | 6 | Ground interaction |
| toe joint position | 2 | Passive toe state |
| toe joint velocity | 2 | Passive toe dynamics |
This setup allows the critic to learn from simulator-only signals without forcing the actor to depend on unavailable hardware data.
5. Why toe state belongs in the critic
The passive toes affect support, push-off, and recovery from forward pitching. However, they are not actively actuated and are not instrumented like the main leg joints.
Toe state is therefore exposed to the critic only.
This allows the training process to capture the relationship between toe deflection and stability, while still requiring the actor to infer toe behavior indirectly from:
- ankle motion
- IMU state
- body response during stance and push-off
6. Contact force as privileged information
Contact force is also useful during training. It helps the critic evaluate whether the robot is loading the ground in a stable way, even though the actor does not receive direct force measurements as a deployment input.
In practice, this improves:
- stance stability
- push-off behavior
- foot placement quality
The resulting policy does not react to contact changes as aggressively as force-rich commercial systems, but it achieves useful and stable behavior without relying on a direct force-sensing action policy.
7. Network structure
The policy uses a straightforward multilayer perceptron.
| Network | Structure |
|---|---|
| Actor | 45 -> 512 -> 256 -> 128 -> 12 |
| Critic | (45 + privileged) -> 512 -> 256 -> 128 -> 1 |
Additional settings:
- activation: ELU
- observation normalization: enabled
- initial policy noise standard deviation: 1.0
The network design is intentionally simple. Sim2real performance came primarily from the observation and actuation interface, not from architectural novelty.
How is this guide?