Humanoid Robot Locomotion Control with Integrated State Estimation

The pursuit of advanced humanoid robots represents a critical frontier in robotics, driven by their inherent compatibility with human-engineered environments. This compatibility positions them as vital agents for achieving human-robot collaboration and intelligent automation across diverse fields such as manufacturing, logistics, and personal assistance. Their anthropomorphic form and operational modalities allow for seamless integration into spaces designed for humans. Recent advancements in high-performance actuators, perception hardware, and intelligent control algorithms, particularly reinforcement learning (RL) and simulation-to-real transfer techniques, have significantly accelerated progress in legged locomotion, laying a foundation for humanoid robot development.

However, deploying a humanoid robot in unstructured, real-world scenarios presents profound challenges that extend beyond those faced by simpler quadrupedal platforms. The humanoid morphology, characterized by a high degree of freedom, a complex upper-lower body coordination structure, and a naturally unstable bipedal gait, drastically increases the difficulty of dynamic balance and whole-body motion control. A core challenge in achieving robust locomotion for any legged system, especially for a humanoid robot, is the reliance on accurate state estimation. In practice, sensors like Inertial Measurement Units (IMUs) and encoders provide only partial and often noisy observations. Critical states such as the base linear velocity, the precise height of the feet above the ground, and the timing of foot-ground contact are not directly measured but are essential for stable gait generation and adaptation.

This state estimation problem is exacerbated for a humanoid robot operating on complex terrains or under external disturbances. Slippery surfaces, compliant flooring like carpets, or unexpected pushes can cause significant discrepancies between the robot’s internal belief about its state and reality. Traditional methods that rely on explicit kinematic or dynamic models often struggle with the complexities of non-rigid terrain and incur high computational costs. While some approaches use dedicated foot contact sensors, these can be fragile and add system complexity. Consequently, the development of a control framework that does not depend on explicit contact sensing yet achieves high robustness and environmental adaptability remains a pivotal scientific and engineering challenge for advancing humanoid robot capabilities.

Inspired by biological systems and recent learning-based approaches for quadrupeds, we posit that a humanoid robot must learn an internal “world model” – an implicit understanding of its relationship with the environment – to react robustly. This work presents a novel reinforcement learning control framework that co-trains a locomotion policy with a state estimation network. The framework is designed to accurately infer key unobserved state variables online, using only proprioceptive history, and uses these estimates to enable robust velocity tracking across varied and disturbed environments. The primary contributions of this work are: 1) A unified RL framework featuring an asymmetric actor-critic architecture jointly optimized with a state estimation network, eliminating the need for privileged information or external contact sensors during deployment; 2) A novel reward function component designed for the polygonal foot geometry of a humanoid robot to improve contact quality estimation and stepping behavior; and 3) Successful sim-to-real transfer and experimental validation on a custom-built humanoid robot platform, demonstrating stable locomotion at speeds up to 1 m/s on diverse surfaces including rubber, carpet, and hard factory floors.

Methodology

Our objective is to design a reinforcement learning-based control framework that enables a humanoid robot to perform stable, velocity-tracking locomotion while simultaneously learning to estimate its latent state. The overall architecture is depicted in the schema below, integrating two core neural networks: the State Estimation Network and the Control Policy Network (Actor).

Theoretical Model: Partially Observable Markov Decision Process (POMDP)

The control problem is formulated as an infinite-horizon Partially Observable Markov Decision Process (POMDP), defined by the tuple $$(S, O, A, P, R, Z, \gamma)$$. Here, $ S $ represents the true state space of the robot-environment system, which is not fully observable. Instead, the robot receives a partial observation $ obs_t \in O $ at each time step $ t $. The robot takes an action $ a_t \in A $, causing a state transition according to the dynamics $ P = p(s_{t+1} | s_t, a_t) $, and receives a scalar reward $ r_t = R(a_t | s_t) $. The discount factor is $ \gamma \in [0, 1) $. A key element in our training framework is the use of privileged information $ z^*_t \in Z $, which is available from the simulator (e.g., terrain heightmap, ground truth velocity) but inaccessible on the real robot. The policy $ \pi_\phi = p(a_t | obs_t, z_t) $, parameterized by $ \phi $, maps the current observation and an estimated state $ z_t $ to an action. The estimated state $ z_t $ is produced by the state estimation network and aims to approximate the privileged information relevant for control.

The goal of reinforcement learning is to find the optimal policy $ \pi^* $ that maximizes the expected discounted return:

$$
J(\pi^*) = \mathbb{E}_{\rho \sim p(\rho|\pi)} \left[ \sum_{t=0}^{T-1} \gamma^t r(a_t | s_t) \right]
$$

where $ \rho = \{(s_0, a_0, r_0), (s_1, a_1, r_1), \dots\} $ is a trajectory sampled by following the policy $ \pi $.

Network Architecture and Training Framework

The learning framework consists of four neural networks based on Multi-Layer Perceptrons (MLPs): the Policy Network (Actor), the Value Network (Critic), an Encoder Network, and the State Estimation Network.

1. Asymmetric Actor-Critic Design: We employ an asymmetric actor-critic structure. The Policy Network (Actor) takes as input the proprioceptive observation history $ obs_t $ and the estimated state $ z_t $. The Value Network (Critic), which guides the policy update during training, receives a more informative state: the concatenation of $ obs_t $ and an encoded version of the privileged information $ \tilde{z}^*_t $. This asymmetry allows the critic to make more accurate value assessments during training, leading to better policy gradients, while the actor learns to rely solely on information available in the real world.

2. State Estimation and Encoding: The State Estimation Network is trained via supervised learning to predict specific, physically interpretable state variables from the recent history of proprioception. Its output is $ z_t = (v_t, h_t, p_t) $, representing the estimated base linear velocity, foot height above terrain, and foot-ground contact probability, respectively. The Encoder Network compresses the high-dimensional privileged information $ u_t $ (e.g., heightmaps, friction coefficients) into a compact feature vector $ \tilde{z}^*_t $. This vector serves as the target for the state estimation network during training, using a Mean Squared Error (MSE) loss:

$$
\mathcal{L}_{MSE} = || \tilde{z}^*_t – z_t ||^2
$$

3. Observation and Action Spaces: The proprioceptive observation $ obs_t $ is a tuple containing:

$$
obs_t = (c_t, \phi_t, \theta_t, \dot{\theta}_t, a_{t-1}, e_t, \omega_t)
$$

where $ c_t $ is the commanded velocity, $ \phi_t $ is the gait phase, $ \theta_t $ and $ \dot{\theta}_t $ are joint positions and velocities, $ a_{t-1} $ is the previous action, and $ e_t $ and $ \omega_t $ are the base orientation (in Euler angles) and angular velocity from the IMU. A history of the last 15 steps of $ obs_t $ is fed into the networks to capture temporal dynamics.

The action $ a_t $ output by the policy network represents target joint positions. These are converted to joint torques $ \tau_t $ via a Proportional-Derivative (PD) controller:

$$
\tau_t = K_p (a_t \cdot s_a + \theta_{def} – \theta_t) – K_d \dot{\theta}_t
$$

where $ K_p $ and $ K_d $ are the proportional and derivative gains, $ s_a = 0.25 $ is an action scaling factor, and $ \theta_{def} $ is the default joint position in a neutral standing pose. This low-level PD controller ensures stable and smooth actuation. The gains for the humanoid robot’s leg joints are configured as follows:

Joint	Hip Yaw	Hip Roll	Hip Pitch	Knee	Ankle Roll	Ankle Pitch
Stiffness $K_p$	200	325	1450	550	130	310
Damping $K_d$	7.5	12	75	20	3	3.5

Reward Function Design

The reward function is carefully engineered to encourage velocity tracking, energy efficiency, and stable, natural motions for the humanoid robot. It is a weighted sum of multiple components: $ r_t = \sum_i r_i w_i $. A key innovation for the humanoid robot is the foot placement reward. Unlike point-feet quadrupeds, a humanoid robot has polygonal feet. To encourage stable, full-foot contact and discourage toe/heel scraping, we sample $ n $ points on the sole of each foot. The reward penalizes points that sink below the terrain surface beyond a threshold $ \epsilon $, effectively teaching the robot to place its foot flat on the supporting ground.

$$
r_{foot} = – \sum_{i=1}^{2} c_i \sum_{j=1}^{n} \mathbb{1}(d_{ij} < \epsilon)
$$

Here, $ c_i $ indicates whether foot $ i $ is in a stance phase, $ d_{ij} $ is the terrain height at sample point $ j $, and $ \mathbb{1} $ is the indicator function. The primary reward components and their weights are summarized below:

Reward Component $ r_i $	Mathematical Definition	Weight $ w_i $
Linear Velocity Tracking	$ \exp(-5 \\| c_{t,xy}^{v} – v_{t,xy} \\|^2 ) $	1.8
Angular Velocity Tracking	$ \exp(-5 \\| c_{t,xy}^{\omega} – \omega_{t,xy} \\|^2 ) $	1.2
Joint Torque	$ -\sum \| \tau_t \|^2 $	-1e-7
Joint Acceleration	$ -\sum \| \ddot{\theta}_t \|^2 $	-1e-7
Joint Velocity Limit	$ -\sum \mathbb{1}(\|\dot{\theta}_t\| > \dot{\theta}_{lim}) $	-1.0
Action Rate	$ -\sum \| a_t – a_{t-1} \|^2 $	-5e-6
Foot Placement (Novel)	$ -\sum_{i=1}^{2} c_i \sum_{j=1}^{n} \mathbb{1}(d_{ij} < \epsilon) $	-0.01
Base Height	$ \exp(-100 \| h_{base} – h_{desired} \| ) $	0.2
Feet Distance	$ (\exp(-100 \| d_{min} \|) + \exp(-100 \| d_{max} \|))/2 $	0.2

Training and Domain Randomization

Training is conducted in a parallelized simulation environment (Isaac Gym) using the Proximal Policy Optimization (PPO) algorithm. To facilitate robust sim-to-real transfer, extensive domain randomization is applied during training, varying physical parameters within specified ranges to prevent the policy from overfitting to simulation specifics. The key PPO hyperparameters and domain randomization ranges are listed below.

PPO Algorithm Hyperparameters
Hyperparameter	Value
Learning Rate	1e-3
Discount Factor ($\gamma$)	0.994
GAE Parameter ($\lambda$)	0.95
Entropy Coefficient	0.01
Training Epochs per Update	2
Mini-batch Count	4

Domain Randomization Parameters
Parameter	Randomization Range
Terrain Friction Coefficient	[0.2, 2.0]
Joint Damping	±30% of nominal
Payload Mass (at base)	[-5.0, 5.0] kg
Motor Strength	[0.8, 1.2] of nominal
Commanded Velocity	Uniform in [-1.0, 1.0] m/s (linear), [-0.5, 0.5] rad/s (angular)

Experiments and Results

Ablation Studies in Simulation

To validate the contribution of the state estimation network and the asymmetric design, we conducted ablation studies within the simulation. We compared three configurations: 1) Ours: The full proposed framework; 2) w/o Estimator: Policy network receives only $ obs_t $, critic receives $ (obs_t, \tilde{z}^*_t) $; 3) w/o Asymmetric: Both actor and critic receive $ (obs_t, \tilde{z}^*_t) $ (i.e., full privileged info during training and testing, an unrealistic upper bound).

The convergence of the average total reward during training is plotted below. The full method achieves both a higher final reward and more stable learning compared to the ablated versions. The performance drop when removing the state estimator confirms that learning to infer state is crucial when privileged information is absent. The performance of the asymmetric model surpassing the “w/o Estimator” model underscores the benefit of the asymmetric critic during training.

We further evaluated the terrain traversal success rate of the final policies across three distinct terrains: flat ground, slopes, and rough uneven terrain. Each policy was tested for 10 episodes per terrain. The results clearly demonstrate the superior robustness of the full method, especially on challenging terrain, highlighting the importance of accurate state estimation for the humanoid robot.

Simulated Terrain Traversal Success Rate (%)
Method / Terrain	Flat Ground	Slopes	Rough Terrain
Ours (Full Method)	100%	100%	100%
w/o Asymmetric	100%	80%	20%
w/o Estimator	100%	50%	10%

The accuracy of the state estimation network was quantified by comparing its output against ground-truth simulator values during a forward walk. The results show high fidelity, confirming the network’s capability to provide reliable state information to the policy.

State Estimation Network Output vs. Ground Truth
Estimated Parameter	Network Output	Simulator Ground Truth
Base Linear Velocity ($v_t$)	0.97 m/s	1.00 m/s
Foot Height ($h_t$)	0.038 m	0.040 m
Foot Contact Probability ($p_t$)	92%	95%

Sim-to-Sim Transfer and Disturbance Rejection

Before real-world deployment, the policy was transferred from Isaac Gym to the MuJoCo simulator without any fine-tuning (sim-to-sim). This step tests the policy’s robustness to dynamics discrepancies. Furthermore, we tested the policy’s ability to reject external disturbances. Random force pushes were applied to the torso of the humanoid robot during locomotion. The framework, leveraging its state estimation to quickly perceive the induced velocity change, was able to generate corrective actions to recover balance and maintain the desired walking direction, demonstrating inherent robustness.

Real-World Deployment on Humanoid Robot Platform

The final policy was deployed without modification on a custom-built, full-sized humanoid robot. The platform stands approximately 164 cm tall, weighs 43 kg, and has 25 degrees of freedom. The control policy and state estimation network run in real-time at 100 Hz on an onboard computer. The humanoid robot was tested on various indoor surfaces, including rubber mats, carpet, and hard factory floors, with a commanded forward velocity of up to 1 m/s.

The deployment was successful, with the humanoid robot demonstrating stable and adaptive walking. The joint trajectories showed smooth, periodic motion, and the foot contact forces indicated stable ground interaction without excessive impacts. A direct comparison was made between policies trained with and without the state estimation network. The policy with state estimation showed significantly better velocity tracking accuracy and was able to achieve a higher maximum stable walking speed across all tested surfaces, as quantified below.

Real-Robot Performance Comparison
Performance Metric	Terrain	With State Estimator	Without State Estimator
Command Tracking Error (m/s)	Rubber	0.20	0.60
	Carpet	0.30	0.70
	Hard Floor	0.07	0.30
State Estimation Error (MSE)	Rubber	0.07	N/A
	Carpet	0.09	N/A
	Hard Floor	0.05	N/A
Max Stable Speed (m/s)	Rubber	0.8	0.4
	Carpet	0.7	0.3
	Hard Floor	1.0	0.7

Conclusion

In this work, we presented a robust locomotion control framework for humanoid robots that integrates state estimation directly into the reinforcement learning pipeline. The core idea is to co-train a control policy with a neural network that learns to infer critical, unobserved states—such as base velocity, foot height, and contact probability—from proprioceptive history. This allows the humanoid robot to maintain an accurate internal model of its interaction with the environment without relying on explicit contact sensors or privileged information during deployment.

The proposed asymmetric actor-critic architecture, combined with a novel foot-placement reward designed for polygonal feet, enables the learning of a policy that is both perceptive and robust. Extensive simulation experiments, including ablation studies and disturbance rejection tests, validated the necessity and effectiveness of each component. Most importantly, the framework demonstrated successful zero-shot sim-to-real transfer, enabling a physical humanoid robot to walk stably at speeds up to 1 m/s across diverse and challenging real-world surfaces.

This research marks a step toward more autonomous and adaptive humanoid robots. The current framework relies on a predefined gait structure. Future work will focus on integrating imitation learning from human motion data to generate more natural and versatile locomotion behaviors for the humanoid robot. Furthermore, extending the state estimation to predict a broader range of environmental properties and incorporating exteroceptive perception (e.g., vision) will be crucial for enabling truly robust navigation and manipulation in complex, unstructured environments.