In recent years, the development of humanoid robots has gained significant attention due to their potential in space missions, disaster response, and industrial applications. Humanoid robots, with their human-like structure, offer unique advantages in adaptability and dexterity, enabling them to navigate unstructured environments and perform complex tasks. However, a critical challenge lies in enabling these humanoid robots to learn diverse whole-body motion skills under a single policy model while ensuring high-quality execution and smooth transitions between skills. Traditional methods often struggle with conflicts arising from differing skill characteristics, such as the need for static balance in standing versus dynamic movements in jumping, leading to suboptimal performance or catastrophic forgetting. This paper addresses these issues by proposing an efficient imitation learning framework that integrates goal-conditioned reinforcement learning and generative adversarial imitation learning, enhanced with preference-based rewards and adaptive sampling techniques.
The core of our approach is to train humanoid robots to imitate a wide range of human demonstrations, including standing, squatting, walking, jumping over obstacles, stooping for inspection, and picking up objects. By leveraging a single policy model, we aim to reduce computational overhead and improve real-time performance. Our method, which we refer to as the Single Model Imitation Learning for Multi-skill Efficiency (SMILE), incorporates several novel components: a discriminator network to ensure motion realism, a preference-based reward function to handle skill conflicts, and a failure-frequency-based priority sampling mechanism to focus on challenging skills. Through extensive simulations, we demonstrate that humanoid robots trained with SMILE achieve a high success rate of over 90% in executing and transitioning between multiple skills, outperforming existing methods that rely solely on reinforcement learning or uniform sampling.

Humanoid robots are particularly valuable in space applications, where they can utilize tools designed for humans and adapt to unpredictable environments. For instance, in space stations, humanoid robots can perform maintenance tasks, while on extraterrestrial surfaces, they can conduct exploration and sample collection. The ability to learn diverse skills in a unified manner is essential for enhancing the autonomy and survivability of humanoid robots in such scenarios. However, previous works have often focused on single-skill learning or decoupled control of upper and lower body movements, which can lead to coordination issues in dynamic tasks. Our work builds upon advances in imitation learning and reinforcement learning for humanoid robots, aiming to overcome limitations in multi-skill learning by addressing the trade-off between motion quality and continuity during transitions.
In this paper, we first describe the problem formulation and the architecture of our humanoid robot system. We then detail the SMILE method, including the network structures, state and action spaces, reward functions, and sampling strategies. We present simulation results that validate the effectiveness of our approach, followed by ablation studies and comparisons with baseline methods. Finally, we discuss the implications of our findings and outline future research directions for deploying humanoid robots in real-world applications.
Background and Related Work
Humanoid robots have been a focal point in robotics research due to their potential to perform human-like tasks in complex environments. Reinforcement learning (RL) has shown promise in enabling humanoid robots to acquire locomotion and manipulation skills. For example, prior studies have demonstrated that humanoid robots can learn to walk, run, and even recover from falls using RL-based methods. However, RL often requires extensive reward engineering and may struggle with learning nuanced human-like motions from limited data. Imitation learning (IL) addresses this by leveraging human demonstration data to guide policy learning, resulting in more natural and efficient behaviors.
Recent approaches combine RL and IL to harness the strengths of both. Adversarial Motion Priors (AMP), for instance, use a discriminator to encourage policies to match the state distribution of human demonstrations, leading to realistic gait patterns in humanoid robots. Despite these advances, multi-skill learning remains challenging. Methods like ExBody decouple upper and lower body control to handle diverse skills, but this can result in coordination issues during dynamic movements. Other frameworks, such as H2O and OmniH2O, enable real-time teleoperation and multi-skill execution but may suffer from instability or reduced motion quality due to frequent step adjustments.
Our work draws inspiration from these methods but introduces key innovations to handle skill conflicts and transition smoothness. By integrating goal-conditioned RL with generative adversarial IL, and incorporating preference-based rewards and adaptive sampling, SMILE enables humanoid robots to learn a broad skill set while maintaining high performance across transitions. This holistic approach sets it apart from existing techniques that often prioritize either stability or dynamism, leading to compromises in multi-skill scenarios.
Methodology
The SMILE method is designed to train humanoid robots to imitate diverse whole-body motion skills through a single policy model. We formulate the problem as a finite-horizon Markov Decision Process (MDP) defined by the tuple $$(S, A, R, P, \gamma)$$, where $$S$$ is the state space, $$A$$ is the action space, $$R$$ is the reward function, $$P$$ is the state transition function, and $$\gamma$$ is the discount factor. The objective is to maximize the expected cumulative reward: $$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{T-1} \gamma^t R(s_t, a_t) \right]$$, where $$\tau$$ is a trajectory generated by the policy $$\pi$$.
Our humanoid robot system consists of multiple layers: perception, decision-making, and motion control. The perception layer includes sensors such as cameras and IMUs, while the decision layer processes data and generates task commands. The motion control layer executes policies to produce joint-level actions. For training, we use a simulated environment with a physics engine, allowing efficient policy optimization through parallel experimentation.
Network Architecture
The policy network is a 6-layer Multilayer Perceptron (MLP) with hidden layer sizes of 2048, 1536, 1024, 1024, 512, and 512 neurons, using SiLU activation functions. The value network has an identical structure. The discriminator network comprises two hidden layers with 1024 and 512 neurons, respectively, using ReLU activations. The discriminator is trained to distinguish between policy-generated trajectories and reference human demonstrations, providing a reward signal that encourages realistic motion. Policy updates are performed using the Proximal Policy Optimization (PPO) algorithm, which ensures stable training by constraining policy changes.
State and Action Spaces
The state space $$S$$ is divided into proprioceptive state $$s^p_t$$ and goal state $$s^g_t$$. The proprioceptive state captures the current state of the humanoid robot, including joint positions, orientations, angular velocities, and linear velocities. It is defined as $$s^p_t = [r_t, p_t, \omega_t, v_t]$$, where $$r_t$$ represents joint orientations in 6D rotation format, $$p_t$$ denotes 3D positions, $$\omega_t$$ is angular velocity, and $$v_t$$ is linear velocity. The goal state measures the discrepancy between the current state and the reference motion: $$s^g_t = (\hat{r}_{t+1} \odot r_t, \hat{p}_{t+1} – p_t, \hat{v}_{t+1} – v_t, \hat{\omega}_t – \omega_t, \hat{\theta}_{t+1}, \hat{p}_{t+1})$$, where symbols with hats indicate reference values from human demonstrations, and $$\odot$$ computes relative rotation error.
The action space $$A$$ consists of target joint positions for PD controllers, with a dimensionality of 19, excluding certain degrees of freedom such as those in the neck, wrists, and ankles to simplify control. This design focuses on core body movements essential for whole-body skills.
| State Component | Description | Dimension |
|---|---|---|
| Proprioceptive State | Joint orientations, positions, angular and linear velocities | 298 |
| Goal State | Differences in orientation, position, velocity, and angular velocity from reference | 480 |
Reward Function
The reward function $$R(s_t, a_t)$$ is a combination of four components: imitation reward, discriminator reward, preference reward, and physical constraint reward. The imitation reward encourages the policy to match reference motions across multiple aspects, including joint angles, velocities, positions, and orientations. Each sub-component is computed using an exponential decay function to emphasize accuracy. The discriminator reward is derived from the discriminator’s output, providing higher rewards when the policy’s state distribution aligns with that of human demonstrations.
The preference reward incorporates human-like biases to penalize undesirable behaviors. It includes terms for tripping prevention, sliding penalty, and foot orientation reward. Mathematically, the preference reward $$r_h$$ is defined as: $$r_h = k_b r_b + k_h r_h + k_o r_o$$, where $$r_b = \mathbb{I}(\|F^{xy}_{\text{feet}}\|_2 > 5 \text{ or } |F^{z}_{\text{feet}}|)$$ penalizes tripping, $$r_h = \|v_{\text{feet}}\|_2^2 \times \mathbb{I}(\|F_{\text{feet}}\|_2 > 1)$$ penalizes foot sliding, and $$r_o = \|g^{xy}_{\text{l\_feet}}\|_2 + \|g^{xy}_{\text{r\_feet}}\|_2$$ encourages proper foot alignment with the ground. Here, $$\mathbb{I}(\cdot)$$ is an indicator function, $$F$$ denotes contact forces, $$v$$ is velocity, and $$g$$ represents gravity components.
The physical constraint reward promotes energy efficiency and reduces high-frequency oscillations, such as foot jittering. The overall reward is a weighted sum: $$R = w_{\text{im}} r_{\text{im}} + w_{\text{disc}} r_{\text{disc}} + w_{\text{pref}} r_h + w_{\text{con}} r_{\text{con}}$$, where weights are tuned to balance the contributions.
Initialization and Termination
We use reference state initialization, where each training episode starts from a randomly sampled state in the reference motion dataset. This reduces exploration difficulty and accelerates learning. Additionally, an early termination mechanism is employed: if the average Euclidean distance between the humanoid robot’s joint positions and the reference exceeds 0.25 meters, the episode terminates. This prevents the policy from diverging too far from desired behaviors.
Failure-Frequency-Based Priority Sampling
To address catastrophic forgetting and uneven skill learning, we propose a priority sampling method based on failure counts. The sampling probability for each reference motion sample $$i$$ is given by: $$P_i = \frac{(\rho_i^k + \epsilon)^\alpha}{\sum_j (\rho_j^k + \epsilon)^\alpha}$$, where $$\rho_i^k$$ is the cumulative failure count for sample $$i$$ at evaluation $$k$$, $$\epsilon$$ is a smoothing constant, and $$\alpha$$ controls the sampling bias. The failure count is updated as $$\rho_i^k \leftarrow \eta \rho_i^{k-1} + \Delta_{\text{fail}}$$, with $$\Delta_{\text{fail}} = 1$$ if a failure occurs, and $$\eta$$ is a decay factor. This approach prioritizes challenging skills, improving overall learning efficiency.
| Reward Type | Description | Mathematical Form |
|---|---|---|
| Imitation Reward | Encourages matching reference motions | Sum of exponential decays on joint errors |
| Discriminator Reward | Aligns policy state distribution with demonstrations | Based on discriminator output |
| Preference Reward | Penalizes tripping, sliding, and poor foot orientation | $$r_h = k_b r_b + k_h r_h + k_o r_o$$ |
| Physical Constraint Reward | Promotes energy efficiency and reduces jitter | Penalizes large forces and high frequencies |
Experiments and Results
We evaluated the SMILE method in a simulated environment using the Isaac Gym platform, with policies running at 30 Hz and simulations at 60 Hz. The humanoid robot model has 55 degrees of freedom, emphasizing human-like proportions and dynamics. Training was conducted on a system with an Intel i9-13900K CPU and NVIDIA RTX 4090 GPU. We used 15 diverse human demonstration sequences from the AMASS dataset, covering skills such as standing, squatting, walking, jumping, and object manipulation.
The success rate $$E_s$$ was used as the primary metric, defined as the ability to maintain stability and synchronize with reference motions without exceeding a 0.5-meter average joint position error. Our results show that humanoid robots trained with SMILE achieved a success rate of 93.33%, significantly outperforming baselines. The robots demonstrated smooth transitions between skills, such as from walking to jumping or squatting to standing, with natural human-like motion patterns.
For example, in multi-directional object picking, the humanoid robot coordinated全身 movements, including waist rotation and leg adjustments, to reach targets efficiently. In walking transitions, it exhibited heel-to-toe gait patterns and seamless direction changes. High-dynamic skills like jumping over obstacles involved brief aerial phases, showcasing the policy’s ability to handle dynamic balance. These outcomes highlight the effectiveness of SMILE in enabling humanoid robots to perform complex, multi-skill tasks.
Ablation Studies
We conducted ablation experiments to assess the impact of key components. Removing the preference reward reduced the success rate to 53.33%, while disabling priority sampling resulted in a 73.33% success rate. Training efficiency also declined, as measured by episode length and cumulative reward. The discriminator reward, which reflects motion realism, was higher in SMILE (around 1.6) compared to ablations (around 0.6), indicating better alignment with human demonstrations.
These results underscore the importance of preference rewards in handling skill conflicts and priority sampling in focusing on difficult skills. Without these elements, humanoid robots struggled with instability and inefficient motion, particularly during transitions between static and dynamic skills.
Comparison with Reinforcement Learning Methods
We compared SMILE with a goal-conditioned RL baseline similar to OmniH2O, which uses reward shaping for multi-skill learning. The baseline initially learned conservative behaviors, such as standing with minimal movement, but failed to master dynamic skills like jumping or fast walking. Fine-tuning with SMILE’s reward functions improved performance but led to increased volatility and skill degradation, such as reduced step size in walking.
In contrast, SMILE maintained high motion quality across all skills, with no significant forgetting. The discriminator reward provided a consistent signal for realism, preventing the policy from adopting suboptimal shortcuts. This demonstrates that SMILE effectively balances skill-specific requirements and transition smoothness, a key advantage for humanoid robots in real-world applications.
| Method | Success Rate (%) | Average Episode Length | Discriminator Reward |
|---|---|---|---|
| SMILE (Ours) | 93.33 | 309 | 1.6 |
| Without Preference Reward | 53.33 | 177 | 0.6 |
| Without Priority Sampling | 73.33 | ~250 | ~1.0 |
| Goal-Conditioned RL Baseline | ~60 | ~200 | ~0.5 |
Conclusion
In this paper, we presented the SMILE method for enabling humanoid robots to learn diverse whole-body motion skills through a single policy model. By integrating goal-conditioned reinforcement learning with generative adversarial imitation learning, and incorporating preference-based rewards and adaptive sampling, SMILE addresses key challenges in multi-skill learning, such as skill conflicts and transition quality. Simulation results confirm that humanoid robots trained with SMILE achieve high success rates and natural motion patterns, outperforming existing methods.
Future work will focus on sim-to-real transfer, leveraging teacher-student frameworks to deploy SMILE on physical humanoid robot platforms. We also plan to extend the method to larger skill sets and more complex environments, further enhancing the scalability and robustness of humanoid robots. The ability to learn and transition between multiple skills efficiently will be crucial for advancing the autonomy of humanoid robots in space exploration, disaster response, and other critical applications.
The SMILE method represents a significant step forward in multi-skill imitation learning for humanoid robots, offering a balanced approach to motion quality and continuity. As humanoid robots continue to evolve, methods like SMILE will play a vital role in unlocking their full potential in real-world scenarios.