Intelligent Whole-Body Motion Strategy Generation for Multi-Skilled Humanoid Robots

The pursuit of creating a versatile humanoid robot capable of performing a wide array of tasks in complex, unstructured environments represents a fundamental challenge in robotics. A core aspect of this challenge is the generation of whole-body motion strategies that are not only physically feasible and stable but also appear natural and human-like. While significant progress has been made in training humanoid robots to master individual skills—such as walking, standing up, or running—in isolation, a critical gap remains: enabling a single policy model to learn and seamlessly execute a diverse portfolio of skills. The central problem I aim to address is the inherent conflict that arises when a monolithic policy attempts to optimize for skills with divergent motion characteristics. For instance, static skills like standing or detailed inspection require the humanoid robot to maintain equilibrium with minimal movement, favoring conditions of static balance. In stark contrast, dynamic skills like jumping over a trench or high-dynamic stepping necessitate the active breaking of static balance, involving rapid, forceful motions and periods of aerial phase before recovering stability. When a single neural network policy is trained on data encompassing such varied skills, the gradient signals from these conflicting objectives can interfere, leading to suboptimal convergence, catastrophic forgetting of previously learned skills, or a compromised “averaged” behavior that excels at none.

Furthermore, even if a policy learns to approximate individual skills, transitioning between them smoothly while maintaining the quality of each motion segment is non-trivial. An over-emphasis on smoothness may dilute the distinctive features of a skill (e.g., the height of a jump), while focusing solely on perfecting individual skills may result in jerky, unstable transitions that could cause the humanoid robot to fall. My research is motivated by the need for humanoid robots in future space missions, where adaptability and the ability to perform a repertoire of mobile manipulation tasks—from routine station maintenance to extraterrestrial surface exploration—are paramount. The goal is to develop a unified framework that allows a humanoid robot to be an agile, general-purpose platform.

To overcome these challenges, I propose a novel framework termed Single-model Imitation Learning for Multi-skill Efficiency (SMILE). This method integrates Goal-Conditioned Reinforcement Learning (GCRL) with Generative Adversarial Imitation Learning (GAIL) to efficiently generate high-quality, diverse motions from a single policy. The core innovation lies in two synergistic components: a preference-based reward shaping mechanism and a failure-frequency-based priority sampling strategy. The preference rewards inject human priors about “good” motion characteristics beyond mere kinematic imitation, guiding the policy to resolve conflicts between skill types. The adaptive sampling algorithm dynamically focuses training on challenging skills that the policy struggles with, ensuring balanced learning progress across the entire skill set and preventing catastrophic forgetting. Formally, the training objective for the humanoid robot policy $\pi$ is to maximize the expected cumulative return over trajectories $\tau$:

$$
J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{T-1} \gamma^t R(\mathbf{s}^p_t, \mathbf{s}^g_t, \mathbf{a}_t) \right]
$$

where $\gamma$ is the discount factor, and the composite reward $R$ is key to our approach.

Methodological Framework

1. Problem Formulation and Motion Retargeting

The task is formalized as a finite-horizon Markov Decision Process (MDP). The state $\mathbf{s}_t$ observed by the humanoid robot is partitioned into two components: the proprioceptive state $\mathbf{s}^p_t$ and a goal state $\mathbf{s}^g_t$.

$$
\mathbf{s}^p_t = [\mathbf{r}_t, \mathbf{p}_t, \boldsymbol{\omega}_t, \mathbf{v}_t]
$$

Here, $\mathbf{r}_t$, $\mathbf{p}_t$, $\boldsymbol{\omega}_t$, and $\mathbf{v}_t$ represent the 6D rotation, 3D position, angular velocity, and linear velocity of all body links, respectively. The goal state $\mathbf{s}^g_t$ encodes the difference between the current state and the upcoming reference motion, providing a target for imitation:

$$
\mathbf{s}^g_t = (\hat{\mathbf{r}}_{t+1} \odot \mathbf{r}_t, \hat{\mathbf{p}}_{t+1} – \mathbf{p}_t, \hat{\mathbf{v}}_{t+1} – \mathbf{v}_t, \hat{\boldsymbol{\omega}}_t – \boldsymbol{\omega}_t, \hat{\boldsymbol{\theta}}_{t+1}, \hat{\mathbf{p}}_{t+1}^{root})
$$

The notation $\hat{\cdot}$ denotes values from the reference motion dataset. The action $\mathbf{a}_t \in \mathbb{R}^{N}$ consists of target joint positions for a PD controller. To create this reference dataset, human motion capture sequences from sources like AMASS are retargeted onto the humanoid robot’s morphology using inverse kinematics, ensuring the spatial trajectory of key body points is preserved.

2. Composite Reward Function

The reward function $R(\mathbf{s}^p_t, \mathbf{s}^g_t, \mathbf{a}_t)$ is the primary tool for shaping the humanoid robot’s behavior. It consists of four carefully designed components:

a) Imitation Reward ($r^{imp}$): This reward encourages the humanoid robot to closely follow the kinematic profile of the reference motion. It is computed as a sum of exponential error terms for joint positions, rotations, and their velocities.

b) Discriminator Reward ($r^{disc}$): Inspired by Adversarial Motion Priors (AMP), a discriminator network $D$ is trained concurrently to distinguish between state transitions from the policy and those from the reference dataset. The policy receives a reward $r^{disc} = -\log(1 – D(\mathbf{s}_t, \mathbf{s}_{t+1}))$, encouraging it to produce state transitions that are distributionally indistinguishable from human motion, thereby capturing the style and coordination nuances.

c) Physical Constraint Reward ($r^{con}$): This term enforces realism and efficiency, penalizing excessive joint torque, power consumption, and unnatural joint accelerations to ensure the humanoid robot moves in an energy-plausible manner.

d) Preference-Based Reward ($r^{pref}$): This is a novel contribution to address inter-skill conflict. It encodes explicit human preferences for robust, human-like ground interaction, which are often not fully captured by kinematic imitation or the discriminator alone. It comprises three sub-terms:

$$
r^{pref} = k_b r^{stumble} + k_s r^{slip} + k_o r^{orientation}
$$

$$
\begin{aligned}
r^{stumble} &= \mathbb{I}(\|\mathbf{F}^{feet}_{xy}\|_2 > \tau_{force}) \quad \text{(Penalizes stumbling contact)} \\
r^{slip} &= \|\mathbf{v}^{feet}\|^2_2 \cdot \mathbb{I}(\|\mathbf{F}^{feet}\|_2 > 0) \quad \text{(Penalizes foot sliding)} \\
r^{orientation} &= \|\mathbf{g}^{l\_feet}_{xy}\|_2 + \|\mathbf{g}^{r\_feet}_{xy}\|_2 \quad \text{(Penalizes non-vertical foot orientation)}
\end{aligned}
$$

Here, $\mathbb{I}(\cdot)$ is an indicator function, $\mathbf{F}^{feet}$ is the foot contact force, $\mathbf{v}^{feet}$ is the foot velocity, and $\mathbf{g}^{feet}_{xy}$ is the projection of the gravity vector onto the foot’s local horizontal plane. By discouraging stumbling, sliding, and tilted feet, this reward actively guides the humanoid robot toward a more stable and human-like interaction with the ground, which is crucial for reconciling the demands of both static and dynamic skills. The total reward is:

$$
R = w_{imp} r^{imp} + w_{disc} r^{disc} + w_{con} r^{con} + w_{pref} r^{pref}
$$

Table 1: Summary of Reward Function Components
Component	Purpose	Key Mechanism
Imitation Reward ($r^{imp}$)	Kinematic fidelity to reference	Exponential error on poses/velocities
Discriminator Reward ($r^{disc}$)	Style and distribution matching	Adversarial training vs. reference data
Constraint Reward ($r^{con}$)	Physical realism & efficiency	Penalizes torque, power, jerk
Preference Reward ($r^{pref}$)	Ground interaction quality	Penalizes stumble, slip, bad orientation

3. Failure-Frequency-Based Priority Sampling

Training on a diverse skill set with uniform sampling can be inefficient, as easy skills are mastered quickly while difficult ones remain a source of failure, hindering overall progress. To accelerate learning and ensure the humanoid robot dedicates more capacity to challenging skills, I implement an adaptive sampling scheme for selecting reference motion clips.

Each skill clip $i$ in the dataset is associated with a failure count $\rho_i$. During training, the probability $P_i$ of sampling clip $i$ is given by a prioritized softmax:

$$
P_i = \frac{(\rho_i + \epsilon)^\alpha}{\sum_j (\rho_j + \epsilon)^\alpha}
$$

where $\epsilon$ is a small smoothing constant, and $\alpha$ controls the prioritization strength. The failure count $\rho_i$ for clip $i$ is updated after every training epoch $k$:

$$
\rho_i^{k} \leftarrow \eta \rho_i^{k-1} + \Delta_{fail}
$$

Here, $\eta$ is a decay factor (e.g., 0.99), and $\Delta_{fail}$ is 1 if the policy failed to complete the clip during evaluation in epoch $k$, else 0. Clips on which the humanoid robot consistently fails see their sampling probability increase, focusing the policy’s learning effort. This method ensures that the training curriculum automatically adapts to the policy’s current weaknesses, significantly improving sample efficiency and final performance across all skills.

4. Network Architecture and Training

The policy $\pi$ and value function $V$ are modeled by separate 6-layer MLPs with SiLU activations. The discriminator $D$ is a smaller 2-layer MLP. The policy is trained using Proximal Policy Optimization (PPO), a stable on-policy RL algorithm. The discriminator is updated using a standard binary cross-entropy loss to classify real vs. policy-generated state transitions. Training is conducted in a high-performance physics simulator (e.g., Isaac Gym) with domain randomization to enhance robustness.

Experimental Analysis and Results

The SMILE framework was evaluated on a diverse set of 15 skills crucial for a general-purpose humanoid robot, including standing, squatting, multi-directional walking, picking objects from various heights, jumping, and high-dynamic stepping. The policy was trained in simulation, and its performance was measured by the success rate $E_s$—the fraction of episodes where the humanoid robot completed the skill without falling and maintained kinematic accuracy.

Ablation Studies

Ablation experiments were conducted to isolate the contribution of the two core components: the Preference-Based Reward and the Failure-Frequency-Based Priority Sampling.

Table 2: Ablation Study Results (Final Performance)
Method	Success Rate ($E_s$)	Avg. Episode Length	Avg. Discriminator Reward	Notes
SMILE (Full Method)	93.33%	309	1.6	Balanced, high-quality performance
Without Preference Reward	53.33%	177	0.6	Poor ground interaction, frequent falls
Without Priority Sampling	73.33%	–	–	Slower convergence, skill imbalance

The results are clear. Removing the preference reward caused a drastic 40% drop in success rate. The humanoid robot exhibited poor foot-ground interaction, leading to frequent stumbles and slips, especially during transitions between static and dynamic phases. The average episode length was shorter, indicating premature failure. Furthermore, the discriminator reward remained low, suggesting the generated motions lacked the nuanced style of human movement. This confirms that kinematic imitation and adversarial training alone are insufficient to resolve the fundamental conflict between skill types; explicit preference guidance is essential.

Removing the priority sampling mechanism led to a 20% reduction in success rate and noticeably slower convergence. The policy showed an imbalance in skill mastery, often excelling at a subset of skills while performing poorly on others, a classic symptom of catastrophic interference in multi-task learning. The adaptive sampling ensures that the humanoid robot’s policy continuously works on its weaknesses, leading to robust, all-around competence.

Comparative Analysis with Goal-Conditioned RL

I compared SMILE against a strong baseline method based purely on Goal-Conditioned Reinforcement Learning (GCRL), similar to frameworks like OmniH2O, which uses carefully shaped rewards without an adversarial discriminator. The GCRL baseline was first trained to a stable point, then fine-tuned using the SMILE reward (excluding the discriminator reward) to see if reward shaping alone could bridge the gap.

Table 3: Comparison with Goal-Conditioned RL Baseline
Aspect	SMILE Method	GCRL Baseline (Fine-tuned)
Policy Style	Dynamic, confident, human-like transitions	Conservative, “standing-prioritized”
Motion Quality	High-fidelity jumps, clean foot placement	Substituted jumps with shuffling steps, degraded gait
Skill Transition	Smooth, coherent, and intentional	Hesitant, often used upper body to compensate for stiff legs
Forgetting	Minimal; retained proficiency across all skills	Significant; fine-tuning degraded previously learned walking

The GCRL baseline converged to a conservative strategy that prioritized static stability above all else. While it could stand and perform small, stable movements, it failed to execute dynamic skills like a proper jump, replacing them with inefficient shuffling. When fine-tuned with the composite reward (but no discriminator), it attempted more diverse motions but suffered from severe motion degradation and forgetting. For example, a previously learned robust walk deteriorated into a mincing gait. This highlights a key advantage of SMILE: the adversarial discriminator component provides a powerful, data-driven constraint on motion quality and style. It prevents the policy from “cheating” the reward function by adopting bizarre but high-reward behaviors, and it enforces the preservation of human-like motion characteristics across all skills, thereby mitigating forgetting. The discriminator acts as a continuous, evolving benchmark of what constitutes valid humanoid motion.

Capability Demonstration

The humanoid robot trained with the SMILE framework successfully learned and integrated all 15 skills into a single, cohesive policy. It demonstrated:
• Static-Dynamic Transitions: Smoothly moving from a stable stand into a dynamic jump and landing recover.
• Multi-Directional Mobility: Transitioning between forward, backward, and sidestepping gaits with human-like turning steps (e.g., pivoting on the ball of the foot).
• Whole-Body Coordination: Performing actions like picking up objects from the floor, which required coordinated bending at the hip and knee while maintaining balance, rather than just leaning over.
• Distinct Skill Execution: The jump skill clearly displayed a distinct aerial phase with both feet off the ground, and the high-step skill showed a deliberate, high knee lift—neither was blended into a generic “leg movement”.

The policy’s ability to execute these distinct skills from a common ready state and chain them together in sequences showcases its generalizability and the effectiveness of the proposed framework in managing the multi-skill optimization problem for a humanoid robot.

Conclusion and Future Work

In this work, I have presented SMILE, a unified imitation learning framework that enables a single policy to generate high-quality, diverse whole-body motions for a humanoid robot. The integration of adversarial imitation learning with carefully designed preference-based rewards and an adaptive, failure-driven sampling curriculum directly addresses the core challenges of multi-skill learning: optimization conflict and imbalanced progress. The preference rewards explicitly guide the humanoid robot towards robust, human-like ground interaction principles, which serve as a common ground for both static and dynamic skills. The priority sampling ensures efficient use of training resources by focusing on the current limitations of the policy.

The experimental results demonstrate the necessity of each component. The full SMILE method achieved a 93.33% success rate across a challenging skill set, significantly outperforming ablated versions and a strong GCRL baseline. It produced motions that were not only functionally correct but also exhibited the style and coordination nuances characteristic of human movement, enabling smooth and natural transitions between disparate skills.

The framework is inherently scalable. The preference reward formulation and priority sampling algorithm are skill-agnostic. To incorporate new skills, one would simply add the retargeted reference motions to the dataset. The sampling mechanism would automatically identify the new skill as a “challenge” and allocate appropriate training attention, while the adversarial discriminator would ensure the new motions conform to the overall distribution of human-like behavior. Future work will focus on the critical step of sim-to-real transfer. Techniques like system identification, domain randomization, and potentially a teacher-student distillation pipeline will be explored to deploy the policies trained in simulation onto a physical humanoid robot platform, bringing us closer to realizing the vision of a truly versatile, multi-skilled humanoid robot capable of operating in real-world environments.