Gait Switching Method for Humanoid Robots Integrating Vision-Language Models and Proximal Policy Optimization

Humanoid robots, with their anthropomorphic structure, offer significant advantages in adapting to diverse terrains and operating existing human-made tools, combining flexibility and versatility. In various human production and living scenarios, these robots can reduce the need for physical environment modifications, thereby lowering the cost of large-scale deployment. Gait switching is a core technology for humanoid robots to overcome the dynamic adaptation challenges in complex, continuous terrains. Existing methods primarily rely on proprioception and lack the ability to actively understand external environmental features, making autonomous and smooth gait transitions difficult. This paper explores a gait switching method that integrates the semantic mapping capabilities of vision-language models (VLMs) with the adaptive learning characteristics of the proximal policy optimization (PPO) algorithm. We first generate human-like gait sequences through motion retargeting, then train gait primitives using a reward-shaped PPO algorithm to build a multi-terrain gait library. A gait scheduler based on a VLM dynamically matches suitable gait primitives, and Lagrange interpolation is employed to constrain joint trajectories for adaptive gait transitions. Experiments in typical scenarios validate the effectiveness of the proposed method.

The construction of a gait library is fundamental for enabling humanoid robots to perform seamless locomotion across multiple terrains. We begin by retargeting source gait sequences to generate human-like motion trajectories, which serve as reference gaits. A reward-shaped PPO algorithm is then used to train gait primitives, building a comprehensive gait library that adapts to flat ground, stairs, slopes, and narrow passages, thereby expanding the multi-terrain mobility of humanoid robots.

Motion retargeting maps source gait sequences to the joint space of humanoid robots to produce motion trajectories. Due to differences in joint distribution and range limits between source and target robots, and the presence of noise and redundant segments, gait sequences must be preprocessed to ensure rationality. We employ time-window cropping and outlier removal methods to eliminate invalid start and end frames and correct abrupt changes. First, the start and end frames of actions are manually annotated to remove stationary frames, ensuring the sequence contains a complete motion process. Second, the $3\sigma$ rule is applied to correct outliers in joint position sequences $\theta_i(t)$. The arithmetic mean $\mu_\theta$ and standard deviation $\sigma_\theta$ are calculated as follows:

$$ \mu_\theta = \frac{1}{T} \sum_{i=1}^{T} \theta_i(t) $$

$$ \sigma_\theta = \sqrt{\frac{1}{T} \sum_{i=1}^{T} (\theta_i(t) – \mu_\theta)^2} $$

Here, $T$ is the total number of frames. If a frame’s angle $\theta_i(t)$ satisfies $|\theta_i(t) – \mu_\theta| > 3\sigma_\theta$, it is marked as an outlier and replaced with the average of adjacent frames. The processed human-like gait sequences are then used for motion retargeting.

Human-like gait sequences include walking, running, crouching, slope traversal, and stair climbing. Motion retargeting applies constraints to each joint via a joint parameter mapping function, scaling joint positions, adjusting initial joint positions, and limiting joint ranges to generate human-like gaits. To maintain similarity between the retargeted and source gaits, the joint parameter mapping function is defined as:

$$ Q_t(t) = f(Q_s(t), P_i) $$

$$ f(Q_s(t), P_i) = [k_s \cdot Q_s(t) + k_0, P] $$

where $Q_s(t)$ is the source gait sequence, $Q_t(t)$ is the retargeted gait sequence for the humanoid robot, $P_i$ is the set of joint parameters, $f$ is the mapping function, $k_s$ is the joint scaling factor, and $k_0$ is the joint offset coefficient. If the converted gait sequence exceeds joint limits, truncation is applied:

$$ Q_{tf}(t) = \text{range}(Q_t(t), P) $$

where $Q_{tf}(t)$ is the truncated sequence, and the $\text{range}()$ function ensures joint positions remain within limits to avoid motion interference.

We incorporate reward shaping theory to design a hierarchical reward function and use the PPO algorithm to train gait primitives, enabling the reproduction of human-like gait sequences while adapting to terrains of different sizes. The state space for the multi-terrain gait generation task includes velocity commands, joint states, network action outputs, and base states, as summarized in Table 1. The action space comprises joint positions for the arms (8 joints), legs (12 joints), and waist (2 joints). A PD controller computes joint torques to drive the robot in the simulation platform.

Table 1: State Space of Humanoid Robots
State Category	State Name	State Dimension (dim)
Velocity Command	Forward Linear Velocity	1
Velocity Command	Yaw Angular Velocity	1
Joint State	Joint Position	22
Joint State	Joint Velocity	22
Action Output	Previous Action Output	22
Base State	Three-Axis Angular Velocity	3
Base State	Quaternion	4

Let the reward function for the gait generation task be $R(s, a, s’)$, where $s$ is the current state, $a$ is the action, and $s’$ is the next state. The shaped reward $R_s(s, a, s’)$ is introduced as:

$$ R_s(s, a, s’) = \gamma \Phi(s’) – \Phi(s) $$

where $\Phi(s)$ is a potential function indicating the ease of reaching the goal from state $s$, and $\gamma$ is the discount factor. If $\gamma \Phi(s’) – \Phi(s) > 0$, the transition from $s$ to $s’$ reduces potential energy, providing positive feedback; otherwise, it penalizes the agent. According to reward shaping theory, the optimal policy $\pi^*$ remains unchanged when adding $R_s$ to $R$, accelerating learning without altering the task’s optimal solution.

For the gait generation task, state $s$ includes the joint position vector $\mathbf{q}$ of the humanoid robot:

$$ \mathbf{q} = [\mathbf{q}_l, \mathbf{q}_a, \mathbf{q}_w] $$

where $\mathbf{q}_l$, $\mathbf{q}_a$, and $\mathbf{q}_w$ are joint positions for the legs, arms, and waist, respectively. The reference gait has target joint positions $\mathbf{q}_{\text{ref}}$, so the joint trajectory error vector $\mathbf{e}_q$ is:

$$ \mathbf{e}_q = \mathbf{q} – \mathbf{q}_{\text{ref}} = [\mathbf{e}_l, \mathbf{e}_a, \mathbf{e}_w] $$

where $\mathbf{e}_l$, $\mathbf{e}_a$, and $\mathbf{e}_w$ are error vectors for the legs, arms, and waist. Since joint errors are bounded, the potential function $\Phi_r(s)$ and its difference $\Delta \Phi_r(s) = \Phi_r(s’) – \Phi_r(s)$ are bounded, satisfying the conditions for reward shaping. To quantify joint trajectory errors, a weighted error norm is used as the potential function $\Phi_r(s)$:

$$ \Phi_r(s) = l_1 \|\mathbf{e}_l\|^2 + l_2 \|\mathbf{e}_a\|^2 + l_3 \|\mathbf{e}_w\|^2 $$

where $l_1, l_2, l_3$ are weight coefficients adjusting the importance of different body parts. The shaped reward $R_s^r$ is:

$$ R_s^r(s, a, s’) = \gamma \Phi_r(s’) – \Phi_r(s) $$

The human-like gait tracking reward $r_r$ is:

$$ r_r = \exp(-\Phi_r(s)) $$

After state transition, the reward $r_r’$ becomes:

$$ r_r’ = \exp(-\Phi_r(s’)) = r_r \cdot \exp(-(\gamma \Phi_r(s’) – \Phi_r(s))) $$

Taking logarithms on both sides:

$$ \log \left( \frac{r_r’}{r_r} \right) = – R_s^r $$

Thus, maximizing $r_r$ is equivalent to minimizing $R_s^r$. By designing $r_r$, the agent is guided to reduce $\Phi_r(s)$, bringing $\mathbf{q}$ closer to $\mathbf{q}_{\text{ref}}$. The overall reward function $R_h$ is:

$$ R_h = m_1 r_r + m_2 r_d + m_3 r_g + m_4 r_v + m_5 r_b $$

where $r_d$ is a leg distance reward to prevent interference, $r_g$ is a gait state reward for timing coordination between support and swing phases, $r_v$ is a velocity tracking reward, $r_b$ is a base state reward for torso stability, and $m_1$ to $m_5$ are weight coefficients. Adjusting $m_i$ allows flexible control of PPO rewards for different terrains. The trained gait primitives form a multi-terrain gait library, including flat ground (walking and running), slopes (adapting to angles <15°), stairs (adapting to heights <0.15 m), and narrow passages (crouching walking for low spaces).

Autonomous gait switching is achieved through a gait scheduler based on Grounded SAM2, which extracts environmental features, identifies terrains, segments passable regions, and plans paths. The scheduler selects gait primitives based on terrain characteristics, enabling seamless transitions. Grounded SAM2 combines Grounding DINO, an open-vocabulary object detector, and SAM2 for precise segmentation. Grounding DINO identifies terrains using prompt words and provides bounding boxes, while SAM2 generates high-resolution semantic masks for fine-grained boundary extraction. Depth camera data from the humanoid robot’s head is integrated to output terrain类别, masks, and distances for gait switching.

The gait scheduler uses terrain类别, distance, and mask information to extract features for switching. Terrain categories include flat ground, slopes, stairs, and narrow passages. The terrain distance is the relative distance from the robot’s base to the terrain boundary. For non-flat terrains, the scheduler统计 occurrence frequency and confidence. If the confidence exceeds a threshold ($p > 0.9$), continuous detections exceed a set value ($n > 5$), and the terrain distance is below a threshold ($x < 1$ m), the scheduler selects the corresponding gait primitive from the library; otherwise, the current gait is maintained for stability. After switching, gait parameters are updated, and new control commands are executed.

Direct gait switching on complex terrains may cause instability. Thus, transition states are designed: for walking and running, the default standing state is used; for slope, stair, and narrow passage gaits, the default crouching state is adopted. When switching from walking or running to slope, stair, or narrow passage gaits, the robot first transitions to standing, then crouches, and finally switches to the target gait. Reverse switches follow the same process. Lagrange interpolation generates smooth transition gait sequences:

$$ S(t) = \frac{t – t_1}{t_0 – t_1} y_0 + \frac{t – t_0}{t_1 – t_0} y_1 $$

where $y_0$ and $y_1$ are state parameters at the start and end points, and $t_0$ and $t_1$ are corresponding sampling points. $S(t)$ is the state value at each sampling point during transition. Transition times are dynamically adjusted based on pre- and post-switch speeds and gait cycles to ensure natural alignment and avoid failure due to abrupt changes.

We conduct experiments in a high-fidelity virtual fire scenario to validate the autonomous gait switching method. Fire scenarios feature low visibility and dynamic terrains with flames, smoke, and collapsed objects, requiring environmental understanding and multi-terrain adaptation. Using Isaac Lab, we build a multi-terrain warehouse scene based on real-world objects like metal shelves, wooden boxes, metal slopes, fire hydrants, and sorting boxes. The simulation parameters are set as in Table 2. The simulation frequency is 1000 Hz, synchronizing with the control frequency for rapid gait adjustments.

Table 2: Simulation Platform and Humanoid Robot Parameters
Parameter Category	Parameter
Physics Engine	PhysX
Renderer	Omniverse RTX
Simulation Frequency (Hz)	1000
Control Frequency (Hz)	1000
Robot Mass (kg)	55.655
Robot Height (m)	1.7

The multi-terrain gait library is constructed using the reward-shaped PPO algorithm, with hyperparameters listed in Table 3. Gait primitives for flat ground, slopes, stairs, and narrow passages are generated, as shown in motion results. Crouching walking gait reduces height for limited-space terrains; stair climbing and slope traversal gaits adjust step length, frequency, and center of mass for stable motion.

Table 3: PPO Algorithm Hyperparameters
Parameter Category	Parameter
Number of Environments	4096
Clipping Parameter	0.2
Entropy Regularization Coefficient	0.001
Mini-batch Size	8
Learning Rate	1×10⁻⁵
Discount Factor	0.994
Generalized Advantage Estimate Parameter	0.9
Target KL Divergence	0.01

For autonomous terrain understanding in fire scenarios, we fine-tune Grounded SAM2 with a small dataset of 200 images covering flat ground, slopes, stairs, shelves, hydrants, and exits, including features like flames, collapsed shelves, and smoke occlusion. Data augmentation techniques like scaling and cropping are applied. We compare its performance with YOLOv11 using average precision (AP) for single categories and mean average precision (mAP) with IoU [0.5:0.95]. Images are resized to 640×480, and training uses an NVIDIA RTX 4080 GPU with 15 epochs and batch size 16. Results in Table 4 show that our fine-tuned model outperforms YOLOv11, achieving 93.3% mAP and over 90% AP for all categories. The VLM’s cross-modal decoder adaptively fuses text and visual features, and self-attention mechanisms address semantic relevance issues in CNNs, demonstrating enhanced generalization with few-shot fine-tuning.

Table 4: Terrain Recognition Accuracy
Method	mAP	AP (Floor)	AP (Slopes)	AP (Steps)	AP (Shelf)	AP (Hydrant)	AP (Exit)
YOLOv11	0.911	0.892	0.890	0.891	0.930	0.892	0.969
Ours	0.933	0.900	0.915	0.916	0.942	0.944	0.982

For passable region path planning, SAM2 performs precise segmentation based on Grounding DINO’s detections, outputting terrain masks connected into passable regions. After binarization, a planned trajectory is formed by connecting midpoints along the boundaries of passable areas. To avoid sudden yaw angle changes when the robot reaches the top of a passable region, the yaw angle and angular velocity are computed as:

$$ \{X_a\}_i = \frac{1}{2} p_{ix}, \quad \{Y_a\}_k = \max[\{Y_a\}_k] $$

$$ a_y = \arctan\left( \frac{\sum_{i=1}^{n} (p_{ix} – p_{x0})}{\sum_{i=1}^{n} (p_{iy} – p_{y0})} \right) $$

where $p_{ix}$ and $p_{iy}$ are horizontal and vertical coordinates of points in the passable region, $\{X_a\}_i$ is the pixel length of a row, $\{Y_a\}_k$ is the pixel length of a column, $p_{x0}$ and $p_{y0}$ are start point coordinates, and $a_y$ is the yaw angle. The yaw angular velocity $\omega_a$ is:

$$ \omega_a = \frac{a_y[k+1] – a_y[k]}{t_{k+1} – t_k} $$

where $t_k$ is the $k$-th sample, and $a_y[k]$ is the yaw angle at sample $k$.

Autonomous gait switching experiments are conducted in the virtual fire scene with multi-terrain parameters in Table 5. The slope has a 15° angle, stairs have 9 steps of 0.15 m height each, and the narrow passage has a height limit of 1.6 m, below the robot’s 1.7 m height. Results show that on flat ground, the humanoid robot uses walking gait and switches to running for rapid exploration when no other terrain is detected. When slopes or stairs are identified, Grounded SAM2 generates passable regions, and the scheduler selects corresponding gaits for successful traversal. For stairs, the robot follows the planned path, adjusts step length, transitions to standing, crouches, and matches stair gait primitives to climb steadily. In narrow passages, crouching walking gait reduces base height and swing amplitude, enabling passage through collapsed shelves.

Table 5: Multi-terrain Parameters
Terrain Category	Slope	Stairs	Narrow Passage
Length (m)	4.4	14	3
Width (m)	1.4	2	0.8
Height (m)	0.4	1.35	1.6

In conclusion, we propose a gait switching method for humanoid robots that integrates vision-language models and proximal policy optimization, significantly enhancing adaptability in multi-terrain environments. Key contributions include: (1) designing a hierarchical reward function combining human-like gait sequences and proprioceptive states to train multi-terrain gait primitives with PPO, building a comprehensive gait library; (2) developing a gait scheduler based on Grounded SAM2 for cross-modal terrain feature extraction and dynamic gait selection, with Lagrange interpolation ensuring smooth joint trajectories for autonomous switching. Experiments in virtual fire scenes validate the method’s effectiveness, demonstrating robust performance across diverse terrains. Future work will focus on real-world deployment and handling more dynamic obstacles.