Efficient Learning of Robust Multigait Quadruped Locomotion for Minimizing the Cost of Transport

Quadruped robots, often referred to as robot dogs, have demonstrated remarkable capabilities in navigating complex environments through various control strategies. However, their adaptability and energy efficiency still lag behind biological counterparts. A key reason is that animals seamlessly switch between different gait patterns to optimize traversability and energy consumption across diverse terrains and speeds. This study explores learning-based approaches to develop robust and energy-efficient multigait locomotion policies for quadruped robots. By integrating gait-conditioned policies into a unified framework, we enable smooth transitions and adaptive behavior, minimizing the cost of transport (CoT) while maintaining robustness. Our methodology leverages reinforcement learning (RL) to train individual gait policies and a gait selector module, resulting in a multimodal locomotion controller that outperforms single-gait strategies. Experimental results in simulation validate the effectiveness of our approach, showing significant improvements in energy efficiency, velocity tracking, and terrain adaptability.

In legged locomotion, the choice of gait—such as walk, trot, or gallop—involves trade-offs between stability, speed, and energy expenditure. For instance, a walk gait offers high stability at low speeds but is inefficient for rapid movement, whereas a gallop gait allows high velocity at the cost of increased energy consumption. Traditional control methods for quadruped robots often rely on predefined gaits or heuristic transitions, which can be suboptimal and require manual tuning. In contrast, learning-based methods, particularly RL, can autonomously discover efficient policies by optimizing rewards related to energy, stability, and task performance. This work focuses on developing a framework that learns gait-conditioned policies for a robot dog and synthesizes them into an adaptive multigait controller, enabling real-time gait switching based on velocity commands and terrain feedback.

The core of our approach involves training individual policies for distinct gaits using privileged learning, where precise simulation states are used during training to improve robustness. Each policy is conditioned on gait phase signals generated by a periodic phase generator, which ensures consistent gait patterns. A gait selector module, trained with RL, then dynamically chooses the optimal gait and parameters based on environmental inputs. This modular architecture allows for efficient integration of multiple skills without the need for complex state machines or manual intervention. Key contributions include a unified training framework for diverse gaits, a safe transition mechanism, and extensive analysis of energy efficiency across different velocity commands.

To evaluate our method, we compare it against single-gait policies and a finite state machine (FSM) baseline in simulated environments. Metrics such as velocity tracking error, CoT, success rate on challenging terrains, and joint velocity safety during transitions are used. Results demonstrate that our multigait policy achieves lower energy consumption, smoother transitions, and superior adaptability compared to alternatives. The following sections detail the related work, methodology, experimental setup, and results, concluding with discussions on limitations and future directions.

Related Work

Research in quadruped robot locomotion has evolved from model-based controllers to learning-based approaches. Early works utilized central pattern generators (CPGs) and heuristic rules to generate gait patterns, but these often required extensive parameter tuning and lacked adaptability. With advances in RL, researchers have developed policies that emerge gaits indirectly through reward shaping or explicitly via gait conditioning. For example, privileged learning methods use simulated privileged information (e.g., accurate velocity estimates) to train policies that transfer robustly to real-world robot dogs. Studies have shown that energy-based rewards can lead to natural gait emergence, but they may not enforce specific gait patterns, potentially resulting in suboptimal efficiency.

Gait-conditioned locomotion has been explored using various techniques, such as tracking reference contact sequences or incorporating phase-based rewards. These methods enable a quadruped robot to follow desired gait patterns, but they often focus on single gaits or lack smooth transition mechanisms. Skill integration approaches, including policy distillation and mixture of experts, aim to combine multiple behaviors into a single policy. However, these can be computationally intensive and may not guarantee optimal transitions. Our work builds on these ideas by training individual gait policies with explicit phase conditioning and using a learned selector for seamless integration, prioritizing energy efficiency and safety.

In terms of energy optimization, the cost of transport is a critical metric, defined as the energy consumed per unit weight and distance. Biological studies indicate that animals minimize CoT by switching gaits at specific speeds, inspiring similar strategies for robot dogs. Previous RL-based methods have incorporated CoT rewards, but they often do not explore multigait scenarios comprehensively. Our framework explicitly addresses this by analyzing CoT across gaits and velocities, enabling the selector to choose the most efficient gait dynamically. Additionally, domain randomization and curriculum training enhance sim-to-real transfer, ensuring robustness across varied terrains and conditions.

Methodology

Our control architecture consists of three main components: single-gait policies, a gait phase generator, and a gait selector module. The single-gait policies are trained using RL to follow velocity commands and gait phase signals. Each policy for a specific gait type $j$ is denoted as $\sigma_j$ and includes a variational autoencoder (VAE) for proprioceptive encoding and velocity estimation, followed by a backbone network that outputs joint position targets. The policies operate at 50 Hz, with a low-level PD controller running at 1 kHz for joint tracking. The gait selector module, trained separately, takes velocity commands, terrain height maps, and proprioceptive history as inputs to compute gait parameters $\theta_t$ and a one-hot vector $\mathbf{w}$ for policy blending. It runs at 5 Hz to reduce computational load.

The overall system can be described by the following equations:

$$\mathbf{w}, \theta = \psi(\mathbf{v}^*, H, \mathbf{s}_{t:t-h}, \theta_{t-1}),$$

$$\mathbf{z}, \hat{\mathbf{v}} = \mu_j(\mathbf{s}_{t-1:t-h}),$$

$$\hat{\mathbf{s}}_{t+1} = \eta_j(\mathbf{s}_{t-1:t-h}),$$

$$\mathbf{a}_{j,t} = \rho_j(\mathbf{z}, \hat{\mathbf{v}}, \mathbf{v}^*, \mathbf{g}_\theta(t), \mathbf{s}_t, \mathbf{a}_{j,t-1}),$$

$$\mathbf{a}_t = \sum_{j=1}^N w_j \cdot \mathbf{a}_{j,t},$$

where $\psi$ is the gait selector, $\mu_j$ and $\eta_j$ are the encoder and decoder of the VAE, $\rho_j$ is the backbone policy, $\mathbf{g}_\theta(t)$ is the phase vector, and $\mathbf{a}_t$ is the final action. The phase generator produces periodic signals based on sine waves to define gait rhythms. For each leg $i$, the phase signal is computed as:

$$\mathbf{g}_\theta(t) = [g_0, g_1, g_2, g_3]^T,$$

$$\theta = [\Delta\phi_0, \cdots, \Delta\phi_3, \gamma_0, \cdots, \gamma_3, T_0, \cdots, T_3],$$

$$g_i = [\sin\phi_i, \cos\phi_i]^T, \quad i = 0,1,2,3,$$

$$\phi_i =
\begin{cases}
\frac{\pi}{\gamma_i T_i} t + \Delta\phi_i, & 0 \leq t < \gamma_i, \\
\frac{\pi t}{(1-\gamma_i)T_i} + \frac{1-2\gamma_i}{1-\gamma_i} \pi + \Delta\phi_i, & \gamma_i \leq t < T_i,
\end{cases}$$

where $T_i$ is the gait period, $\gamma_i$ is the duty cycle (stance phase proportion), and $\Delta\phi_i$ is the phase offset. This formulation allows flexible gait design, such as walk ($\Delta\phi = [0, 0.5\pi, 1.5\pi, \pi]^T$, $\gamma=0.25$), trot ($\Delta\phi = [0, \pi, \pi, 0]^T$, $\gamma=0.5$), and gallop ($\Delta\phi = [0, 1.6\pi, 0.8\pi, 0.4\pi]^T$, $\gamma=0.75$).

For training, we use proximal policy optimization (PPO) with a reward function that combines task-oriented, gait-conditioned, and efficiency terms. The reward for single-gait policies includes:

$$R_{\text{velocity}} = \alpha_1 e^{-\beta_1 \|\mathbf{v}^* – \mathbf{v}\|^2},$$

$$R_{\text{orientation}} = \alpha_2 e^{-\beta_2 \|\mathbf{k}^* – \mathbf{k}\|^2},$$

$$R_{\text{height}} = \alpha_3 e^{-\beta_3 |p_z^* – p_z|^2},$$

$$R_{\text{swing}} = \alpha_4 e^{-\beta_4 \sum_i C_i \|\mathbf{F}_i\|^2},$$

$$R_{\text{stance}} = \alpha_5 e^{-\beta_5 \sum_i (1-C_i) \|\mathbf{v}_{f,i}\|^2},$$

$$R_{\text{energy}} = \alpha_6 e^{-\beta_6 \frac{|\dot{\mathbf{q}}^T \boldsymbol{\tau}|}{m g \|\mathbf{v}\|}},$$

$$R_{\text{smooth}} = \alpha_7 e^{-\beta_7 \|\boldsymbol{\tau} – \boldsymbol{\tau}_{\text{prev}}\|^2},$$

$$R_{\text{jointvel}} = \alpha_8 e^{-\beta_8 \|\dot{\mathbf{q}}\|^2},$$

where $C_i$ is 1 if leg $i$ is in swing phase ($\phi_i > 0$) and 0 otherwise, $\mathbf{F}_i$ is foot contact force, $\mathbf{v}_{f,i}$ is foot velocity, $\boldsymbol{\tau}$ is joint torque, and $m$ is robot mass. The energy term $R_{\text{energy}}$ directly penalizes CoT. The gait selector reward includes tracking accuracy, survival penalty, decision smoothness, safety, and CoT minimization:

$$R_{\text{tracking}} = \alpha_9 e^{-\beta_9 \|\mathbf{v}^* – \mathbf{v}\|^2},$$

$$R_{\text{survival}} = -\alpha_{10} K_{\text{fail}},$$

$$R_{\text{decision}} = -\alpha_{11} \|\theta – \theta_{\text{prev}}\|,$$

$$R_{\text{safety}} = -\alpha_{12} \max \|\dot{\mathbf{q}}\|,$$

$$R_{\text{CoT}} = -\alpha_{13} \frac{|\dot{\mathbf{q}}^T \boldsymbol{\tau}|}{m g \|\mathbf{v}\|}.$$

Domain randomization is applied during training to enhance robustness, with parameters varied across episodes, as summarized in Table 1.

Table 1: Domain Randomization Parameters
Parameter	Range
Payload mass (kg)	[-2.0, 4.0]
Friction coefficient	[0.3, 1.7]
Motor strength	[0.8, 1.2]
PD factor	[0.9, 1.1]
Latency (ms)	[0, 20]

Curriculum training is used to gradually increase difficulty, with terrain challenges tailored to each gait. For example, walk policies train on shorter, more rugged terrains, while gallop policies use longer, smoother courses. This ensures that each gait policy masters locomotion under appropriate conditions.

Experimental Setup

We evaluate our method using the Jueying Lite3 and Unitree A1 quadruped robots in simulation. Training is conducted in Isaac Gym with PyTorch, and testing in RaiSim for realism. Policies are trained with 4096 parallel environments, an episode length of 8 seconds, and discount factor 0.995. Network architectures include hidden layers of sizes [256, 128] for the backbone, [512, 128] for the encoder, and [256, 256, 64] for the gait selector. Observation history length is 25 steps.

Comparison groups include single-gait policies at different frequencies, an FSM baseline that switches gaits based on velocity thresholds, and prior methods like emergent gait (EG) distillation and convex MPC. Evaluation metrics are:

RMSE: Velocity tracking error at low (0.375 m/s), medium (0.9 m/s), and high (1.5 m/s) commands.
CoT: Average cost of transport over feasible velocity ranges.
TSR: Traverse success rate on challenging terrains (fractal, slopes, discrete platforms, gaps, projectiles).
Safety: Joint velocity ratio during transitions.

Feasible velocity ranges for each gait are determined based on failure rates, tracking errors, and contact state errors. For instance, walk policies perform well below 0.5 m/s, trot up to 1.5 m/s, and gallop beyond 2.0 m/s. The gait selector is trained to operate within these ranges, switching gaits smoothly to minimize CoT.

Results and Analysis

Single-gait policies exhibit distinct performance characteristics. Walk policies achieve low CoT at low speeds but fail at high velocities, while gallop policies enable fast locomotion with higher energy consumption. Trot policies balance speed and efficiency, with a wide feasible range. Training with multiple frequencies shows that higher frequencies expand the velocity range but can reduce robustness. For example, a trot policy at 2.4 Hz tracks commands up to 2.5 m/s but has a higher failure rate on rough terrain compared to 1.2 Hz.

Table 2 summarizes the performance of single-gait policies at typical frequencies, our multigait policy, and the FSM baseline. Our method achieves the lowest CoT across all velocities, with smooth transitions that avoid the spikes seen in FSM. Velocity tracking errors are comparable to the best single-gait policies, and success rates on challenging terrains are higher due to adaptive gait selection.

Table 2: Performance Comparison Under Different Conditions
Parameter	Walk (0.8 Hz)	Walk (1.6 Hz)	Trot (1.2 Hz)	Trot (2.4 Hz)	Gallop (2.0 Hz)	Gallop (3.2 Hz)	FSM	Ours
RMSE(Δv_L)	0.1034	0.0920	0.0954	0.0982	0.0966	0.0941	0.1125	0.0929
RMSE(Δv_M)	0.3768	0.1430	0.1365	0.0860	0.0621	0.0625	0.0827	0.0620
RMSE(Δv_H)	0.8916	0.4997	0.3042	0.0880	0.0760	0.0500	0.0650	0.0522
CoT_avg	0.7241	0.4733	0.6886	0.6041	0.5675	0.4514	0.4531	0.4306
TSR_fractal	0.90	0.95	1.00	1.00	0.80	0.85	1.00	1.00
TSR_slope	0.85	0.85	1.00	1.00	0.65	0.70	1.00	1.00
TSR_discrete	0.80	0.95	1.00	0.95	0.70	0.60	0.75	1.00
TSR_gaps	0.80	0.80	0.90	1.00	0.80	0.80	1.00	0.95
TSR_projectile	0.95	1.00	1.00	1.00	1.00	1.00	1.00	1.00

Energy efficiency analysis, as shown in Figure 1, reveals that our multigait policy consistently achieves the lowest CoT, with no abrupt increases at transition points. In contrast, FSM exhibits spikes due to discrete switching. The gait selector prefers robust gaits like trot on complex terrains, optimizing for both success rate and energy. For instance, on discrete platforms, our policy maintains a 100% success rate by switching to trot, whereas single-gallop policies drop to 60%.

Safety during transitions is assessed through joint velocity ratios. Our method produces smoother transitions, with lower peak joint velocities compared to FSM. Figure 2 illustrates the distribution of joint velocity ratios, where our approach clusters closer to 1.0, indicating minimal disruption. This aligns with the energy efficiency results, as violent transitions increase CoT.

Comparison with prior methods in Table 3 shows that our approach outperforms EG distillation and convex MPC in CoT across velocities. For example, at 0.375 m/s, our CoT is 0.3551, while EG and MPC are 0.8059 and 2.2390, respectively. This highlights the benefits of explicit gait conditioning and adaptive selection.

Table 3: Energy Efficiency Comparison at Different Velocity Commands
v* (m/s)	Ours	EG	Convex MPC
0.375	0.3551	0.8059	2.2390
0.9	0.3803	0.3936	0.7506
1.5	0.3937	0.5841	0.5871

Conclusion and Discussion

This work presents a learning framework for efficient multigait locomotion in quadruped robots. By training gait-conditioned policies and a gait selector, we enable adaptive behavior that minimizes the cost of transport while maintaining robustness. Experimental results demonstrate seamless transitions, improved energy efficiency, and superior terrain adaptability compared to single-gait and FSM baselines. The modular architecture allows for easy extension to additional gaits and skills, making it suitable for real-world applications.

However, limitations exist. Current policies switch discretely, which, though smooth, could be further optimized with continuous blending. Additionally, energy efficiency, while improved, still falls short of biological systems due to the lack of elastic elements like tendons. Future work will incorporate compliant mechanisms and explore more complex motor skills, such as jumping and dancing, using the same framework. Integration with perception-based gait selection could also enhance adaptability in unstructured environments.

In summary, our approach advances the state of the art in quadruped robot locomotion by combining learning-based gait conditioning with efficient skill integration. The resulting robot dog controller achieves near-optimal energy consumption across speeds, paving the way for more autonomous and sustainable legged robots.