In recent years, the development of quadruped robots, often referred to as robot dogs, has garnered significant attention due to their potential in applications such as industrial inspection, disaster response, and exploration. These robots must navigate complex environments with varying ground roughness, which poses substantial challenges for stable and efficient locomotion. Traditional control methods, including model predictive control (MPC), rely on simplified dynamics and struggle in low-friction conditions where uncertainties in ground interaction lead to performance degradation. Reinforcement learning (RL) offers an alternative by enabling robots to learn adaptive strategies through environmental interaction, but it often faces issues with simulation-to-real transfer and inadequate handling of centroid adjustments for stability. In this work, we propose a hierarchical control framework that integrates model-based approaches with learning-based techniques to enhance the adaptability of quadruped robots across different terrain types. Our method focuses on real-time estimation of ground properties, continuous contact state modeling, and dynamic centroid adjustment to maintain high-speed motion while minimizing foot slippage. Through extensive simulations, we demonstrate that our approach outperforms baseline controllers, particularly in extreme low-friction scenarios, by effectively balancing speed and stability.

The core of our framework lies in a two-layer architecture: a decision layer for high-level adaptation and an execution layer for real-time control. The decision layer utilizes a Long Short-Term Memory (LSTM) network to estimate ground friction coefficients and predict sliding distances based on historical state sequences. This allows for adaptive adjustment of the centroid height, a critical factor in maintaining stability on slippery surfaces. For instance, in low-friction environments, lowering the centroid reduces overturning moments and increases foot-ground contact area, thereby enhancing friction utilization. The execution layer, trained via reinforcement learning, translates high-level commands into joint-level actions using a proximal policy optimization (PPO) algorithm. Key innovations include a penalty mechanism based on single foot placement deviation, which avoids over-penalization of transient slips, and a continuous contact state description using hyperbolic tangent functions to smooth phase transitions. Experiments conducted in the Isaac Gym simulation environment validate our method’s robustness across friction coefficients of $\mu \in \{0.05, 0.2, 1.0\}$ and speeds of 1.5 m/s and 2.0 m/s. Results show that our adaptive control reduces foot sliding distance to as low as $0.308 \pm 0.005$ cm while maintaining a speed of 1.428 m/s on $\mu = 0.05$ terrain, underscoring its practical value for real-world deployments.
Quadruped robots exhibit a variety of gait patterns, such as trotting, pacing, bounding, and pronking, each suited to specific environmental conditions. The trot gait, characterized by diagonal leg support, is particularly effective for flat and uneven terrains due to its inherent stability and efficient force distribution. In our approach, we model gait generation using phase variables derived from central pattern generators (CPGs), which provide a unified framework for coordinating leg movements. The phase $\phi_i$ for each leg $i \in \{\text{LF}, \text{RF}, \text{LH}, \text{RH}\}$ is defined relative to a global time variable $t$, updated as $t_{\text{new}} = t_{\text{old}} + \frac{f_{\text{cmd}}}{f_{\pi}}$, where $f_{\text{cmd}}$ is the commanded frequency and $f_{\pi}$ is the system update rate. The phase offsets $\delta_i$ are determined by parameters $\theta = \{\theta_1, \theta_2, \theta_3\}$ to achieve specific gaits, as summarized in Table 1.
| Gait Type | $\theta_1$ | $\theta_2$ | $\theta_3$ |
|---|---|---|---|
| Pronk | 0 | 0 | 0 |
| Trot | $\pi$ | $\pi$ | 0 |
| Pace | $\pi$ | 0 | $\pi$ |
| Bound | 0 | $\pi$ | $\pi$ |
Contact state modeling is crucial for stable locomotion. Traditional binary methods (0 for no contact, 1 for full contact) cause abrupt transitions and impact forces. We propose a continuous description using the hyperbolic tangent function: $$\Psi(x, \sigma) = \frac{1}{2} \left(1 + \tanh\left(\frac{x}{\sigma}\right)\right),$$ where $\sigma$ controls smoothing. The contact state for a single foot is then: $$C_{\text{foot}}(t) = \Psi(t, \sigma) \times (1 – \Psi(t – 0.5, \sigma)) + \Psi(t – 1, \sigma) \times (1 – \Psi(t – 1.5, \sigma)).$$ This formulation ensures smooth transitions between swing and stance phases, reducing impulses and improving gait similarity by up to 8.93% compared to discrete methods, as validated in our experiments.
The hierarchical control framework operates at different time scales: the decision layer at 10 Hz for environmental adaptation and the execution layer at 100 Hz for real-time control. The decision layer inputs a history of 50 state sequences $\{s_{t-49}, \ldots, s_t\}$ and velocity commands $v_{\text{cmd}}$, outputting estimated friction $\mu_{\text{est}}$, predicted sliding distance $d_{\text{slid}}$, and desired centroid height $h_{\text{com}}^{\text{cmd}}$. The state vector includes joint positions $q_j$, velocities $\dot{q}_j$, base angular velocity $\dot{q}_b$, gravity direction $r_g$, command velocity $v_{\text{cmd}}$, and control parameters $c_t$. The LSTM network, with 128 hidden units and self-attention, is trained on data from $\mu = 1.0$ environments using the loss function: $$\mathcal{L} = \alpha \|\mu_{\text{est}} – \mu_{\text{true}}\|^2 + \beta \|d_{\text{slid}} – d_{\text{true}}\|^2,$$ with $\alpha = 0.2$ and $\beta = 0.8$. Gaussian noise is added during training to improve generalization.
Centroid adaptation is inspired by biological observations where animals lower their center of mass on slippery surfaces. The desired height $h_{\text{com}}^{\text{des}}$ is computed as: $$h_{\text{com}}^{\text{des}} = \text{clip}(h_{\text{base}} – \Delta h_{\text{adapt}}, 0, h_{\text{max}}),$$ where $\Delta h_{\text{adapt}} = \frac{|d_{\text{slid}} – d_{\text{slid}}^{\text{avg}}|}{d_{\text{slid}}^{\text{avg}}} \times c$, with $d_{\text{slid}}^{\text{avg}}$ being the 50-step average sliding distance and $c$ a gain coefficient. To ensure smoothness, the commanded height is adjusted gradually: $$h_{\text{com}}^{\text{cmd}} = h_{\text{com}}^{\text{prev}} + \text{clip}(h_{\text{com}}^{\text{des}} – h_{\text{com}}^{\text{prev}}, -\dot{h}_{\text{max}} \Delta t, \dot{h}_{\text{max}} \Delta t).$$ This strategy enhances stability by optimizing joint torque distribution and increasing effective friction.
The execution layer employs RL to map states to actions. The state space $s_t \in \mathbb{R}^{37}$ includes robot state, commands, and control parameters, as detailed in Table 2. The action $a_t \in \mathbb{R}^{12}$ represents target joint positions, converted to torques via a PD controller: $$\tau = k_p (a_t – q_{\text{current}}) + k_d (\dot{a}_t – \dot{q}_{\text{current}}),$$ with $k_p = 20$ N·m/rad and $k_d = 0.5$ N·m/rad. We use a teacher-student training approach, where the teacher network has access to privileged information (e.g., true friction and contact forces), and the student network learns from observable states only. The reward function combines task-oriented and constraint terms, with a novel single foot placement penalty: $$r_{\text{contact}} = \|p_{\text{foot}}^c – p_{\text{foot}}^{c, \text{cmd}}\| \times I(t = t_{\text{contact}}),$$ where $I(\cdot)$ is an indicator function activated only at initial contact time $t_{\text{contact}}$. This avoids continuous penalization and encourages precise foot placement. The total reward is: $$r_{\text{total}} = r_{\text{task}} \times \exp(r_{\text{aux}}),$$ where $r_{\text{task}}$ includes positive rewards for tracking and stability, and $r_{\text{aux}}$ aggregates constraint penalties. The PPO algorithm optimizes the policy by maximizing the expected discounted reward, with hyperparameters listed in Table 3.
| Parameter | Min | Max | Unit |
|---|---|---|---|
| $v_x^{\text{cmd}}$ | -3.00 | 3.00 | m/s |
| $v_y^{\text{cmd}}$ | -1.00 | 1.00 | m/s |
| $\omega_z^{\text{cmd}}$ | -1.00 | 1.00 | rad/s |
| $f^{\text{cmd}}$ | 1.50 | 4.00 | Hz |
| $h_{\text{com}}^{\text{cmd}}$ | -0.45 | 0.10 | m |
| $h_{\text{foot}}^{\text{cmd}}$ | 0.03 | 0.30 | m |
| Parameter | Value |
|---|---|
| Batch Size | 4096 × 24 |
| Mini-batch Size | 4096 × 6 |
| Iterations | 5 |
| Clipping Range | 0.20 |
| Entropy Coefficient | 0.01 |
| Discount Factor | 0.99 |
| GAE Discount | 0.95 |
| Target KL Divergence | 0.01 |
| Learning Rate | Adaptive |
Experiments were conducted in Isaac Gym to evaluate performance across friction coefficients $\mu \in \{0.05, 0.2, 1.0\}$ and speeds of 1.5 m/s and 2.0 m/s. We compared our adaptive controller against a baseline that uses a fixed centroid height. The trot gait was selected as the base due to its diagonal support pattern, which provides inherent stability. Performance metrics included centroid height adjustment, foot sliding distance, and actual speed. Results for the trot gait under low-speed conditions (1.5 m/s) are summarized in Table 4, demonstrating that our method achieves lower sliding distances with minimal height adjustments while maintaining high speeds. For example, at $\mu = 0.05$, adaptive control reduced sliding to $0.308 \pm 0.005$ cm with a height change of -0.061 m, whereas the baseline required -0.400 m for similar sliding but suffered a speed drop to 1.334 m/s.
| Adaptive Control | $\mu$ | Centroid Height Change (m) | Foot Sliding Distance (cm) | Actual Speed (m/s) |
|---|---|---|---|---|
| Yes | 1.0 | -0.0311 | 0.290 ± 0.0004 | 1.4900 |
| No | 1.0 | 0 | 0.316 ± 0.0500 | 1.5039 |
| Yes | 0.2 | -0.0380 | 0.312 ± 0.0040 | 1.5020 |
| No | 0.2 | 0 | 0.327 ± 0.0600 | 1.5481 |
| Yes | 0.05 | -0.0610 | 0.308 ± 0.0050 | 1.4284 |
| No | 0.05 | 0 | 0.343 ± 0.0800 | 1.4550 |
Under high-speed conditions (2.0 m/s), our adaptive controller maintained superior performance. For instance, at $\mu = 0.05$, it achieved a sliding distance of $0.239 \pm 0.002$ cm with a height adjustment of -0.275 m, while the baseline with -0.400 m height change resulted in $0.389 \pm 0.0116$ cm sliding and a lower speed of 1.4957 m/s. These results highlight the efficiency of our centroid adaptation strategy in balancing speed and stability. The continuous contact model also improved gait similarity by reducing phase transition impacts, with gains of up to 3.96% for trot and 8.93% for pronk gaits compared to discrete methods.
In conclusion, our hierarchical control framework enables quadruped robots to adapt dynamically to varying ground roughness by integrating real-time estimation, continuous state modeling, and centroid adjustment. The use of LSTM networks for environmental perception and a reward function with single foot placement penalties enhances learning efficiency and stability. Future work will focus on extending this approach to more diverse terrains, optimizing adaptive parameters, and deploying on physical robot dog platforms to validate real-world applicability. This research underscores the potential of combining model-based and learning-based methods for robust locomotion in challenging environments.
