Structure-Control Co-design for Quadruped Robots Using Pre-training and Fine-tuning

In nature, animals such as cougars exhibit remarkable locomotion capabilities, often characterized by asymmetric leg structures where powerful hind legs provide the energy for leaps and jumps. This biological inspiration has motivated our research into optimizing the mechanical design of quadruped robots, specifically focusing on leg length adjustments to enhance overall motion performance. Traditional robot design often relies on empirical knowledge, lacking a systematic approach to co-optimize morphology and control. To address this, we propose a novel pre-training and fine-tuning framework for structure-control co-design, which efficiently learns optimal control policies for various robot morphologies while reducing computational costs. Our method leverages deep reinforcement learning (DRL) to handle complex tasks like parkour, where jumping and climbing require precise coordination between physical structure and control strategies. By integrating spatial domain randomization and discount regularization during pre-training, we develop a generalized policy that adapts to diverse leg lengths. Subsequently, fine-tuning this policy for specific structures within a Bayesian optimization loop ensures optimal performance. This approach not only improves training efficiency but also surpasses traditional methods that optimize control policies independently, offering a new solution for enhancing the extreme capabilities of quadruped robots.

The co-design of robot morphology and control is a challenging problem, as it involves a bi-level optimization where the lower level finds the best control policy for a given structure, and the upper level searches for the optimal structure based on the policy’s performance. Previous work has either trained dedicated controllers for each candidate structure, which is computationally expensive, or used generalized policies that may not guarantee optimality. Our framework bridges this gap by pre-training a robust policy that generalizes across structures and then fine-tuning it quickly for specific parameters. This allows us to explore the design space efficiently using Bayesian optimization, where the fitness function is the cumulative reward from parkour tasks. We validate our method through simulations in Isaac Gym, demonstrating significant improvements in jumping height and distance compared to baseline algorithms. The key contributions include the introduction of spatial domain randomization to handle multiple morphologies simultaneously, the application of discount regularization to enhance generalization, and a comprehensive experimental evaluation showing the superiority of our approach in both single-task and multi-task scenarios.

In the following sections, we detail our methodology, starting with the pre-training phase that employs spatial domain randomization and discount regularization. We then describe the fine-tuning process integrated with Bayesian optimization for structure selection. Experimental results compare our approach with existing methods, highlighting gains in training efficiency and task performance. Finally, we discuss implications for robot design, including the potential for adaptive leg structures in future quadruped robots.

Methodology

Our approach formulates the structure-control co-design as a bi-level optimization problem. The lower level involves learning an optimal control policy for a given robot morphology, while the upper level optimizes the morphological parameters (e.g., leg lengths) to maximize performance in specific tasks. We model the control problem as a Markov Decision Process (MDP) defined by the tuple $$(S, A, R, P, \gamma)$$, where $$S$$ is the state space, $$A$$ is the action space, $$R(s_t, a_t)$$ is the reward function, $$P(s_{t+1} | s_t, a_t)$$ is the state transition probability, and $$\gamma$$ is the discount factor. The goal is to find a policy $$\pi^*(a_t | s_t)$$ that maximizes the expected discounted cumulative reward: $$\mathbb{E} \left[ \sum_{t=0}^{T} \gamma^t r_t \right]$$. We use the Proximal Policy Optimization (PPO) algorithm to solve this, due to its stability and effectiveness in robotic control tasks.

The state space for our quadruped robot includes several components: proprioceptive information $$x_t$$ (e.g., body angular velocity, roll and pitch angles, joint positions and velocities), explicit privileged information $$e’_t$$ (e.g., linear velocity), privileged information $$e_t$$ (e.g., robot mass, center of mass, structural parameters, friction coefficients), external perception $$m_t$$ (e.g., terrain height samples), and historical proprioception $$h_t$$. The action space consists of target joint angles for the 12 actuators, which are converted to torques using a PD controller. The reward function is designed to encourage task completion, such as reaching waypoints in parkour tasks, and includes terms for goal tracking, foot clearance, velocity tracking, and penalties for excessive actions or torques. Specifically, the reward $$r_t$$ at time $$t$$ is computed as:

$$r_t = w_1 \cdot \text{goal\_track} + w_2 \cdot \text{foot\_clearance} + w_3 \cdot \text{yaw\_track} + \sum_{i} w_i \cdot \text{penalty}_i$$

where $$w_i$$ are weights, and the terms are defined as follows: goal tracking encourages movement toward a target waypoint, foot clearance penalizes contact near edges, and yaw tracking aligns the robot’s orientation. Additional penalties regulate joint accelerations, torques, and body rotations to ensure smooth and efficient motion.

Pre-training Phase

In the pre-training phase, we aim to learn a generalized control policy that performs well across a wide range of leg lengths. To achieve this, we introduce spatial domain randomization, which involves training multiple robots with different morphologies simultaneously in a parallel simulation environment. Unlike temporal domain randomization, which changes parameters over time, spatial domain randomization leverages the massive parallelism of simulators like Isaac Gym to train thousands of robots with varying leg lengths concurrently. This approach enhances the policy’s robustness without significant computational overhead.

We modify the URDF (Unified Robotics Description Format) file of the robot to adjust leg lengths. The parameters include scaling factors $$\xi_i$$ for the thigh and shank segments of each leg, where $$i = 0$$ for front thighs, $$i = 1$$ for front shanks, $$i = 2$$ for hind thighs, and $$i = 3$$ for hind shanks. Each $$\xi_i$$ is sampled from a uniform distribution $$\xi_i \sim U(0.6, 1.4)$$, and the corresponding URDF parameters (e.g., link origins, masses, inertias) are updated accordingly. To maintain stability, symmetric adjustments are applied to left and right legs. The PD controller gains are also scaled using a polynomial function to account for changes in leg length:

$$\eta_i = a \xi_i^3 + b \xi_i^2 + c \xi_i + d$$

where $$a, b, c, d$$ are hand-tuned coefficients. The proportional and derivative gains for each joint are then adjusted as $$k_p^i = \eta_i \cdot \bar{k}_p^i$$ and $$k_d^i = \eta_i \cdot \bar{k}_d^i$$, where $$\bar{k}_p^i$$ and $$\bar{k}_d^i$$ are the default values.

To further improve generalization, we apply discount regularization by reducing the discount factor $$\gamma$$ to a lower value $$\gamma_{\text{reg}} = 0.98$$ (compared to the standard $$\gamma = 0.99$$). This encourages the value network to focus on short-term rewards, reducing overfitting to specific morphologies. The advantage function $$A_t^{\gamma_{\text{reg}}}$$ and temporal difference error $$\delta_t^{\gamma_{\text{reg}}}$$ are computed as:

$$A_t^{\gamma_{\text{reg}}} = \sum_{l=0}^{\infty} (\gamma_{\text{reg}} \lambda)^l \delta_{t+l}^{\gamma_{\text{reg}}}$$

$$\delta_t^{\gamma_{\text{reg}}} = r_t + \gamma_{\text{reg}} V_\phi(s_{t+1}) – V_\phi(s_t)$$

where $$\lambda$$ is the GAE parameter, and $$V_\phi$$ is the value network. The actor and critic loss functions in PPO are modified accordingly:

$$L_{\text{Actor}} = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{old}}}} \left[ \min \left( \frac{\pi_\theta}{\pi_{\theta_{\text{old}}}} A_t^{\gamma_{\text{reg}}}, \text{clip} \left( \frac{\pi_\theta}{\pi_{\theta_{\text{old}}}}, 1-\epsilon, 1+\epsilon \right) A_t^{\gamma_{\text{reg}}} \right) \right]$$

$$L_{\text{Critic}} = \mathbb{E}_{s_t \sim \tau} \left[ \frac{1}{2} \left( r_t + \gamma_{\text{reg}} V_\phi(s_{t+1}) – V_\phi(s_t) \right)^2 \right]$$

This combination of spatial domain randomization and discount regularization enables the pre-trained policy to adapt to unseen leg lengths during fine-tuning.

Fine-tuning Phase and Bayesian Optimization

In the fine-tuning phase, we optimize the structural parameters $$\xi = [\xi_0, \xi_1, \xi_2, \xi_3]$$ for specific tasks, such as high jump or long jump. We use Bayesian optimization (BO) as a black-box optimizer to maximize the fitness function $$f(\xi | \kappa, \pi_\xi)$$, which is the average cumulative reward over multiple episodes for a given task $$\kappa$$ and policy $$\pi_\xi$$. The fitness is computed as:

$$f = \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} r(s_t^i, a_t^i)$$

where $$N$$ is the number of robots, and $$T$$ is the episode length. For each candidate $$\xi$$ proposed by BO, we fine-tune the pre-trained policy by disabling spatial domain randomization and focusing on the specific morphology. Fine-tuning involves a short training period (e.g., 400 steps) with the highest difficulty level in the curriculum, ensuring rapid adaptation. This process provides accurate fitness evaluations for BO, leading to efficient convergence to the optimal structure.

The overall optimization problem is:

$$\xi^* = \arg \max_{\xi \in \Xi} f(\xi | \kappa, \pi_\xi)$$

where $$\Xi$$ is the feasible space of leg lengths. By integrating fine-tuning into BO, we reduce the time required to evaluate each structure while maintaining policy optimality.

Experimental Setup

We conduct experiments in the Isaac Gym simulator using a model based on the Jueying Lite3 robot dog. The control policy runs at 50 Hz, and the PD controller at 200 Hz. Torques are computed as:

$$\tau_{\text{target}}^i = k_p^i (q_i^* – q_i) + k_d^i (\dot{q}_i^* – \dot{q}_i)$$

$$\tau_{\text{scale}}^i = \tau_{\text{target}}^i \cdot \alpha_i$$

$$\tau_{\text{real}}^i = \text{clip}(\tau_{\text{scale}}^i, -\tau_{\text{max}}, \tau_{\text{max}})$$

where $$\alpha_i$$ is a damping ratio adjusted during training to address sim-to-real gaps, and $$\tau_{\text{max}}$$ is the maximum joint torque. We train with $$N = 8192$$ robots in parallel for pre-training, sampling leg scaling factors from $$U(0.6, 1.4)$$. Tasks include high jump and long jump parkour scenarios, with rewards tailored to encourage aggressive motions.

We compare our method against three baseline co-design algorithms: (1) Online PCODP, which uses temporal domain randomization and trains policies sequentially for different structures; (2) Offline PCODP, which employs offline RL to learn a generalized policy from expert data; and (3) EAT, which uses a Transformer-based architecture for generalization. Performance is evaluated based on cumulative reward and task-specific metrics like jump height and distance.

Results and Discussion

Control Policy Performance

We first evaluate the effectiveness of spatial domain randomization and discount regularization in pre-training. Table 1 compares the average cumulative rewards for 100 randomly sampled structures using different domain randomization methods. Our spatial domain randomization achieves significantly higher rewards than temporal domain randomization and no randomization, with p-values below 0.05 in t-tests, confirming its superiority.

Table 1: Comparison of domain randomization methods
Method	Average Cumulative Reward	P-value vs. Spatial
Temporal Domain Randomization	3.51 ± 0.30	4.42e-05
No Randomization	9.91 ± 0.37	4.55e-04
Spatial Domain Randomization	15.58 ± 0.67	–

Figure 1 shows the training curves for different regularization techniques. Discount regularization converges to higher rewards than activation regularization, no regularization, and phased policy gradient, demonstrating its efficacy in enhancing generalization for locomotion tasks.

We also assess training efficiency by comparing standard training (from scratch for each structure) with our pre-training and fine-tuning approach. As shown in Figure 2, fine-tuning converges in about 400 steps (6.67% of standard training time) while achieving comparable performance. For example, training 30 structures with standard methods requires 180,000 episodes, whereas our method needs only 18,000 episodes (6000 for pre-training + 30 × 400 for fine-tuning), reducing time by 90%.

Co-design Performance

We evaluate co-design performance on 81 structures uniformly sampled from the parameter space. Heatmaps in Figure 3 illustrate the cumulative rewards for each method. Our approach achieves the highest rewards across all structures, indicating that fine-tuning provides near-optimal policies for each morphology. In contrast, Online PCODP suffers from forgetting previous structures, Offline PCODP fails to generalize to unseen parameters, and EAT performs poorly due to high-dimensional state spaces.

Table 2 summarizes the cumulative rewards and time consumption for jump tasks. Our method outperforms baselines by a factor of 3 or more in rewards, with significantly lower training time. For instance, in long jump, we achieve a reward of 28.23 compared to 6.18 for Online PCODP, while reducing time from over 86 hours to under 9 hours.

Table 2: Co-design performance in jump tasks
Method	Long Jump Reward	High Jump Reward	Time Consumption
Online PCODP	6.18	10.37	3h 5min
Offline PCODP	4.78	4.63	86h 45min
EAT	0.66	0.71	86h 17min
Our Method	28.23	29.06	8h 51min

Robot Parkour Performance

In simulation, we compare our co-designed robot dog with the Extreme Parkour algorithm, a state-of-the-art method. Table 3 shows that our approach achieves a jump distance of 1.00 m and height of 0.55 m, surpassing Extreme Parkour (0.87 m and 0.47 m). This highlights the importance of structure optimization alongside control.

Table 3: Parkour performance in simulation
Method	Jump Distance (m)	Jump Height (m)
Extreme Parkour	0.87	0.47
Our Method	1.00	0.55

Figure 4 depicts robots with optimized vs. default structures in jump tasks. For long jump, optimized hind legs (e.g., $$\xi_3 = 1.24$$) provide greater propulsion, while shorter front legs ($$\xi_1 = 0.67$$) prevent tipping. In high jump, longer legs overall (e.g., $$\xi_0 = 1.39$$) aid in climbing. These results underscore the synergy between morphology and control in extreme tasks.

For multi-task environments, we analyze the crural index (shank-to-thigh ratio × 100). Structures with a crural index above 100 achieve higher fitness (29.177 ± 2.02) than those below 100 (27.898 ± 1.76), with a p-value of 0.027 in a t-test. This aligns with biological observations and existing robot designs, such as the Jueying Lite3, which has a thigh length of 200 mm and shank length of 210 mm (crural index of 105).

Conclusion

We present a pre-training and fine-tuning framework for structure-control co-design in quadruped robots, enabling efficient optimization of leg lengths for enhanced parkour performance. By combining spatial domain randomization and discount regularization, we learn a generalized policy that adapts quickly to specific morphologies through fine-tuning. Experimental results demonstrate significant improvements in training efficiency and task performance over existing methods, with our approach achieving higher jump distances and heights in simulation. The co-designed robot dog exhibits morphology tailored to task requirements, such as longer hind legs for jumping and balanced ratios for multi-task scenarios. Future work could explore adaptive leg structures that dynamically adjust to different environments, further pushing the limits of quadruped robot capabilities. This research provides a foundation for data-driven robot design, bridging the gap between biological inspiration and engineering innovation.