Reinforcement Learning-Based Gait Generation for Bipedal Robots with Telescopic Legs

In recent years, the field of robot technology has witnessed significant advancements, particularly in the development of bipedal robots. These systems are increasingly applied across various industries, including healthcare, manufacturing, and services, due to their exceptional adaptability to diverse environments. However, traditional bipedal robots often suffer from high energy consumption and structural inefficiencies. To address these challenges, I have focused on a novel telescopic leg structure that enhances energy efficiency by reducing the number of active joints and simplifying the overall design. This approach draws inspiration from passive dynamic walkers, which mimic human-like movements with minimal energy input. In this article, I present a reinforcement learning (RL) method for generating stable and efficient gaits for such telescopic-legged bipedal robots. By incorporating periodic functions into the reward design, I aim to encourage symmetric and periodic walking patterns, thereby improving energy efficiency and robustness. The integration of advanced robot technology with RL algorithms allows for adaptive control without requiring extensive domain expertise, making it a promising direction for future developments in legged locomotion.

The core of my approach lies in leveraging deep reinforcement learning (DRL) to optimize gait generation. Traditional control methods for bipedal robots often rely on complex model-based strategies, such as those based on the linear inverted pendulum (LIP) model or angular momentum regulation. While effective, these methods can be computationally intensive and may not fully exploit the energy-saving potential of telescopic leg structures. In contrast, RL enables the robot to learn optimal policies through interaction with the environment, adapting to uncertainties and disturbances autonomously. I utilize the proximal policy optimization (PPO) algorithm within an actor-critic framework, which has proven effective in high-dimensional control tasks. The state space includes joint positions, velocities, body orientation, and control targets, while the action space consists of target values for the joint actuators. A key innovation is the introduction of a periodic signal in the state observations to guide the learning process toward symmetric gaits. This signal alternates rewards and penalties for foot-ground contact, ensuring that the robot develops a rhythmic and efficient walking pattern. The reward function is densely designed to include components for tracking control targets, minimizing torque and action changes, and enforcing periodicity. Through extensive simulations in the IsaacGym environment, I demonstrate that this method not only achieves stable velocity tracking but also exhibits superior energy efficiency compared to conventional approaches. The results highlight the potential of combining telescopic leg designs with RL-based control to advance robot technology in real-world applications.

To provide a theoretical foundation, I begin with the Markov decision process (MDP) formulation, which underpins the RL framework. In robot technology, the dynamics of a bipedal robot can be described by the following equation:

$$ M(q) \ddot{u} + b(q, \dot{u}) + g(q) = S \tau + J_{\text{ext}}^T F_{\text{ext}} $$

where $ q $ represents the joint positions, $ \dot{u} $ and $ \ddot{u} $ are the joint velocities and accelerations, $ M(q) $ is the mass matrix, $ b(q, \dot{u}) $ accounts for Coriolis and centrifugal forces, $ g(q) $ is the gravity vector, $ S $ is the selection matrix for actuated joints, $ \tau $ denotes the joint torques, $ F_{\text{ext}} $ represents external forces, and $ J_{\text{ext}}^T $ is the Jacobian transpose for the force application points. This equation captures the full dynamics, but in practice, not all states are observable due to sensor limitations. Thus, I approximate the system as an MDP defined by the tuple $ \{ \mathcal{S}, \mathcal{A}, P, r, \gamma \} $, where $ \mathcal{S} \in \mathbb{R}^n $ is the state space, $ \mathcal{A} \in \mathbb{R}^m $ is the action space, $ P $ is the state transition function, $ r $ is the reward function, and $ \gamma \in [0,1] $ is the discount factor. The goal is to find a policy $ \pi_\theta(a|s) $ that maximizes the expected return $ R(\tau) $ over trajectories $ \tau $ sampled from the policy. The actor-critic architecture is employed, with the actor network outputting action means and the critic network estimating value functions. The generalized advantage estimation (GAE) is used to compute advantages, balancing bias and variance in gradient estimates.

The design of the state and action spaces is critical for effective learning in robot technology. The state space is a 33-dimensional vector that includes:

Joint positions $ q \in \mathbb{R}^n $
Joint velocities $ \dot{q} \in \mathbb{R}^n $
Previous actions $ A_{t-1} \in \mathbb{R}^n $
Body linear velocity $ v_b = [v_x, v_y, v_z]^T \in \mathbb{R}^3 $
Body angular velocity $ \omega_b = [\omega_x, \omega_y, \omega_z]^T \in \mathbb{R}^3 $
Body orientation $ o = [w, x, y, z] \in \mathbb{R}^4 $ (represented as a quaternion)
Control target $ v_{\text{cmd}} = [v_x^{\text{cmd}}, v_y^{\text{cmd}}, \omega_{\text{yaw}}^{\text{cmd}}]^T \in \mathbb{R}^3 $
Periodic signal $ P = \{ p_r(t), p_l(t) \} \in \mathbb{R}^2 $, where $ p_r(t) $ and $ p_l(t) $ are phase signals for the right and left feet, respectively.

The action space consists of target values for the six joint actuators, which are controlled using a PD controller. The reward function is composed of multiple terms to guide the learning process:

$$ r = r_{\text{cmd}} + r_{\text{diff}} + r_{\text{tq}} + r_{\text{foot}} + r_T $$

where $ r_{\text{cmd}} $ encourages velocity tracking, $ r_{\text{diff}} $ penalizes large action changes, $ r_{\text{tq}} $ minimizes torque usage, $ r_{\text{foot}} $ enforces periodic foot contact, and $ r_T $ is a termination penalty. The foot reward $ r_{\text{foot}} $ is defined as:

$$ r_{\text{foot}} = k_{\text{foot}} \sum_{l=0}^{N} \gamma^l \left[ p_r(\tau_l) q_{\text{rf}}(\tau_l) + p_l(\tau_l) q_{\text{lf}}(\tau_l) \right] $$

with $ \tau_l = t – N + l \Delta t $, where $ \Delta t $ is the time step, $ q_{\text{rf}}(t) $ and $ q_{\text{lf}}(t) $ are force sensor outputs for the right and left feet, $ \gamma $ is a decay factor, $ k_{\text{foot}} $ is a scaling coefficient, and $ N $ is the window size. The periodic signals $ p_r(t) $ and $ p_l(t) $ are generated using sinusoidal functions with a phase difference of $ \pi $ to alternate between reward and penalty phases. For example, if the gait period is $ T $, the signals can be expressed as:

$$ p_r(t) = \sin\left( \frac{2\pi t}{T} \right), \quad p_l(t) = \sin\left( \frac{2\pi t}{T} + \pi \right) $$

This design ensures that when one foot is in the stance phase (rewarded), the other is in the swing phase (penalized), promoting symmetric walking. The control reward $ r_{\text{cmd}} $ is given by:

$$ r_{\text{cmd}} = k_{xy} \exp\left( -\frac{\| v_{xy}^{\text{cmd}} – v_{xy}^{\text{act}} \|}{\sigma_{xy}} \right) + k_{\omega} \exp\left( -\frac{|\omega_{\text{yaw}}^{\text{cmd}} – \omega_{\text{yaw}}^{\text{act}}|}{\sigma_{\omega}} \right) $$

where $ k_{xy} $ and $ k_{\omega} $ are scaling factors, and $ \sigma_{xy} $ and $ \sigma_{\omega} $ control the sensitivity. The torque penalty $ r_{\text{tq}} $ and action change penalty $ r_{\text{diff}} $ are defined as:

$$ r_{\text{tq}} = k_{\text{tq}} \| \tau_t \|, \quad r_{\text{diff}} = k_{\text{diff}} \| A_t – A_{t-1} \| $$

with $ k_{\text{tq}} $ and $ k_{\text{diff}} $ as negative coefficients to discourage inefficient movements. The termination penalty $ r_T $ is -1 if the robot falls (e.g., hip height below 0.4 m) and 0 otherwise.

For the experimental setup, I used the IsaacGym simulator, which allows for parallel execution of multiple environments on a GPU. The simulation frequency was set to 200 Hz, with the neural network output at 50 Hz. A total of 4096 environments were run in parallel to accelerate data collection. The PPO algorithm was configured with an Adam optimizer and a learning rate of 0.0002. The discount factor $ \gamma $ was 0.99, and the GAE parameter $ \lambda $ was 0.95. The policy and value networks were implemented as multilayer perceptrons (MLPs) with layers of 512, 256, and 128 neurons, using ELU activation functions. To enhance robustness, disturbances such as random forces applied to the body center of mass, observation noise, and randomized ground friction were introduced during training. The table below summarizes the key hyperparameters:

Parameter	Value
Adam Learning Rate	0.0002
Discount Factor ($ \gamma $)	0.99
GAE Discount ($ \lambda $)	0.95
Number of Epochs	4
Batch Size	81,920 (4096 × 20)
Mini-batch Size	20,480 (4096 × 5)
Foot Reward Scale ($ k_{\text{foot}} $)	0.1
Control Reward Scales ($ k_{xy}, k_{\omega} $)	0.2, 0.15
Control Sensitivities ($ \sigma_{xy}, \sigma_{\omega} $)	0.3, 0.2
Gait Period ($ T $)	0.5 s
Sliding Window Size ($ N $)	25

The training process involved collecting approximately 3 million data samples per velocity range, with convergence observed around 8 million steps. The average episode length and reward were monitored to assess performance. For velocity tracking, I tested three ranges: 0.4–0.7 m/s, 0.7–1.0 m/s, and a backward walking case. The results showed that the 0.7–1.0 m/s range achieved the most stable learning, with the robot quickly reaching and maintaining the target velocity. The following table compares the average speed and cost of transport (COT) for different methods, including the proposed RL approach, a model-based angular momentum method, and a DCM-based method. COT is defined as:

$$ \text{COT} = \frac{\sum_{i=1}^{n} |\tau_i \dot{\theta}_i| + \| v \|}{W \cdot \| v \|} $$

where $ \tau_i $ and $ \dot{\theta}_i $ are the torque and velocity of joint $ i $, $ v $ is the robot’s velocity, and $ W $ is the weight. Lower COT values indicate higher energy efficiency.

Method	Control Speed (m/s)	Average Speed (m/s)	COT Mean
RL (Proposed)	0.5	0.532	0.465
RL (Proposed)	0.7	0.693	0.427
RL (Proposed)	1.0	0.961	0.484
Angular Momentum	0.5	0.556	0.582
Angular Momentum	0.7	0.767	0.534
Angular Momentum	1.0	1.102	0.547
DCM	0.5	0.547	0.503
DCM	0.7	0.741	0.471
DCM	1.0	1.086	0.489

The proposed RL method achieved an average COT of 0.459 across speeds, compared to 0.554 for the angular momentum method and 0.488 for DCM, representing improvements of approximately 20% and 6%, respectively. This demonstrates the energy efficiency of the telescopic leg structure combined with RL-based control. Additionally, robustness was evaluated by applying external forces in forward, backward, and lateral directions during 10-second and 20-second walking tests. The maximum tolerable forces before failure are summarized below:

Method	Time (s)	Forward Force (N)	Backward Force (N)	Lateral Force (N)
RL (Proposed)	10	75	90	25
RL (Proposed)	20	65	60	30
Angular Momentum	10	70	100	30
Angular Momentum	20	65	65	30
DCM	10	65	100	25
DCM	20	50	60	20

The RL controller exhibited competitive robustness, particularly in forward and lateral directions, though it was slightly less robust backward compared to the angular momentum method. This can be attributed to limited training in backward motion. The periodic symmetry of the gait was analyzed by examining foot force patterns over time. With the periodic reward, the robot maintained a consistent 0.5 s gait cycle with alternating foot contacts, whereas without it, the pattern was irregular. When the period was adjusted to 0.6 s, the symmetry persisted, validating the flexibility of the approach.

In conclusion, the integration of telescopic leg structures with reinforcement learning represents a significant advancement in robot technology. The proposed method efficiently generates symmetric and periodic gaits, leading to improved energy efficiency and robustness. The use of periodic signals in the reward function effectively guides the learning process, reducing the need for manual tuning. Future work will explore omnidirectional movement and adaptation to uneven terrains, further pushing the boundaries of bipedal locomotion. As robot technology continues to evolve, such learning-based approaches will play a crucial role in developing autonomous and efficient robotic systems for real-world applications.