Dynamic Potential-Based Rewards for Learning Bipedal Locomotion

In recent years, the integration of deep reinforcement learning (DRL) into robot technology has revolutionized the control of legged robots, enabling them to perform complex tasks such as walking, running, and navigating uneven terrain. However, challenges like insufficient exploration capability, low sample efficiency, and unstable walking patterns persist in the learning process. To address these issues, we propose a novel reward shaping method based on dynamic potential energy, termed Dynamic Potential-Based Reward Shaping (DPBRS). This approach dynamically adjusts the rewards obtained during training, enhancing exploration and accelerating convergence to optimal policies. Our work demonstrates the effectiveness of DPBRS in controlling a bipedal robot for steady walking at commanded speeds, with results showing improved training efficiency and more natural, stable gaits. Furthermore, we validate the generality of DPBRS by applying it to quadrupedal robots, underscoring its broad applicability in advancing robot technology.

The core of our method lies in reformulating the reward function within the reinforcement learning framework. Traditional reward shaping, such as Potential-Based Reward Shaping (PBRS), relies on static potential functions that may limit exploration. In contrast, DPBRS incorporates dynamic elements that vary across training episodes, fostering more adaptive learning. We implement this using the Proximal Policy Optimization (PPO) algorithm in a simulated environment built with the MuJoCo physics engine. Our experiments involve a bipedal robot with six degrees of freedom, and we compare DPBRS against standard PBRS and other reward functions. The results indicate that DPBRS not only speeds up training but also produces smoother and more robust locomotion, highlighting its potential for real-world applications in robot technology.

This article is structured as follows: First, we provide background on reinforcement learning and reward shaping in the context of robot technology. Next, we detail the formulation of DPBRS, including mathematical derivations and algorithmic integration. We then describe the experimental setup, including the robot model, simulation environment, and training parameters. Following this, we present comprehensive results and analyses, covering training performance, gait stability, speed tracking, and generalization tests. Finally, we conclude with insights and future directions for enhancing robot technology through adaptive reward mechanisms.

Background and Motivation

Reinforcement learning (RL) has emerged as a powerful tool for developing autonomous systems in robot technology, particularly for locomotion tasks. In RL, an agent learns to make decisions by interacting with an environment, with the goal of maximizing cumulative rewards. The problem is typically modeled as a Markov Decision Process (MDP), defined by the tuple $$M = (S, A, P, R)$$, where $$S$$ is the state space, $$A$$ is the action space, $$P$$ represents the state transition probabilities, and $$R$$ is the reward function. For legged robots, the state space often includes kinematic and dynamic parameters, such as joint angles, velocities, and body orientation, while the action space consists of motor torques or positions.

In DRL, a neural network, such as an actor network parameterized by $$\theta$$, is trained to output actions based on states, denoted as $$\pi_\theta(a|s)$$. The objective is to find the optimal policy $$\pi_\theta^*(a|s)$$ that maximizes the expected discounted return $$R = \sum_{t=0}^{T} \gamma^t r_t$$, where $$\gamma$$ is the discount factor and $$T$$ is the episode length. However, designing an effective reward function $$R$$ is critical yet challenging; poor reward design can lead to suboptimal behaviors or failed learning. This is especially true in robot technology, where tasks require precise coordination and stability.

Reward shaping techniques, such as PBRS, have been proposed to guide learning by providing dense rewards that reflect progress toward goals. PBRS defines a potential function $$\Phi(s)$$ over states, and the shaped reward is given by $$F(s_t, a_t, s_{t+1}) = \gamma \Phi(s_{t+1}) – \Phi(s_t)$$. This ensures policy invariance while accelerating learning. Despite its benefits, PBRS assumes a fixed potential function, which can restrict exploration. Our DPBRS method addresses this by introducing dynamic potential functions that evolve during training, thereby enhancing exploration and sample efficiency in complex robot technology applications.

Dynamic Potential-Based Reward Shaping Formulation

We formulate DPBRS by extending the PBRS framework to include time-varying potential functions. Let $$\Phi_D(s_t)$$ be the dynamic potential function at state $$s_t$$, defined as $$\Phi_D(s_t) = \sum_i w_i(t) \Delta_i^E(s_t, s_0)$$, where $$w_i(t)$$ are dynamic weights for each reward component, and $$\Delta_i^E(s_t, s_0)$$ are potential-based terms computed from the current state $$s_t$$ and a target state $$s_0$$. The DPBRS reward at each step is then derived as $$F_D(s_t, a_t, s_{t+1}) = \gamma \Phi_D(s_{t+1}) – \Phi_D(s_t)$$. By allowing $$w_i(t)$$ to vary, we inject stochasticity into the reward signal, promoting diverse exploration strategies.

The dynamic weights $$w_i(t)$$ can be implemented using various scaling methods: uniform random scaling, Gaussian random scaling, or quadratic random scaling. For instance, uniform random scaling is defined as $$w_i^D = U(a_i, b_i) w_i$$, where $$U(a_i, b_i)$$ samples uniformly from the interval $$[a_i, b_i]$$, and $$w_i$$ is a baseline weight. Similarly, Gaussian scaling uses $$w_i^D = N(\mu_i, \sigma_i^2) w_i$$, and quadratic scaling employs $$w_i^D = P(a_i, b_i) w_i$$, where $$P$$ represents a parabolic distribution. These methods ensure that the reward components are dynamically adjusted, making the learning process more robust to local optima.

For bipedal locomotion, we design the potential-based terms $$\Delta_i^E$$ to encourage specific behaviors: speed tracking, direction alignment, survival, and foot orientation. For example, the speed tracking term is $$\Delta_0^E(s_t, s_0) = \exp\left(-\frac{(v_x – v_0)^2}{\sigma_0^2}\right)$$, where $$v_x$$ is the current forward velocity, $$v_0$$ is the target velocity, and $$\sigma_0$$ is a scaling parameter. The direction term is $$\Delta_1^E(s_t, s_0) = \exp\left(-\frac{(\theta_z – \theta_0)^2}{\sigma_1^2}\right)$$, with $$\theta_z$$ being the current yaw angle and $$\theta_0$$ the target direction. The survival term $$\Delta_2^E(s_t, s_0) = H(s_t, s_0)$$ provides a constant reward if the robot remains upright, and the foot orientation term $$\Delta_3^E(s_t, s_0)$$ encourages flat foot placement by computing the Euclidean distance between current and target foot orientations.

Additionally, we include penalty terms to regulate energy consumption, foot contact forces, and lateral deviation. The energy penalty is $$P_{\text{ctrl}} = p_0 \sum_{i=1}^6 a_i^2$$, where $$a_i$$ are joint torques; the contact force penalty is $$P_{\text{con}} = p_1 \sum_{n=1}^2 \sum_{i=1}^3 (C_n^i)^2$$, with $$C_n^i$$ being contact forces; and the yaw penalty is $$P_y = p_2 |y – y_0|$$, where $$y$$ is the current lateral position. The total reward at each step is thus:

$$R(s_t) = \sum_{i=0}^3 w_i^D \Delta_i^E(s_t, s_0) + \frac{w_0 + w_1 + w_2 + w_3}{2} – P_{\text{ctrl}} – P_{\text{con}} – P_y$$

where $$w_i^D$$ are the dynamic weights. If the robot falls (e.g., pitch or roll angles exceed a threshold $$\beta_0$$), the episode terminates with a large negative reward $$w_4$$. This comprehensive reward structure ensures that the robot learns to walk efficiently while maintaining stability.

To theoretically justify DPBRS, we analyze its impact on policy gradient methods. The objective function in RL is $$J(\theta) = \sum_{M_n} P(M_n | \theta) R(M_n)$$, where $$M_n$$ is a trajectory and $$P(M_n | \theta)$$ is its probability under policy $$\pi_\theta$$. The gradient update is $$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$. With DPBRS, the reward $$R(M_n)$$ becomes $$R_D(M_n)$$, and the gradient includes an additional term $$\delta$$ due to the dynamic weights, leading to $$\theta \leftarrow \theta + \alpha \nabla_\theta J_D(\theta) + \alpha \delta$$. This stochastic update direction enhances exploration by varying the reward landscape across episodes, which is crucial for complex tasks in robot technology.

Experimental Setup and Implementation

We conducted experiments using the Bolt bipedal robot, an open-source platform with six degrees of freedom: hip pitch, hip roll, and knee pitch for each leg. The action space consists of six-dimensional torque commands, normalized to $$[-1, 1]$$ N·m. The state space is 48-dimensional, including body orientation Euler angles (3D), linear velocity and acceleration (6D), joint angles and angular velocities (12D), historical joint information (18D), foot positions relative to the body (6D), and external commands for target velocity and yaw angle (2D). The target velocities are sampled from $$v_0 = \{0.3, 0.4, 0.5, 0.6, 0.7\}$$ m/s at the start of each episode.

The simulation environment is built with MuJoCo, a physics engine renowned for its accuracy in modeling robot technology. We use PPO as the DRL algorithm due to its stability and performance in continuous control tasks. The actor and critic networks have architectures with hidden layers [256, 128, 32], and we use a discount factor $$\gamma = 0.99$$, learning rate $$10^{-4}$$, clip parameter 0.1, and batch size 64. Each training episode involves 2,048 environment steps, and we update the policy for 10 iterations per episode. The reward parameters are set as follows: $$w_0 = 7.5$$, $$w_1 = 2.0$$, $$w_2 = 1.0$$, $$w_3 = 0.5$$, $$w_4 = -5.0$$, $$\sigma_0 = 0.5$$, $$\sigma_1 = 0.667$$, $$\sigma_2 = 1.0$$, $$p_0 = 0.1$$, $$p_1 = 5 \times 10^{-7}$$, $$p_2 = 0.3$$. For dynamic scaling, we use uniform random bounds $$a_0 = 0.67$$, $$b_0 = 1.37$$ for $$\Delta_0^E$$, and similar ranges for other terms.

We compare DPBRS against standard PBRS and other baseline rewards, such as sparse rewards and unidirectional tracking rewards. Training is performed over 650 episodes, with evaluation based on the average reward over 10 episodes. A baseline reward of 1,800 (corresponding to 0.9 per step for 2,000 steps) is used to indicate successful walking. We also test generalization on a quadruped robot and on sloped terrain with random inclines of $$[-5^\circ, 5^\circ]$$.

Results and Analysis

Our experiments demonstrate that DPBRS significantly outperforms PBRS in training efficiency and final performance. The evaluation reward curves show that DPBRS variants achieve higher rewards faster, with DPBRS using uniform scaling (DPBRS_U) reaching the baseline of 1,800 by episode 538, whereas PBRS only attains 1,655 by episode 554. The sample efficiency, defined as $$SE = (R_{\text{eval}} – R_b) / N$$ where $$N$$ is the number of samples, is $$9.10 \times 10^{-7}$$ for DPBRS_U compared to $$-1.28 \times 10^{-4}$$ for PBRS, indicating better exploration. Table 1 summarizes the cumulative rewards for different algorithms combined with DPBRS_U and PBRS, highlighting PPO’s superiority.

Table 1: Cumulative Rewards for Different RL Algorithms with PBRS and DPBRS_U
Algorithm	PBRS	DPBRS_U
PPO	1,831.8	1,843.3
SAC	1,759.9	1,768.0
TRPO	1,785.8	1,720.4
A2C	456.0	906.5

In terms of walking performance, DPBRS_U enables more accurate velocity tracking across different command speeds. As shown in Table 2, the average velocities $$v_x$$ closely match the targets, with errors below 1.4% for DPBRS_U versus up to 2.4% for PBRS. The standard deviations of velocity are also lower for DPBRS_U, indicating smoother motion. During gradual speed increase tests, DPBRS_U maintains stable velocity with fluctuations around 8.5%, while PBRS exhibits fluctuations up to 13.8%.

Table 2: Average Walking Velocities and Standard Deviations for Different Reward Functions
Target Velocity $$v_0$$ (m/s)	DPBRS_U $$v_x \pm \sigma_v$$ (m/s)	PBRS $$v_x \pm \sigma_v$$ (m/s)
0.3	0.303 ± 0.026	0.305 ± 0.042
0.4	0.402 ± 0.025	0.404 ± 0.038
0.5	0.488 ± 0.030	0.493 ± 0.045
0.6	0.598 ± 0.038	0.592 ± 0.061
0.7	0.692 ± 0.045	0.693 ± 0.053

Straight-line walking is another key metric. DPBRS_U results in smaller yaw angles and reduced lateral drift compared to PBRS. For instance, at $$v_0 = 0.7$$ m/s, the average yaw angle for DPBRS_U is $$0.035 \pm 0.018$$ rad, while for PBRS, it is $$0.040 \pm 0.024$$ rad. In tests with gradually increasing speed, DPBRS_U limits the maximum lateral deviation to -0.076 m, whereas PBRS reaches -0.201 m, demonstrating better direction adherence.

Stability, reflected in center of mass height, is also improved with DPBRS_U. The robot’s center of mass remains stable at around 0.480 m across speeds, slightly increasing with higher velocities due to more extended leg postures. In contrast, PBRS leads to a decreasing height, dropping to 0.438 m at $$v_0 = 0.7$$ m/s, which compromises stability. Gait analysis reveals that DPBRS_U produces more human-like walking patterns, with natural leg alternation and foot placement, whereas PBRS results in crouched and unstable motions.

We further evaluated the sensitivity of DPBRS to the dynamic scaling range. Varying the bounds for $$\Delta_0^E$$ weights, such as using $$(a_0, b_0) = (0.5, 1.5)$$ or $$(0.1, 3.0)$$, we found that broader ranges initially cause higher reward variance but ultimately lead to faster convergence than PBRS. This robustness simplifies reward design in robot technology, as precise weight tuning is less critical.

Generalization tests on a quadruped robot confirm DPBRS’s applicability beyond bipedal systems. DPBRS_U enables stable trotting gaits with proper foot support and minimal yaw deviation, whereas PBRS leads to erratic movements and poor direction tracking. On sloped terrain, DPBRS_U achieves a highest evaluation reward of 1,662.9, a 6.3% improvement over PBRS’s 1,564.7, underscoring its adaptability in challenging environments.

Comparison with alternative reward functions, such as sparse rewards (SPRS) and unidirectional tracking rewards (TRRS), highlights DPBRS’s superiority. SPRS gives a reward of -1 unless the exact target velocity is achieved, leading to conservative policies with an average reward of 1,354.7. TRRS, which rewards any velocity above the target, results in overspeeding and an average reward of 1,521.3. Both fail to reach the baseline, whereas DPBRS_U consistently excels, as summarized in Table 3.

Table 3: Average Walking Speeds for Different Reward Strategies (Unit: m/s)
Target Velocity $$v_0$$	DPBRS_U $$v_x$$	SPRS $$v_x$$	TRRS $$v_x$$
0.3	0.303	0.141	0.671
0.4	0.402	0.140	0.680
0.5	0.488	0.140	0.669
0.6	0.598	0.138	0.668
0.7	0.692	0.138	0.672

Conclusion and Future Work

In this work, we introduced Dynamic Potential-Based Reward Shaping (DPBRS) to address challenges in deep reinforcement learning for bipedal robot locomotion. By incorporating dynamic potential functions, DPBRS enhances exploration and sample efficiency, leading to faster training and more stable, natural walking gaits. Our extensive experiments in simulated environments validate its superiority over traditional PBRS and other reward design methods, with improvements in velocity tracking, straight-line walking, and stability. The generality of DPBRS is demonstrated through successful application to quadrupedal robots and sloped terrain, emphasizing its potential to advance robot technology.

Future research will focus on several directions. First, we plan to investigate more advanced dynamic scaling methods, such as adaptive weight tuning based on learning progress. Second, we aim to transfer the learned policies to physical robots, addressing sim-to-real gaps through domain randomization and system identification. Third, we will explore DPBRS in more complex tasks, like dynamic balancing and obstacle avoidance, to further push the boundaries of robot technology. Lastly, integrating meta-learning techniques could enable robots to quickly adapt to new environments, making DPBRS a cornerstone for autonomous robot technology development.

In summary, DPBRS offers a robust and flexible framework for reward shaping in reinforcement learning, with significant implications for robot technology. By dynamically adjusting rewards, it fosters efficient learning of complex behaviors, paving the way for more intelligent and adaptable robotic systems. We believe that this approach will inspire further innovations in the field, ultimately contributing to the widespread deployment of robots in real-world scenarios.