In recent years, the exploration of small celestial bodies such as asteroids and comets has gained significant attention due to their scientific and economic value. These bodies preserve primordial information about the solar system and contain abundant rare resources. However, their weak and irregular gravitational fields pose unique challenges for surface exploration. Traditional wheeled robots struggle with insufficient traction in low-gravity conditions, making legged robots, particularly quadruped robots, a promising alternative due to their ability to navigate rough terrain through hopping locomotion. In such environments, the prolonged free-fall phase during jumps—lasting up to several seconds—necessitates effective attitude control to correct orientation deviations caused by uneven ground reactions during takeoff. Without proper control, these deviations can lead to unsafe landings, instrument damage, or even escape from the gravitational field. This article addresses the attitude maneuver problem for quadruped robots in weak gravitational environments using a model-free reinforcement learning approach, eliminating the need for additional actuators like reaction wheels.
The core challenge lies in achieving three-axis attitude control—pitch, roll, and yaw—using only the robot’s leg joints, which are inherently underactuated for this purpose. Inspired by the cat righting reflex, where felines reorient themselves during free fall without external torque, we propose a controller trained via the Proximal Policy Optimization (PPO) algorithm. This method leverages deep reinforcement learning to handle the complex, nonlinear dynamics without explicit model derivation. Simulations demonstrate that the trained controller enables the robot dog to adjust its attitude efficiently, and experimental validation on a microgravity simulation platform confirms its practicality for real-world applications.
Problem Formulation and Dynamics Analysis
Small celestial bodies exhibit gravitational accelerations ranging from $10^{-5}$ m/s² to $10^{-1}$ m/s², depending on their size. For large asteroids like 216 Kleopatra, with surface gravity around 0.04–0.05 m/s², a quadruped robot’s jump phase can extend to 8–10 seconds. During this period, the robot is in free fall, and conservation of angular momentum dictates that leg movements can induce body rotations. The quadruped robot is modeled as a multi-rigid body system, comprising a main body and four legs, each with three rotational joints: abduction hip (a_hip), flexion hip (f_hip), and knee. The joint position vector is denoted as $\mathbf{q} = [q_{\text{a_hipFL}}, q_{\text{f_hipFL}}, q_{\text{kneeFL}}, \dots, q_{\text{kneeRR}}]^T \in \mathbb{R}^{12}$. The dynamics during free fall are described by the equation:
$$ \mathbf{M}(\mathbf{q}) \begin{bmatrix} \dot{\boldsymbol{\theta}} \\ \ddot{\mathbf{q}} \end{bmatrix} + \mathbf{C}(\dot{\mathbf{q}}, \mathbf{q}) \begin{bmatrix} \dot{\boldsymbol{\theta}} \\ \dot{\mathbf{q}} \end{bmatrix} = \begin{bmatrix} \mathbf{0} \\ \boldsymbol{\tau} \end{bmatrix} $$
where $\mathbf{M} \in \mathbb{R}^{15 \times 15}$ is the inertia matrix, $\boldsymbol{\theta} \in \mathbb{R}^3$ represents the body’s Euler angles (roll, pitch, yaw), $\mathbf{C} \in \mathbb{R}^{15 \times 15}$ is the Coriolis matrix, and $\boldsymbol{\tau} \in \mathbb{R}^{12}$ is the joint torque vector. In weak gravity, even small leg movements can shift the center of mass significantly, generating rotational moments. The coupling between joints and the underactuation make model-based control infeasible, necessitating a data-driven approach like reinforcement learning.
| Parameter | Value |
|---|---|
| Abduction Hip Joint Range (rad) | -0.80 to 0.80 |
| Flexion Hip Joint Range (rad) | -1.04 to 4.18 |
| Knee Joint Range (rad) | -2.69 to 0.91 |
| Max Joint Torque (N·m) | 33.5 |
| Max Joint Speed (rad/s) | 21.0 |
| Body Mass (kg) | 4.713 |
| Motor Mass per Joint (kg) | 0.605 |
| Thigh Mass (kg) | 1.013 |
| Shank Mass (kg) | 0.166 |
Reinforcement Learning Controller Design
The reinforcement learning framework employs the PPO algorithm, an on-policy method suitable for continuous control tasks. The agent interacts with a simulation environment built in PyBullet, where the quadruped robot’s dynamics are accurately modeled. The goal is to learn a policy that maps observed states to joint actions, maximizing cumulative rewards over time. The Markov decision process is defined by the state space $\mathcal{O}$, action space $\mathcal{A}$, and reward function $R$.
The state observation $\mathbf{o}_t \in \mathcal{O}$ includes joint positions, joint velocities, and body orientation angles:
$$ \mathbf{o}_t = [\mathbf{q}, \dot{\mathbf{q}}, \boldsymbol{\theta}] \in \mathbb{R}^{27} $$
The action $\mathbf{a}_t \in \mathcal{A}$ consists of desired joint positions for all 12 motors, normalized to their operational ranges:
$$ \mathbf{a}_t = [q_1^*, q_2^*, \dots, q_{12}^*] \in \mathbb{R}^{12} $$
Joint torques are computed using a PD controller:
$$ \tau(n) = K_p (a_t[n] – q[n]) – K_d \dot{q}[n] $$
with gains $K_p = 10$ and $K_d = 1$. The Actor-Critic neural networks both use multi-layer perceptrons (MLPs) with tanh activation functions. The Actor network has three hidden layers of 128 neurons each, while the Critic has two hidden layers of 128 neurons. Training hyperparameters are summarized in the table below.
| Hyperparameter | Value |
|---|---|
| Total Training Steps | 1.2 × 108 |
| Max Steps per Episode | 2000 |
| Learning Rate | 0.0003 |
| Discount Factor (γ) | 0.999 |
| Batch Size | 48,000 |
| Gradient Clipping Threshold | 1 |
| Entropy Coefficient | 0.02 |
The reward function combines primary and auxiliary components. The primary reward $R_1$ drives attitude correction toward desired angles $\boldsymbol{\theta}^*$:
$$ R_1 = w_1 \exp(-|\theta_{\text{roll}} – \theta_{\text{roll}}^*|^2) + w_2 \exp(-|\theta_{\text{pitch}} – \theta_{\text{pitch}}^*|^2) + w_3 \exp(-|\theta_{\text{yaw}} – \theta_{\text{yaw}}^*|^2) $$
where $w_1, w_2, w_3$ are negative weights (e.g., -1) to encourage reduced errors. Auxiliary rewards $R_2$ penalize undesirable behaviors like self-collision ($r_1 = -0.03$) and joint motor reversals ($r_2 = -0.001$). The total reward is $R = R_1 + R_2$, with auxiliary terms being orders of magnitude smaller to prioritize main objectives.
Simulation Results for Attitude Maneuvers
Extensive simulations were conducted to evaluate the controller’s performance in single-axis and multi-axis attitude adjustments. The quadruped robot was trained in PyBullet under weak gravity conditions (0.05 m/s²), with episodes lasting 8 seconds to match typical jump durations.
Pitch Angle Adjustment
Pitch control is fully actuated due to parallel joint axes, allowing large-angle maneuvers. The robot dog achieved pitch corrections from extreme angles (e.g., +140°) to near-zero (±10°) within 8 seconds. The motion involves coordinated leg swinging to shift momentum, as shown in the simulation snapshots. The plot below illustrates pitch angle trajectories from various initial conditions, all converging to safe landing orientations.
$$ \theta_{\text{pitch}}(t) \rightarrow [-10^\circ, 10^\circ] \quad \text{within} \quad t \leq 8 \, \text{s} $$
Roll Angle Adjustment
Roll control is underactuated, relying on abduction hip joints and coupled leg movements. The controller adjusted roll angles by up to 90° while suppressing pitch and yaw deviations to within ±15°. For instance, from an initial roll of -90°, the robot reoriented to near-level attitude, as visualized in simulation. The ability to minimize cross-axis errors is critical for stability during landing.
Yaw Angle Adjustment
Yaw maneuvers enable the quadruped robot to change heading direction mid-air, essential for path planning. The controller achieved yaw rotations of up to 90° (e.g., from 0° to -90°) while keeping pitch and roll variations below 30°. This demonstrates the controller’s capability to perform complex three-axis attitude maneuvers without dedicated actuators.

Experimental Validation with Microgravity Simulation
To validate the simulation results, a microgravity test platform was developed using air bearing technology. The quadruped robot was mounted on a lightweight support structure with air bearings, floating on a flat surface to simulate two-dimensional motion in a frictionless environment. Body orientation was tracked via motion capture cameras, and control commands were sent at 120 Hz due to communication constraints.
Experiments focused on pitch adjustment, as the platform confines motion to the sagittal plane. The robot successfully corrected pitch angles from various initial states (e.g., -90° to 0°) within 18 seconds—slower than simulation due to hardware limits but still effective. The plot of pitch angle versus time confirms stable convergence, with the robot dog achieving landing-ready orientations. This real-world test underscores the transferability of the learned policy to physical systems.
| Metric | Simulation | Experiment |
|---|---|---|
| Control Frequency (Hz) | 240 | 120 |
| Max Pitch Adjustment (°) | 140 | 90 |
| Time to Correct (s) | 8 | 18 |
| Cross-Axis Error (°) | < 15 | N/A (2D only) |
Conclusion and Future Work
This article presents a reinforcement learning-based controller for attitude maneuvers of quadruped robots in weak gravitational environments. The PPO-trained policy enables three-axis orientation control using only leg joints, eliminating the mass penalty of additional actuators. Simulations show robust performance in pitch, roll, and yaw adjustments, while experiments on a microgravity platform validate its practicality for pitch control. The approach leverages the robot dog’s inherent dynamics, offering a scalable solution for small body exploration.
Future work will focus on integrating hopping control for full mobility on asteroid surfaces and developing three-dimensional experimental setups. Although reinforcement learning provides model-free advantages, challenges remain in generalization and noise robustness. Advances in simulation fidelity and algorithm design will further enhance this method, paving the way for autonomous quadruped robots in deep space missions.
