Compliant Control Method for On-orbit Assembly by Space Robots Based on Reinforcement Learning

In the context of advancing space missions, the demand for large-scale and modular spacecraft, such as space-based solar power stations, satellite antennas, and segmented space telescopes, has driven the development of on-orbit assembly techniques. Space robots, with their high agility, reusability, and multifunctionality, have emerged as critical tools for assembling these structures in orbit. However, challenges like structural vibrations of assembly components, dynamic coupling between the robot and the structure, and limited control accuracy of the base satellite pose significant hurdles. Traditional compliance control methods, such as impedance control and force/position hybrid control, often require meticulous parameter tuning and struggle with adaptability to task variations. To address these issues, this paper proposes a model-data hybrid-driven approach that integrates impedance control with deep reinforcement learning (DRL) to efficiently learn compliant assembly strategies for space robots. By framing the assembly task as a Markov Decision Process (MDP) and leveraging the prior knowledge from impedance control, the method enhances training efficiency and robustness. The approach is validated through a parallelized simulation environment built using Isaac Gym, demonstrating improved compliance control performance and adaptability to uncertainties.

The on-orbit assembly scenario involves a free-floating space robot performing modular assembly tasks, such as assembling a segmented space telescope. The robot consists of a base satellite and a multi-degree-of-freedom manipulator, operating in a microgravity environment. Key assumptions include treating the robot as a multi-rigid-body system, ignoring gravitational effects, and simplifying the assembly interfaces as axis-hole components with uniform gaps. The dynamic coupling between the robot and the flexible structure leads to base satellite drift and contact force variations, complicating the control process. The assembly process is divided into non-contact alignment and contact-based insertion phases, with the latter being the focus for compliance control. The dynamics of the space robot are described by the Lagrangian formulation, accounting for base-manipulator interactions and external forces. Structural vibrations of the assembly components are modeled using sinusoidal functions with randomized parameters to simulate real-world uncertainties.

To formalize the assembly problem, it is modeled as an MDP defined by the tuple {S, A, P, R, κ}, where S is the state space, A is the action space, P is the state transition function, R is the reward function, and κ is the discount factor. The state space includes sensor data such as joint positions and velocities, contact forces and moments, and base and end-effector poses and velocities. The action space is defined as incremental joint position commands, which are fed to a low-level controller. The reward function is designed to encourage successful assembly while penalizing excessive contact forces and base displacements. The Proximal Policy Optimization (PPO) algorithm is employed to learn the assembly policy, utilizing an actor-critic framework with neural networks for policy and value functions. To improve learning efficiency, impedance control is incorporated as a prior model, reducing the exploration space and enhancing stability during training.

The impedance control prior model establishes a desired dynamic behavior between the robot and the environment, modeled as a second-order system. In joint space, the impedance relationship is given by:

$$ M_d \Delta \ddot{\theta} + B_d \Delta \dot{\theta} + K_d \Delta \theta = \tau_d – \tau_e $$

where $ M_d $, $ B_d $, and $ K_d $ are the desired inertia, damping, and stiffness matrices, respectively; $ \Delta \theta $ is the joint position error; $ \tau_d $ is the desired torque; and $ \tau_e $ is the external torque. The control law derived from this model simplifies to:

$$ \tau_i = H_m(\theta) M_d^{-1} (B_d \Delta \dot{\theta} + K_d \Delta \theta) + (I_n – H_m(\theta) M_d^{-1}) \tau_e $$

where $ H_m(\theta) $ is the manipulator inertia matrix. This is contrasted with a computed torque controller:

$$ \tau_c = H_m(\theta) (K_d \Delta \dot{\theta} + K_p \Delta \theta) $$

Domain randomization is applied during training to enhance robustness, including noise in observations, actions, and initial base poses. The simulation environment is built in Isaac Gym, enabling parallelized training of multiple robot instances on GPU hardware. The space robot parameters, such as mass and inertia properties, are based on a realistic model, and the assembly task involves inserting a peg into a hole with a 1 mm gap. Training uses the PPO algorithm with hyperparameters like a learning rate of 0.0003 and a discount factor of 0.99.

Simulation results demonstrate the effectiveness of the proposed method. The impedance control prior model reduces training time by approximately 17.3% compared to the computed torque controller, as shown by cumulative reward convergence curves. The assembly steps decrease over training, indicating improved efficiency. Contact forces during assembly are maintained within safe limits, with maximum absolute values lower in the impedance control case. Base satellite displacements are constrained within thresholds, ensuring operational stability. The method also generalizes to unseen conditions, such as larger base pose errors, without significant performance degradation. Tables and equations summarize key parameters and results, highlighting the advantages of integrating robot technology with reinforcement learning for compliant control.

In conclusion, the hybrid approach combining impedance control and deep reinforcement learning offers a robust solution for space robot on-orbit assembly. It addresses dynamic coupling and vibrations while reducing parameter tuning efforts. Future work will explore handling higher-frequency vibrations and extending the framework to full assembly sequences, including transport phases. The use of advanced robot technology in this context paves the way for autonomous space operations.

Table 1: Space Robot Kinematic and Dynamic Parameters
Symbol	Base	Joint 1	Joint 2	Joint 3	Joint 4	Joint 5	Joint 6
θ (rad)	—	θ₁	θ₂	θ₃	θ₄	θ₅	θ₆
α (rad)	—	0	-π/2	0	0	π/2	-π/2
a (m)	—	0	0	0.612	0.572	0	0
d (m)	—	0.228	0	0	0.164	0.116	1.208
m (kg)	570.81	3.55	9.85	6.20	1.26	1.10	195.98
Ixx (kg·m²)	67.046	0.014	0.026	0.012	0.002	0.002	11.966
Iyy (kg·m²)	55.382	0.015	0.575	0.315	0.003	0.002	20.108
Izz (kg·m²)	70.94	0.012	0.012	0.311	0.002	0.002	11.965
pCoM x (m)	0	-0.008	0.266	0.252	-0.002	0	0
pCoM y (m)	-0.373	0	0	0	0	0	0.033
pCoM z (m)	-0.373	0	0	0.003	-0.015	-0.005	-0.524

The reinforcement learning framework leverages robot technology to handle complex dynamics. The state space is defined as $ s_t = [\theta, \dot{\theta}, f, m, r_b, q_b, r_e, q_e, v_b, \omega_b, v_e, \omega_e]^T $, where each element represents sensor measurements. The action space is $ a_t = [\Delta\theta_1, \Delta\theta_2, \Delta\theta_3, \Delta\theta_4, \Delta\theta_5, \Delta\theta_6]^T $, and the reward function is:

$$ r_t = \begin{cases}
1 – l/L & \text{if assembly successful} \\
-1 + d/D & \text{if assembly failed}
\end{cases} $$

where $ l $ is the current step, $ L $ is the maximum steps, $ d $ is the insertion depth, and $ D $ is the total depth. The PPO algorithm optimizes the policy network $ \pi_\phi(a|s) $ and value network $ V_\psi(s) $ using the clipped objective function:

$$ L^{\text{clip}}(\phi) = \mathbb{E} \left[ \min(r_t(\phi) \hat{A}_t, \text{clip}(r_t(\phi), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right] $$

where $ r_t(\phi) = \pi_\phi(a_t|s_t) / \pi_{\phi_{\text{old}}}(a_t|s_t) $, $ \hat{A}_t $ is the advantage estimate, and $ \epsilon $ is a clipping parameter. The advantage function is computed as:

$$ \hat{A}_t = \delta_t + (\gamma\lambda) \delta_{t+1} + \cdots + (\gamma\lambda)^{T-t+1} \delta_{T-1} $$

with $ \delta_t = r_t + \gamma V_\psi(s_{t+1}) – V_\psi(s_t) $. This formulation enables stable and efficient policy updates.

Table 2: PPO Algorithm Hyperparameters
Parameter	Value
Learning rate α	0.0003
Discount factor γ	0.99
Time horizon T	32
Advantage parameter λ	0.99
Batch size M	8192
Parallel instances N	512
Clipping factor ε	0.2
VF coefficient c₁	1
Entropy coefficient c₂	0.01

The integration of robot technology with deep reinforcement learning allows for adaptive compliance control. The impedance control parameters are set as $ M_d = \text{diag}(1,1,1,1,1,1) $, $ B_d = \text{diag}(8,8,8,8,8,8) $, and $ K_d = \text{diag}(50,50,50,50,50,50) $. In comparison, the computed torque controller uses $ K_d = \text{diag}(8,8,8,8,8,8) $ and $ K_p = \text{diag}(50,50,50,50,50,50) $. The use of robot technology in the simulation environment facilitates rapid prototyping and testing. Domain randomization parameters include Gaussian noise for observations and actions, and uniform distributions for initial base poses, ensuring robustness to uncertainties.

Performance metrics from simulations show that the impedance control prior model achieves higher cumulative rewards and lower contact forces. For instance, the maximum contact force in the x-direction is reduced to 2.5 N compared to 7.5 N with computed torque control. Base displacements remain within 0.1 m for position and 0.5 rad for orientation, below safety thresholds. The table below summarizes the statistical results of contact forces during assembly.

Table 3: Maximum Contact Force Statistics
Direction	Impedance Control Mean (N)	Impedance Control Variance	Computed Torque Mean (N)	Computed Torque Variance
F_x	2.5	0.5	7.5	1.2
F_y	1.75	0.3	3.0	0.8
F_z	0.75	0.1	2.0	0.4

The convergence of assembly steps over training iterations demonstrates the learning progress. With impedance control, steps decrease from 36 to 30, while computed torque control reduces from 30 to 25. However, the impedance control approach achieves this with better force compliance. The reward function effectively guides the policy toward successful assembly, with episodes terminating if forces exceed 10 N or moments exceed 2 N·m. The use of robot technology in the reward design ensures that safety and efficiency are balanced.

In summary, the proposed method showcases the potential of combining model-based control with data-driven learning for space robot applications. The impedance control prior model enhances training efficiency and stability, while deep reinforcement learning adapts to dynamic uncertainties. This hybrid approach represents a significant advancement in robot technology for on-orbit operations, enabling autonomous assembly of large space structures. Future research will focus on extending this framework to more complex assembly tasks and real-world deployments, further pushing the boundaries of robot technology in space.