The coordinated behaviors observed in natural fish schools, such as cooperative hunting and dynamic obstacle avoidance, offer profound inspiration for the design of multi-agent systems, particularly in challenging environments like underwater domains. This collective intelligence, characterized by distributed perception and adaptive coordination, is a compelling model for developing advanced control strategies for bionic robot collectives. The pursuit-evasion problem, a classic testbed for multi-agent cooperation, becomes significantly more complex in underwater settings due to inherent challenges like strong disturbances, noisy perception, and the underactuated nature of bionic robot platforms. Traditional methods often rely on precise dynamical models or impose restrictive assumptions on opponent behavior, limiting their applicability in real-world, dynamic scenarios where opponents are intelligent and adaptive.
To address these limitations, this work presents a comprehensive framework for training cooperative pursuit policies for underwater bionic robot swarms. The core of our approach integrates a Multi-Agent Reinforcement Learning architecture with a Multi-Head Self-Attention mechanism (MARL-MHSA) and is grounded in a data-driven simulation environment. This synergy is designed to enhance inter-agent cooperation, improve policy training efficiency, and crucially, bridge the notorious simulation-to-reality gap that often hinders the deployment of learned controllers onto physical bionic robot systems.

The fundamental challenge in underwater bionic robot control stems from the difficulty of obtaining accurate hydrodynamics models. Instead of relying on such first-principle models, we adopt a data-driven modeling paradigm. We collect real motion data from a physical bionic robot platform—a multi-joint robotic shark—by executing various control commands. A dataset $B = \{(u_i, x_i) | i = 1, …, N\}$ is constructed, where $u_i \in \mathbb{R}^m$ represents the control input (e.g., frequency and bias of a Central Pattern Generator) and $x_i \in \mathbb{R}^n$ represents the observed state (e.g., forward velocity and angular velocity). A deep neural network is then trained to learn the mapping from the control space to the motion state space:
$$F_s: (f, b) \rightarrow (v, \omega)$$
where $f$ is the tail-beat frequency, $b$ is the control bias, $v$ is the linear velocity, and $\omega$ is the angular velocity. This learned model forms the dynamics engine of our simulation environment, effectively aligning the simulation’s kinematic behavior with that of the real bionic robot. To further enhance realism, Gaussian noise is injected into the state transitions to mimic environmental uncertainties and actuator imperfections.
The effectiveness of this data-driven approach is summarized in the following mapping table, which shows the consistent velocity outputs for given control inputs, demonstrating the model’s ability to capture the bionic robot’s motion characteristics.
| $f$ (Hz) | $b = -30^\circ$ | $b = -20^\circ$ | $b = -10^\circ$ | $b = 0^\circ$ | $b = 10^\circ$ | $b = 20^\circ$ | $b = 30^\circ$ |
|---|---|---|---|---|---|---|---|
| 0.5 | 0.040 / -0.763 | 0.089 / -0.619 | 0.180 / -0.419 | 0.155 / 0 | 0.190 / 0.311 | 0.093 / 0.501 | 0.045 / 0.611 |
| 0.8 | 0.046 / -0.912 | 0.104 / -0.771 | 0.189 / -0.409 | 0.217 / 0 | 0.184 / 0.301 | 0.106 / 0.580 | 0.045 / 0.699 |
| 1.0 | 0.072 / -1.177 | 0.141 / -0.996 | 0.279 / -0.574 | 0.336 / 0 | 0.267 / 0.553 | 0.143 / 1.056 | 0.070 / 1.168 |
| 1.2 | 0.061 / -1.028 | 0.116 / -0.863 | 0.243 / -0.519 | 0.269 / 0 | 0.229 / 0.451 | 0.128 / 0.873 | 0.059 / 1.000 |
With a high-fidelity simulation environment established, we focus on the core algorithmic challenge: learning an effective distributed cooperative policy for the pursuing bionic robot team. We formulate the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Each pursuer agent receives a local observation $O_i$, which typically includes information about nearby teammates and the evader, such as relative distances and bearings. The dynamic nature of underwater environments means the number of observable neighbors can change, leading to variable-length observations. To handle this, we employ a mean-embedding technique to create a fixed-dimensional input for the policy network:
$$\bar{O}_i = \frac{1}{|\hat{O}_i|} \sum_{o_{i,j} \in \hat{O}_i} \phi_{\text{NN}}(o_{i,j})$$
where $\hat{O}_i$ is the set of raw observations for agent $i$, $o_{i,j}$ is an observation of neighbor $j$, and $\phi_{\text{NN}}$ is a neural network feature extractor. This ensures permutation invariance and a consistent input size.
The policy architecture follows the Centralized Training with Decentralized Execution (CTDE) paradigm. During training, a centralized critic has access to global information to guide learning, while during execution, each bionic robot uses only its own local observation and a decentralized actor network to make decisions. The key innovation is the incorporation of a Multi-Head Self-Attention (MHSA) mechanism into the centralized critic. This allows each agent to dynamically weight the importance of information from all other agents when assessing the value of a joint action.
For an agent $i$, the critic network first encodes the joint observation-action pair into a feature vector $v_i$. For each attention head $j$, an attention weight $\alpha_j^i$ is computed, signifying the relevance of other agents’ information to agent $i$ from the perspective of head $j$:
$$\alpha_j^i = \frac{\exp(v_{-i}^T K_j^T Q_j v_i)}{\sum_{-i} \exp(v_{-i}^T K_j^T Q_j v_i + \lambda \cdot \text{Reg}(v_{-i}))}$$
Here, $v_{-i}$ denotes the features of all agents except $i$, $Q_j$ and $K_j$ are the learned query and key matrices for head $j$, and $\lambda$ is a regularization coefficient. The output of head $j$ for agent $i$ is then a weighted sum:
$$x_j^i = [c_j^1, …, c_j^N] \cdot \alpha_j^i, \quad \text{where } c_j^k = V_j v^k$$
with $V_j$ being the value matrix. The final state-action value for agent $i$ is constructed by concatenating the outputs from all $H$ attention heads:
$$Q_i = [x_1^i, x_2^i, …, x_H^i]$$
This structured attention enables the bionic robot agents to learn nuanced cooperative strategies by focusing on the most relevant inter-agent dependencies for the task at hand.
The policy for each bionic robot, $\pi_{\theta}^i(a^i | O_i)$, is a decentralized actor network optimized using policy gradient methods, with the gradient computed as:
$$\nabla_{\theta_i} J(\theta_i) = \mathbb{E}_{O \sim D, a \sim \pi} \left[ \nabla_{\theta_i} \log(\pi_{\theta}^i(a^i | O_i)) Q_i(O, A) \right]$$
The reward function $r^i$ is designed to promote efficient collaboration. It combines a sparse success bonus, a dense shaping reward based on distance to the evader, and a penalty for leaving the operational area:
$$r^i = r_{\text{main}}^i + r_{\text{att}}^i + r_{\text{sub}}^i$$
$$r_{\text{main}}^i = \begin{cases} 10, & \text{if } d_{i,e} \leq R_c \\ 0, & \text{otherwise} \end{cases}$$
$$r_{\text{att}}^i = -0.15 \| p_i – p_e \|$$
$$r_{\text{sub}}^i = \begin{cases} 0, & |p_i| < 1.5 \\ 2(|p_i| – 1.5), & 1.5 \leq |p_i| < 2 \\ \min(e^{2|p_i|-4}, 10), & |p_i| \geq 2 \end{cases}$$
The complete training algorithm for the bionic robot swarm is outlined below.
Algorithm 1: MARL-MHSA Training for Bionic Robot Pursuit
Input: Initial policy networks $\pi_{\theta}^i$, centralized critic networks $Q_{\omega}^i$, environment.
Output: Optimized policy networks $\pi_{\theta^*}^i$.
- Initialize target networks $\pi_{\theta’}^i$, $Q_{\omega’}^i$ and experience replay buffer $D$.
- for episode = 1 to M do
- Initialize environment state $s_0$.
- for t = 1 to T do
- for each bionic robot agent i = 1 to N do
- Receive local observation $\hat{O}_t^i$.
- Compute mean-embedded observation $\bar{O}_t^i$.
- Sample action $a_t^i \sim \pi_{\theta}^i(\bar{O}_t^i)$.
- end for
- Execute joint action $A_t = \{a_t^1, …, a_t^N\}$, receive reward $r_t$ and next state $s_{t+1}$.
- Store transition $(s_t, A_t, r_t, s_{t+1})$ in $D$.
- end for
- if training condition is met then
- Sample a random batch from $D$.
- for each agent i = 1 to N do
- Compute $Q_{\omega}^i(O, A)$ using the MHSA critic.
- Compute TD target: $y = r^i + \gamma Q_{\omega’}^i(O’, A’)$.
- Update critic by minimizing loss: $L(\omega^i) = (Q_{\omega}^i(O, A) – y)^2$.
- Update actor $\pi_{\theta}^i$ using the policy gradient with $Q_{\omega}^i(O, A)$.
- end for
- Soft-update target networks.
- end if
- end for
To facilitate learning in the challenging sparse-reward setting, a phased curriculum learning strategy is employed. Training begins in an easier setting (e.g., larger capture radius $R_c$, slower evader) and gradually transitions to the full, difficult task parameters. This significantly accelerates policy convergence compared to direct training on the hardest scenario from the outset.
We evaluate the proposed MARL-MHSA framework against several state-of-the-art multi-agent reinforcement learning baselines, including MAPPO, QMIX, and COMA. The bionic robot pursuit team consists of three agents, and the evader is controlled by a pre-trained adversarial policy. Performance is measured by pursuit success rate and the average number of steps required for a successful capture.
The results demonstrate the superior performance of our approach. The integration of the multi-head self-attention mechanism allows the bionic robot team to learn more sophisticated coordination patterns, leading to higher success rates and more efficient captures. The ablation studies confirm the contribution of the MHSA module, showing clear improvements over a standard actor-critic baseline without attention. The scalability of the approach is also validated, showing robust performance as the number of pursuing bionic robots increases.
| Method | Success Rate (%) | Avg. Success Steps | Training Episodes (Convergence) |
|---|---|---|---|
| MAPPO | 74 | 82 | ~4,800 |
| QMIX | 83 | 65 | ~8,300 |
| COMA | 89 | 72 | ~9,700 |
| Actor-Critic (No Attention) | 82 | 73 | ~9,500 |
| MARL-MHSA (Ours) | 92 | 63 | ~7,500 |
The final and most critical validation involves deploying the policy, trained entirely in the data-driven simulation, onto a team of physical bionic robot platforms in a real underwater pool environment. The bionic robots successfully executed the learned cooperative pursuit strategy against an autonomously evading target. The experiment confirmed the effective transferability of the policy, with the bionic robot pursuers demonstrating adaptive coordination, dynamically encircling and intercepting the higher-maneuverability evader. This successful transition from simulation to reality underscores the critical role of the data-driven modeling component in aligning the simulation dynamics with the true physics of the bionic robot system, effectively bridging the sim-to-real gap.
In conclusion, this work presents a robust and effective framework for learning cooperative control policies for underwater bionic robot swarms. By integrating a data-driven simulation model with a novel multi-agent reinforcement learning architecture enhanced by multi-head self-attention, we address key challenges in policy training, inter-agent coordination, and real-world deployment. The MARL-MHSA framework enables bionic robot agents to learn complex, context-aware cooperative strategies, while the data-driven approach ensures these strategies are executable on physical hardware. The results, validated through extensive simulations and real-world pool experiments, show significant improvements in pursuit success rate and efficiency over strong baselines. This research provides a promising pathway toward deploying intelligent, collaborative bionic robot systems for complex underwater tasks, pushing the boundaries of what is achievable with bio-inspired robotic collectives.
