Collaborative Pursuit Strategies for Underwater Bionic Robots Using Multi-Agent Reinforcement Learning

In recent years, the development of underwater bionic robots has garnered significant attention due to their potential in exploring and utilizing marine resources. These bionic robots, inspired by the morphology and locomotion of aquatic organisms such as fish, exhibit remarkable maneuverability and minimal environmental disturbance. The bionic robot design enables efficient navigation through complex underwater environments, making them ideal for tasks like resource detection, security patrols, and ecological monitoring. However, the inherent challenges of underwater settings—including nonlinear dynamics, turbulence, and limited communication—complicate the implementation of cooperative behaviors among multiple bionic robots. This paper addresses these issues by proposing a multi-agent reinforcement learning (MARL) framework for collaborative pursuit tasks, where a team of bionic robots works together to capture an evader in a dynamic underwater scenario.

The pursuit-evasion problem, commonly observed in natural systems like predator-prey interactions, involves a group of pursuers coordinating to intercept one or more evaders. In underwater contexts, this task is particularly demanding due to the bionic robots’ underactuated nature and susceptibility to hydrodynamic forces. Traditional methods, such as those based on dynamic game theory or particle swarm optimization, often rely on precise models that are difficult to derive in such environments. Moreover, as the number of bionic robots increases, computational costs escalate exponentially, limiting scalability. To overcome these limitations, we leverage MARL, which allows bionic robots to learn adaptive strategies through experience without explicit environmental modeling. Our approach employs the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, combining centralized training with decentralized execution to enhance cooperation among bionic robots while maintaining individual autonomy.

Central to our framework is the design of the bionic robot platform. Inspired by sharks, we developed a prototype bionic robot with a multi-joint tail driven by servomotors, mimicking the oscillatory motion of biological fish. This bionic robot measures 0.68 meters in length and weighs approximately 3.05 kilograms, achieving a maximum speed of 0.85 body lengths per second and a minimum turning radius of 0.15 meters. The bionic robot’s movement is governed by a Central Pattern Generator (CPG) model, which produces periodic signals for tail undulation. By adjusting parameters like frequency and offset in the CPG output, we control the bionic robot’s speed and heading. For instance, the tail beat frequency is set to 0.8 Hz, with an amplitude of 35 degrees, allowing the bionic robot to advance about 0.08 meters in 0.5 seconds. The technical specifications of the bionic robot are summarized in Table 1.

Table 1: Technical Parameters of the Bionic Robot
Attribute	Parameter
Length	0.68 m
Weight	3.05 kg
Control Unit	STM32F407
Actuator	KingMax 4510s
Communication	RF200
IMU	JY901s
Max Speed	0.85 BL/s
Min Turning Radius	0.15 m

In the pursuit-evasion task, we consider a 2D bounded underwater arena where two bionic robot pursuers collaborate to capture a single bionic robot evader. The evader possesses superior agility, with a smaller turning radius (0.18 m) and higher angular velocity (1.2 rad/s) compared to the pursuers (0.2 m and 0.9 rad/s, respectively). Each bionic robot has a sensing range $R_s$, and capture occurs when a pursuer’s distance to the evader $d_{pe}$ falls below a threshold $d_{\text{capture}}$. The state transition for each bionic robot is modeled discretely, accounting for the periodic motion of the tail. Specifically, the position and orientation of a bionic robot at time $t$ and $t+1$ are denoted as $\mathbf{p}_t$ and $\mathbf{p}_{t+1}$, with the displacement between consecutive target points exceeding the distance covered in one tail beat cycle. The kinematic equations for the bionic robot are derived using a backstepping control approach to ensure stable tracking of target trajectories. For a bionic robot $i$, the state vector $\mathbf{s}_i$ is defined as:

$$ \mathbf{s}_i = [\mathbf{p}_i, \psi_i, \mathbf{d}_{i1}, \mathbf{d}_{i2}, \dots, \mathbf{d}_{iN}, \boldsymbol{\theta}_{i1}, \boldsymbol{\theta}_{i2}, \dots, \boldsymbol{\theta}_{iN}] $$

where $\mathbf{p}_i = (x_i, y_i)$ is the global position, $\psi_i$ is the yaw angle, $\mathbf{d}_{ij} = \|\mathbf{p}_i – \mathbf{p}_j\|$ is the distance to other bionic robots $j$, and $\boldsymbol{\theta}_{ij}$ is the relative heading angle. This state representation enables each bionic robot to perceive inter-agent relationships and environmental dynamics, facilitating informed decision-making.

The action space for each bionic robot is designed to reflect its locomotion capabilities. An action $\mathbf{a}_i$ consists of a heading change $\Delta \psi_i$ and a movement distance $\Delta d_i$. The heading change is bounded by $\Delta \psi_i \in [-\pi/4, \pi/4]$ radians, allowing sharp turns for evasion or pursuit. The movement distance $\Delta d_i$ is constrained by the minimum turning radius; for pursuers, $\Delta d_i \geq 0.15$ m, while for the evader, $\Delta d_i \geq 0.2$ m. This asymmetry emphasizes the evader’s mobility, necessitating coordinated strategies among pursuers. The action execution is governed by the CPG model, where adjustments in frequency and offset translate to changes in speed and direction. The dynamics of the bionic robot can be approximated as:

$$ \mathbf{p}_{t+1} = \mathbf{p}_t + \Delta d_i \cdot [\cos(\psi_t + \Delta \psi_i), \sin(\psi_t + \Delta \psi_i)]^T $$

ensuring smooth transitions between states.

For the MARL framework, we adopt MADDPG, which utilizes actor-critic networks for each bionic robot. During training, critics access global information to evaluate actions, while actors rely on local observations during execution. This decentralized approach enhances scalability and robustness. The reward function for a bionic robot pursuer $i$ is structured into three components: extrinsic rewards, auxiliary rewards, and intrinsic penalties. The extrinsic reward $R_{\text{ext}}$ is given upon successful capture of the evader, providing a strong positive signal. Auxiliary rewards encourage behaviors that aid pursuit, such as reducing distance to the evader. For example, the distance-based reward $R_{\text{dist}}$ is defined as:

$$ R_{\text{dist}} = -\alpha \cdot d_{pe} $$

where $\alpha$ is a scaling factor. Intrinsic penalties deter undesirable actions, like exiting the workspace, with a penalty $R_{\text{penalty}} = -\beta$ if the bionic robot violates boundaries. The total reward $R_i$ for pursuer $i$ is:

$$ R_i = R_{\text{ext}} + R_{\text{dist}} + R_{\text{penalty}} $$

For the evader, rewards are inverted to promote escape. This reward structure is summarized in Table 2.

Table 2: Reward Function Components for Bionic Robots
Component	Type	Expression
Extrinsic Reward	Positive	$+100$ on capture
Distance Reward	Negative	$-\alpha \cdot d_{pe}$
Boundary Penalty	Negative	$-\beta$ if out-of-bounds

Training involves a curriculum learning strategy, where task difficulty gradually increases to promote exploration. Initially, bionic robots learn in simplified settings with slower evaders, progressing to full dynamics. This accelerates convergence and improves policy robustness. The MADDPG algorithm updates actor and critic networks using gradient descent, with the loss function for the critic $L_c$ and actor $L_a$ for each bionic robot $i$ given by:

$$ L_c = \mathbb{E}[ (Q_i(\mathbf{s}, \mathbf{a}) – y_i)^2 ] $$
$$ L_a = -\mathbb{E}[ Q_i(\mathbf{s}, \pi_i(\mathbf{o}_i)) ] $$

where $y_i = r_i + \gamma Q_i'(\mathbf{s}’, \pi'(\mathbf{o}’))$ is the target value, $\gamma$ is the discount factor, and $\mathbf{o}_i$ is the local observation of bionic robot $i$. This ensures that each bionic robot learns policies that maximize collective performance.

Experiments were conducted in a 5m × 4m indoor pool with three bionic robots: two pursuers and one evader. The bionic robots were equipped with control units and IMUs, and their positions were tracked via a global vision system. In one trial, the pursuers demonstrated effective cooperation by encircling the evader. Pursuer 1 engaged in direct chasing, while Pursuer 2 anticipated the evader’s path and executed a blocking maneuver. The trajectories, plotted with arrows indicating movement directions, show how the pursuers reduced the evader’s escape space, leading to capture. In another trial, both pursuers coordinated from left and right flanks, compressing the evader against a boundary. The bionic robots’ ability to adapt their strategies in real-time highlights the efficacy of the learned policies. Statistical analysis of multiple runs revealed a significant increase in capture rate and stability compared to baseline methods, with the bionic robot team achieving over 80% success in dynamic conditions.

In conclusion, our MARL-based approach enables underwater bionic robots to perform collaborative pursuit tasks efficiently. The integration of bionic robot dynamics with reinforcement learning fosters robust coordination, addressing the challenges of nonlinear underwater environments. Future work will focus on enhancing the bionic robot’s perceptual capabilities and reducing the sim-to-real gap by incorporating real-world data into training. This study lays a foundation for advanced multi-bionic robot systems in marine applications, underscoring the potential of bionic robots in complex autonomous operations.

The development of bionic robot technologies continues to evolve, with ongoing improvements in materials, sensors, and algorithms. For instance, the bionic robot’s CPG model can be extended to include more degrees of freedom, enabling more complex maneuvers. Additionally, the use of hierarchical reinforcement learning could further optimize multi-bionic robot cooperation. As bionic robots become more prevalent, their impact on underwater exploration and security will grow, driven by innovations in AI and robotics. The bionic robot paradigm, inspired by nature, offers a sustainable path toward intelligent ocean systems, where bionic robots operate seamlessly in schools, much like their biological counterparts.

To quantify the performance of the bionic robot system, we analyzed metrics such as capture time and energy efficiency. The bionic robot pursuers consistently outperformed non-learning baselines, with an average capture time reduction of 30%. Moreover, the bionic robot evader’s survival time decreased as pursuers learned coordinated tactics, demonstrating the adaptive nature of the MARL framework. These results validate the bionic robot as a viable platform for autonomous underwater tasks, paving the way for larger-scale deployments. The bionic robot’s design, coupled with advanced learning algorithms, represents a significant step toward realizing fully autonomous marine swarms, where bionic robots collaborate to achieve complex objectives.

In summary, the bionic robot approach harnesses biological principles to overcome engineering challenges, and when combined with MARL, it unlocks new possibilities for underwater robotics. The iterative learning process allows bionic robots to refine their behaviors in response to environmental feedback, leading to emergent cooperation. As research progresses, we anticipate that bionic robots will play a crucial role in oceanography, defense, and environmental monitoring, thanks to their versatility and efficiency. The journey of the bionic robot from concept to real-world application exemplifies the synergy between biomimicry and artificial intelligence, highlighting the bionic robot as a cornerstone of future robotic systems.