Multi-Robot Social Formation Navigation Based on Intrinsic Reward Mechanism

In recent years, the field of robot technology has witnessed significant advancements, particularly in multi-robot systems where collaboration is essential for complex tasks. As a researcher deeply involved in this domain, I have focused on addressing the challenges of multi-robot social formation navigation, which involves coordinating multiple robots to maintain predefined formations while navigating through dynamic environments filled with obstacles like pedestrians. One critical issue in this area is relative over-generalization, where robots settle for suboptimal joint strategies due to inefficient exploration. To overcome this, I propose an innovative approach that leverages intrinsic reward mechanisms to enhance exploration and coordination among robots. This article details my methodology, which combines random network distillation and elliptical episodic bonuses to form a double-timescale intrinsic reward, integrated within a centralized training with decentralized execution (CTDE) framework. Through extensive simulations, I demonstrate that this approach outperforms existing baseline algorithms in terms of success rate, collision avoidance, navigation time, and formation accuracy. The integration of advanced robot technology here not only improves performance but also paves the way for more robust and scalable multi-robot applications.

Multi-robot social formation navigation is a subset of multi-robot systems where robots must move in a coordinated manner while adhering to social norms, such as avoiding collisions with humans and other robots. In my work, I model this as a partially observable Markov decision process (POMDP) to handle the uncertainties in dynamic environments. The state space for each robot includes observable components like position, velocity, and radius, as well as hidden states such as goal position and orientation. Formally, the joint state space is defined as $\mathcal{S}$, with each robot’s observation $o_i$ comprising its own state and the observable states of other agents. The action space involves continuous controls for linear and angular velocities, represented as $\mathbf{a}_i = [v_i, \omega_i]$ for robot $i$. The optimization goal is to minimize the navigation time and formation error while satisfying collision constraints, which can be expressed as:

$$ \arg\min_{\pi} \mathbb{E} \left[ t_g + \sum_{t=0}^{t_g} \sum_{i=1}^{n} \| \mathbf{p}_i^t – \mathbf{p}_0^t – \mathbf{H}_i \| \right] $$

subject to:

$$ \| \mathbf{p}_i^t – \mathbf{p}_{-i}^t \| > r_i + r_{-i}, \quad \forall i, t $$

where $t_g$ is the time for the leader robot to reach the goal, $\mathbf{p}_i^t$ is the position of robot $i$ at time $t$, $\mathbf{H}_i$ is the desired relative offset, and $r_i$ is the radius. The challenges in robot technology for such tasks often stem from the non-cooperative nature of pedestrian movements, leading to inefficient exploration with standard methods like $\epsilon$-greedy strategies.

To address relative over-generalization, I developed a double-timescale intrinsic reward mechanism that encourages robots to explore the joint observation space more effectively. This mechanism combines two types of intrinsic rewards: random network distillation (RND) and elliptical episodic bonuses (E3B). The RND intrinsic reward operates on a global timescale, promoting exploration of novel states over the entire training process. It is computed as the prediction error between a fixed target network $\eta$ and a trainable predictor network $\hat{\eta}$:

$$ r_{\text{RND}}(\mathbf{o}_t) = \| \eta(\mathbf{o}_t) – \hat{\eta}(\mathbf{o}_t) \|^2 $$

where $\mathbf{o}_t = [o_0^t, o_1^t, \dots, o_n^t]$ is the joint observation at time $t$. The E3B intrinsic reward, on the other hand, functions on an episodic timescale, incentivizing diverse state visits within a single episode. It uses an elliptical measure based on an embedding network $\phi$ trained via an inverse dynamics model $\delta$:

$$ b(\mathbf{o}_t) = \phi(\mathbf{o}_t)^T \mathbf{C}^{-1} \phi(\mathbf{o}_t) $$

with the covariance matrix $\mathbf{C}$ defined as:

$$ \mathbf{C} = \lambda \mathbf{I} + \sum_{i=1}^{t-1} \phi(\mathbf{o}_i) \phi(\mathbf{o}_i)^T $$

where $\lambda$ is a scaling coefficient. By combining these, the double-timescale intrinsic reward $r_t$ is given by:

$$ r_t = 2 \cdot r_{\text{RND}}(\mathbf{o}_t) \cdot b(\mathbf{o}_t) $$

This reward is then integrated with the extrinsic rewards designed for the leader and follower robots. For the leader robot, the extrinsic reward $r_0^t$ includes penalties for collisions and bonuses for reaching the goal. For follower robots, the extrinsic reward $r_i^t$ combines formation maintenance and collision avoidance components:

$$ r_i^t = r_f^i(t) + r_c^i(t) $$

where $r_f^i(t)$ is based on the formation error $e_i^t = \| \mathbf{p}_i^t – \mathbf{p}_0^t – \mathbf{H}_i \|$, and $r_c^i(t)$ penalizes proximity to other agents. The total reward for each robot is:

$$ R_i^t = r_i^t + \beta r_t $$

with $\beta$ controlling the weight of the intrinsic reward. This reward design is embedded in a CTDE framework, where during training, robots share observations to learn a centralized critic, but execute policies decentralizedly based on local observations. This approach leverages advancements in robot technology to foster better coordination and exploration.

In my experiments, I evaluated the proposed algorithm against state-of-the-art baselines like FCCADRL, MLGA2C, and SAMARL in a simulated environment with dynamic pedestrians. The simulation involved three robots—one leader and two followers—aiming to maintain a triangular formation while navigating around five pedestrians controlled by ORCA and Social Force models. I used Python and PyTorch for implementation, with parameters summarized in Table 1.

Table 1: Experimental Parameters
Parameter	Value	Parameter	Value
Discount Factor $\gamma$	0.99	Batch Size	256
Activation Function	ReLU	Maximum Time Limit	21s
Scaling Coefficient $\lambda$	0.1	Maximum Training Episodes	80,000
Learning Rate	5e-4	Weight $\beta$	1
Time Step	0.25s	Replay Buffer Capacity	200,000

The evaluation metrics included success rate, collision rate, navigation time, and formation error. After 80,000 training episodes, my algorithm demonstrated superior performance, as shown in the training curves where it achieved higher success rates and lower collision rates compared to baselines. For instance, in a test scenario with five pedestrians, the proposed method achieved a success rate of 91.7% versus 90.2% for FCCADRL, highlighting the benefits of enhanced exploration in robot technology. Quantitative results across different pedestrian densities are summarized in Table 2.

Table 2: Quantitative Evaluation Results for Various Pedestrian Counts
Pedestrians	Algorithm	Success Rate (%)	Collision Rate (%)	Navigation Time (s)	Formation Error (m)
5	MLGA2C	92.3	7.7	9.57	0.68
	JDIR-MLGA2C	93.7	6.3	9.52	0.64
	FCCADRL	90.2	9.8	9.64	0.70
	JDIR-FCCADRL	91.7	8.3	9.56	0.65
	SAMARL	87.9	12.1	9.24	0.78
	JDIR-SAMARL	89.8	10.2	9.15	0.74
7	MLGA2C	90.8	9.2	9.75	1.02
	JDIR-MLGA2C	92.2	7.8	9.71	0.99
	FCCADRL	89.0	11.0	9.95	1.05
	JDIR-FCCADRL	90.5	9.5	9.84	0.95
	SAMARL	86.4	13.6	9.42	1.14
	JDIR-SAMARL	88.6	11.7	9.35	1.09
9	MLGA2C	88.1	11.9	10.08	1.37
	JDIR-MLGA2C	89.5	10.5	9.94	1.33
	FCCADRL	87.2	12.8	10.11	1.39
	JDIR-FCCADRL	88.6	11.4	9.96	1.34
	SAMARL	83.5	16.5	9.69	1.54
	JDIR-SAMARL	85.6	14.4	9.58	1.48

Qualitative analysis further validated the effectiveness of my approach. For example, in trajectories generated by FCCADRL, followers often collided with pedestrians or exhibited large formation errors due to inefficient exploration. In contrast, my algorithm enabled smoother navigation with timely obstacle avoidance and quicker formation recovery. The use of intrinsic rewards motivated robots to discover optimal joint strategies, reducing instances of relative over-generalization. This is evident in the formation error curves, where my method maintained lower errors throughout the episodes. Additionally, I conducted ablation studies to dissect the contribution of each intrinsic reward component. Results in Table 3 show that combining RND and E3B yields the best performance, underscoring the synergy between global and episodic exploration in robot technology.

Table 3: Ablation Study Results
Algorithm	Success Rate (%)	Collision Rate (%)	Navigation Time (s)	Formation Error (m)
FCCADRL	90.2	9.8	9.64	0.70
RND-FCCADRL	91.0	9.0	9.59	0.67
E3B-FCCADRL	90.9	9.1	9.61	0.68
JDIR-FCCADRL	91.7	8.3	9.56	0.65

In conclusion, my work presents a novel intrinsic reward-based approach to multi-robot social formation navigation that effectively mitigates relative over-generalization. By integrating double-timescale intrinsic rewards within a CTDE framework, I have enhanced the exploration capabilities of robots, leading to improved coordination and performance in dynamic environments. The results affirm the potential of this method in advancing robot technology for real-world applications, such as search and rescue or automated logistics. Future research will focus on scaling to larger robot teams and exploring alternative intrinsic reward mechanisms to further push the boundaries of multi-robot systems. Through continuous innovation in robot technology, I aim to develop more intelligent and adaptive robotic solutions that can operate seamlessly in complex social settings.