The integration of interactive technology has positioned robotic instruction and companionship as a transformative new paradigm in education, playing an increasingly vital role in early childhood learning. These intelligent robot systems offer the potential for real-time feedback on student engagement and enhanced pedagogical strategies. Technologies such as dialogue policy learning and speech emotion recognition models based on capsule networks present promising solutions for fostering children’s social skills and emotional well-being. However, persistent challenges remain in the domain of intelligent robot control, notably concerning insufficient information capture and inaccurate environmental perception. A critical issue is the robot’s inability to fully meet the demands of intelligent virtual simulation and its often inadequate recognition of children’s actions and learning contexts. While research into robot control and action recognition is abundant, specific investigations targeting the unique requirements of preschool education are relatively scarce. To address the weaknesses in human-robot interaction and the imprecise recognition of children’s postures in this setting, this study proposes and designs a novel intelligent robot control system based on the Salp Swarm Algorithm (SSA). The system aims to provide a new pedagogical approach for preschool education, ultimately enhancing the learning experience for young children.

1. Action Recognition Algorithm for Preschool Intelligent Robots Based on SSA
As technology advances, intelligent robot systems are finding broader application within preschool education. They offer young learners vivid and engaging educational experiences and provide institutions with more diverse teaching methodologies. Nevertheless, robotic simulation technology also introduces specific challenges, including inconvenient online teaching gestures and inaccurate student emotion recognition. To tackle the instability in motion capture for educational robots, this study proposes an action recognition algorithm based on SSA, designed to strengthen the system’s autonomous ability to discern instructional gestures. The SSA is particularly effective for exploring various regions of the solution space with high convergence speed. The improved backpropagation for pose estimation, enhanced by SSA, is formulated as follows:
$$ z_l = w_{l-1}a_{l-1} + b_{l-1} $$
$$ a_l = f(z_l) $$
$$ \delta^L = (y – a_L) * \sigma'(z_L) $$
where \( x \) and \( l \) denote the input matrix and the current network layer number, respectively. The value obtained after nonlinear transformation is represented by \( a \). The linear sum and the bias of the current layer are \( z \) and \( b \), respectively, and \( w \) represents the weight matrix. The updates for weights and biases are calculated as:
$$ w^l_{kj} \rightarrow (w^l_{kj})’ = w^l_{kj} – \eta \frac{\partial f}{\partial w^l_{kj}} = w^l_{kj} – \eta * a^{l-1}_k * \delta^l_j $$
$$ b^l_{j} \rightarrow (b^l_{j})’ = b^l_{j} – \eta \frac{\partial f}{\partial b^l_{j}} = b^l_{j} – \eta \delta^l_j $$
Here, \( \eta \) is the learning rate, and the partial derivatives for the weight and bias are computed separately. To further improve the target pose recognition accuracy of the SSA, a hybrid attention module is introduced to enhance the ability to aggregate target information. The Convolutional Block Attention Module (CBAM) strengthens feature representation while suppressing irrelevant and noisy information. The computation for the hybrid attention module is:
$$ F’ = M_c(F) \otimes F $$
$$ F” = M_s(F’) \otimes F’ $$
where \( F \) is the input feature map to the improved attention module, \( F’ \) is the feature map weighted by the channel attention module, and \( F” \) is the final target feature map processed by the spatial attention module. \( M_c(F) \) and \( M_s(F’) \) represent the extraction operations of the channel and spatial attention modules, respectively. The design framework of the SSA-improved motion capture system is illustrated in the following logical flow, which integrates sensory input, data processing via the hybrid attention and SSA mechanisms, and output to a software platform for educational evaluation. The weight calculation for a feature channel in this system is given by:
$$ \omega_1 = \sigma \left( \sum_{j=1}^{k} w_i y^i_j \right), \quad y^i_j \in \Omega^i_k $$
where \( y^i_j \) and \( \Omega^i_k \) represent the pooled set features and the set of adjacent feature channels, respectively, and \( k \) is the number of adjacent features. In summary, the SSA-based intelligent robot action recognition algorithm facilitates effective target pose matching for preschool children, offering a robust foundation for enhanced human-robot interaction in educational curricula.
2. Optimization of the Intelligent Robot Control System via Multi-Sensor Information Fusion
While the SSA-based hybrid action recognition algorithm provides effective identification of human movements, its capabilities in robot intelligent control require further enhancement. Robots are now extensively utilized across various aspects of preschool education, offering improved interaction and instructional experiences for young children. However, preschool robots often encounter issues such as motion control failure and weak adaptability during interactions. To improve the interactive capability of the intelligent robot with children, this study proposes a Multi-Sensor Information Fusion (MSIF)-based intelligent robot control system, building upon the SSA action recognition algorithm to improve positioning and navigation. MSIF comprehensively processes data from multiple sensors, employing different fusion methods to achieve optimal state estimation. The Extended Kalman Filter (EKF) is used for state prediction within the MSIF framework:
$$ \hat{x}_{k|k-1} = f(\hat{x}_{k-1}, u_{k-1}) $$
where \( u_{k-1} \) is the system input, assumed to have zero mean, and the initial state estimate is set to \( \hat{x}_0 \). The covariance matrix for the preschool robot system utilizing MSIF is represented as:
$$ \sum_{k-1} = \begin{bmatrix} k_r \Delta S_{r, k-1} & 0 \\ 0 & k_l \Delta S_{l, k-1} \end{bmatrix} $$
Here, \( k-1 \) denotes a specific time instance, \( k_l \) is a common constant for robot odometry error, \( k_r \) is a constant related to ground friction, and \( \Delta S_{r, k-1} \) and \( \Delta S_{l, k-1} \) are the distances moved by the robot’s right and left wheels during time interval \( k \), respectively. The robot’s position observation equation is given by:
$$ Z_k = h(X_k) = \begin{bmatrix} x_{uo,k} \\ y_{uo,k} \\ \theta_{uo,k} \end{bmatrix} = \begin{bmatrix} x_k \\ y_k \\ \theta_k \end{bmatrix} + \begin{bmatrix} n_{x,k} \\ n_{y,k} \\ n_{\theta,k} \end{bmatrix} $$
where \( x_{uo,k} \) and \( y_{uo,k} \) are the observed robot positions from a monocular camera, \( \theta_{uo,k} \) is the observed attitude from an Inertial Measurement Unit (IMU), and \( n_{x,k} \), \( n_{y,k} \), and \( n_{\theta,k} \) represent observation noises on the x-axis, y-axis, and attitude angle, respectively. To address the challenge of capturing dynamic data from a moving preschool robot, Stochastic Configuration Networks (SCNs) are further integrated to optimize the robot’s follow-up control system. SCNs can randomly generate network structures and weights, offering good generalization performance when dealing with complex nonlinear problems. The robot output definition after introducing SCNs is:
$$ u_{scn} = \hat{w}^T h = \hat{d} $$
where \( u_{scn} \) represents the output definition, \( \hat{d} \) is the estimated value of uncertain random parameters in the control system, \( \hat{w} \) denotes the output weight, and \( h \) is the output from the hidden layer. Consequently, this study proposes a comprehensive intelligent robot control system for preschool education that synergistically combines SSA and multi-sensor information fusion with SCNs. The system first involves researching preschool children’s behavior to understand their action habits. The intelligent robot then receives behavioral instructions to form an interactive feedback loop. To ensure accurate human-robot interaction, the SSA algorithm performs pose and action recognition, while MSIF technology and SCNs handle positioning and tracking control, enabling the system to generate effective responses. Finally, data on posture, dialogue interaction, and other parameters are transmitted in real-time to a platform terminal, meeting the intelligent demands of preschool education.
3. Experimental Design and Setup
To validate the performance and application effectiveness of the proposed methodology, a structured experimental design was implemented. The experimental environment and key parameters are summarized below.
| Component | Specification |
|---|---|
| Hardware Platform | Main controller: STM32F103 embedded system. Sensors: Inertial Measurement Unit (IMU), monocular vision camera, ultrasonic sensors. |
| Software Environment | OS: Windows 10. Language: Python 3.8. Framework: PyTorch 1.9. |
| Computing Resources | RAM: 16 GB. GPU: NVIDIA GTX 980 (for neural network acceleration). |
| Datasets | UCF Sport (10 sports action classes) and HMDB51 (51 daily action classes). |
| Evaluation Metrics | Average Relative Error (ARE), Recognition Accuracy, Trajectory Error, User Satisfaction Score (0-100). |
| Experimental Scenarios | 1. Algorithm comparison on UCF Sport & HMDB51. 2. Path tracking in a 5m×5m square. 3. User interaction with 13 preschool children. |
4. Performance Analysis of the Improved Intelligent Robot Control System
4.1 Evaluation of the SSA-based Action Recognition Algorithm
To verify the superior performance of the SSA-fused robot action recognition algorithm, it was compared against three other prevalent algorithms: Deep Neural Network (DNN), the combination of Deep Recurrent Q-Network (DRQN), and Generative Adversarial Networks (GAN). The action pose test results on the UCF Sport and HMDB51 datasets are conceptually illustrated through characteristic curves. The actual motion angles primarily fluctuated between -35° and 35°. The SSA-fused algorithm’s sampled points showed a curve that largely coincided with the actual angle motion trajectory, particularly at sample points 300 and 700 on the UCF Sport dataset. In contrast, the DRQN algorithm demonstrated a significant deviation, such as at sample point 200 on the HMDB51 dataset where its estimated angle was -38°, differing from the actual change by -8°. These observations suggest that the SSA-fused algorithm provides effective recognition of different actions.
To further validate the precision of the SSA-fused algorithm, the Average Relative Error (ARE) was calculated for 13 test subjects across all four algorithms. The results are summarized in the table below, highlighting key performance differences.
| Algorithm | Subjects with Max |ARE| | Notable ARE Values | Overall Precision Assessment |
|---|---|---|---|
| SSA-Fused | Subject #2, #7, #11 | 3.1%, -2.8%, 3.2% | Maintains high precision across test subjects. |
| DRQN | – | Minimum ARE: -7.64% | Shows moderate error levels. |
| DNN | Subject #8 | ARE: -41.68% | Exhibits large error spikes for some subjects. |
| GAN | – | Max Positive ARE: 38.78% Min Positive ARE: 14.18% |
Shows consistently high positive error bias. |
The data clearly indicates that the SSA-fused intelligent robot action recognition algorithm maintains a high level of precision in posture detection across different test subjects, with a maximum absolute ARE of only 3.2%, significantly outperforming the compared methods.
4.2 Performance Verification of the SSA-Improved Intelligent Control System
Following the validation of the action recognition component, the performance of the complete preschool intelligent robot control system was analyzed and compared against systems built upon DRQN, DNN, and GAN algorithms. The experimental setup utilized a TP-LINK TL-WDR5620 router for secure LAN data transmission, a DiMP tracker for image feature extraction, and the TC-Bot platform for building the robot dialogue simulation system. The trajectory estimation test involved navigating a square path with sides of 5 meters.
The proposed SSA-based intelligent robot control system demonstrated a high degree of accuracy. Its estimated trajectory closely followed the actual path, with perfect coincidence at the two right-angle turns when both x and y axes reached 5 meters. At a point where x=6m and y=5m, a minor deviation of only 0.2m was observed. In stark contrast, the DNN-based system showed significant deviation at the turns, with errors of approximately 0.4m on the x-axis and 0.2m on the y-axis compared to the actual trajectory. This underscores the superior control efficacy and higher estimation accuracy of the SSA-based system.
To test the action recognition accuracy within the control system context, different preset actions were evaluated. The recognition accuracy results are compared below:
| Control System | Action: Hands Up Stretch | Action: Single-Leg Stand | Action: Clasp Hands | Action: Clapping |
|---|---|---|---|---|
| SSA-Based | 0.99 | 0.99 | 0.99 | 0.98 |
| DRQN-Based | 0.94 | 0.93 | 0.95 | 0.92 |
| DNN-Based | 0.88 | 0.90 | 0.91 | 0.89 |
| GAN-Based | 0.95 | 0.94 | 0.95 | 0.96 |
The SSA-based intelligent robot control system achieved the highest accuracy across all tested actions, with a perfect score of 0.99 for three actions, demonstrating its robust and precise recognition capabilities.
Finally, a user experience test was conducted with the SSA-based system. The results, gathered from multiple users, are summarized as follows:
| Evaluation Dimension | Score Range (Typical) | Highest Score | Key Observation |
|---|---|---|---|
| Overall Satisfaction | 85 – 100 | 98 | Majority of users rated the experience highly. |
| Interaction Naturalness | 75 – 95 | 95 | Generally high, with room for improvement in fluidity. |
| Emotional Experience | 70 – 92 | 92 | Good, but some users reported lower scores (3-7). |
| Interaction Response Speed | 88 – 96 | 96 | Rated as fast and responsive. |
| Privacy Protection Perception | 85 – 92 | 92 | Users felt confident about data safety. |
| Human-Robot Collaboration | 88 – 96 | 96 | High satisfaction with cooperative tasks. |
The system successfully met user expectations, receiving high satisfaction scores across most dimensions. The areas of emotional experience and interaction naturalness, while generally positive, indicate potential targets for future refinement to make the intelligent robot interaction even more seamless and emotionally resonant.
5. Conclusion
To address deficiencies in human-robot interaction within preschool settings, this study constructed an intelligent robot control system based on the Salp Swarm Algorithm (SSA), designed for intelligent information capture and emotional communication with young children. The experimental results demonstrate that the SSA-fused robot action recognition algorithm closely matched actual angle motion trajectories at key sampling points. The proposed SSA-based preschool intelligent robot control system achieved a recognition accuracy of 0.99 for actions like clasping hands, outperforming DRQN, DNN, and GAN-based systems which scored 0.95, 0.91, and 0.95 respectively for the same action. Furthermore, the system exhibited high trajectory tracking accuracy and received strong user satisfaction ratings. This research presents innovations in posture information capture and human-robot interaction specifically for preschool education. However, the study did not delve deeply into the analysis of children’s vocal or facial data. Therefore, future work on this intelligent robot platform will involve more detailed research into multimodal data integration, including speech and facial expression analysis, to create an even more comprehensive and responsive educational companion.
