In recent years, the rapid advancement of robot technology has led to its integration across various industries to enhance operational efficiency and safety. In the power sector, live operations on substation switchgear pose significant risks due to high-voltage environments and the need for precise manual interventions. These tasks require skilled personnel and involve potential hazards, highlighting the urgent need for autonomous solutions. Our work addresses this challenge by developing an embodied intelligent robot capable of performing autonomous operations on switchgear components. We propose a comprehensive framework centered on “one system, three tasks,” which integrates hardware design with a multi-task perception algorithm. This approach enables the robot to detect operation targets, regress operational postures, and determine force application points through an end-to-end deep learning network. By combining robot technology with embodied intelligence principles, our system achieves stable and efficient performance in real-world scenarios, as demonstrated through rigorous testing.
The overall design of the autonomous robot incorporates a modular structure to facilitate adaptability in diverse operational environments. Key components include a four-wheel drive chassis, control system modules, interchangeable tools, and a robotic arm. The chassis measures 738 mm in length, 500 mm in width, and 338 mm in height, with a vertical load capacity of 100 kg. It is powered by a 48 V, 24 Ah lithium battery, which features a quick-release design for easy maintenance. The robotic arm, weighing 18 kg, has a payload capacity of 4 kg at its end-effector and is equipped with a universal quick-change interface. This interface allows for seamless switching between various tools, such as those for pressing buttons, turning knobs, and operating handcarts, all mounted on a tool支架 with standardized mechanical and electrical connections. The control system is hierarchically organized, comprising a main industrial computer that coordinates navigation, motion control, and perception modules. This integration ensures real-time responsiveness and reliability, critical for high-stakes operations in substations.

To enable multifunctional capabilities, we designed an automated tool-changing mechanism that operates without additional power sources. The process follows a four-step sequence: contact, shift, extract, and engage, facilitated by the rotation of the robotic arm’s end-axis. Each tool is assigned a unique code for identification, and the quick-change assembly includes terminals for transmitting power and communication signals, supporting up to 8 channels. This design ensures that tools can be swapped efficiently while maintaining electrical and data connectivity, as summarized in Table 1. The equation for the tool engagement force can be modeled as $$ F_t = k \cdot \Delta x $$ where \( F_t \) is the tool engagement force, \( k \) is the stiffness coefficient, and \( \Delta x \) is the displacement during contact. This mechanism underscores the role of robot technology in enhancing operational flexibility.
| Parameter | Value |
|---|---|
| Tool Types | 3 (Button, Knob, Handcart) |
| Quick-Change Channels | 8 |
| Engagement Time | < 5 seconds |
| Power Transmission | 48 V DC |
The core of our autonomous operation lies in the multi-task perception algorithm, which integrates target detection, posture regression, and force point determination into a single end-to-end deep learning network. This architecture consists of a backbone network, a neck network, and a multi-task head. The backbone employs a lightweight structure to extract features from input images, producing outputs at multiple scales (P3, P4, P5, P6). The neck utilizes a Bidirectional Feature Pyramid Network (BiFPN) to fuse these features, enhancing the model’s ability to handle varying object sizes. Finally, the multi-task head splits into three branches: one for detecting operation targets, one for regressing the robotic arm’s posture, and one for identifying key points for force application. The overall network can be represented by the function $$ \mathcal{F}(I) = \mathcal{H}_{\text{task}} \circ \mathcal{N}_{\text{BiFPN}} \circ \mathcal{B}_{\text{light}}(I) $$ where \( I \) is the input image, \( \mathcal{B}_{\text{light}} \) is the backbone, \( \mathcal{N}_{\text{BiFPN}} \) is the neck, and \( \mathcal{H}_{\text{task}} \) is the multi-task head.
The backbone network incorporates identity mapping and linear operations to concatenate feature maps efficiently. It comprises two lightweight modules: one expands channels to improve feature extraction, while the other reduces channels to match output dimensions. The feature transformation can be expressed as $$ F_{\text{out}} = \text{Concat}(F_{\text{identity}}, W \cdot F_{\text{in}}) $$ where \( F_{\text{in}} \) and \( F_{\text{out}} \) are input and output features, \( W \) is a weight matrix, and Concat denotes channel concatenation. This design minimizes computational overhead while maintaining high accuracy, essential for real-time applications in robot technology.
The BiFPN neck network assigns learnable weights to different input features during fusion, prioritizing more informative layers. The weighted fusion output \( o \) is computed as $$ o = \sum_{i=0} \frac{w_i \times I_i}{\epsilon + \sum_{j=0} w_j} $$ where \( w_i \) and \( w_j \) are learnable weights, \( I_i \) is the input feature, and \( \epsilon \) is a small constant to avoid division by zero. This approach enhances the network’s representational power by dynamically adjusting to feature importance, a key advancement in robot technology for complex environments.
The multi-task head processes fused features to generate predictions for each subtask. The target detection head outputs class probabilities and bounding boxes using decoupled convolutions. For a feature position \( i \), the predicted class \( \hat{c}_i \) is given by $$ \hat{c}_i = \arg \max_c P(c \mid f_i) $$ and the bounding box \( (\hat{b}_{ix}, \hat{b}_{iy}, \hat{b}_{iw}, \hat{b}_{ih}) \) is derived as $$ (\hat{b}_{ix}, \hat{b}_{iy}, \hat{b}_{iw}, \hat{b}_{ih}) = \arg \max_b P(b \mid f_i) $$ where \( f_i \) is the feature representation. The posture regression head estimates the robotic arm’s orientation relative to the base coordinate frame, outputting rotation angles \( \hat{\theta}_i \) as $$ \hat{\theta}_i = \arg \max_\theta P(\theta \mid f_i) $$ The key point detection head employs a stacked hourglass network to identify force application points \( \hat{K}_i \) using $$ \hat{K}_i = \arg \max_K P(K \mid f_i) $$ This integration allows the robot to perform precise operations autonomously.
Training the network involves an end-to-end approach with a multi-task loss function defined as $$ l = \lambda l_1 + \alpha l_2 + \beta l_3 $$ where \( l_1 \) is the detection loss, \( l_2 \) is the posture loss, and \( l_3 \) is the key point loss, with \( \lambda \), \( \alpha \), and \( \beta \) as weighting coefficients. The detection loss \( l_1 \) combines classification loss (binary cross-entropy), objectness loss (generalized distribution-based fidelity), and bounding box loss (complete IoU). The posture loss \( l_2 \) uses a dynamically scaled cross-loss to handle class imbalance, and the key point loss \( l_3 \) applies a smooth L1 loss for robustness against outliers. This comprehensive loss function ensures balanced learning across tasks, a cornerstone of reliable robot technology.
We conducted extensive experiments to validate the robot’s performance in functional and precision tests. The functional tests simulated various navigation errors (±10 cm) and initial poses, assessing the robot’s ability to operate buttons, knobs, and handcarts. For button operations, the algorithm identified the control panel, computed its pose, and regressed key points for pressing. Knob operations involved detecting the knob axis and five boundary points for rotation. Handcart operations focused on locating the operation hole and its center for engagement. All tasks were executed successfully, demonstrating the system’s robustness. Precision tests involved 10 repeated trials for each task, comparing results against manually taught ground truths. The position and orientation errors were recorded, as shown in Table 2 for button operations. The mean error in position coordinates was under 1 mm, and orientation errors were below 1 degree, confirming the algorithm’s high accuracy.
| Parameter | Ground Truth | Max Error |
|---|---|---|
| x (mm) | 96 | 0.83 |
| y (mm) | 26 | 0.76 |
| z (mm) | 672 | 0.76 |
| Rx (°) | -90 | 0.74 |
| Ry (°) | 0 | 0.99 |
| Rz (°) | 0 | 0.53 |
The success of these tests underscores the effectiveness of our embodied intelligence approach. By fusing hardware design with advanced algorithms, the robot achieves sub-millimeter precision in positioning and sub-degree accuracy in orientation, meeting the stringent requirements of substation operations. The multi-task perception algorithm processes images in real-time, enabling the robot to adapt to environmental variations. This capability is vital for scaling robot technology in critical infrastructure. Furthermore, the tool-changing mechanism ensures operational diversity without human intervention, reducing downtime and enhancing safety.
In conclusion, our work presents a significant step forward in autonomous robot technology for power systems. The integration of embodied intelligence principles allows the robot to perceive, decide, and act in dynamic environments, fulfilling complex tasks with high reliability. Future directions include expanding the algorithm to handle more switchgear components and improving network efficiency for edge deployment. As robot technology evolves, such systems will play a pivotal role in modernizing infrastructure, ensuring safety, and boosting productivity. The demonstrated synergy between software and hardware in our robot sets a benchmark for autonomous operations in high-risk industries.
