Design of a Visual Recognition System for Humanoid Robots Using Convolutional Neural Networks

In recent years, the rapid advancement of artificial intelligence has significantly impacted various fields, with humanoid robots emerging as a key application. These robots aim to mimic human behavior and cognition, enabling more natural interactions and collaborations. One critical task in robot technology is line following, where robots must accurately detect and follow paths in real-world environments. Traditional methods often face challenges such as high computational complexity, low recognition accuracy, and slow processing speeds, limiting their practical deployment. To address these issues, I developed a visual recognition system based on an improved convolutional neural network (CNN) that combines the lightweight MobileNetV2 architecture with the Convolutional Block Attention Module (CBAM). This approach reduces model parameters while enhancing accuracy, making it suitable for embedded systems in robot technology applications. In this paper, I describe the design, implementation, and testing of this system, highlighting its efficiency and reliability in line-following tasks for humanoid robots.

The core of my design focuses on optimizing robot technology for real-time performance. The humanoid robot I designed features a structure with 12 degrees of freedom, simulating human-like movements in the head, arms, torso, and legs. It utilizes two types of servos: RD3115 for basic motions like walking and turning, and SG90 for lightweight components such as arms and the camera gimbal. This configuration ensures stability and flexibility, which are crucial for dynamic environments in robot technology. The control system integrates an image recognition unit and a servo motion unit, forming a closed-loop process where images are captured, processed, and used to drive robot actions. Key components include an STM32F103C8T6 microcontroller for command processing, an OpenMV4 H7 PLUS module for neural network-based image recognition, and a servo driver board capable of controlling up to 24 servos. This hardware setup enables efficient data flow and responsive control, essential for advanced robot technology applications.

For the visual recognition system, I constructed a dataset specifically for line-following tasks, which includes three categories: moving straight, turning left, and turning right. The images were captured using an OpenMV camera to simulate real-world conditions, and preprocessing steps involved binarization to remove noise and enhance feature extraction. Binarization converts images into binary form, highlighting the track lines against the background. To improve model generalization, I applied data augmentation techniques such as brightness adjustment, contrast variation, rotation, Gaussian noise addition, and blurring. This resulted in a balanced dataset with 150 left-turn images, 150 right-turn images, and 128 straight-moving images. The augmentation process mitigates issues like lighting changes and motion blur, common in robot technology deployments, ensuring robust performance.

The neural network model is based on MobileNetV2, a lightweight CNN that employs depthwise separable convolution to reduce computational cost. This decomposition replaces standard convolution with depthwise convolution and pointwise convolution, significantly lowering the number of parameters. Mathematically, standard convolution for an input feature map $X \in \mathbb{R}^{H \times W \times C}$ with a kernel $W \in \mathbb{R}^{K \times K \times C \times M}$ produces an output $Y \in \mathbb{R}^{H’ \times W’ \times M}$ as follows: $$ Y = X * W + b $$ where $*$ denotes the convolution operation, and $b$ is the bias term. In depthwise separable convolution, this is split into two steps: depthwise convolution applies a separate $K \times K$ kernel per input channel, yielding $Y_{\text{depth}} \in \mathbb{R}^{H’ \times W’ \times C}$: $$ Y_{\text{depth}} = X \odot W_{\text{depth}} $$ where $\odot$ represents channel-wise convolution. Then, pointwise convolution uses a $1 \times 1$ kernel $W_{\text{point}} \in \mathbb{R}^{1 \times 1 \times C \times M}$ to combine channels: $$ Y = Y_{\text{depth}} * W_{\text{point}} + b $$ This reduces parameters from $O(K^2 \cdot C \cdot M)$ to $O(K^2 \cdot C + C \cdot M)$, making it ideal for resource-constrained robot technology. MobileNetV2 also incorporates inverted residual blocks, which first expand the channel dimension with a $1 \times 1$ convolution, apply depthwise convolution, and then project back with another $1 \times 1$ convolution. This structure, combined with linear bottlenecks, preserves features and mitigates gradient vanishing, enhancing performance in visual tasks for robot technology.

To further improve accuracy, I integrated the CBAM attention mechanism into MobileNetV2. CBAM sequentially infers attention maps along channel and spatial dimensions, allowing the model to focus on critical features. The channel attention module compresses spatial dimensions using global max pooling and average pooling, followed by a shared multilayer perceptron (MLP) with a reduction ratio $r$. The output channel attention $M_c$ is computed as: $$ M_c(F) = \sigma(\text{MLP}(\text{AvgPool}(F)) + \text{MLP}(\text{MaxPool}(F))) $$ where $F$ is the input feature map, and $\sigma$ is the sigmoid function. The spatial attention module then processes this output by applying max and average pooling along the channel axis, concatenating the results, and convolving with a $7 \times 7$ kernel to produce a spatial attention map $M_s$: $$ M_s(F) = \sigma(f^{7 \times 7}([\text{AvgPool}(F); \text{MaxPool}(F)])) $$ where $f^{7 \times 7}$ denotes a convolution operation. The final refined feature map is obtained by: $$ F’ = M_c(F) \otimes F $$ $$ F” = M_s(F’) \otimes F’ $$ where $\otimes$ denotes element-wise multiplication. By adding CBAM after the last bottleneck layer in MobileNetV2, the model achieves better focus on track lines, boosting accuracy without substantial parameter increase. This integration is a key innovation in my robot technology approach, as it enhances feature discrimination in cluttered environments.

I evaluated the model using different width multipliers $\alpha$ (0.25, 0.50, 0.75, 1.00) to scale MobileNetV2’s size and balance performance and efficiency. The training involved a learning rate of 0.001 over 100 epochs, with the dataset split into training and testing sets. The results, summarized in Table 1, show that $\alpha = 0.50$ offers an optimal trade-off, with high accuracy and minimal parameters. After incorporating CBAM, the model’s accuracy improved further, as detailed in Table 2. These experiments demonstrate the effectiveness of my approach in robot technology, where lightweight models are essential for real-time deployment.

Table 1: Performance of MobileNetV2 with Different Width Multipliers
Width Multiplier (α)	Model Size (MB)	Training Accuracy (%)	Test Accuracy (%)
0.25	1.2	94.5	92.8
0.50	1.96	97.12	95.33
0.75	2.8	97.5	95.7
1.00	3.5	97.8	95.9

Table 2: Comparison of MobileNetV2 and CBAM-Enhanced MobileNetV2
Model	Size (MB)	Training Accuracy (%)	Test Accuracy (%)
MobileNetV2 (α=0.50)	1.96	97.12	95.33
CBAM-MobileNetV2 (α=0.50)	2.08	98.26	95.86

The training process utilized the preprocessed dataset, with images resized to 224×224 pixels to match MobileNetV2’s input dimensions. I employed the Adam optimizer and categorical cross-entropy loss, defined as: $$ L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c}) $$ where $N$ is the number of samples, $C$ is the number of classes, $y_{i,c}$ is the true label, and $\hat{y}_{i,c}$ is the predicted probability. Data augmentation was applied on-the-fly during training to prevent overfitting, which is common in robot technology due to limited data. The model was implemented in TensorFlow and deployed on the OpenMV module, where it achieved an average inference time of 50 ms per frame, meeting real-time requirements for robot technology applications.

For practical testing, I built a physical humanoid robot prototype and conducted experiments in a controlled environment simulating line-following scenarios. The robot successfully identified straight paths, left turns, and right turns with high reliability. Despite the low-resolution images from the OpenMV camera, the model’s robustness, enhanced by data augmentation and attention mechanisms, ensured accurate classifications. During testing, I visualized the results by drawing bounding boxes around detected tracks and displaying the class label and confidence score. The system maintained a consistent performance, with the CBAM-enhanced model achieving a 95.86% accuracy on the test set, outperforming the baseline MobileNetV2. This demonstrates the viability of my design in real-world robot technology, where efficiency and accuracy are paramount.

In conclusion, my work presents a novel visual recognition system for humanoid robots that leverages advanced CNN techniques to address the challenges of line-following tasks. By integrating MobileNetV2 with CBAM attention, I achieved a lightweight yet accurate model suitable for embedded deployment in robot technology. The system’s design, from hardware components to software algorithms, ensures seamless operation and adaptability to environmental variations. Experimental results confirm its superiority over traditional methods, with significant improvements in recognition rates and operational stability. This contribution not only advances the state of robot technology but also opens avenues for future research, such as extending the approach to more complex navigation tasks or integrating additional sensors for enhanced perception. As robot technology continues to evolve, my system provides a scalable framework for developing intelligent, autonomous robots capable of performing in diverse real-world applications.