In recent years, the field of robotics has witnessed remarkable advancements, particularly in the domain of bipedal humanoid robots. These AI human robot systems are designed to mimic human morphology and locomotion, enabling them to operate in complex environments such as household services, emergency rescue operations, and industrial automation. A critical component of their functionality is intelligent pose detection, which involves accurately identifying and tracking the positions of key body joints in real-time. This capability is essential for tasks requiring interaction with other robots or humans, environmental navigation, and adaptive behavior. However, existing pose detection algorithms are predominantly optimized for human subjects, and their direct application to AI human robot platforms often yields suboptimal results due to differences in skeletal structure, movement patterns, and environmental contexts. In this article, we address this gap by proposing an enhanced pose detection method tailored specifically for bipedal humanoid robots, leveraging deep learning techniques and attention mechanisms to improve accuracy and robustness.

The importance of pose detection in AI human robot systems cannot be overstated. It serves as the foundation for higher-level cognitive functions, such as action recognition, gesture interpretation, and collaborative task execution. Traditional approaches to pose detection have been broadly categorized into top-down and bottom-up methods. Top-down methods first detect individual instances in an image using object detection algorithms and then perform single-person pose estimation on each detected region. For example, methods like Faster R-CNN combined with ResNet architectures have been employed to achieve high precision in human pose estimation. However, these approaches struggle with occlusions and scalability issues when multiple robots are present in a scene. In contrast, bottom-up methods detect all keypoints in an image first and then group them into individual instances using techniques like part affinity fields (PAF) or associative embeddings. While bottom-up methods offer better robustness to occlusions, they often face challenges in accurately associating keypoints to the correct AI human robot instances, especially in dense environments.
Our work builds upon these foundations by introducing a novel bottom-up pose detection framework that incorporates a channel-based attention mechanism and a deeper backbone network. We replace the commonly used ResNet18 with ResNet34 to enhance feature extraction capabilities while optimizing computational efficiency through reduced channel dimensions in skip connections and additive feature fusion. This design not only improves the model’s ability to capture fine-grained details but also maintains real-time performance, which is crucial for dynamic AI human robot applications. The integration of attention mechanisms allows the model to dynamically weight important feature channels, thereby focusing on relevant spatial information for keypoint detection. In the following sections, we provide a comprehensive overview of related work, detail our methodology, present experimental results, and discuss future directions.
Related Work
Pose detection has been extensively studied in the context of human motion analysis, with numerous algorithms developed for various applications. Early methods relied on handcrafted features and traditional computer vision techniques, but the advent of deep learning has revolutionized the field. Convolutional neural networks (CNNs) have become the de facto standard for pose estimation due to their ability to learn hierarchical representations from data. For AI human robot systems, however, these methods must be adapted to account for differences in joint kinematics and environmental factors.
Top-down methods, such as those based on Mask R-CNN or HRNet, achieve high accuracy by first localizing individuals and then estimating their poses. For instance, HRNet maintains high-resolution representations throughout the network, enabling precise keypoint localization. However, these methods are computationally intensive and may not be suitable for real-time AI human robot operations where multiple instances need to be processed simultaneously. Moreover, their performance degrades when robots are partially occluded or closely interacting, which is common in collaborative scenarios.
Bottom-up methods, on the other hand, offer a more scalable solution. OpenPose, for example, uses PAFs to associate keypoints with individual instances, achieving real-time performance on human pose datasets. Similarly, associative embedding methods learn to group keypoints by assigning them instance-specific tags. While these approaches have shown promise, they often require post-processing steps that can introduce errors. In the context of AI human robot pose detection, previous work has demonstrated the feasibility of bottom-up methods using datasets collected from robot soccer competitions. However, these implementations typically use lighter backbone networks like ResNet18, which may limit their ability to capture complex features. Our approach addresses these limitations by incorporating a deeper network and attention mechanisms, resulting in improved performance for AI human robot applications.
Methodology
Our proposed method for bipedal humanoid robot pose detection follows a bottom-up paradigm, consisting of an encoder-decoder architecture with integrated attention mechanisms. The overall goal is to accurately detect keypoints and associate them with individual robot instances in an image. Let us define the input image as $I$, with a set of pixels $P$, where each pixel $p = (u, v) \in P$ has coordinates $u$ and $v$. The robot has $N$ keypoints, represented as a set $H = \{h_1, h_2, \dots, h_N\}$, where $h_i$ denotes the coordinates of the $i$-th keypoint. Additionally, we define $M$ skeletal connections between keypoints, forming a set $L = \{l_1, l_2, \dots, l_M\}$, where each $l_j$ is a set of pixels representing the $j$-th skeleton.
The network processes the input image $I$ and generates heatmaps for keypoints and skeletons. Keypoints are extracted by identifying local maxima in the heatmaps using non-maximum suppression, resulting in the set $H$. The skeletons $L$ are then used to group keypoints into individual AI human robot instances. The network architecture, as illustrated in Figure 3, comprises an encoder based on ResNet34 and a decoder with transposed convolution layers. We introduce a squeeze-and-excitation (SE) block between the encoder and decoder to enhance feature representation by recalibrating channel weights. This attention mechanism computes a channel-wise descriptor $z$ through global average pooling, followed by two fully connected layers with a ReLU activation and a sigmoid function, respectively. The output is used to scale the feature maps, emphasizing informative channels. Mathematically, for a feature map $F \in \mathbb{R}^{C \times H \times W}$, the SE block computes:
$$ z_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} F_c(i, j) $$
$$ s = \sigma(W_2 \delta(W_1 z)) $$
where $z_c$ is the $c$-th channel descriptor, $\delta$ is the ReLU activation, $W_1$ and $W_2$ are weights of the fully connected layers, and $\sigma$ is the sigmoid function. The scaled feature map $\tilde{F}$ is then obtained as $\tilde{F}_c = s_c \cdot F_c$.
The decoder consists of successive transposed convolution layers with batch normalization and ReLU activation, gradually upsampling the feature maps to the original resolution. Skip connections from the encoder are incorporated to preserve spatial details; however, we reduce the channel dimensions from 128 to 64 to minimize computational cost. Furthermore, instead of concatenating features, we use additive fusion to combine encoder and decoder features, which reduces memory usage and maintains information integrity. The final output includes two sets of heatmaps: one for keypoints and another for skeletons, which are used to compute the loss during training.
The loss function is based on mean squared error (MSE) between the predicted heatmaps and ground truth annotations. For keypoint heatmaps, the loss $L_k$ is defined as:
$$ L_k = \frac{1}{N} \sum_{i=1}^{N} \| \hat{Y}_i – Y_i \|^2 $$
where $\hat{Y}_i$ is the predicted heatmap for the $i$-th keypoint, and $Y_i$ is the corresponding ground truth. Similarly, for skeleton heatmaps, the loss $L_s$ is computed as:
$$ L_s = \frac{1}{M} \sum_{j=1}^{M} \| \hat{L}_j – L_j \|^2 $$
The total loss $L_{total}$ is a weighted sum of $L_k$ and $L_s$, optimized using the AdamW algorithm with an initial learning rate of $10^{-4}$.
Experimental Setup and Results
We evaluated our proposed method on the HumanoidRobotPose dataset, which comprises images collected from robot soccer competitions. The dataset includes multiple AI human robot instances under various poses and occlusion conditions. We split the data into training and testing sets, with 80% used for training and 20% for evaluation. Data augmentation techniques such as random horizontal flipping, rotation, scaling, and translation were applied to enhance model generalization.
The model was trained for 200 epochs with a batch size of 16, and the encoder was initialized with pre-trained weights from ImageNet. We compared our method against several state-of-the-art bottom-up pose detection approaches, including OpenPose, associative embedding (AE), PifPaf, HigherHRNet, and a baseline method from prior work. Performance was measured using the object keypoint similarity (OKS) metric, which computes the similarity between predicted and ground truth keypoints based on scale and per-keypoint thresholds. OKS is defined as:
$$ OKS = \frac{\sum_i \exp(-d_i^2 / (2s^2 \kappa_i^2)) \delta(v_i > 0)}{\sum_i \delta(v_i > 0)} $$
where $d_i$ is the Euclidean distance between the predicted and ground truth keypoint, $s$ is the object scale, $\kappa_i$ is a per-keypoint constant, and $v_i$ is the visibility flag. We report average precision (AP) and average recall (AR) at different thresholds (e.g., AP50, AP75) to comprehensively assess model performance.
Table 1 summarizes the architectural details and computational costs of each method. Our approach uses an input size of 384×384 and ResNet34 as the backbone, resulting in 21.8 million parameters and 23.5 GFLOPs. Despite the deeper network, our model achieves a frame rate of 62 FPS, making it suitable for real-time AI human robot applications. In contrast, methods like OpenPose and AE have higher computational demands, while the baseline method with ResNet18 has lower accuracy.
| Method | Input Size | Backbone | Parameters (M) | GFLOPs | FPS |
|---|---|---|---|---|---|
| OpenPose | 368 | VGG19 | 25.8 | 159.8 | 14 |
| AE | 512 | Hourglass | 138.8 | 441.6 | 5 |
| PifPaf | 385 | ShuffleNetV2 | 9.4 | 46.3 | 13 |
| HigherHRNet | 512 | HRNet-W32 | 28.6 | 94.7 | 13 |
| Baseline | 384 | ResNet18 | 12.8 | 28.0 | 48 |
| Our Method | 384 | ResNet34 | 21.8 | 23.5 | 62 |
Table 2 presents the quantitative results on the test set. Our method achieves an AP of 83.0% and an AR of 83.8%, outperforming all compared methods by a significant margin. Specifically, we observe improvements of nearly 5% in AP and AR compared to the baseline, demonstrating the effectiveness of our architectural enhancements. The attention mechanism and deeper network enable better feature learning, while the optimized skip connections and additive fusion reduce computational overhead without sacrificing accuracy.
| Method | AP | AP50 | AP75 | APM | APL | AR | AR50 | AR75 | ARM | ARL |
|---|---|---|---|---|---|---|---|---|---|---|
| OpenPose | 67.9 | 80.0 | 70.0 | 73.8 | 73.1 | 68.7 | 80.1 | 70.4 | 74.8 | 74.4 |
| AE | 62.9 | 71.9 | 64.1 | 64.0 | 72.9 | 64.6 | 73.9 | 65.6 | 64.7 | 76.0 |
| PifPaf | 76.1 | 81.6 | 75.6 | 76.0 | 91.0 | 77.9 | 83.6 | 77.2 | 77.7 | 93.0 |
| HigherHRNet | 73.4 | 84.1 | 75.6 | 80.3 | 78.7 | 76.2 | 85.3 | 77.2 | 81.4 | 83.0 |
| Baseline | 78.1 | 84.6 | 79.6 | 87.5 | 80.2 | 79.4 | 85.4 | 80.6 | 88.4 | 81.6 |
| Our Method | 83.0 | 88.9 | 83.9 | 89.2 | 82.3 | 83.8 | 89.2 | 84.4 | 89.8 | 83.1 |
To further analyze the impact of our modifications, we conducted ablation studies by varying components of the network. For instance, removing the SE block resulted in a 2% drop in AP, highlighting the importance of attention mechanisms for AI human robot pose detection. Similarly, using ResNet18 instead of ResNet34 led to a reduction in accuracy, confirming that deeper networks capture more discriminative features. The additive fusion strategy also contributed to a 1% improvement in AP compared to concatenation, while reducing GFLOPs by 15%.
Conclusion and Future Work
In this article, we presented an intelligent pose detection method for bipedal humanoid robots that combines a deep encoder-decoder architecture with channel-based attention mechanisms. Our approach addresses the limitations of existing methods by enhancing feature representation and computational efficiency, resulting in state-of-the-art performance on the HumanoidRobotPose dataset. The integration of ResNet34, SE blocks, and optimized skip connections enables accurate keypoint detection and association, which is crucial for real-world AI human robot applications such as collaborative tasks and dynamic environments.
Despite these advancements, there are several avenues for future research. First, expanding the dataset to include more diverse scenarios, such as varying lighting conditions and complex backgrounds, could improve model generalization. Second, exploring transformer-based architectures or graph neural networks may further enhance the model’s ability to capture long-range dependencies between keypoints. Additionally, incorporating temporal information through recurrent networks or 3D convolutions could enable pose tracking in video sequences, which is essential for understanding AI human robot motion over time. Finally, deploying the model on embedded systems with limited resources would facilitate its adoption in practical AI human robot platforms. We believe that continued innovation in pose detection will play a pivotal role in advancing the capabilities of bipedal humanoid robots, ultimately enabling them to perform more complex and autonomous tasks in real-world settings.