The field of robotics is undergoing a transformative era, driven by rapid advancements in artificial intelligence, particularly in deep learning and computer vision. Within this landscape, the development of humanoid robot technology has progressed from laboratory research to the cusp of widespread commercial application. Their potential is vast, spanning domestic assistance, emergency rescue operations, and complex industrial manufacturing. A core competency required for these applications is sophisticated perception and interaction. While a humanoid robot can accurately sense its own state using integrated sensors, interpreting the actions and intents of other agents in a shared environment is crucial for collaborative or reactive tasks. This capability fundamentally relies on advanced visual understanding, with pose detection being a cornerstone technology.
Pose detection, the process of locating key anatomical joints or parts in an image and connecting them to form a skeletal representation, is well-established for humans. However, directly applying these algorithms to humanoid robots is suboptimal. Despite morphological similarities, differences in dimensions, textures, and movement patterns necessitate tailored solutions. This paper addresses this gap by investigating and enhancing intelligent pose detection methods specifically for bipedal humanoid robots. We delve into the theoretical foundations of existing human pose estimation algorithms and propose a novel architecture that integrates a channel-wise attention mechanism, significantly boosting detection accuracy and robustness for robotic platforms.

The challenge of multi-person pose estimation has traditionally been approached via two paradigms: top-down and bottom-up. Top-down methods first detect individual human instances (e.g., using a Region-based Convolutional Neural Network – Faster R-CNN) and then perform single-person skeleton estimation within each cropped region. Networks like High-Resolution Net (HRNet) maintain high-resolution representations through multi-scale fusion, yielding superior accuracy. Conversely, bottom-up methods first detect all potential keypoints across the entire image and then group them into person instances. Pioneering work like DeepCut used integer linear programming for grouping, while OpenPose introduced Part Affinity Fields (PAFs), a set of 2D vector fields that encode the position and orientation of limbs, to associate keypoints. Stacked Hourglass networks capture multi-scale contextual information through repeated pooling and upsampling, facilitating keypoint detection in crowded scenes.
For humanoid robots, a bottom-up approach is often more suitable. It avoids the computational bottleneck and error propagation associated with a preliminary detection stage, especially when robots are partially occluded. Prior work, such as that by Amini et al., created a dedicated dataset from RoboCup footage and applied a bottom-up model for robot pose estimation. Building upon this foundation, our work introduces significant architectural improvements. We replace the backbone network, incorporate an attention mechanism, and optimize feature fusion pathways, leading to a more precise and efficient model tailored for the unique characteristics of a humanoid robot.
Problem Formulation and Overview
Given an input RGB image $\mathbf{I}$, with a set of pixels $\mathcal{P}$ where a pixel is defined as $\mathbf{p} = (u, v)^\top \in \mathcal{P}$, the goal is to detect the pose of all bipedal humanoid robots present. A pose is defined by a set of $N$ keypoints (e.g., head, shoulders, elbows, knees), denoted as $\mathcal{H} = \{ \mathbf{h}_1, \mathbf{h}_2, …, \mathbf{h}_N \}$, where $\mathbf{h}_i = (x_i, y_i)^\top$ represents the 2D coordinates of the $i$-th keypoint. These keypoints are connected by a predefined skeleton consisting of $M$ limbs (e.g., torso, upper arm). Each limb $l_j$ is associated with a subset of pixels that belong to it.
The model, a Convolutional Neural Network (CNN), processes $\mathbf{I}$ to produce two sets of heatmaps: one for keypoint confidence and one for limb association (like PAFs). The keypoint heatmaps have $N$ channels, where each channel predicts the probability of a specific keypoint being located at each pixel. The limb association maps have $2M$ channels, encoding vectorial information for each limb. The final pose estimation involves: 1) detecting keypoints by finding local maxima in the keypoint heatmaps, and 2) parsing the skeleton by connecting keypoints based on the scores from the limb association maps. The overall pipeline is designed to be efficient and accurate, balancing the needs for real-time performance and high precision in dynamic environments where a humanoid robot operates.
Proposed Model Architecture
Our proposed model follows an encoder-decoder structure, optimized for the pose detection task for a humanoid robot. The design philosophy prioritizes a balance between representational capacity and computational efficiency to ensure practical deployability.
Encoder: Enhanced Feature Extraction
The encoder is responsible for extracting hierarchical features from the input image. We employ ResNet34 as the backbone, a significant upgrade from the ResNet18 used in baseline models. The deeper architecture of ResNet34, with its 34 weighted layers, provides a richer receptive field and greater capacity to learn complex features pertinent to the often metallic and rigid structure of a humanoid robot, as opposed to the organic shapes of humans. The residual connections mitigate the vanishing gradient problem, allowing for effective training. The encoder outputs a high-level feature map $\mathbf{F}_{enc} \in \mathbb{R}^{C_e \times H_e \times W_e}$, where $C_e=512$.
Formally, let the encoder function be $E(\cdot)$. For an input image $\mathbf{I}$:
$$\mathbf{F}_{enc} = E(\mathbf{I}; \theta_E)$$
where $\theta_E$ represents the parameters of the ResNet34 encoder, initialized with weights pre-trained on ImageNet.
Squeeze-and-Excitation Attention Block
To enhance the network’s focus on the most informative feature channels for humanoid robot pose estimation, we integrate a Squeeze-and-Excitation (SE) block between the encoder and decoder. This channel-wise attention mechanism dynamically recalibrates channel-wise feature responses. The SE block performs two operations: Squeeze and Excitation.
Squeeze: A global average pooling operation aggregates spatial information $H_e \times W_e$ from each channel of $\mathbf{F}_{enc}$, producing a channel descriptor $\mathbf{z} \in \mathbb{R}^{C_e}$.
$$z_c = \frac{1}{H_e \times W_e} \sum_{i=1}^{H_e} \sum_{j=1}^{W_e} F_{enc}(c, i, j)$$
Excitation: A simple gating mechanism with a sigmoid activation learns a nonlinear interaction between channels and outputs per-channel modulation weights $\mathbf{s}$.
$$\mathbf{s} = \sigma(\mathbf{W}_2 \delta(\mathbf{W}_1 \mathbf{z}))$$
Here, $\delta$ is the ReLU activation, $\mathbf{W}_1 \in \mathbb{R}^{\frac{C_e}{r} \times C_e}$ and $\mathbf{W}_2 \in \mathbb{R}^{C_e \times \frac{C_e}{r}}$ are learnable weight matrices, and $r$ is a reduction ratio (set to 16). The final output of the SE block is obtained by rescaling $\mathbf{F}_{enc}$:
$$\tilde{\mathbf{F}}_{enc} = \mathbf{s} \odot \mathbf{F}_{enc}$$
where $\odot$ denotes channel-wise multiplication. This process allows the model to emphasize features from channels relevant to robot joints and suppress less useful ones.
Decoder and Feature Fusion
The decoder $D(\cdot)$ reconstructs high-resolution feature maps from the attended encoder features $\tilde{\mathbf{F}}_{enc}$. It consists of a series of transposed convolution (ConvTranspose) layers, each followed by Batch Normalization and ReLU activation. To preserve fine-grained spatial details lost during encoding, we employ skip connections from intermediate encoder layers. However, to optimize computational load, we reduce the number of channels in these skip connection features from 128 to 64 before fusion.
A critical design change is the method of feature fusion. Instead of concatenating decoder features with skip connection features (which increases channel count and subsequent computation), we use an element-wise addition operation. This requires the decoder and skip connection features to have the same number of channels (64). The fusion at stage $k$ can be represented as:
$$\mathbf{F}_{dec}^k = \text{ReLU}(\text{BN}(\text{ConvTranspose}(\mathbf{F}_{dec}^{k-1}))) + \text{Conv}_{1\times1}(\mathbf{F}_{skip}^k)$$
where $\text{Conv}_{1\times1}$ is a $1\times1$ convolution used to project the skip connection features to 64 channels. This approach maintains information flow while reducing model complexity.
The final decoder output passes through two separate $1\times1$ convolutional heads to produce the target heatmaps: one for keypoint confidence $\mathbf{H}_{kp} \in \mathbb{R}^{N \times H_o \times W_o}$ and one for part affinity fields $\mathbf{H}_{paf} \in \mathbb{R}^{2M \times H_o \times W_o}$, where $H_o \times W_o$ is the output resolution.
Loss Function
We adopt the same loss function as the baseline for fair comparison. The total loss $L$ is a sum of Mean Squared Error (MSE) losses applied to the predicted heatmaps and the ground truth:
$$L = \lambda_{kp} \cdot L_{kp} + \lambda_{paf} \cdot L_{paf}$$
$$L_{kp} = \frac{1}{N \cdot H_o \cdot W_o} \sum_{n=1}^{N} \sum_{i=1}^{H_o} \sum_{j=1}^{W_o} || \mathbf{H}_{kp}^{(n,i,j)} – \mathbf{H}_{kp, gt}^{(n,i,j)} ||^2_2$$
$$L_{paf} = \frac{1}{2M \cdot H_o \cdot W_o} \sum_{m=1}^{2M} \sum_{i=1}^{H_o} \sum_{j=1}^{W_o} || \mathbf{H}_{paf}^{(m,i,j)} – \mathbf{H}_{paf, gt}^{(m,i,j)} ||^2_2$$
where $\lambda_{kp}$ and $\lambda_{paf}$ are weighting factors, and the subscript $gt$ denotes ground truth. This objective guides the network to accurately localize keypoints and predict the orientation of limbs for the humanoid robot.
Experimental Setup and Results
We evaluate our proposed method on the publicly available HumanoidRobotPose dataset, derived from RoboCup soccer matches, which contains images with single and multiple robot instances.
Training Details
The model is trained for 200 epochs using the AdamW optimizer with an initial learning rate of $1 \times 10^{-4}$ and a batch size of 16. Data augmentation includes random horizontal flipping, rotation ($\pm 30^\circ$), scaling (0.75-1.25), and translation. The encoder weights are initialized from a model pre-trained on ImageNet.
Evaluation Metric
We use the standard Object Keypoint Similarity (OKS)-based Average Precision (AP) and Average Recall (AR) as our primary metrics. OKS acts like IoU for keypoints, calculating the similarity between predicted and ground truth poses based on normalized distance:
$$\text{OKS} = \frac{\sum_i \exp(-d_i^2 / 2s^2 k_i^2) \delta(v_i > 0)}{\sum_i \delta(v_i > 0)}$$
where $d_i$ is the Euclidean distance between the predicted and true keypoint, $s$ is the object scale, $k_i$ is a per-keypoint constant that controls falloff, $v_i$ is the visibility flag, and $\delta$ is the indicator function. AP is computed over multiple OKS thresholds (e.g., AP50 at OKS=0.50, AP75 at OKS=0.75).
Comparative Analysis
We compare our method against several state-of-the-art bottom-up human pose estimation methods adapted to the humanoid robot task: OpenPose, Associative Embedding (AE), PifPaf, HigherHRNet, and the baseline method from prior work. Table 1 details the architectural and computational characteristics of each network.
| Method | Input Size | Backbone | Params (M) | GFLOPs | FPS |
|---|---|---|---|---|---|
| OpenPose | 368 | VGG19 | 25.8 | 159.8 | 14 |
| AE | 512 | Hourglass | 138.8 | 441.6 | 5 |
| PifPaf | 385 | ShuffleNetV2 | 9.4 | 46.3 | 13 |
| HigherHRNet | 512 | HRNet-W32 | 28.6 | 94.7 | 13 |
| Baseline [11] | 384 | ResNet18 | 12.8 | 28.0 | 48 |
| Our Method | 384 | ResNet34 + SE | 21.8 | 23.5 | 62 |
Our model, while having more parameters (21.8M) than the baseline (12.8M) due to the deeper backbone, is more computationally efficient in terms of GFLOPs (23.5 vs. 28.0) and achieves a higher inference speed (62 FPS vs. 48 FPS). This is attributed to the optimized feature fusion (addition instead of concatenation) and reduced channel count in skip connections.
Table 2 presents the quantitative results on the test set. Our method consistently outperforms all others across all AP and AR metrics, demonstrating its superior accuracy for humanoid robot pose estimation.
| Method | AP | AP50 | AP75 | APM | APL | AR | AR50 | AR75 | ARM | ARL |
|---|---|---|---|---|---|---|---|---|---|---|
| OpenPose | 67.9 | 80.0 | 70.0 | 73.8 | 73.1 | 68.7 | 80.1 | 70.4 | 74.8 | 74.4 |
| AE | 62.9 | 71.9 | 64.1 | 64.0 | 72.9 | 64.6 | 73.9 | 65.6 | 64.7 | 76.0 |
| PifPaf | 76.1 | 81.6 | 75.6 | 76.0 | 91.0 | 77.9 | 83.6 | 77.2 | 77.7 | 93.0 |
| HigherHRNet | 73.4 | 84.1 | 75.6 | 80.3 | 78.7 | 76.2 | 85.3 | 77.2 | 81.4 | 83.0 |
| Baseline [11] | 78.1 | 84.6 | 79.6 | 87.5 | 80.2 | 79.4 | 85.4 | 80.6 | 88.4 | 81.6 |
| Our Method | 83.0 | 88.9 | 83.9 | 89.2 | 82.3 | 83.8 | 89.2 | 84.4 | 89.8 | 83.1 |
The results show a clear improvement of approximately 5% in both AP (83.0 vs. 78.1) and AR (83.8 vs. 79.4) over the baseline. This gain validates the effectiveness of our architectural enhancements: the deeper ResNet34 backbone provides stronger feature representation, and the SE block enables dynamic channel-wise feature refinement, which is particularly beneficial for distinguishing the structured components of a humanoid robot from cluttered backgrounds or other robots.
Conclusion and Future Work
In this work, we presented an advanced intelligent pose detection framework specifically designed for bipedal humanoid robots. By moving beyond direct applications of human-centric algorithms, we introduced a model incorporating a deeper ResNet34 backbone and a channel-wise Squeeze-and-Excitation attention mechanism. Strategic optimizations in the decoder’s feature fusion process, namely channel reduction and additive fusion, resulted in a model that is not only more accurate but also computationally more efficient than its predecessor. Our method achieved state-of-the-art performance on a standard humanoid robot pose dataset, with an AP of 83.0% and an AR of 83.8%.
Despite these promising results, challenges remain for deploying such systems in unconstrained real-world scenarios involving a humanoid robot. Future work will focus on expanding and diversifying the training dataset to include more varied environments, lighting conditions, and robot models to improve generalization. Furthermore, exploring transformer-based architectures or test-time augmentation strategies could yield additional performance gains. Ultimately, robust and accurate pose detection is a critical stepping stone towards enabling seamless and intelligent interaction between multiple humanoid robots and their environments, paving the way for their broader adoption in complex service and industrial tasks.
