Humanoid robots have gained significant attention in human-robot interaction due to their anthropomorphic appearance and ability to emulate human-like facial expressions. These robots are increasingly deployed in fields such as healthcare, education, and customer service, where emotional engagement is crucial. However, a persistent challenge in facial expression imitation for humanoid robots lies in the mismatch between hardware execution speeds and data processing rates. Traditional methods, such as uniform down-sampling of expression data, often lead to jittery and unnatural mimicry due to the loss of subtle yet critical expression changes. This paper addresses this issue by proposing a novel two-stage strategy that combines facial keyframe detection and facial keypoint recognition. The approach leverages a self-supervised multi-scale optical flow technique to identify keyframes capturing essential expression transitions, followed by facial keypoint extraction to generate control commands for robot actuators. By focusing on keyframes, the method reduces data processing frequency without sacrificing expression fidelity, resulting in smoother and more natural imitation. Experimental validation on public datasets, including VoxCeleb, 300VW, and CAER, demonstrates the effectiveness of the proposed strategy in enhancing the fluidity and accuracy of humanoid robot expression mimicry.

The integration of humanoid robots into daily life necessitates advanced interaction capabilities, with facial expressions playing a pivotal role in conveying emotions and intentions. Humanoid robots equipped with sophisticated mechanical systems, such as servo-driven facial components, can replicate a wide range of expressions. However, the inherent limitations of servo motors, such as rotational delays, often cause a disconnect between the rapid data stream from vision systems and the slower mechanical response. This discrepancy leads to choppy and unrealistic expression sequences, undermining the naturalness of human-robot interaction. Existing solutions, including frame rate reduction, inadvertently discard vital micro-expressions, such as slight eyebrow raises or lip twitches, which are essential for authentic emotional communication. To overcome these drawbacks, this work introduces a keyframe-based methodology that dynamically selects the most informative frames from a video sequence, ensuring that only significant expression changes are processed and imitated by the humanoid robot. The core innovation lies in the fusion of multi-scale optical flow features with deep learning-based keyframe detection, enabling precise capture of temporal facial dynamics. This approach not only optimizes computational efficiency but also aligns data processing with the physical constraints of humanoid robot hardware, paving the way for more lifelike and responsive robotic companions.
The remainder of this paper is organized as follows: Section 2 reviews related work on humanoid robot facial expression imitation, video keyframe detection, and optical flow estimation. Section 3 elaborates on the proposed two-stage strategy, detailing the keyframe detection model and facial keypoint extraction process. Section 4 presents experimental results and discussions, including dataset descriptions, implementation details, and performance comparisons. Finally, Section 5 concludes the paper and outlines future research directions.
Related Work
Facial expression imitation in humanoid robots has evolved from pre-programmed sequences to data-driven approaches enabled by artificial intelligence. Early humanoid robots, such as Kismet and Albert HUBO, utilized complex arrays of servos and actuators to generate facial expressions through mechanical deformation. While these systems achieved a degree of expressiveness, they lacked adaptability and real-time responsiveness. Recent advancements have incorporated computer vision and machine learning techniques to enable dynamic expression mimicry. For instance, Ren et al. developed an automatic facial expression learning method that allows humanoid robots to imitate expressions in real-time by analyzing visual input. Similarly, Huang et al. integrated dual LSTM networks to enhance the smoothness of expression sequences, producing more natural animations. Despite these improvements, the fundamental issue of hardware-data rate mismatch remains largely unaddressed. In many humanoid robot platforms, servo motors require significant time to complete movements, leading to command queuing and execution delays. This often results in expression jitter, where the robot struggles to keep pace with rapid changes in human expressions.
Video keyframe detection has been extensively studied in computer vision for applications like video summarization and compression. Traditional methods rely on frame differencing, histogram comparisons, or feature-based similarity measures to identify representative frames. For example, frame difference techniques compute pixel-wise variations between consecutive frames, selecting frames with significant changes as keyframes. However, these approaches are often insensitive to subtle motions, such as facial muscle movements, making them unsuitable for expression analysis. Deep learning-based keyframe detection methods, such as those using recurrent neural networks or convolutional architectures, have shown promise in capturing temporal dependencies. Yet, they typically require large annotated datasets and may not generalize well to the nuanced domain of facial expressions. Optical flow estimation, which models pixel-level motion between frames, provides a robust foundation for analyzing facial dynamics. Self-supervised optical flow networks, like FastFlowNet, learn motion patterns without manual labels, making them ideal for real-time applications. By integrating multi-scale optical flow features with keyframe detection, our method enhances sensitivity to minor expression changes, ensuring that critical frames are accurately identified for humanoid robot imitation.
The proposed strategy builds upon these foundations by combining the strengths of optical flow and deep learning for keyframe detection, followed by facial keypoint extraction to drive humanoid robot actuators. This two-stage process effectively bridges the gap between high-frequency visual data and low-frequency hardware execution, enabling seamless and natural expression mimicry.
Proposed Methodology
The proposed two-stage facial expression imitation strategy for humanoid robots consists of a keyframe detection phase and a facial keypoint extraction phase. In the first stage, a novel keyframe detection model identifies frames with significant expression changes from an input video sequence. This model incorporates a multi-scale optical flow feature extraction module and a keyframe detection module based on ResNet-50. In the second stage, a facial keypoint detection algorithm, such as PFLD, processes the selected keyframes to extract geometric information of facial components. These keypoints are then translated into control angles for servo motors in the humanoid robot’s head, enabling precise imitation of expressions like eye blinking, mouth opening, and eyebrow movement. The overall workflow ensures that only essential expression data is processed, reducing computational load and aligning with the mechanical limitations of humanoid robots.
Keyframe Detection Model
The keyframe detection model is designed to capture subtle facial expression variations by leveraging temporal motion information. It comprises two main components: the optical flow feature extraction module and the keyframe detection module.
Optical Flow Feature Extraction Module: This module employs FastFlowNet, a lightweight neural network, to compute multi-scale optical flow features between consecutive frames. Given two frames, $F_{t-1}$ and $F_t$, the network generates optical flow features $F_{\text{flow}}$, which represent pixel-level displacements in horizontal and vertical directions. The wrap operation is applied to align $F_{t-1}$ with $F_t$ using the flow field, producing $F_{t-1}^{\text{wrap}}$. The transformation is defined as:
$$ F_{t-1}^{\text{wrap}}(i,j) = F_{t-1}(i + u(i,j), j + v(i,j)) $$
where $i$ and $j$ denote pixel coordinates, and $u$ and $v$ are the horizontal and vertical components of $F_{\text{flow}}(i,j)$, respectively. The structural similarity (SSIM) loss is used to measure the discrepancy between $F_{t-1}^{\text{wrap}}$ and $F_t$ across multiple scales. The SSIM between two frames $x$ and $y$ is calculated as:
$$ \text{SSIM}(x,y) = \frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $$
where $\mu_x$ and $\mu_y$ are pixel means, $\sigma_x$ and $\sigma_y$ are standard deviations, $\sigma_{xy}$ is the covariance, and $C_1$ and $C_2$ are constants. The total loss for this module is the sum of SSIM losses at each scale, guiding the network to learn accurate motion representations.
Keyframe Detection Module: This module uses ResNet-50 as a backbone to extract spatial features from input frames. The extracted features are fused with corresponding multi-scale optical flow features to capture temporal dynamics. The fused features are passed through a fully connected network that outputs a score between 0 and 1, indicating the magnitude of expression change relative to the previous frame. A score close to 1 signifies a substantial change, marking the frame as a potential keyframe. The module is trained using mean squared error (MSE) loss, with the target scores derived from blendshape coefficients and head pose parameters. During inference, local extremum search is applied to the score sequence to identify keyframes corresponding to peak expression changes.
The joint training of both modules is achieved through multi-task learning, where the total loss is a linear combination of the SSIM loss and MSE loss. This ensures that the model simultaneously learns motion patterns and expression dynamics, enhancing keyframe detection accuracy for humanoid robot applications.
Facial Keypoint Detection and Robot Control
Once keyframes are selected, a facial keypoint detection algorithm processes them to locate landmarks on eyebrows, eyes, nose, and mouth. The PFLD model is employed for its high precision and efficiency. The detected keypoints are used to quantify facial expressions, which are then mapped to control commands for humanoid robot actuators. For example, the mouth opening angle $\theta$ is computed from upper and lower lip keypoints. Let $P = \{p_1, p_2, \dots, p_n\}$ and $Q = \{q_1, q_2, \dots, q_n\}$ represent sets of upper and lower lip keypoints, respectively. The angle $\theta$ is given by:
$$ \theta = \alpha \times \frac{1}{n} \sum_{i=1}^{n} |p_i – q_i| $$
where $\alpha$ is a hardware-specific constant determined empirically. Similar calculations are performed for other facial components, such as eye aperture and eyebrow elevation. The derived angles are transmitted to the humanoid robot’s control system, which adjusts servo motors to replicate the expression. This process ensures that the humanoid robot mimics only the most expressive frames, reducing unnecessary movements and improving imitation naturalness.
Experimental Setup and Results
To evaluate the proposed strategy, experiments were conducted on three public datasets: VoxCeleb, 300VW, and CAER. These datasets contain diverse facial expressions in unconstrained environments, making them suitable for testing real-world scenarios. The keyframe detection model was implemented using PyTorch and trained on an NVIDIA RTX 3090Ti GPU. The ResNet-50 backbone was initialized randomly, and training parameters included a batch size of 96, learning rate of 0.0001, weight decay of 0.0001, and dropout rate of 0.5. The optimizer was stochastic gradient descent (SGD), and training proceeded for 1,000 epochs.
The performance of the keyframe detection model was assessed using two metrics: keyframe quantity matching error ($E_n$) and keyframe position matching error ($E_l$). Let $G_n$ and $P_n$ denote the ground truth and predicted number of keyframes, respectively. Then:
$$ E_n = |P_n – G_n| $$
For position error, let $G_l^{(x)}$ and $P_l^{(x)}$ represent the ground truth and predicted positions of the $x$-th keyframe in a sequence of $k$ keyframes. The position error is defined as:
$$ E_l = \frac{1}{k} \sum_{x=1}^{k} |P_l^{(x)} – G_l^{(x)}| $$
Table 1 compares the proposed method with baseline approaches, including ResNet-50 without optical flow and frame difference methods, on the three datasets. The results demonstrate that the integration of multi-scale optical flow features significantly reduces both quantity and position errors, achieving an average position error of less than one frame.
| Method | Dataset | Quantity Error ($E_n$) | Position Error ($E_l$) |
|---|---|---|---|
| ResNet-50 | VoxCeleb | ±6.61 | ±1.483 |
| 300VW | ±5.96 | ±1.654 | |
| CAER | ±4.89 | ±1.552 | |
| Frame Difference | VoxCeleb | ±5.28 | ±1.357 |
| 300VW | ±5.15 | ±1.391 | |
| CAER | ±4.32 | ±1.605 | |
| Proposed Method | VoxCeleb | ±3.24 | ±0.640 |
| 300VW | ±3.74 | ±0.759 | |
| CAER | ±3.85 | ±0.852 |
For facial expression imitation, the proposed strategy was deployed on a physical humanoid robot platform. A video sequence of a human subject displaying various expressions was processed, and keyframes were identified based on score peaks. The robot successfully imitated expressions such as smiling, surprise, and neutral poses, with smooth transitions between keyframes. The imitation quality was quantified using cosine similarity between the robot’s expression parameters and the human’s ground truth expressions. Over 100 keyframes, the average cosine similarity was 0.7226, indicating a high degree of resemblance. This confirms that the keyframe-based approach effectively preserves essential expression details while mitigating hardware-induced jitter.
Discussion
The experimental results validate the efficacy of the proposed two-stage strategy for humanoid robot facial expression imitation. By focusing on keyframes, the method reduces the data processing burden and ensures that the humanoid robot only executes meaningful expression changes. The integration of optical flow features enables the detection of subtle facial motions that are often missed by traditional keyframe detection algorithms. This is particularly important for humanoid robots operating in dynamic environments, where real-time responsiveness is critical. However, the method relies on accurate facial keypoint detection, which may be affected by occlusions or extreme head poses. Future work could incorporate robust keypoint detectors or adversarial training to handle such scenarios. Additionally, the hardware-specific constant $\alpha$ in the control equation requires calibration for different humanoid robot platforms, which may limit plug-and-play applicability. Despite these limitations, the strategy represents a significant step toward natural and efficient human-robot interaction, with potential extensions to other domains like virtual avatars or assistive technologies.
Conclusion
This paper presents a keyframe-based approach for facial expression imitation in humanoid robots, addressing the challenge of hardware-data rate mismatch. The proposed two-stage strategy combines multi-scale optical flow-based keyframe detection with facial keypoint extraction to select and imitate only the most expressive frames. Experimental results on public datasets demonstrate superior performance in keyframe accuracy and imitation naturalness compared to baseline methods. The approach not only enhances the fluidity of humanoid robot expressions but also optimizes computational resources, making it suitable for real-time applications. Future research will focus on adapting the method to various humanoid robot architectures and exploring its integration with emotional intelligence models for more context-aware interactions.