Keyframe-Based Facial Expression Mimicry for Humanoid Robots

In recent years, humanoid robots have gained significant attention due to their human-like appearance and ability to express emotions, making them ideal for applications in healthcare, education, and social interaction. However, a critical challenge in facial expression mimicry for humanoid robots is the mismatch between hardware execution speed and data processing rates. This discrepancy often leads to jerky and unnatural expressions, undermining the robot’s ability to engage in seamless human-robot interaction. Traditional methods, such as uniform down-sampling of expression data, partially address this issue but result in the loss of subtle yet crucial expression details. To overcome these limitations, we propose a novel two-stage strategy that combines facial keyframe detection and keypoint recognition. Our approach leverages a self-supervised multi-scale optical flow technique to identify keyframes in expression sequences, ensuring that only the most representative frames are used for imitation. This method not only reduces computational overhead but also enhances the naturalness and fluidity of the robot’s expressions. By integrating this strategy with existing facial keypoint detection algorithms, we enable humanoid robots to mimic human expressions more accurately and efficiently. In this paper, we detail the design and implementation of our keyframe detection model, validate its performance on public datasets, and demonstrate its effectiveness on a physical humanoid robot platform. Our contributions include a robust keyframe detection framework and a comprehensive evaluation that highlights improvements in expression mimicry quality.

The development of humanoid robots has revolutionized human-robot interaction by enabling machines to exhibit social behaviors akin to humans. Facial expressions play a pivotal role in conveying emotions and intentions, making them a key component of effective communication. However, replicating these expressions on humanoid robots poses significant technical challenges. The mechanical actuators in robot faces, such as servo motors, have limited rotation speeds, which often lag behind the high frame rates of video data. For instance, a typical servo might take 120 ms to rotate 60 degrees, while video data is captured at 25–30 Hz. This mismatch causes the robot to process new expression commands before completing previous ones, resulting in choppy and disjointed mimicry. Traditional solutions involve reducing the data frequency through down-sampling, but this approach indiscriminately discards frames, leading to the loss of critical expression transitions. Our work addresses this issue by selectively identifying keyframes that capture significant expression changes, thereby optimizing the data flow to match the hardware capabilities of humanoid robots. This strategy ensures that the robot mimics expressions smoothly and naturally, enhancing the overall interaction experience.

Related Work

Facial expression mimicry in humanoid robots has been explored through various approaches, including the use of LCD displays for rendering expressions and mechanical systems with silicone skins for realistic facial movements. Early humanoid robots like Kismet, Albert HUBO, and WE-4RII employed numerous servos to generate a wide range of expressions. While pre-programmed expressions can be effective, they lack the flexibility required for dynamic interactions. Recent advances have incorporated data-driven AI algorithms to enable real-time expression mimicry. For example, some studies have developed automated imitation methods using deep neural networks, while others have integrated Long Short-Term Memory (LSTM) networks to improve smoothness. Additionally, research on soft-skinned robotic faces has utilized self-supervised learning frameworks for imitation. Despite these innovations, the compatibility between algorithms and hardware remains a challenge. Most studies focus on algorithmic design without adequately addressing the rate mismatch between data processing and mechanical execution. This gap motivates our work on keyframe-based mimicry, which aims to balance data efficiency and expression fidelity for humanoid robots.

Keyframe detection is a well-established technique in video analysis for compressing data and eliminating redundancy. Conventional methods rely on frame differencing, similarity metrics, and motion analysis to identify representative frames. However, these approaches often struggle with subtle facial expression changes due to their sensitivity thresholds. Deep learning-based methods have shown promise in capturing temporal dynamics, but they require extensive labeled data. Optical flow estimation, which tracks pixel-level motion between frames, has been widely used for motion analysis. Traditional optical flow methods assume brightness constancy and smoothness, but they can be enhanced with self-supervised learning and multi-scale features. Our keyframe detection model builds on these concepts by combining a ResNet-50 backbone with multi-scale optical flow features. This hybrid approach allows us to capture minute expression variations effectively, making it suitable for the nuanced requirements of humanoid robots.

Proposed Method

Our two-stage facial expression mimicry strategy for humanoid robots consists of keyframe detection and facial keypoint extraction. In the first stage, we develop a keyframe detection model that identifies frames with significant expression changes. This model integrates a multi-scale optical flow feature extraction module and a keyframe detection module based on ResNet-50. The optical flow module computes motion features between consecutive frames, while the keyframe module evaluates the magnitude of expression changes. By fusing these features, the model assigns a score to each frame, and keyframes are selected based on local maxima in the score sequence. In the second stage, we use a facial keypoint detection algorithm, such as PFLD, to extract geometric information from the keyframes. These keypoints are then mapped to control commands for the robot’s servo motors, enabling precise imitation of expressions like eye blinking or mouth opening. The overall process ensures that the humanoid robot only processes essential data, reducing latency and improving mimicry quality.

Keyframe Detection Model

The keyframe detection model is designed to capture subtle facial expression changes in video sequences. It comprises two main components: the optical flow feature extraction module and the keyframe detection module. The optical flow module uses FastFlowNet to compute multi-scale motion features between frame $ F_{t-1} $ and frame $ F_t $. The optical flow feature $ F_{\text{flow}} $ is a two-channel tensor representing horizontal and vertical displacements. For each scale, we warp $ F_{t-1} $ using $ F_{\text{flow}} $ to obtain $ F_{t-1}^{\text{wrap}} $, and then compute the structural similarity (SSIM) loss between $ F_{t-1}^{\text{wrap}} $ and $ F_t $. The warping operation is defined as:

$$F_{t-1}^{\text{wrap}}(i,j) = F_{t-1}(i + u, j + v)$$

where $ u = F_{\text{flow}}(i,j,0) $ and $ v = F_{\text{flow}}(i,j,1) $ are the displacement values. The SSIM loss for multiple scales is summed to train the module. After training, the optical flow features are extracted for inference.

The keyframe detection module uses ResNet-50 as a backbone to extract spatial features from each frame. These features are combined with the multi-scale optical flow features to capture temporal dynamics. The fused features are passed through fully connected layers to produce a score between 0 and 1, indicating the magnitude of expression change. A higher score signifies a more significant change. During inference, keyframes are selected by identifying local maxima in the score sequence. The model is trained using a multi-task loss function that combines the SSIM loss from the optical flow module and the mean squared error (MSE) loss from the keyframe module. The total loss $ L_{\text{total}} $ is computed as:

$$L_{\text{total}} = \alpha L_{\text{SSIM}} + \beta L_{\text{MSE}}$$

where $ \alpha $ and $ \beta $ are weighting coefficients. This approach ensures robust keyframe detection for humanoid robots.

Facial Keypoint Detection

Once keyframes are identified, we apply a facial keypoint detection algorithm to extract geometric features. We use the PFLD model for its high accuracy and efficiency. The model detects keypoints for eyebrows, eyes, nose, and mouth, which are symmetric across the face. For example, the mouth opening angle $ \theta $ is calculated from the upper and lower lip keypoints. Let $ P = \{p_1, p_2, \dots, p_n\} $ and $ Q = \{q_1, q_2, \dots, q_n\} $ represent the sets of upper and lower lip keypoints, respectively. The angle $ \theta $ is given by:

$$\theta = \alpha \times \frac{1}{n} \sum_{i=1}^{n} |p_i – q_i|$$

where $ \alpha $ is a hardware-specific constant. This angle is converted into a command for the robot’s servo motors, controlling the mouth movement. Similar calculations are performed for other facial features, enabling the humanoid robot to mimic expressions accurately.

Experiments and Results

We evaluated our keyframe-based mimicry strategy on three public datasets: VoxCeleb, CAER, and 300VW. These datasets contain diverse facial expressions in real-world scenarios, making them suitable for testing humanoid robot applications. We implemented our model using PyTorch on an NVIDIA RTX 3090Ti GPU. The ResNet-50 backbone was initialized randomly, and training parameters included a batch size of 96, learning rate of 0.0001, and SGD optimizer. The model was trained for 1000 epochs with dropout and weight decay regularization.

Dataset Description

The VoxCeleb dataset includes 19,348 video clips of over 6,000 speakers, exhibiting a wide range of expressions like talking and smiling. The CAER dataset consists of 13,000 video clips from movies, annotated with emotions such as happiness and anger. The 300VW dataset contains 114 videos with annotated facial landmarks. We preprocessed these datasets by extracting blendshape coefficients and head poses using existing keypoint detectors. The root mean square of these 55-dimensional vectors served as ground truth labels for keyframe detection. Manual inspection ensured the quality of the labels.

Keyframe Detection Performance

We compared our keyframe detection method with ResNet-50 and frame differencing on 30 randomly selected videos from each dataset. The evaluation metrics included keyframe quantity error $ E_n $ and position error $ E_l $, defined as:

$$E_n = |P_n – G_n|$$

$$E_l = \frac{1}{k} \sum_{x=1}^{k} |P_l^{(x)} – G_l^{(x)}|$$

where $ P_n $ and $ G_n $ are the predicted and ground truth keyframe counts, and $ P_l^{(x)} $ and $ G_l^{(x)} $ are the predicted and ground truth positions. The results are summarized in Table 1.

Table 1: Keyframe Detection Errors on Different Datasets
Method	Dataset	Quantity Error ($ E_n $)	Position Error ($ E_l $)
ResNet-50	Vox	±6.61	±1.483
	300VW	±5.96	±1.654
	CAER	±4.89	±1.552
Frame Differencing	Vox	±5.28	±1.357
	300VW	±5.15	±1.391
	CAER	±4.32	±1.605
Our Method	Vox	±3.24	±0.640
	300VW	±3.74	±0.759
	CAER	±3.85	±0.852

Our method achieved the lowest errors across all datasets, with an average quantity error of ±3.61 and position error of ±0.750. This demonstrates its superiority in accurately identifying keyframes for humanoid robots. The training loss curves in Figure 1 show that our model converged faster and to lower loss values compared to baseline methods.

Facial Expression Mimicry Results

We deployed our strategy on a physical humanoid robot platform to assess its mimicry performance. A 7-second video of a student’s facial expressions was processed, and 8 keyframes were selected. The keypoint detection model extracted mouth angles, which were translated into servo commands. The robot successfully imitated the expressions, as shown in the example commands for mouth servos. We quantified the similarity between the robot’s expressions and the human expressions using cosine similarity. For 100 keyframes, the average cosine similarity was 0.7226, indicating a high degree of accuracy. This result confirms that our keyframe-based approach enhances the naturalness and fluidity of expression mimicry in humanoid robots.

Discussion

Our keyframe-based strategy effectively addresses the rate mismatch between data processing and hardware execution in humanoid robots. By selecting only essential frames, we reduce the data load without sacrificing expression quality. The integration of multi-scale optical flow features allows the model to capture subtle expression changes, which is crucial for realistic mimicry. However, the performance may vary with lighting conditions and video quality. Future work could explore adaptive thresholding for keyframe selection and real-time optimization for dynamic environments. Additionally, extending this approach to multi-modal interactions, such as combining facial expressions with speech, could further enhance the capabilities of humanoid robots.

Conclusion

We presented a novel two-stage strategy for facial expression mimicry in humanoid robots, leveraging keyframe detection and keypoint recognition. Our keyframe detection model, based on ResNet-50 and multi-scale optical flow, accurately identifies significant expression changes in video sequences. The facial keypoint detection then translates these changes into control commands for the robot’s actuators. Experimental results on public datasets and a physical robot platform demonstrate that our method reduces jerks and improves mimicry naturalness. This work provides a robust solution for enhancing human-robot interaction, with potential applications in healthcare, education, and beyond. Future research will focus on optimizing the algorithm for real-time performance and exploring its integration with other sensory modalities.