Implicit Emotion-Oriented Speech-Driven Bionic Robot Facial Animation

The advancement of artificial intelligence has significantly enhanced the capabilities of bionic robots in simulating human behaviors and expressions. This progress unlocks revolutionary potential across various fields, particularly in proactive health, social interaction, and educational entertainment. Proactive health emphasizes improving overall well-being through individual participation and health management. Bionic robots, serving as interactive companions or therapeutic aids, hold immense significance for assisting vulnerable groups such as the hearing impaired, the elderly, and children with Autism Spectrum Disorder (ASD). Research indicates that using robots to help ASD children develop social and emotional skills can yield positive outcomes. Within this context, the technology for speech-driven generation of facial expressions and head movements in bionic robots emerges as a critical research focus for enabling natural and effective human-robot interaction.

Facial expressions and head motions are not merely important mediums for emotional communication but also vital cues for understanding human intent. Studies suggest that over 60% of information in human communication is conveyed through non-verbal behaviors, with facial expressions and head movements playing a significant role. Therefore, enabling a bionic robot to express accurate and natural expressions is crucial for enhancing the affinity of human-robot interaction. This work aims to bridge the gap between audio signals and physical robot actuation, exploring how to transform speech into coordinated facial and neck movements for a bionic robot platform.

Traditional approaches for animating robotic faces often relied on pre-programmed, fixed action sequences or random movements, which are insufficient for generating precise, context-aware emotional expressions aligned with speech. Recent methods leverage facial landmark detection from video to map movements to servo motors. Meanwhile, in the computer graphics domain, significant breakthroughs have been made in audio-driven facial animation for virtual avatars. However, translating these advancements to physical bionic robots presents unique challenges, including hardware constraints, the need for real-time performance, and the precise mapping of continuous control parameters to discrete servo movements. This paper proposes a novel, holistic approach to address these challenges, focusing on generating natural, emotion-aware animations for a physical bionic robot head directly from audio input.

The core contributions of this work are threefold: First, we propose a novel deep learning-based method for bionic robot actuation. This method predicts a unified set of control coefficients encompassing both facial expression and neck motion parameters directly from speech features, achieving precise audio-to-robot expression control. Second, we design an implicit emotion-oriented feature fusion autoencoder framework. This architecture can infer emotional features implicitly from the audio signal without requiring explicit emotion labels or parameters as input, thereby synthesizing facial expressions and neck movements rich in emotional nuance. Third, we construct a series of robot-specific motion templates corresponding to each servo motor. Through a servo mapping strategy, these templates allow for the reconstruction of expressions and head poses on the physical bionic robot platform.

Related Work

Optimal Servo Displacement Mapping

Current techniques for generating talking faces in virtual environments primarily revolve around two core methodologies: direct mapping of speech signals to vertex coordinates of a face mesh, and the prediction of coefficients related to a parametric facial model, such as Blendshapes. The Blendshape model is a popular linear model that uses a compact set of parameters (e.g., 52) to represent key deformations of a human face, enabling the simulation of various 3D facial expressions. These coefficients are generally independent of a specific template mesh, allowing them to be reused across different face models to display consistent expressions.

In the domain of controlling physical expression robots, traditional methods depended on a fixed set of pre-programmed actions. Recent research has improved upon this by mapping facial landmarks detected from video to servo displacements, enabling more detailed facial control. Our work innovates by focusing on mapping audio-derived features—not video—to a comprehensive set of control parameters that include neck motion. We formulate the relationship between high-level control coefficients and individual servo displacements as an optimization problem, solved in collaboration with professional animators to respect mechanical limits. This expert-optimized servo driving method, analogous to human muscle control mechanisms, can effectively reproduce subtle changes in facial detail, providing a viable technical pathway for expression generation in physical bionic robots.

Audio-Driven Facial Expression and Head Pose

In speech-driven facial animation research, early works utilized models like Hidden Markov Models (HMMs). While producing some results, they struggled with capturing complex speech-lip relationships and often required significant manual post-processing. The advent of Deep Neural Networks (DNNs) has led to substantial progress. Systems have been developed to estimate facial model parameters from phonemes or acoustic features, drive 3D face models, and predict 2D lip landmarks. More recently, methods employing Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and Generative Adversarial Networks (GANs) have significantly improved the quality and realism of synthesized talking faces.

Many synthesized talking face videos feature a fixed head pose. Our method incorporates three-dimensional geometric information, enabling the simultaneous generation of personalized head poses, expressions, and lip movements. Crucially, our implicit emotion-oriented framework infers emotional representation from audio without needing emotion as an explicit input parameter, synthesizing emotionally rich facial expressions directly. Furthermore, the method retains the editable nature of the 3D model pipeline, allowing flexible adjustment of expression intensity for different scenarios, a feature we extend to the physical bionic robot domain.

Proposed Methodology

Our system employs a custom-built 25-Degree-of-Freedom (DOF) bionic robot head for demonstrating facial expressions and head motions. The overall pipeline is as follows: Audio data is input into a deep learning model (termed speech2head) which outputs unified 3D facial unit control coefficients. These motion control coefficients can not only drive a digital avatar based on a Blendshape model but are also transformed into servo control commands for the bionic robot through a pre-defined optimal mapping strategy.

The Bionic Robot Head Platform

The bionic robot head platform, with its soft skin, microprocessor, advanced servo control system, and精密 mechanical structure, can replicate human facial muscle actions and neck postures.

1) Hardware: The platform consists of a head frame, internal modules, and a neck module. The frame is 3D-printed based on a realistic human face and covered with soft skin. The internal cavity houses the mechanical control structures, which use linkage and hemispheric mechanisms. A pair of hemispheric mechanisms control eyelid opening/closing (50°–80° range). Movements of eyebrows and cheeks are achieved via linkage structures with a maximum displacement of 5mm. The neck module utilizes three high-performance servo motors enabling three-DOF rotation.

2) Control: An STM32 microprocessor on the robot head communicates with a server via serial port to receive standardized optimal servo displacement data. The microprocessor maps these values to Pulse-Width Modulation (PWM) signals to drive the servos. The platform is equipped with 25 servos, controlling the mouth, cheeks, eyelids, eyebrows, and neck, with independent control possible for left/right eyelids, eyebrows, and cheeks.

Feature Extraction

We use Mel-Frequency Cepstral Coefficients (MFCCs) as the acoustic feature representation. Given an audio signal $x(t)$, it is pre-processed (pre-emphasis, framing, windowing) into a sequence of frames $\{x_n\}_{n=1}^{T}$. For each frame $x_n$, a 39-dimensional MFCC feature vector $F_n \in \mathbb{R}^{D}$ is extracted. To capture dynamic characteristics, we also compute first-order ($\Delta$) and second-order ($\Delta\Delta$) differential features (delta and delta-delta). The delta parameter is calculated as:

$$
d_t = \begin{cases}
C_{t+1} – C_t & t < K \\
\frac{\sum_{k=1}^{K} k (C_{t+k} – C_{t-k})}{2 \sum_{k=1}^{K} k^2} & K \leq t < Q-K \\
C_t – C_{t-1} & t \geq Q-K
\end{cases}
$$

where $d_t$ is the first-order delta at time $t$, $C_t$ is the $t$-th cepstral coefficient, $Q$ is the order, and $K$ is the time difference for the derivative. The second-order delta is obtained by applying the same formula to the first-order delta result.

Model Architecture: The Speech2Head Framework

The speech2head model is designed to map sequences of acoustic features to sequences of facial and neck control coefficients. Its core is a feature fusion module containing two sub-networks: a Content Encoder and an Emotion Encoder. The overall fusion is defined as:

$$
F_{\text{fusion}} = f_{\text{concat}}(E_{\text{con}}, E_{\text{emo}})
$$

where $E_{\text{con}}$ and $E_{\text{emo}}$ are the outputs of the content and emotion encoders respectively, and $f_{\text{concat}}$ is a concatenation operation.

The preprocessed audio feature matrix $F_n \in \mathbb{R}^{N \times D}$ is processed frame-by-frame through convolutional layers:

$$
z^{(i)}_t = \text{ReLU}(x_t * W^{(i)} + b^{(i)})
$$

The convolved features are stacked temporally into $Z = [z_1, z_2, …, z_T] \in \mathbb{R}^{T \times D’}$. This sequence $Z$ is fed into a multi-layer LSTM (mLSTM) module. The final hidden state at the last timestep serves as the content encoder output $E_{\text{con}}$. The mLSTM update rules are:

$$
\begin{aligned}
f_t &= \exp(w_f^T x_t), \quad i_t = \exp(w_i^T x_t), \quad o_t = \sigma(W_o h_t) \\
C_t &= f_t C_{t-1} + i_t v_t k_t^T \\
n_t &= f_t n_{t-1} + i_t k_t \\
h_t &= o_t \odot \tanh\left(\frac{C_t q_t}{\max(\|n_t^T q_t\|, 1)}\right)
\end{aligned}
$$

where $x_t$ is the input, $W_*$ and $w_*$ are weights, $\sigma$ is the ReLU activation, and the $\max$ function ensures a non-zero denominator.

Simultaneously, the audio features are passed to the Emotion Encoder, a separate stack of mLSTM layers. The final hidden state $h_T \in \mathbb{R}^{D”}$ is projected through a linear layer to produce the emotion encoding $E_{\text{emo}}$.

In the fusion stage, $E_{\text{con}}$ and $E_{\text{emo}}$ are concatenated. The fused feature vector undergoes further processing through convolutional and pooling layers, followed by Batch Normalization and ReLU activation. This allows the model to jointly consider emotional and content information from the audio. Finally, the processed features are fed into a regression network that outputs the predicted motion control coefficients for each timestep.

Loss Function

To train the model, we use a composite loss function combining self-reconstruction loss and velocity loss to ensure both accuracy and temporal smoothness of the generated coefficients.

$$
\mathcal{L} = \lambda_1 \mathcal{L}_{\text{self}} + \lambda_2 \mathcal{L}_{\text{velocity}}
$$

$$
\mathcal{L}_{\text{self}} = \|\hat{b}_t – b_t\|_2
$$

$$
\mathcal{L}_{\text{velocity}} = \|(\hat{b}_t – \hat{b}_{t-1}) – (b_t – b_{t-1})\|_2
$$

where $\mathcal{L}_{\text{self}}$ and $\mathcal{L}_{\text{velocity}}$ are the self-reconstruction and velocity losses, $\lambda_1$ and $\lambda_2$ are balancing weights, and $b_t$ and $\hat{b}_t$ are the ground-truth and predicted control coefficients at frame $t$.

Servo Control Parameter Mapping Strategy

To actuate the physical bionic robot, we collaborate with professional animators to create 25 semantically meaningful robot head motion templates. Each template corresponds to the action of a single servo (e.g., raising the left eyebrow). The servo displacement for each of the 25 motors is generated by a linear combination of the predicted Blendshape coefficients. The optimal mapping is formulated as a constrained optimization problem to minimize the difference between the desired facial expression combination and the achievable servo movements, respecting physical limits:

$$
\begin{aligned}
\min_{w} & \sum_{j=1}^{25} \left( s_j – \sum_{i=1}^{52} w_{ij} x_i \right)^2 \\
\text{s.t.} & \quad w_{ij} \geq 0 \quad \forall i, \forall j \\
& \quad s_{j}^{\text{min}} \leq s_j \leq s_{j}^{\text{max}} \quad \forall j
\end{aligned}
$$

where $s_1, …, s_{25}$ are the servo states, $x_1, …, x_{52}$ are the facial control coefficients, $w_{ij}$ is the contribution weight of the $i$-th coefficient to the $j$-th servo, and $W$ is the weight matrix containing all $w_{ij}$.

Experiments and Analysis

Datasets

We utilize two widely used open-source audiovisual datasets: RAVDESS and HDTF. Since these are 2D datasets, we employ a 3D reconstruction pipeline to extract the necessary 3D facial and head motion control coefficients (Blendshapes and neck rotation angles) from the 2D video frames, creating a time-aligned dataset of audio and control parameters.

Dataset	Description	Use Case
RAVDESS	Multimodal emotional speech. 24 actors, 1,440 clips, 8 emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised).	Primary training and evaluation for emotion-oriented generation.
HDTF	High-resolution talking face videos. ~16 hours, 300+ subjects, 10,000 sentences.	Evaluation of generalization ability on unseen, in-the-wild data.

The processed datasets are split 80%/10%/10% for training, validation, and testing, respectively.

Experimental Setup

We train our model primarily on the RAVDESS dataset. We use dynamic batching (batch size of 4), the AdamW optimizer with an initial learning rate of $10^{-5}$, and weight decay of $10^{-2}$. Audio augmentation (time-shifting with 50% probability within 1/30 sec) is applied for better generalization. Training runs for 500 epochs on an NVIDIA 2090 GPU, and we select the model with the lowest validation loss.

Comparative Experiments

Baseline Models: We compare our speech2head method against two state-of-the-art speech-driven blendshape animation models: SAiD and EmoTalk. SAiD uses a lightweight Transformer, while EmoTalk employs an emotion-disentanglement encoder.

Evaluation Metrics: We use two metrics: 1) Lip Sync Error (LSE): The mean L2 norm error of lip-related control coefficients, measuring audio-lip synchronization. 2) Emotion Sync Error (ESE): The mean L2 norm error of non-lip coefficients (eyebrows, cheeks, head pose), measuring the synchronization of overall emotional expression with speech content and prosody.

Quantitative Results: The results on both RAVDESS and the unseen HDTF dataset are shown below. Our method achieves the best ESE, indicating superior overall emotional expression generation. While EmoTalk achieves a lower LSE on HDTF, our method maintains competitive LSE while excelling in the more comprehensive ESE metric and generating neck motion, which the baselines do not.

Method	RAVDESS		HDTF		Includes Neck
	ESE ↓	LSE ↓	ESE ↓	LSE ↓
SAiD	0.04784	0.02518	0.04702	0.01981	×
EmoTalk	0.02382	0.00482	0.02083	0.00278	×
Our Method	0.01373	0.00941	0.01655	0.00965	√

Efficiency Analysis: A key requirement for deployment on a bionic robot is efficiency. Our model is designed to be lightweight. As shown in the table below, our model size is only 8.3 MB, which is 99% smaller than the baselines, making it highly suitable for mobile or embedded deployment. The average inference time per sample is 0.324 seconds, well within the threshold for real-time interaction (<0.5s). While EmoTalk is faster, it does not generate neck motion and focuses only on facial expressions.

Method	Inference Time (s) ↓	Model Size (MB) ↓
SAiD	10.053	1288.8
EmoTalk	0.076	835.9
Our Method	0.324	8.3

Ablation Study

We conduct ablation studies to validate the contribution of key modules in our framework. The results, presented in the table below, clearly show that removing either the mLSTM-based emotion encoder or the content encoder leads to an increase in both ESE and LSE on both datasets. This confirms the effectiveness of our implicit emotion-oriented feature fusion design.

Method	RAVDESS		HDTF
	ESE ↓	LSE ↓	ESE ↓	LSE ↓
w/o mLSTM (Emo Encoder)	0.0176	0.0112	0.0206	0.0163
w/o Content Encoder	0.0215	0.0145	0.0257	0.0149
Full Model	0.0137	0.0094	0.0166	0.0097

Velocity Loss Evaluation

We retrained the model on RAVDESS without the velocity loss term ($\lambda_2=0$). Visual analysis of the predicted coefficient trajectories (e.g., for jaw opening or brow furrowing) showed that the absence of velocity loss led to increased frame-to-frame jitter and instability. The inclusion of $\mathcal{L}_{\text{velocity}}$ effectively suppressed these erratic jumps, resulting in smoother and more natural-looking facial animation sequences, which is critical for the perceived naturalness of the bionic robot’s movements.

Emotional Expression Validation

To objectively assess the emotional content of our generated animations, we trained a separate emotion recognition network on the Blendshape coefficients from the RAVDESS dataset. We then used this classifier to label the emotion of the Blendshape sequences generated by our speech2head model from audio. The results, compared against the ground-truth labels from the original dataset, are shown below. The generated sequences achieve a reasonable emotional accuracy, particularly for emotions like disgust and anger, demonstrating that our implicit method successfully encodes emotional features into the output control parameters.

Emotion Category	Ground-Truth Accuracy (%)	Generated Sequence Accuracy (%)
Calm	66.67	58.33
Happy	56.25	47.50
Sad	41.67	35.00
Fearful	35.29	32.88
Surprised	33.33	33.33
Disgusted	87.91	82.73
Angry	80.01	63.64
Average	57.30	50.49

Application Analysis: Simulated Interactive Dialogue

We demonstrate a potential application in proactive health by configuring the bionic robot as a conversational companion. A large language model provides dialogue responses, which are converted to speech via a text-to-speech engine. This audio is then processed by our speech2head system to drive the robot’s expressions in real-time. In a simulated dialogue scenario (e.g., a user expressing sadness over losing a game, with the robot offering consolation), the robot generates appropriate empathetic expressions (softened eyes, slight head tilt) synchronized with the comforting speech.

An informal evaluation with 10 participants rating the interaction on four factors showed promising results: the bionic robot received high ratings for Dialogue Reaction Sensitivity and Expression Motion Fluency. Ratings for Emotional Expression Accuracy and Human-Robot Dialogue Realism were positive but indicated room for improvement, often related to the limitations of the underlying speech synthesis or the mechanical constraints of the platform rather than the expression mapping itself.

Evaluation Factor	Excellent	Good	Fair	Poor
Dialogue Reaction Sensitivity	7	2	1	0
Emotional Expression Accuracy	4	3	1	2
Expression Motion Fluency	5	3	2	0
Human-Robot Dialogue Realism	4	4	1	1

Conclusion and Future Work

This paper presented a novel speech-driven method for controlling a bionic robot’s facial expressions and head movements. Unlike previous works, our approach holistically generates control parameters for both face and neck, enabling more lifelike animations. The proposed implicit emotion-oriented feature fusion autoencoder successfully infers emotional cues from speech without explicit labels, enriching the generated expressions. Through a specialized servo mapping strategy, these high-level parameters are accurately realized on a physical 25-DOF bionic robot head. Quantitative and qualitative experiments demonstrate that our method outperforms existing audio-driven animation techniques in generating emotionally synchronized expressions and achieves this with a highly lightweight model suitable for real-time deployment on resource-constrained platforms.

Despite these advancements, several limitations point to future research directions. First, our current model operates on pre-recorded audio clips and is not designed for true streaming inference, which is essential for uninterrupted live interaction. Second, the training data is derived from 2D video reconstruction, which may not capture the full fidelity of 3D facial micro-expressions compared to data from 3D scanners. Third, the mechanical design of the bionic robot imposes inherent constraints, limiting the reproduction of certain expressions (e.g., sticking out the tongue, showing teeth in a broad grin). Future work will focus on developing streamable models, incorporating high-fidelity 4D scan data for training, and exploring more advanced mechanical designs to expand the expressive repertoire of the bionic robot, ultimately moving closer to perfectly natural and responsive human-robot communication.