Improved MFCC and DNN for Robot Speech Recognition

In the realm of robot technology, enabling seamless human-robot interaction through voice commands is a critical advancement. However, environmental noise often degrades the clarity and accuracy of speech recognition systems, posing significant challenges for reliable robot control. To address this, we propose a novel speech recognition method that integrates improved Mel Frequency Cepstral Coefficient (MFCC) feature extraction with a Deep Neural Network (DNN)-based framework. This approach not only enhances feature discrimination but also leverages advanced acoustic modeling and speech enhancement techniques to improve robustness in noisy conditions. By focusing on robot technology applications, our method aims to facilitate more intuitive and efficient control of robotic systems, such as autonomous vehicles and industrial assistants, where precise voice command interpretation is essential.

The core of our methodology involves refining MFCC features using Linear Discriminant Analysis (LDA), Maximum Likelihood Linear Transformation (MLLT), and Speaker Adaptive Transformation (SAT). These transformations reduce feature dimensionality and redundancy, leading to more discriminative representations. For the acoustic model, we enhance the traditional DNN-Hidden Markov Model (DNN-HMM) by incorporating a Deep Boltzmann Machine (DBM), which improves the modeling of complex speech patterns. Additionally, we develop a speech enhancement module combining DNNs with Harmonic Enhancement (HE) to suppress noise interference. This integrated framework ensures that robot technology can operate effectively in real-world environments, such as factories or public spaces, where background noise is prevalent.

To begin, let us delve into the MFCC feature extraction process. The MFCC is derived from the human auditory system’s nonlinear perception of frequency, which is modeled by the Mel scale. The conversion from linear frequency to Mel frequency is given by:

$$ F_{\text{mel}} = 2595 \log_{10} \left(1 + \frac{f}{700}\right) $$

where $ F_{\text{mel}} $ is the perceived frequency and $ f $ is the actual frequency in Hz. This transformation aligns with how robot technology processes auditory inputs, mimicking human-like perception. The Mel filter bank consists of triangular filters spaced according to the Mel scale, with the frequency response for each filter defined as:

$$ H_m(k) =
\begin{cases}
0, & k < f(m-1) \\
\frac{k – f(m-1)}{f(m) – f(m-1)}, & f(m-1) \leq k \leq f(m) \\
\frac{f(m+1) – k}{f(m+1) – f(m)}, & f(m) \leq k \leq f(m+1) \\
0, & k > f(m+1)
\end{cases} $$

Here, $ H_m(k) $ is the transfer function of the $ m $-th filter, $ k $ is the frequency bin index from the Fast Fourier Transform (FFT), and $ f(m) $ is the center frequency of the $ m $-th filter, calculated as:

$$ f(m) = \frac{N}{F_s} B^{-1} \left( F_{\text{mel}}(f_l) + m \frac{F_{\text{mel}}(f_h) – F_{\text{mel}}(f_l)}{M + 1} \right) $$

where $ N $ is the FFT length, $ F_s $ is the sampling frequency, $ B^{-1} $ is the inverse of the Mel frequency transformation, $ f_l $ and $ f_h $ are the lower and upper frequency bounds, and $ M $ is the number of filters. This setup is crucial for robot technology to handle varied acoustic environments. After pre-emphasis, framing, and windowing the speech signal, we compute the FFT to obtain the frequency domain representation:

$$ X(i, k) = \text{FFT}[x_i(m)], \quad 0 \leq m \leq 320 $$

where $ X(i, k) $ is the frequency domain signal and $ x_i(m) $ is the time-domain speech frame. The spectral energy is then derived as $ E_i(k) = |X(i, k)|^2 $. Applying the Mel filter bank yields:

$$ S(i, m) = \sum_{k=0}^{N-1} E_i(k) H_m(k), \quad 0 \leq m < M $$

Finally, the MFCC coefficients are obtained through the Discrete Cosine Transform (DCT):

$$ \text{MFCC}(i, n) = \sqrt{\frac{2}{M}} \sum_{m=0}^{M-1} \log_{10}[S(i, m)] \cos \left( \frac{\pi n (2m – 1)}{2M} \right) $$

where $ n $ is the feature dimension. To capture dynamic features, we compute the Delta coefficients:

$$ D_i = \frac{\sum_{\theta=1}^{\Theta} \theta (\text{MFCC}_{i+\theta} – \text{MFCC}_{i-\theta})}{2 \sum_{\theta=1}^{\Theta} \theta^2} $$

These steps form the basis of our feature extraction, but to enhance performance in robot technology, we apply LDA, MLLT, and SAT. LDA reduces dimensionality by maximizing inter-class variance and minimizing intra-class variance. The between-class and within-class scatter matrices are defined as:

$$ S_b = \sum_{i=1}^{c} n_i (\mu_i – \mu)(\mu_i – \mu)^T $$
$$ S_w = \sum_{i=1}^{c} \sum_{j=1}^{n_i} (x_j^i – \mu_i)(x_j^i – \mu_i)^T $$

where $ \mu_i $ is the mean of class $ i $, $ \mu $ is the overall mean, $ n_i $ is the number of samples in class $ i $, and $ c $ is the number of classes. The transformation matrix $ W_L $ is derived to project features into a lower-dimensional space: $ Y = W_L^T X $. MLLT further decorrelates the feature parameters by maximizing the likelihood under a linear transformation, while SAT adapts the speaker-independent model to individual speakers, improving personalization in robot technology applications.

For the acoustic model, we improve upon the DNN-HMM framework by integrating a Deep Boltzmann Machine (DBM). Traditional DNN-HMM models struggle with complex speech patterns, but DBM enhances the model’s ability to capture high-level features. The energy function of a Restricted Boltzmann Machine (RBM), the building block of DBM, is:

$$ E(v, h) = -\sum_{i=1}^{n} \sum_{j=1}^{m} w_{ij} h_i v_j – \sum_{j=1}^{m} b_j v_j – \sum_{i=1}^{n} c_i h_i $$

where $ v $ and $ h $ are visible and hidden units, $ w_{ij} $ are weights, and $ b_j $, $ c_i $ are biases. The joint probability is given by $ p(v, h) = \frac{e^{-E(v, h)}}{\sum_{v,h} e^{-E(v, h)}} $. In DBM, multiple layers of RBMs are stacked with undirected connections, allowing for better representation learning. The training involves backpropagation with a cross-entropy loss function:

$$ F_{\text{CE}} = -\sum_{u=1}^{U} \sum_{t=1}^{T} \log y_{v_t}(s_{u_t}) $$

where $ y_{v_t} $ is the output probability and $ s_{u_t} $ is the state at time $ t $. The gradient for the output layer is computed as:

$$ \frac{\partial F_{\text{CE}}}{\partial a_{v_t}(s)} = y_{u_t}(s) – \delta_{s, s_{u_t}} $$

where $ \delta_{s, s_{u_t}} $ is the Kronecker delta. This improved DNN-HMM model significantly boosts recognition accuracy in robot technology scenarios.

To mitigate noise interference, we incorporate a speech enhancement module based on DNN and Harmonic Enhancement (HE). The DNN predicts autoregressive parameters for both clean speech and noise. The cost function for DNN training is:

$$ J(w, b) = \frac{1}{M_b} \sum_{i=1}^{M_b} [d(i) – h_{w,b}(x(i))]^2 $$

where $ M_b $ is the batch size, $ d(i) $ is the desired output, and $ h_{w,b} $ is the DNN mapping function. The Wiener filter is then constructed using the estimated parameters:

$$ \text{WF}(k) = \frac{\hat{g}_x |A_x(k)|^{-2}}{\hat{g}_x |A_x(k)|^{-2} + \hat{g}_w |A_w(k)|^{-2}} $$

where $ \hat{g}_x $ and $ \hat{g}_w $ are the gains for speech and noise, and $ |A_x(k)|^2 $, $ |A_w(k)|^2 $ are the spectral shapes. HE identifies harmonic frequencies by checking peak conditions:

$$ 20 \log_{10}[A(w_a)] – 20 \log_{10}[\max\{A(w_i)\}] > 8 \text{ dB} $$
$$ \frac{|w_a – l w_0|}{l w_0} < 0.1 $$

where $ w_a $ is the peak frequency and $ w_0 $ is the fundamental frequency. A comb filter is applied to enhance harmonics:

$$ H_I(w_k) =
\begin{cases}
\text{WF}(w_k) \cdot e^{-2(w_k – w_a)^2 / \sigma^2}, & w_k \in \left[w_a – \frac{w_0}{2}, w_a + \frac{w_0}{2}\right] \\
\text{WF}(w_k), & \text{otherwise}
\end{cases} $$

This approach effectively reduces noise while preserving speech quality, which is vital for robot technology operating in dynamic environments.

In our experiments, we evaluated the proposed method under various noise conditions and compared it with existing techniques. The performance metrics included Segmental Signal-to-Noise Ratio (SSNR), Perceptual Evaluation of Speech Quality (PESQ), and Word Error Rate (WER). The following table summarizes the SSNR results for different algorithms across noise types and input SNRs:

Noise	Input SNR (dB)	MIC	SHMM	DNNHE
Babble	-5	11.4	10.1	12.7
	0	9.7	8.5	10.7
	5	8.4	6.6	9.1
	10	6.6	4.4	7.8
Factory	-5	15.4	12.6	16.0
	0	13.5	10.2	14.4
	5	11.6	8.1	13.0
	10	9.4	5.8	10.5
White	-5	17.0	13.2	18.5
	0	14.0	9.2	14.3
	5	11.5	6.3	12.4
	10	8.9	3.2	10.6

The results demonstrate that our DNNHE method consistently achieves higher SSNR values, indicating better noise suppression. For instance, in factory noise at 5 dB input SNR, DNNHE attains an SSNR of 13.0, compared to 11.6 for MIC and 8.1 for SHMM. Similarly, the PESQ scores, as shown in the next table, highlight improved speech quality:

Noise	Input SNR (dB)	MIC	SHMM	DNNHE
Babble	-5	1.5	1.7	1.7
	0	1.9	2.0	2.1
	5	2.3	2.4	2.5
	10	2.7	2.8	2.8
Factory	-5	1.9	1.8	2.2
	0	2.4	2.4	2.6
	5	2.7	2.6	2.8
	10	3.0	2.9	3.0
White	-5	2.0	1.0	2.1
	0	2.2	1.3	2.4
	5	2.4	1.7	2.7
	10	2.7	2.2	2.8

In factory noise at -5 dB input SNR, DNNHE achieves a PESQ of 2.2, outperforming MIC (1.9) and SHMM (1.8). These enhancements are critical for robot technology, as they ensure clear voice command interpretation. Furthermore, we assessed the word error rate (WER) for different feature sets and models. The Tri-LDA-MLLT-SAT features, combined with the improved DNN-HMM, yielded the lowest WER. For example, at 20 dB SNR, the average WER for Tri-LDA-MLLT-SAT features was 24.9%, compared to 28.6% for triphone features and 27.3% for Tri-LDA-MLLT. The improved DNN-HMM model further reduced the WER to 22.1%, whereas GMM-HMM and QRNN-CTC models achieved 25.3% and 26.3%, respectively.

We also analyzed the impact of the number of filter banks on model performance. As the number of filters increased, both the sentence error rate (SER) and word error rate (WER) initially decreased and then rose slightly. With 30 filter banks, the improved DNN-HMM achieved an SER of 20.7% and a WER of 3.1%, outperforming the standard DNN-HMM (21.7% SER and 3.5% WER). This underscores the importance of optimal filter bank configuration in robot technology applications.

In conclusion, our proposed method significantly advances robot technology by enhancing speech recognition accuracy in noisy environments. The improved MFCC features and DNN-based models reduce word error rates and improve speech quality, enabling more reliable robot control. However, the extensive data requirements for DNN training pose challenges for real-time deployment. Future work will focus on optimizing the model for efficiency, such as through quantization or pruning, to make it more suitable for resource-constrained robotic systems. By continuing to refine these techniques, we can further integrate voice commands into various robot technology domains, from healthcare to manufacturing, ultimately fostering more natural human-robot collaborations.