Speech Processing for Machine Learning: Filter banks, MelFrequency Cepstral Coefficients (MFCCs) and What's InBetween
Speech processing plays an important role in any speech system whether its Automatic Speech Recognition (ASR) or speaker recognition or something else. MelFrequency Cepstral Coefficients (MFCCs) were very popular features for a long time; but more recently, filter banks are becoming increasingly popular. In this post, I will discuss filter banks and MFCCs and why are filter banks becoming increasingly popular.
Computing filter banks and MFCCs involve somewhat the same procedure, where in both cases filter banks are computed and with a few more extra steps MFCCs can be obtained. In a nutshell, a signal goes through a preemphasis filter; then gets sliced into (overlapping) frames and a window function is applied to each frame; afterwards, we do a Fourier transform on each frame (or more specifically a ShortTime Fourier Transform) and calculate the power spectrum; and subsequently compute the filter banks. To obtain MFCCs, a Discrete Cosine Transform (DCT) is applied to the filter banks retaining a number of the resulting coefficients while the rest are discarded. A final step in both cases, is mean normalization.
Setup
For this post, I used a 16bit PCM wav file from here, called “OSR_us_000_0010_8k.wav”, which has a sampling frequency of 8000 Hz. The wav file is a clean speech signal comprising a single voice uttering some sentences with some pauses inbetween. For simplicity, I used the first 3.5 seconds of the signal which corresponds roughly to the first sentence in the wav file.
I’ll be using Python 2.7.x, NumPy and SciPy. Some of the code used in this post is based on code available in this repository.
The raw signal has the following form in the time domain:
Signal in the Time Domain
PreEmphasis
The first step is to apply a preemphasis filter on the signal to amplify the high frequencies. A preemphasis filter is useful in several ways: (1) balance the frequency spectrum since high frequencies usually have smaller magnitudes compared to lower frequencies, (2) avoid numerical problems during the Fourier transform operation and (3) may also improve the SignaltoNoise Ratio (SNR).
The preemphasis filter can be applied to a signal \(x\) using the first order filter in the following equation:
\[y(t) = x(t)  \alpha x(t1)\]
which can be easily implemented using the following line, where typical values for the filter coefficient (\(\alpha\)) are 0.95 or 0.97, pre_emphasis = 0.97
:
Preemphasis has a modest effect in modern systems, mainly because most of the motivations for the preemphasis filter can be achieved using mean normalization (discussed later in this post) except for avoiding the Fourier transform numerical issues which should not be a problem in modern FFT implementations.
The signal after preemphasis has the following form in the time domain:
Signal in the Time Domain after PreEmphasis
Framing
After preemphasis, we need to split the signal into shorttime frames. The rationale behind this step is that frequencies in a signal change over time, so in most cases it doesn’t make sense to do the Fourier transform across the entire signal in that we would lose the frequency contours of the signal over time. To avoid that, we can safely assume that frequencies in a signal are stationary over a very short period of time. Therefore, by doing a Fourier transform over this shorttime frame, we can obtain a good approximation of the frequency contours of the signal by concatenating adjacent frames.
Typical frame sizes in speech processing range from 20 ms to 40 ms with 50% (+/10%) overlap between consecutive frames.
Popular settings are 25 ms for the frame size, frame_size = 0.025
and a 10 ms stride (15 ms overlap), frame_stride = 0.01
.
Window
After slicing the signal into frames, we apply a window function such as the Hamming window to each frame. A Hamming window has the following form:
\[w[n] = 0.54 − 0.46 cos ( \frac{2πn}{N − 1} )\]
where, \(0 \leq n \leq N  1\), \(N\) is the window length. Plotting the previous equation yields the following plot:
Hamming Window
There are several reasons why we need to apply a window function to the frames, notably to counteract the assumption made by the FFT that the data is infinite and to reduce spectral leakage.
FourierTransform and Power Spectrum
We can now do an \(N\)point FFT on each frame to calculate the frequency spectrum, which is also called ShortTime FourierTransform (STFT), where \(N\) is typically 256 or 512, NFFT = 512
; and then compute the power spectrum (periodogram) using the following equation:
\[P = \frac{FFT(x_i)^2}{N}\]
where, \(x_i\) is the \(i^{th}\) frame of signal \(x\). This could be implemented with the following lines:
Filter Banks
The final step to computing filter banks is applying triangular filters, typically 40 filters, nfilt = 40
on a Melscale to the power spectrum to extract frequency bands.
The Melscale aims to mimic the nonlinear human ear perception of sound, by being more discriminative at lower frequencies and less discriminative at higher frequencies.
We can convert between Hertz (\(f\)) and Mel (\(m\)) using the following equations:
\[m = 2595 \log_{10} (1 + \frac{f}{700})\]
\[f = 700 (10^{m/2595}  1) \]
Each filter in the filter bank is triangular having a response of 1 at the center frequency and decrease linearly towards 0 till it reaches the center frequencies of the two adjacent filters where the response is 0, as shown in this figure:
Filter bank on a MelScale
This can be modeled by the following equation (taken from here):
\[
H_m(k) =
\begin{cases}
\hfill 0 \hfill & k < f(m  1) \
\
\hfill \dfrac{k  f(m  1)}{f(m)  f(m  1)} \hfill & f(m  1) \leq k < f(m) \
\
\hfill 1 \hfill & k = f(m) \
\
\hfill \dfrac{f(m + 1)  k}{f(m + 1)  f(m)} \hfill & f(m) < k \leq f(m + 1) \
\
\hfill 0 \hfill & k > f(m + 1) \
\end{cases}
\]
After applying the filter bank to the power spectrum (periodogram) of the signal, we obtain the following spectrogram:
Spectrogram of the Signal
If the Melscaled filter banks were the desired features then we can skip to mean normalization.
Melfrequency Cepstral Coefficients (MFCCs)
It turns out that filter bank coefficients computed in the previous step are highly correlated, which could be problematic in some machine learning algorithms.
Therefore, we can apply Discrete Cosine Transform (DCT) to decorrelate the filter bank coefficients and yield a compressed representation of the filter banks.
Typically, for Automatic Speech Recognition (ASR), the resulting cepstral coefficients 213 are retained and the rest are discarded; num_ceps = 12
.
The reasons for discarding the other coefficients is that they represent fast changes in the filter bank coefficients and these fine details don’t contribute to Automatic Speech Recognition (ASR).
One may apply sinusoidal liftering^{1} to the MFCCs to deemphasize higher MFCCs which has been claimed to improve speech recognition in noisy signals.
The resulting MFCCs:
MFCCs
Mean Normalization
As previously mentioned, to balance the spectrum and improve the SignaltoNoise (SNR), we can simply subtract the mean of each coefficient from all frames.
The meannormalized filter banks:
Normalized Filter Banks
and similarly for MFCCs:
The meannormalized MFCCs:
Normalized MFCCs
Filter Banks vs MFCCs
To this point, the steps to compute filter banks and MFCCs were discussed in terms of their motivations and implementations. It is interesting to note that all steps needed to compute filter banks were motivated by the nature of the speech signal and the human perception of such signals. On the contrary, the extra steps needed to compute MFCCs were motivated by the limitation of some machine learning algorithms. The Discrete Cosine Transform (DCT) was needed to decorrelate filter bank coefficients, a process also referred to as whitening. In particular, MFCCs were very popular when Gaussian Mixture Models  Hidden Markov Models (GMMsHMMs) were very popular and together, MFCCs and GMMsHMMs coevolved to be the standard way of doing Automatic Speech Recognition (ASR)^{2}. With the advent of Deep Learning in speech systems, one might question if MFCCs are still the right choice given that deep neural networks are less susceptible to highly correlated input and therefore the Discrete Cosine Transform (DCT) is no longer a necessary step. It is beneficial to note that Discrete Cosine Transform (DCT) is a linear transformation, and therefore undesirable as it discards some information in speech signals which are highly nonlinear.
It is sensible to question if the Fourier Transform is a necessary operation. Given that the Fourier Transform itself is also a linear operation, it might be beneficial to ignore it and attempt to learn directly from the signal in the time domain. Indeed, some recent work has already attempted this and positive results were reported. However, the Fourier transform operation is a difficult operation to learn and may arguably increase the amount of data and model complexity needed to achieve the same performance. Moreover, in doing ShortTime Fourier Transform (STFT), we’ve assumed the signal to be stationary within this short time and therefore the linearity of the Fourier transform would not pose a critical problem.
Conclusion
In this post, we’ve explored the procedure to compute Melscaled filter banks and MelFrequency Cepstrum Coefficients (MFCCs). The motivations and implementation of each step in the procedure were discussed. We’ve also argued the reasons behind the increasing popularity of filter banks compared to MFCCs.
tl;dr: Use Melscaled filter banks if the machine learning algorithm is not susceptible to highly correlated input. Use MFCCs if the machine learning algorithm is susceptible to correlated input.
Citation:

Liftering is filtering in the cepstral domain. Note the abuse of notation in spectral and cepstral with filtering and liftering respectively. ↩

An excellent discussion on this topic is in this thesis. ↩