Revert to this revision
Syntax highlighting of Archiv/Státnice I3: Analýza a syntéza mluvené řeči

{{Work in progress|25.8.2010}}

= Speech Recognition =

*The task of transforming of acoustic signal into text.
*Based on statistical methods
*Constant increase of performance over last decade
*Parameters influencing the recognizers performance:
**size of the vocabulary
***yes/No recognition
***digit recognition - 10 words
***large vocabulary - 20,000 to 60,000 words
**fluency of speech
***isolated words recognition
***continuous speech recognition
**noisy-signal ratio

*The following text will focus on '''Large-Vocabulary Continuous Speech Recognition''' (although the methods are applicable universally), the methods shown are '''speaker independent''' (The system was not trained on the person to be recognized)

== Speech Recognition Architecture ==
*'''Noisy-channel paradigm'''
*Acoustic input <math>O = o_1, o_2, o_3, ..., o_t</math>
**consist of individual "acoustic observations" <math>o_i</math>
***<math>o_i</math> is represented as a feature vector
***usually one acoustic observation for each 10ms
*Sentence treated as a string of words <math>W = w_1, w_2, w_3, ..., w_n</math>
*Probability model:
**<math>W^*=\textit{argmax}_{W \in \textit{L}} P(W|O) = \textit{argmax}_{W \in \textit{L}} \frac{P(O|W)P(W)}{P(O)} = \textit{argmax}_{W \in \textit{L}} P(O|W)P(W)</math>
**<math>P(W)</math> - the prior probability - computed by '''language model'''
**<math>P(O|W)</math> - the observation likelihood - computed by '''acoustic model'''

== The Hidden Markov Model Applied To Speech ==

*HMM characterized by following components
**<math>Q = q_1 q_2 ... q_n</math> - set of '''states'''
**<math>A = a_{01} a_{02} ... a_{n1} ... a_{nn}</math> - '''transition probability matrix'''
***<math>a_{ij}</math> representing the probability of moving from state ''i'' to state ''j''
**<math>B = b_i(o_t)</math> - set of '''emission probabilities''', each representing the probability of an observation ''o_t'' being generated from a state ''i''
***observations are real-valued vectors (unlike in part-of-speech tagging HMM - observations were discrete values (part of speech tags) there)
**<math>q_0, q_{end}</math> - spcial '''start and end states''' not associated with observations

=== HMM representation of phone ===
[[Image:Fig6.png|thumb|550px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
*''Figure 9.6''
**Phones are non-homogeneous over time
***Thus there are separate states modelling a beginning, middle, and end of each phone.

=== HMM representation of word ===
*''Figure 9.7''
**Concatenation of HMM representations of phones
**The figure depicts a typical feature of ASR HMMs: '''left-to-right''' HMM structure
***HMM don't allow transition from states to go to earlier states in the word
***states can transition to themselves or to successive states
****transitions from states to themselves are extremely important - durations of phones can vary significantly (for example: duration of [aa] phone varies from 7 to 387 miliseconds - 1 to 40 frames)
[[Image:Fig7.PNG|thumb|650px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]

=== Speech recognition HMM ===
[[Image:Fig22.PNG|thumb|500px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
*''Figure 9.22'':
**Combination of word HMMs
**added state optionality state modelling silence
**transition from ''end'' state to ''start'' state - sentence can be constructed out of arbitrary number of words
**transitions from the ''start'' state are assigned unigram LM probabilities

*''Figure 9.29'':
[[Image:Fig29.PNG|thumb|500px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
**''start'' state, ''end'' state and silence states omitted here for a convinience reason
**Bigram language model used here - probabilities of transitions from the ends to the beginnings of the words

== Feature Extraction: MFCC Vectors ==
*''MFCC - mel frequency cepstral coefficients'' - most common feature representation

=== Analog-to-Digital Conversion ===
*'''sampling'''
**measuring of the the amplitude of the signal at a particular time
**''sampling rate'' - number of samples per second
**maximum frequency that can be measured is half of the ''sampling rate''
*'''quantization'''
**Represenation of real-valued numbers (amplitude measurements) as integer
***8 bits => values from -128 to 127
***16 bits => values from -32768 to 32767

=== Preemphasis ===
*boosting of amount of energy in the high frequencies
*improves phone detection accuracy

=== Windowing ===
*spectral features are to be extracted from a small segments of speech
** asumption that the signal is stationary on small segment
*(rougly) stationary segments of speech extracted by using a '''windowing''' technique
*windowing characterized by
**window's '''width'''
**'''offset''' between succesiev windows
**'''shape''' of the window
*segments of speech extracted by windowing are called '''frames'''
**'''frame size''' - number of miliseconds in each frame
**'''frame shift''' - miliseconds between the left edges of successive windows

=== Discrete Fourier Transform (DFT) ===
*extracts spectral information for the windowed signal
*how much energy each frame contains at different frequency bands

=== Mel Filter Bank and Log ===
*human ears less sensitive to higher frequencies (above 1000Hz)
*human hearing is "logaritmic"
**(i.e. for human, distance between 440Hz and 880Hz equals the distance between 880Hz and 1760Hz - in both examples, the distance is one musical octave)
*these feature of hearing are utilized in speech recognition
*DFT outputs are warped onto the '''mel scale'''
**Frequency scale devided into bans (frekvenční pásma)
***10 bans linearly spaced below 1000Hz
***remaining bans spread logarithmically above 1000Hz
**new '''mel-scaled''' vectors
***energy collected from each frequency band
***logarithms of energy values - the human response to signal level is logarithmic

=== The Cepstrum ===
''briefly:''
*inverse DFT of the MEL scaled signal
*result - 12 cepstral coefficients for each frame
*motivation - these coefficients are uncorelated, "better modelate vocal tract"

=== Deltas and Energy ===
*So far - 12 cepstral features
*13th feature - '''Energy''': obtained by suming squares of signal energy for all samples in the given frame
*For each feature:
**'''delta''' cepstral coefficient obtained as a derivation of feature values, computed as 
 <math>d(t) = \frac{c(t+1) - c(t-1)}{2}</math>
 for particular cepstral value ''c(t)'' at time ''t''
**'''double delta''' cepstral coefficient obtained as a derivation of '''delta''' cepstral feature values

=== MFCC Summary ===
*39 MFCC features
 12 cepstral coefficients
 12 delta cepstral coefficients
 12 double delta cepstral coefficients
 1 energy coefficient
 1 delta energy coefficient
 1 double delta energy coefficient

== Acoustic Likelihood Computation ==
*Last chapter - how to obtain feature vectors
*This chapter - how to compute likelihood of these feature vectors given an HMM state
**i.e. how to obtain '''emmision probabilities'''
 <math>B = b_i(o_t) ~ p(o_t | q_i)</math>

=== Vector Quantization ===
*simple method, not used in state of the art systems
*idea: map feature vectors into a small number of classes
*'''codebook''' - list of possible classes
*'''prototype vector''' - feature vector representing the class
*codebook is created by '''clustering''' of all the feature vectors in the training set into the given number of classes
*prototype vector can be chosen as a central point of each cluster
*each incoming feature vector is compared to all prototype vectors, the closest prototype vector is selected, and the feature vector is replaced by the class label of the selected prototype vector.

= Speech Synthesis =