Revert to this revision
Syntax highlighting of Archiv/Státnice I3: Analýza a syntéza mluvené řeči

''Analýza mluvené řeči'' podle knihy ''Speech and Language Processing - Jurafsky & Martin, 2nd Edition''. Zpracoval: ''Michališek''

= Speech Recognition =

*The task of transforming of acoustic signal into text.
*Based on statistical methods
*Constant increase of performance over last decade
*Parameters influencing the recognizers performance:
**size of the vocabulary
***yes/No recognition
***digit recognition - 10 words
***large vocabulary - 20,000 to 60,000 words
**fluency of speech
***isolated words recognition
***continuous speech recognition
**noisy-signal ratio

*The following text will focus on '''Large-Vocabulary Continuous Speech Recognition''' (although the methods are applicable universally), the methods shown are '''speaker independent''' (The system was not trained on the person to be recognized)

== Speech Recognition Architecture ==
*'''Noisy-channel paradigm'''
*Acoustic input <math>O = o_1, o_2, o_3, ..., o_t</math>
**consist of individual "acoustic observations" <math>o_i</math>
***<math>o_i</math> is represented as a feature vector
***usually one acoustic observation for each 10ms
*Sentence treated as a string of words <math>W = w_1, w_2, w_3, ..., w_n</math>
*Probability model:
 <math>W^*=\textit{argmax}_{W \in \textit{L}} P(W|O) = \textit{argmax}_{W \in \textit{L}} \frac{P(O|W)P(W)}{P(O)} = \textit{argmax}_{W \in \textit{L}} P(O|W)P(W)</math>
*<math>P(W)</math> - the prior probability - computed by '''language model'''
*<math>P(O|W)</math> - the observation likelihood - computed by '''acoustic model'''

== The Hidden Markov Model Applied To Speech ==

*HMM characterized by following components
**<math>Q = q_1 q_2 ... q_n</math> - set of '''states'''
**<math>A = a_{01} a_{02} ... a_{n1} ... a_{nn}</math> - '''transition probability matrix'''
***<math>a_{ij}</math> representing the probability of moving from state ''i'' to state ''j''
**<math>B = b_i(o_t)</math> - set of '''emission probabilities''', each representing the probability of an observation ''o_t'' being generated from a state ''i''
***observations are real-valued vectors (unlike in part-of-speech tagging HMM - observations were discrete values (part of speech tags) there)
**<math>q_0, q_{end}</math> - spcial '''start and end states''' not associated with observations

=== HMM representation of phone ===
[[Image:Fig6.png|thumb|550px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
*''Figure 9.6''
**Phones are non-homogeneous over time
***Thus there are separate states modelling a beginning, middle, and end of each phone.

=== HMM representation of word ===
*pronounciation lexicon needed - tell us for each word what phones it consists of (sometimes multiple variants of pronounciation)
**pronounciation lexicon of English: CMU dictionary (publicly available)
*''Figure 9.7''
**Concatenation of HMM representations of phones
**The figure depicts a typical feature of ASR HMMs: '''left-to-right''' HMM structure
***HMM don't allow transition from states to go to earlier states in the word
***states can transition to themselves or to successive states
****transitions from states to themselves are extremely important - durations of phones can vary significantly (for example: duration of [aa] phone varies from 7 to 387 miliseconds - 1 to 40 frames)
[[Image:Fig7.PNG|thumb|650px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]

=== Speech recognition HMM ===
[[Image:Fig22.PNG|thumb|500px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
*''Figure 9.22'':
**Combination of word HMMs
**added state optionality state modelling silence
**transition from ''end'' state to ''start'' state - sentence can be constructed out of arbitrary number of words
**transitions from the ''start'' state are assigned unigram LM probabilities

*''Figure 9.29'':
[[Image:Fig29.PNG|thumb|500px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
**''start'' state, ''end'' state and silence states omitted here for a convinience reason
**Bigram language model used here - probabilities of transitions from the ends to the beginnings of the words

== Feature Extraction: MFCC Vectors ==
*''MFCC - mel frequency cepstral coefficients'' - most common feature representation

=== Analog-to-Digital Conversion ===
*'''sampling'''
**measuring of the the amplitude of the signal at a particular time
**''sampling rate'' - number of samples per second
**maximum frequency that can be measured is half of the ''sampling rate''
*'''quantization'''
**Represenation of real-valued numbers (amplitude measurements) as integer
***8 bits => values from -128 to 127
***16 bits => values from -32768 to 32767

=== Preemphasis ===
*boosting of amount of energy in the high frequencies
*improves phone detection accuracy

=== Windowing ===
*spectral features are to be extracted from a small segments of speech
** asumption that the signal is stationary on small segment
*(rougly) stationary segments of speech extracted by using a '''windowing''' technique
*windowing characterized by
**window's '''width'''
**'''offset''' between succesiev windows
**'''shape''' of the window
*segments of speech extracted by windowing are called '''frames'''
**'''frame size''' - number of miliseconds in each frame
**'''frame shift''' - miliseconds between the left edges of successive windows

=== Discrete Fourier Transform (DFT) ===
*extracts spectral information for the windowed signal
*how much energy each frame contains at different frequency bands

=== Mel Filter Bank and Log ===
*human ears less sensitive to higher frequencies (above 1000Hz)
*human hearing is "logaritmic"
**(i.e. for human, distance between 440Hz and 880Hz equals the distance between 880Hz and 1760Hz - in both examples, the distance is one musical octave)
*these feature of hearing are utilized in speech recognition
*DFT outputs are warped onto the '''mel scale'''
**Frequency scale devided into bans (frekvenční pásma)
***10 bans linearly spaced below 1000Hz
***remaining bans spread logarithmically above 1000Hz
**new '''mel-scaled''' vectors
***energy collected from each frequency band
***logarithms of energy values - the human response to signal level is logarithmic

=== The Cepstrum ===
''briefly:''
*inverse DFT of the MEL scaled signal
*result - 12 cepstral coefficients for each frame
*motivation - these coefficients are uncorelated, "better modelate vocal tract"

=== Deltas and Energy ===
*So far - 12 cepstral features
*13th feature - '''Energy''': obtained by suming squares of signal energy for all samples in the given frame
*For each feature:
**'''delta''' cepstral coefficient obtained as a derivation of feature values, computed as 
 <math>d(t) = \frac{c(t+1) - c(t-1)}{2}</math>
 for particular cepstral value ''c(t)'' at time ''t''
**'''double delta''' cepstral coefficient obtained as a derivation of '''delta''' cepstral feature values

=== MFCC Summary ===
*39 MFCC features
 12 cepstral coefficients
 12 delta cepstral coefficients
 12 double delta cepstral coefficients
 1 energy coefficient
 1 delta energy coefficient
 1 double delta energy coefficient

== Acoustic Likelihood Computation ==
*Last chapter - how to obtain feature vectors
*This chapter - how to compute likelihood of these feature vectors given an HMM state
**i.e. how to obtain '''emmision probabilities'''
 <math>B = b_i(o_t) = p(o_t | q_i)</math>

=== Vector Quantization ===
*simple method, not used in state of the art systems
*idea: map feature vectors into a small number of classes
*'''codebook''' - list of possible classes
*'''prototype vector''' - feature vector representing the class
*codebook is created by '''clustering''' of all the feature vectors in the training set into the given number of classes
*prototype vector can be chosen as a central point of each cluster
*each incoming feature vector is compared to all prototype vectors, the closest prototype vector is selected, and the feature vector is replaced by the class label of the selected prototype vector.
*disadvantage of this method: loss of specific information about the given feature vector, significant impact on performance
*advantage: emissions probabilities can be stored for each pair of HMM state and output symbol, Baum-Welch training conceptually easier

=== Gaussian Probability Density Functions ===
*more adequate method for modelling of emission probabilities
*used in state of the art systems
*for each HMM state, emission probability distribution over the space of possible feature vectors is expressed by '''Gausian Micture Model'''
 <math>b_j(o_t) = \sum_{m=1}^M c_{jm} \frac{1}{\sqrt{ 2\pi |\Sigma_{jm}|}}exp[(o_t-\mu_{jm})^T \Sigma_{jm}^{-1}(o_t-\mu_{jm})]</math>
 <math>\mu_{jm}</math> - mean of the ''m-th'' Gaussian of the state ''j''
 <math>\Sigma{jm}</math> - covariance matrix of the ''m-th'' Gaussian of the state ''j''
*Estimation of the GMM parameters (mean and covariance matrix) - '''Baum-Welch algorithm'''

== Search and Decoding ==
*Bayes probability formula:
 <math>W^* = \textit{argmax}_{W \in L} P(O|W) P(W)</math>
*typically, more complex formula used:
 <math>W_* = \textit{argmax}_{W \in L} P(O|W) P(W)^{LMSF} \textit{WIP}^N</math>
 ''LMSF'' - language model weight
 ''WIP'' word insertion penalty
 ''N'' - number of words in sentence
*decoding - '''Viterbi algorithm'''
**finds the most probably sequence of HMM states
**the output sentence can be easily constructed out of this sequence of HMM states
**'''beam search prunning'''
***at each trellis stage, compute the probability of best state/path ''D''. prune away any state that is less probable than <math>D \times \Theta</math>, where <math>\Theta</math> is the beam width (value lower than 1)
[[Image:Fig26.PNG||550px|(Speech and Language Processing (draft) - Jurafsky & Martin)]]

== Embedded Training ==
*Recall: ''A'' - HMM transition probabilities, ''B'' - emission probabilities (modeled by GMMs)
*Given: phoneset, pronunciation lexicon and the transcribed wavefiles:
**Build a whole sentence HMM for each training sentence (concatenation of HMM for distinct words)
**Initialize ''A'' probabilities to 0.5 (for loop-back or for the correct next subphone) or to zero (for all other transitions)
**Initialize B probabilities by setting the mean and variance for each Gaussian to the global mean and variance for the entire training set
**Run multiple iterations of the Baum-Welch algorithm
*This process will optimize the ''A'' and ''B'' probabilities
**i.e. for each HMM state corresponding to a distinct subphone (each phone modelled by 3 subphones), the probability of loop-back transition and leaving transition is reestimated as well as the parameters of GMM for the given state.
**phone is modelled by the same parameters independently from which word it occurs in (i.e. same HMM for phone [a] in every word containing [a] phone) - ''(this is not written anywhere in Jurafsky & Martin, but it seems to be intuitive and it's hopefully true:) )

= Speech Synthesis =