Revert to this revision
Syntax highlighting of Archiv/Státnice I3: Analýza a syntéza mluvené řeči

{{Work in progress|25.8.2010}}

= Speech Recognition =

*The task of transforming of acoustic signal into text.
*Based on statistical methods
*Constant increase of performance over last decade
*Parameters influencing the recognizers performance:
**size of the vocabulary
***yes/No recognition
***digit recognition - 10 words
***large vocabulary - 20,000 to 60,000 words
**fluency of speech
***isolated words recognition
***continuous speech recognition
**noisy-signal ratio

*The following text will focus on '''Large-Vocabulary Continuous Speech Recognition''' (although the methods are applicable universally), the methods shown are '''speaker independent''' (The system was not trained on the person to be recognized)

== Speech Recognition Architecture ==
*'''Noisy-channel paradigm'''
*Acoustic input <math>O = o_1, o_2, o_3, ..., o_t</math>
**consist of individual "acoustic observations" <math>o_i</math>
***<math>o_i</math> is represented as a feature vector
***usually one acoustic observation for each 10ms
*Sentence treated as a string of words <math>W = w_1, w_2, w_3, ..., w_n</math>
*Probability model:
**<math>W^*=\textit{argmax}_{W \in \textit{L}} P(W|O) = \textit{argmax}_{W \in \textit{L}} \frac{P(O|W)P(W)}{P(O)} = \textit{argmax}_{W \in \textit{L}} P(O|W)P(W)</math>
**<math>P(W)</math> - the prior probability - computed by '''language model'''
**<math>P(O|W)</math> - the observation likelihood - computed by '''acoustic model'''

== The Hidden Markov Model Applied To Speech ==

*HMM characterized by following components
**<math>Q = q_1 q_2 ... q_n</math> - set of '''states'''
**<math>A = a_{01} a_{02} ... a_{n1} ... a_{nn}</math> - '''transition probability matrix'''
***<math>a_{ij}</math> representing the probability of moving from state ''i'' to state ''j''
**<math>B = b_i(o_t)</math> - set of '''emission probabilities''', each representing the probability of an observation ''o_t'' being generated from a state ''i''
***observations are real-valued vectors (unlike in part-of-speech tagging HMM - observations were discrete values (part of speech tags) there)
**<math>q_0, q_{end}</math> - spcial '''start and end states''' not associated with observations

=== HMM representation of phone ===
[[Image:Fig6.png|thumb|550px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
*''Figure 9.6''
**Phones are non-homogeneous over time
***Thus there are separate states modelling a beginning, middle, and end of each phone.

=== HMM representation of word ===
*''Figure 9.7''
**Concatenation of HMM representations of phones
**The figure depicts a typical feature of ASR HMMs: '''left-to-right''' HMM structure
***HMM don't allow transition from states to go to earlier states in the word
***states can transition to themselves or to successive states
****transitions from states to themselves are extremely important - durations of phones can vary significantly (for example: duration of [aa] phone varies from 7 to 387 miliseconds - 1 to 40 frames)
[[Image:Fig7.PNG|thumb|650px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]

=== Speech recognition HMM ===
[[Image:Fig22.PNG|thumb|500px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
*''Figure 9.22'':
**Combination of word HMMs
**added state optionality state modelling silence
**transition from ''end'' state to ''start'' state - sentence can be constructed out of arbitrary number of words
**transitions from the ''start'' state are assigned unigram LM probabilities

*''Figure 9.29'':
[[Image:Fig29.PNG|thumb|500px|(Speech and Language Processing(draft) - Jurafsky, Martin)]]
**''start'' state, ''end'' state and silence states omitted here for a convinience reason
**Bigram language model used here - probabilities of transitions from the ends to the beginnings of the words

== Feature Extraction: MFCC Vectors ==
*''MFCC - mel frequency cepstral coefficients'' - most common feature representation

=== Analog-to-Digital Conversion ===
*'''sampling'''
**measuring of the the amplitude of the signal at a particular time
**''sampling rate'' - number of samples per second
**maximum frequency that can be measured is half of the ''sampling rate''
*'''quantization'''
**Represenation of real-valued numbers (amplitude measurements) as integer
***8 bits => values from -128 to 127
***16 bits => values from -32768 to 32767

=== Preemphasis ===
*boosting of amount of energy in the high frequencies
*improves phone detection accuracy

=== Windowing ===
*spectral features are to be extracted from a small segments of speech
** asumption that the signal is stationary on small segment
*(rougly) stationary segments of speech extracted by using a '''windowing''' technique
*windowing characterized by
**window's '''width'''
**'''offset''' between succesiev windows
**'''shape''' of the window
*segments of speech extracted by windowing are called '''frames'''
**'''frame size''' - number of miliseconds in each frame
**'''frame shift''' - miliseconds between the left edges of successive windows

=== Discrete Fourier Transform (DFT) ===
*extracts spectral information for the windowed signal
*how much energy each frame contains at different frequency bands

=== Mel Filter Bank and Log ===
*human ears less sensitive to higher frequencies (above 1000Hz)
*human hearing is "logaritmic"
**(i.e. for human, distance between 440Hz and 880Hz equals the distance between 880Hz and 1760Hz - in both examples, the distance is one musical octave)
*these feature of hearing are utilized in speech recognition
*DFT outputs are warped onto the '''mel scale'''
**Frequency scale devided into bans (frekvenční pásma)
***10 bans linearly spaced below 1000Hz
***remaining bans spread logarithmically above 1000Hz
**new '''mel-scaled''' vectors
***energy collected from each frequency band
***logarithms of energy values - the human response to signal level is logarithmic

=== The Cepstrum ===
''briefly:''
*inverse DFT of the MEL scaled signal
*result - 12 cepstral coefficients for each frame
*motivation - these coefficients are uncorelated, "better modelate vocal tract"

=== Deltas and Energy ===
*So far - 12 cepstral features
*13th feature - '''Energy''': obtained by suming squares of signal energy for all samples in the given frame
*For each feature:
**'''delta''' cepstral coefficient obtained as a derivation of feature values, computed as 
 <math>d(t) = \frac{c(t+1) - c(t-1)}{2}</math>
 for particular cepstral value ''c(t)'' at time ''t''
**'''double delta''' cepstral coefficient obtained as a derivation of '''delta''' cepstral feature values

=== MFCC Summary ===
*39 MFCC features
 12 cepstral coefficients
 12 delta cepstral coefficients
 12 double delta cepstral coefficients
 1 energy coefficient
 1 delta energy coefficient
 1 double delta energy coefficient

= Speech Synthesis =