Archiv/ωικι.matfyz.cz:Pískoviště

Věci k zapamatování

Joint and conditional probability: $p(A,B) = p(A \cap B)$ ; $p(A|B) = \frac{p(A,B)}{p(B)}$
Bayes Rule: $p(A|B) = p(B|A)\cdot\frac{p(A)}{p(B)}$
Chain Rule: $p(A_1, A_2, \dots, A_n) = p(A_1|A_2, ..., A_n) \cdot p(A_2|A_3, ..., A_n) \cdot \vdots \cdot p(A_n)$
The Golden Rule (of stat. NLP): $A_{\mathrm{best}} = \mathrm{argmax}_A\ p(B|A)\cdot p(A)$

Entropy: $H(X) = - \sum_x p(x)\cdot \log_2(p(x))$
Perplexity: $G(p) = 2^H(p)\,\!$
Conditional entropy: $H(Y|X) = - \sum_{x,y} p(x,y)\cdot\log_2(p(y|x))$
- Chain Rule: $H(X,Y) = H(Y|X) + H(X) = H(X|Y) + H(Y)\,\!$
Kullback-Leibler distance: $D(p||q) = \sum p(x)\cdot\log_2(\frac{p(x)}{q(x)})$
Mutual Information: $I(X,Y) = D(p(x,y)||p(x)\cdot p(y))$
- $I(X,Y) = \sum_{x,y} p(x,y) \cdot log_2( \frac{p(x,y)}{(p(x)\cdot p(y)} )$
- $I(X,Y) = H(X) - H(X|Y)\,\!$
- $D(p||q) \geq 0$
Cross Entropy: $H_{p'}(p) = - \sum_x p'(x)\cdot\log_2(p(x))$
- conditional: $H_{p'}(p) = - \sum_{x,y} p'(x,y)\cdot\log_2(p(y|x))$
- conditional over data: $-\frac{1}{|T'|}\cdot\sum_{i\mathrm{\ over\ data}}\log_2(p(y_i|x_i))$

The Golder Rule (again): $A_{\mathrm{best}} = \mathrm{argmax}_A\ p(B|A)\cdot p(A)$ , where
- $p(B|A)\,\!$ – application specific model
- $p(A)\,\!$ – the language model
Markov Chain (n-gram LM): $p(W) = \prod_i P(w_i|w_{i-n+1}, w_{i-n+2}, ..., w_{i-1})\,\!$
Maximum Likelihood Estimate (3-grams): $p(w_i|w_{i-2}, w_{i-1}) = \frac{c(w_{i-2}, w_{i-1}, w_i)}{c(w_{i-2}, w_{i-1})}$

Adding 1: $p'(w|h) = \frac{c(w,h) + 1}{c(h) + |V|}$
Adding less than 1 $p'(w|h) = \frac{c(w,h) + \lambda}{c(h) + \lambda\cdot|V|}$
Good-Turing: $p'(w_i) = \frac{c(w_i + 1)*N(c(w_i) + 1)}{|T|*N(c(w_i))}$
- - normalize
Linear Interpolation using MLE:
- $p'_{\lambda}(w_i|w_{i-2}, w_{i-1}) = \lambda_3\cdot p_3(w_i|w_{i-2}, w_{i-1}) + \lambda_2\cdot p_2(w_i|w_{i-1}) + \lambda_1\cdot p_1(w_i) + \lambda_0\cdot\frac{1}{|V|}$
- minimize entropy: $-\frac{1}{|H|}\sum_{i=1}^{|H|}\log_2(p'_{\lambda}(w_i|h_i))$
- compute expected counts for lambdas: $c(\lambda_j) = \sum_{i=1}^{|H|}\frac{\lambda_j\cdot p_j(w_i|h_i)}{p'_{\lambda}(w_i|h_i)}$
- compute next lambdas: $\lambda_{j,\mathrm{next}} = \frac{c(\lambda_j)}{\sum_k c(\lambda_k)}$
Bucketed Smoothing – divide heldout data into buckets according to frequency and use LI+MLE

3-gram LM using classes: $p_k(w_i|c_{i-2}, c_{i-1}) = p(w_i|c_i)\cdot p_k(c_i|c_{i-2}, c_{i-1})$
Which classes (words) to merge - objective function: $-H(W) + I(D|E)\,\!$ , where $D, E\,\!$ are LHS and RHS classes of the bigrams in $W\,\!$
Greedy Algorithm
- Start with each word in separate class
- Merge classes $k, l\,\!$ , so that: $(k,l) = \mathrm{argmax}_{k,l}\ I_{\mathrm{merge}\;k,l} (D,E)$
- Repeat the previous step until $|C|\,\!$ is as small as desired