Věci k zapamatování

Probability

  • Joint and conditional probability: p(A,B)=p(AB)p(A,B) = p(A \cap B);p(AB)=p(A,B)p(B)p(A|B) = \frac{p(A,B)}{p(B)}

  • Bayes Rule: p(AB)=p(BA)p(A)p(B)p(A|B) = p(B|A)\cdot\frac{p(A)}{p(B)}

  • Chain Rule: p(A1,A2,,An)=p(A1A2,...,An)p(A2A3,...,An)p(An)p(A_1, A_2, \dots, A_n) = p(A_1|A_2, ..., A_n) \cdot p(A_2|A_3, ..., A_n) \cdot \vdots \cdot p(A_n)

  • The Golden Rule (of stat. NLP): Abest=argmaxA p(BA)p(A)A_{\mathrm{best}} = \mathrm{argmax}_A\ p(B|A)\cdot p(A)

Information Theory

  • Entropy: H(X)=xp(x)log2(p(x))H(X) = - \sum_x p(x)\cdot \log_2(p(x))

  • Perplexity: G(p)=2H(p) ⁣G(p) = 2^H(p)\,\!

  • Conditional entropy: H(YX)=x,yp(x,y)log2(p(yx))H(Y|X) = - \sum_{x,y} p(x,y)\cdot\log_2(p(y|x))

    • Chain Rule: H(X,Y)=H(YX)+H(X)=H(XY)+H(Y) ⁣H(X,Y) = H(Y|X) + H(X) = H(X|Y) + H(Y)\,\!

  • Kullback-Leibler distance: D(pq)=p(x)log2(p(x)q(x))D(p||q) = \sum p(x)\cdot\log_2(\frac{p(x)}{q(x)})

  • Mutual Information: I(X,Y)=D(p(x,y)p(x)p(y))I(X,Y) = D(p(x,y)||p(x)\cdot p(y))

    • I(X,Y)=x,yp(x,y)log2(p(x,y)(p(x)p(y))I(X,Y) = \sum_{x,y} p(x,y) \cdot log_2( \frac{p(x,y)}{(p(x)\cdot p(y)} )

    • I(X,Y)=H(X)H(XY) ⁣I(X,Y) = H(X) - H(X|Y)\,\!

    • D(pq)0D(p||q) \geq 0

  • Cross Entropy: Hp(p)=xp(x)log2(p(x))H_{p'}(p) = - \sum_x p'(x)\cdot\log_2(p(x))

    • conditional: Hp(p)=x,yp(x,y)log2(p(yx))H_{p'}(p) = - \sum_{x,y} p'(x,y)\cdot\log_2(p(y|x))

    • conditional over data: 1Ti over datalog2(p(yixi))-\frac{1}{|T'|}\cdot\sum_{i\mathrm{\ over\ data}}\log_2(p(y_i|x_i))

Language Modeling

  • The Golder Rule (again): Abest=argmaxA p(BA)p(A)A_{\mathrm{best}} = \mathrm{argmax}_A\ p(B|A)\cdot p(A), where

    • p(BA) ⁣p(B|A)\,\! – application specific model

    • p(A) ⁣p(A)\,\! – the language model

  • Markov Chain (n-gram LM): p(W)=iP(wiwin+1,win+2,...,wi1) ⁣p(W) = \prod_i P(w_i|w_{i-n+1}, w_{i-n+2}, ..., w_{i-1})\,\!

  • Maximum Likelihood Estimate (3-grams): p(wiwi2,wi1)=c(wi2,wi1,wi)c(wi2,wi1)p(w_i|w_{i-2}, w_{i-1}) = \frac{c(w_{i-2}, w_{i-1}, w_i)}{c(w_{i-2}, w_{i-1})}

Smoothing

  • Adding 1: p(wh)=c(w,h)+1c(h)+Vp'(w|h) = \frac{c(w,h) + 1}{c(h) + |V|}

  • Adding less than 1 p(wh)=c(w,h)+λc(h)+λVp'(w|h) = \frac{c(w,h) + \lambda}{c(h) + \lambda\cdot|V|}

  • Good-Turing: p(wi)=c(wi+1)N(c(wi)+1)TN(c(wi))p'(w_i) = \frac{c(w_i + 1)*N(c(w_i) + 1)}{|T|*N(c(w_i))}

      • normalize

  • Linear Interpolation using MLE:

    • pλ(wiwi2,wi1)=λ3p3(wiwi2,wi1)+λ2p2(wiwi1)+λ1p1(wi)+λ01Vp'_{\lambda}(w_i|w_{i-2}, w_{i-1}) = \lambda_3\cdot p_3(w_i|w_{i-2}, w_{i-1}) + \lambda_2\cdot p_2(w_i|w_{i-1}) + \lambda_1\cdot p_1(w_i) + \lambda_0\cdot\frac{1}{|V|}

    • minimize entropy: 1Hi=1Hlog2(pλ(wihi))-\frac{1}{|H|}\sum_{i=1}^{|H|}\log_2(p'_{\lambda}(w_i|h_i))

    • compute expected counts for lambdas: c(λj)=i=1Hλjpj(wihi)pλ(wihi)c(\lambda_j) = \sum_{i=1}^{|H|}\frac{\lambda_j\cdot p_j(w_i|h_i)}{p'_{\lambda}(w_i|h_i)}

    • compute next lambdas: λj,next=c(λj)kc(λk)\lambda_{j,\mathrm{next}} = \frac{c(\lambda_j)}{\sum_k c(\lambda_k)}

  • Bucketed Smoothing – divide heldout data into buckets according to frequency and use LI+MLE

Mutual Information and Word Clasess

Word Classes

  • 3-gram LM using classes: pk(wici2,ci1)=p(wici)pk(cici2,ci1)p_k(w_i|c_{i-2}, c_{i-1}) = p(w_i|c_i)\cdot p_k(c_i|c_{i-2}, c_{i-1})

  • Which classes (words) to merge - objective function: H(W)+I(DE) ⁣-H(W) + I(D|E)\,\!, where D,E ⁣D, E\,\! are LHS and RHS classes of the bigrams in W ⁣W\,\!

  • Greedy Algorithm

    • Start with each word in separate class

    • Merge classes k,l ⁣k, l\,\!, so that: (k,l)=argmaxk,l Imerge  k,l(D,E)(k,l) = \mathrm{argmax}_{k,l}\ I_{\mathrm{merge}\;k,l} (D,E)

    • Repeat the previous step until C ⁣|C|\,\! is as small as desired