# Státnice I3: Automatická analýza jazyka

Přejít na: navigace, hledání

Převzato z wiki poznámek k předmětu Statistické metody zpracování přirozených jazyků.

Tato část je neúplná a potřebuje rozšířit. některá místa rozpracovat, jinde zkrátit

## Morphology & Tagging

• Task, formally: $A^{+} \to T$ (simplified), split morphology & tagging (disambiguation): $A^{+} \to 2^{(L,C_1,C_2,\dots,C_n)}\to T$. Tagging must look at context.
• Tagset: influenced by linguistics as well as technical decisions.
• Tag ~ n-tuple of grammar information, often thought of as a flat list.
• Eng. tagsets ~ 45 (PTB), 87 (Brown), presentation: short names.
• Other: bigger tagsets -- only positional tags, size: up to 10k.
• Tagging inside morphology: first, find the right tag, then the morphological categories: $A^{+}\to T\to (L,C_1,\dots,C_n)$.
• Doable for poor flection languages only.
• Possibly only decrease the ambiguity for the purposes of tagging (i.e. morphology doesn't have to be so precise).
• Lemmatization: normally a part of morphology, sometimes (for searching) done separately.
• Stem -- simple code for Eng., no need of a dictionary, now out-dated.
• Possible methods for morphology:
• Word lists: lists of possible tags for each word form in a language
• Works well for Eng. (avg. ca. 2.5 tags/word), not so good for Cze. (avg. ca. 12-15 tags/word).
• Direct coding: splitting into morphemes (problem: split and find possible combinations)
• Finite State Machinery (FSM)
• CFG, DATR, Unification: better for generation than analysis

### Finite State Machinery

Finite-State-Automata

• Smarter word form lists: compression of a long word list into a compact automaton.
• Trie + Grammar information, minimize the automaton (automaton reduction)
• Need to minimize the automaton & not overgenerate

Two-Level-Morphology

• phonology + morphotactics, two-level rules, converted to FSA
• solves e.g. Eng. doubling (stopping), Ger. umlaut, Cze. "ský" -->pl. "čtí" etc.
• Finite State Transducer: an automaton, where the input symbols are tuples
• run modes: check (for a sequence of pairs, gives out Y/N) + analysis (computes the "resolved" (upper) member of the pair for a sequence of "surface" symbols) + synthesis (the other way round)
• used mostly for analysis
• ususaly, one writes small independent rules (watch out for collisions), one FST for one rule -- they're run in parallel and all must hold
• zero-symbols, one side only, check for max. count for a language
• we may eliminate zero-symbols using ordinary FSA on lexicon entries (upper layer alphabet; prefixes: according to them, the possible endings are treated specially)
• FSTs + FSAs may be combined, concatenated; the result is always an FST

Two-level morphology: analysis

1. initialize paths $P := \{\}$
2. read input surface symbols, one at a time, repeat (this loop consumes one char of the input):
1. at each symbol, until max. consecutive zeroes for given language reached, repeat (this loop just adds one possible zero):
1. generate all lexical (upper-level) symbols possibly corresponding to zero (+all possible zeroes on surface)
2. prolong all paths in $P\;\!$ by all such possible $(x:0)\;\!$ pairs (big expansion)
3. check each new path extension against the phonological FST and lexical FSA (lexical symbols only), delete impossible path prefixes.
2. generate all possible lexical symbols (get from all FSTs) for the current input symbol, create pairs
3. extend all paths with the pairs
4. check all paths (next step in FST/FSA) and delete the impossible ones
3. collect lexical "glosses" from all surviving paths

### Rule-Based Tagging

• Rules using: word forms/tags/tag sets for context and current position, combinations
• If-then / regular expression style (blocking negative)
• Implementation: FSA, for all rules -- intersection
• algorithm ~ similar to Viterbi (dynamic programming: if the FSA rejects a path, throw it away)
• May even work, sometimes does not disambiguate enough
• Tagging by parsing: write rules for syntactic analysis and get morphological analysis as a by-product
• in a way, this is the linguistically correct approach
• better, cleaner rules
• more difficult than tagging itself, nobody has ever done it right

The paragon of understanding these iusses is right here!

If only there were more clever peolpe like you!

### Maximum-Entropy Tagging

• using Max. Entropy model
• feature functions -- take the data and say Y/N (or give out a real number), weighted ($\sum \lambda = 1$)
• maximize entropy ~ train the weights
• respect the proportions of counts of features in the data (1's)
• objective function optimization under certain constraints: Lagrange multiplicators
• the only freely adjustable part is the set of features (which is much more than what we may tune in an HMM -- there's only $n$ for $n$-grams)
• features may use tags only at positions to the left (and dynamic programming ~ Viterbi) + words on all positions, but don't have to do it at all (it's only very useful to do so)
• the model in general: $p(y|x) = \frac{1}{Z}e^{\sum_{i=1}^{N}\lambda_i f_i(y,x)}$, find $\lambda_i$ satisfying the model and constraints ($E_p(f_i(y,x)) = d_i$ where $d_i = E'(f_i(y,x))$ -- empirical expectation of the feature frequency)
• for tagging: $p(t|x) = \frac{1}{Z}e^{\sum_{i=1}^{N}\lambda_i f_i(t,x)}$, where $t\in$tagset, $x~$context (words and tags, e.g. up to 3 positions L-R (tags left only))
• features may ask any information from this window
• careful for too many details -- it slows things down and there's an imminent danger of overtraining
• best way: select features automatically -- we then have a combination of HMM & Transformation-Based Learning
• rules involve tags at previous places and are selected by machine
• we still have to define the templates for the features
• features are independent --> parallelizable; may be even interesting linguistic phenomena ~ final punctuation -> mode of the verb etc.
• feature selection
• manually: impossible --> greedy selection: iterative scaling algorithm ($O(n^2)$ -- still too much)
• limiting the number of features: be reasonable in creating the templates
• use contexts which appear in the training data
• use only rarer features (max. ca. 10 times in the training data) -- too frequent features are probably too unspecific -- get rid of them, e.g. if they have a distribution close to uniform (use relative entropy)
• do not use all combinations of context
• feature types
• any, even lexical features -- words with most errors etc.
• the best way of tagging, the only problem is that we don't know what the optimal features are (the neural networks/perceptron are the best for Cze.)

### Feature-Based Tagging

Tato část je neúplná a potřebuje rozšířit. Tohle nechápu, zřejmě to odpovídá nějaké feature selection. -- Tuetschek 11:36, 7 Sep 2010 (CEST)
• idea: save on computing feature weights, select a few but good features
• training criterion: error rate
• model form: same as for maximum entropy $p(y|x) = \frac{1}{Z(x)}e^{\sum_{i=1}^N \lambda_i f_i (y,x)}$ -- exponential or loglinear model

If I were a Teaenge Mutant Ninja Turtle, now I'd say "Kowabunga, dude!"

## Parsing

Chomsky Hierarchy

1. Type 0 Grammars/Languages -- $\alpha\to\beta$, where $\alpha,\beta$ are any strings of nonterminals
2. Context-Sensitive Grammars/Languages -- $\alpha X\beta\to\alpha\gamma\beta$, where $X$ is a nonterminal, $\alpha,\beta,\gamma$ any string of terminals and nonterminals (and $\gamma$ must not be empty)
3. Context-Free Grammars/Languages -- $X\to\gamma$, where $X$ is a nonterminal and $\gamma$ is any string of terminals and nonterminals
4. Regular Grammmars/Languages -- $X\to\alpha Y$, where $X,Y$ are nonterminals and $\alpha$ a string of terminals, $Y$ might be missing

Chomsky's Normal Form

• for CFG's -- each CFG convertible to normal form
• rules only $A\to BC$ (two nonterminals), $A\to\gamma$ (one terminal), $S\to\varepsilon$ (empty string)

Parsing grammars

• Regular Grammars -- FSA, constant space, linear time
• CFG -- widely used for surface syntax description of natural languages, needed: stack space, $O(n^3)$ time -- CKY/CYK algorithm

weGHgb <a href="http://lujeoyzxtsjv.com/">lujeoyzxtsjv</a>

### Probabilistic CFG

• relations among mother & daughter nodes in terms of derivation probability
• probability of a parse tree: $p(T) = \prod_{i=1}^n p(r(i))$, where $p(r(i))$ is a probability of a rule used to generate the sentence of which the tree T is a parse
• probability of a string is not as trivial, as there may be many trees resulting from parsing the string: $p(W) = \sum_{j=1}^n p(T_j)$, where $T_j$'s are all possible parses of the string W.
• assumptions (very strong!):
• independence of context (neighboring subtrees)
• independence of ancestors (upper levels)
• place independence (regardless where the rule appears in the tree) ~ similar to time invariance in HMM
• probability of a rule -- distribution $r(i): A\to \alpha$ ~ $0\leq p(r)\leq 1;\ \ \sum_{r\in\{A\to\dots\}} p(r) = 1$
• may be estimated by MLE from a treebank following a CFG grammar: counting rules & counting non-terminals
• inside probability: $\beta_N(p,q) = p(N\to^{\star}\ w_{pq})$ (probability that the nonterminal $N$ generates the part of the sentence $w_p \dots w_q$)
• for CFG in Normal Form may be computed recursively: $\beta_N(p,q) = \sum_{A,B}\sum_{d=p}^{q-1} p(N\to A\ B)\beta_A(p,d)\beta_B(d+1,q)$

Computing string probability -- Chart Parsing (CYK algorithm)

• create a table $n \times n$, where $n$ is the length of the string
• initialize on the diagonal, using $N\to\alpha$ rules (tuples: nonterminal + probability), compute along the diagonal towards the upper right corner
• fill the cell with a nonterminal + probability that the given part of the string, i.e. row = from, col = to, is generated by the particular rule
• consider more probabilities that lead to the same results & sum them (here: for obtaining the probability of a string, not parsing)
• for parsing: need to store what was the best (most probable) tree -- everything is computable from the table, but it's slow: usually, you want a list of cells / rules that produced this one
• if the CFG is in Chomsky Normal Form, the reverse computation is much more effective

I had no experience in buynig and selling penny stocks, but I had the intense desire for trading them because, in my opinion, no other method than investing in these stocks can multiply my earnings overnight.

sJMsps <a href="http://bclwauajtblj.com/">bclwauajtblj</a>