NPFL124 Zpracování přirozeného jazyka
Materiály
Zadání úkolů 2024
Boolean Retrieval
Implement the inverted index with a hash used for the dictionary part of the index.
Implement algorithms for postings intersection and union.
Index the provided document collection.
Write a query parser for AND, OR, and NOT.
Process the provided set of boolean queris and submit the results.
Write a short report on you work.
Diacritics restoration
Implement a program that reads a Czech text with removed diacritics from STDIN and print the same text with restored diacritics to STDOUT.
A possible solution: build a Czech corpus of your own (e.g. by using a few e-books or news or Wikipedia or ...) that contains at least 100k tokens (words and punctuation marks). Create a modified copy of the corpus in which all Czech diacritics is removed. Extract a mapping from words without diacritics to words with diacritics. For out-of-vocabulary words use letter-trigram language model.
Evaluate the accuracy of the restoration as a percentage of correct non-white characters in the output.
Evaluation datasets - 2 randomly chosen articles from vesmir.cz:
development set
evaluation set (to be used only for evaluation the very final version of your system!)
You can use any programming language as long as it can be compiled/executed on a Linux without too much tweaking (esp. without purchasing any license). Recommended choice: Python 3.
You can use the devtest data any times you need, but you should use the etest data for evaluation only once.
Ideally, organize the execution of the whole experiment into a Makefile that (after typing make all) downloads your training data, as well as the development and evaluation test sets from the links above, trains the model, applies it on the development data and evaluates the accuracy.
Write a short summary (1-2 paragraphs) of the experiment and store it into a README file (txt, md, or pdf).
Analysis of a Trained Model for Sentiment Classifation
In this assignment, you will analyze the weights of a trained neural network. In the practical following Lecture 9, you trained several classifiers for sentiment analyses. Your goal in this assignment will be to interpret the weight of one of the networks you trained in the pracitals: Model 2 based on 1D convolution. If you did not manage to finish model in the practical or you are unsure about your solution, you will receive a reference solution of the CNN-based model on April 24 via email from SIS (email the instructor if not).
The first step in the convolution is multiplying the word embeddings with a weight matrix to analyze the response of convolutional filters. The output of this multiplication can be considered as a measure of how strongly the embeddings match the weight vectors in the convolution, so-called filters. These are the values that you will work with.
Using the input word embeddings (you will likely find them in model.embeddings.weight) and the convolutional filter weights (likely in model.conv[0][1].weight), find the words that lead to the highest filter responses. The response is computed as a dot product of the respective word embeddings and vectors from the weight matrices (you wil have to transpose the weights correctly, then you can find the best-scoring ones using topk function, think of setting the correct dimension). For simplicity, you can only work with kernels of size 1 but feel free to consider longer spans too. (Method tokenizer.convert_ids_to_tokes might be useful to convert the indices back to tokens.) [50% of the assignment]
Look at the results and qualitatively assess what words appear among the best-scoring ones. Write a few paragraphs of 100 to 400 words. [20% of the assignment]
Analyze what POS triggers the convolutional filters the most: compute a statistic how often different POS appear among the best scoring words. For each word, only consider the most frequent POS tag.(You can get the most frequent POS tags, e.g., from the English Web Treebank.) Speculate about the reasons for the statistics that you observe. . Present your results in a table and write your thoughts and comments in at most 200 words. [30% of the assignment]