# **NTIN066** Datové struktury I

<{Box(infobox)}>
|-----|-----|
| **Učitelé:** | [Mgr. Pavel Veselý, Ph.D.](https://iuuk.mff.cuni.cz/~vesely/) |
| | [RNDr. Jiří Fink, Ph.D.](https://ktiml.mff.cuni.cz/~fink/) |
| **Odkaz do SISu:** | [NMAI054](https://is.cuni.cz/studium/predmety/index.php?do=predmet&kod=NTIN066) |
| **Diskuze:** | [Discord kanál](https://discord.com/channels/625428723302137876/762425623062249482) |
<{/Box}>

## Skripta a záznamy
- [Medvědovy záznamy 2023](https://mj.ucw.cz/vyuka/2324/ds1/)
- [Průvodce labyrintem algoritmů(2. vydání)](http://pruvodce.ucw.cz/) - ultra mega dobrý skripta
- [Medvědovy poznámky k přednáškám](https://mj.ucw.cz/vyuka/dsnotes/)

## Materiály k předmětu
- ![Vzorový teoretický úkol](/NTIN066/vzorovy-teoreticky-ukol.pdf)

## Úkoly 2025

### 1. tree_successor
<{Details(Klikni sem pro zobrazení)}>
Given an implementation of a simple binary search tree including parent
pointers, implement a `successor` method. The methods is given a node
and it should return another node of the tree with the next higher key
(i.e., the smallest of keys which are greater than the given one).

- If there is no such node, it should return a null pointer.
- The given node can also be a null pointer, in which case the method should
  return the smallest node in the tree.

You can expect that `successor` method is never called on an empty tree.

You should submit the file `tree_successor.*` (but not the
`tree_successor_test.*`).

Source code templates can be found in [the git repository](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>


### 2. splay_operation
<{Details(Klikni sem pro zobrazení)}>
Given an implementation of a binary search tree including parent pointers:
- implement `splay` method, preferably utilizing the provided `rotate` operation
  performing a single rotation;
- update `lookup`, `insert` and `remove` methods to utilize it correctly.
You can implement the `remove` operation in more ways. However, if it is not the one presented in the lecture, you should provide a reference to the proof that it has the right amortized efficiency (e.g., a reference to a standard textbook).

You should submit the `splay_operation.*` file (but not the
`splay_operation_test.*`).

Source code templates can be found in [the git repository](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>


### 3. Splay experiment
<{Details(Klikni sem pro zobrazení)}>
The goal of this assignment is to evaluate your implementation of Splay trees
experimentally and to compare it with a "naive" implementation which splays
using single rotations only.

You are given a test program (`splay_experiment`) which calls your
implementation from the previous assignment to perform the following
experiments:

- _Sequential test:_ Insert _n_ elements sequentially and then repeatedly
  find them all in sequential order.
- _Random test:_ Insert _n_ elements in random order and then find _5n_
  random elements.
- _Subset test:_ Insert a sequence of _n_ elements, which contains arithmetic
  progressions interspersed with random elements. Then repeatedly access
  a small subset of these elements in random order. Try this with subsets of
  different cardinalities.

The program tries each experiment with different values of _n_. In each try,
it prints the average number of rotations per splay operation.

You should perform these experiments and write a report, which contains the following
plots of the measured data. Each plot should show the dependence of the average
number of rotations on the set size _n_.

- The sequential test: one curve for the standard implementation, one for the naive one.
- The random test: one curve for the standard implementation, one for the naive one.
- The subset test: three curves for the standard implementation with different sizes
  of the subset, three for the naive implementation with the same sizes.

The report should discuss the experimental results and try to explain the observed
behavior using theory from the lectures. (If you want, you can carry out further
experiments to gain better understanding of the data structure and include these
in the report. This is strictly optional.)

You should submit a PDF file with the report (and no source code).
You will get 1 temporary point upon submission if the file is syntantically correct;
proper points will be assigned later.
Please, add the time it took you to complete the assignment into the report as well.

#### Test program

The test program is given three arguments:
- The name of the test (`sequential`, `random`, `subset`).
- The random seed: you should use the last 2 digits of your student ID (you can find
  it in the Study Information System – just click on the Personal data icon). Please
  include the random seed in your report.
- The implementation to test (`std` or `naive`).

The output of the program contains one line per experiment, which consists of:
- For the sequential and random test: the set size and the average number of rotations.
- For the subset test: the subset size, the set size, and the average number of rotations
  per find. The initial insertions of the full set are not counted.

#### Your implementation

Please use your implementation from the previous exercise. Methods `splay()`
and `rotate()` will be augmented by the test program. If you are performing
a double rotation directly instead of composing it from single rotations, you
need to adjust the `BenchmarkingTree` class accordingly.

#### Hints

The following tools can be useful for producing nice plots:
- [pandas](https://pandas.pydata.org/)
- [matplotlib](https://matplotlib.org/)
- [gnuplot](http://www.gnuplot.info/)

A quick checklist for plots:
- Is there a caption explaining what is plotted?
- Are the axes clearly labelled? Do they have value ranges and units?
- Have you mentioned that this axis has logarithmic scale? (Logarithmic graphs
  are more fitting in some cases, but you should tell.)
- Is it clear which curve means what?
- Is it clear what are the measured points and what is an interpolated
  curve between them?
- Are there any overlaps? (E.g., the most interesting part of the curve
  hidden underneath a label?)

In your discussion, please distinguish the following kinds of claims.
It should be always clear which is which:
- Experimental results (i.e., the raw data you obtained from the experiments)
- Theoretical facts (i.e., claims we have proved mathematically)
- Your hypotheses (e.g., when you claim that the graph looks like something is true,
  but you are not able to prove rigorously that it always holds)

Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>

### 4. (a,b)-tree
<{Details(Klikni sem pro zobrazení)}>
You are given a representation of _(a, b)-tree_ with a `find` operation,
and a representation of an _(a, b)-tree node_.

Your goal is to implement an `insert` operation, which inserts the given
key in the tree (or does nothing if the key is already present). Preferably, 
you should also implement `split_node` method and use it properly in 
your `insert` implementation.

The implementation uses the variant of (a,b)-trees from lecture notes by [Martin Mares,
Chapter 3](http://mj.ucw.cz/vyuka/dsnotes/03-abtree.pdf) where the actual values are
stored also in the internal nodes of the tree and not only in leaves.

You should submit the `ab_tree.*` file (but not `ab_tree_test.*` files).

Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>

### 5. ab experiment
<{Details(Klikni sem pro zobrazení)}>
The goal of this assignment is to evaluate your implementation of (a,b)-trees
experimentally and compare performance of (2,3) and (2,4)-trees.

You are given a test program (`ab_experiment`) which is used to evaluate your
implementation of the previous assignment. The test program auguments your implementation
by implementing a `remove` method and it performs the following experiments:

- _Insert test:_ Insert _n_ elements in random order.
- _Min test:_ Insert _n_ elements sequentially and then _n_ times repeat: remove the minimal
  element in the tree and then insert it back.
- _Random test:_ Insert _n_ elements sequentially and then _n_ times repeat: remove random
  element from the tree and then insert random element into the tree. Removed element is
  always present in the tree and inserted element is always *not* present in the tree.


The program tries each experiment with different values of _n_. In each try,
it prints the average number of _structural changes_ per operation. Structural change is
either a node split (in insert) or merging of two nodes (in delete).

You should perform these experiments and write a report, which contains the following
plots of the measured data. Each plot should show the dependence of the average
number of structural changes on the set size _n_.

- The insert test: one curve for (2,3) tree, one for (2,4) tree.
- The min test: one curve for (2,3) tree, one for (2,4) tree.
- The random test: one curve for (2,3) tree, one for (2,4) tree.

The report should discuss the experimental results and try to explain the observed
behavior using theory from the lectures. (If you want, you can carry out further
experiments to gain better understanding of the data structure and include these
in the report. This is strictly optional.)

You should submit a PDF file with the report (and no source code).
You will get 1 temporary point upon submission if the file is syntantically correct;
proper points will be assigned later.
Please, add the time it took you to complete the assignment into the report as well.

#### Test program

The test program is given three arguments:
- The name of the test (`insert`, `min`, `random`).
- The random seed: you should use the last 2 digits of your student ID (you can find
  it in the Study Information System – just click on the Personal data icon). Please
  include the random seed in your report.
- The type of the tree to test (`2-3` or `2-4`).

The output of the program contains one line per experiment, which consists of _n_ and the
average number of structural changes.

#### Your implementation

Please use your implementation from the previous exercise. Methods `split_node(...)`
and `insert()` will be augmented by the test program. If you are performing
a node splits directly instead of using `split_node(...)` method, you
need to adjust the `BenchmarkingABTree` class accordingly.

#### Hints
The following tools can be useful for producing nice plots:
- [pandas](https://pandas.pydata.org/)
- [matplotlib](https://matplotlib.org/)
- [gnuplot](http://www.gnuplot.info/)

A quick checklist for plots:
- Is there a caption explaining what is plotted?
- Are the axes clearly labelled? Do they have value ranges and units?
- Have you mentioned that this axis has logarithmic scale? (Logarithmic graphs
  are more fitting in some cases, but you should tell.)
- Is it clear which curve means what?
- Is it clear what are the measured points and what is an interpolated
  curve between them?
- Are there any overlaps? (E.g., the most interesting part of the curve
  hidden underneath a label?)

In your discussion, please distinguish the following kinds of claims.
It should be always clear which is which:
- Experimental results (i.e., the raw data you obtained from the experiments)
- Theoretical facts (i.e., claims we have proved mathematically)
- Your hypotheses (e.g., when you claim that the graph looks like something is true,
  but you are not able to prove rigorously that it always holds)


Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>

### 6. Matrix transpose
<{Details(Klikni sem pro zobrazení)}>
Implement cache-oblivious transposition of square matrices.

The Python version requires the NumPy module for storing the matrix
in memory efficiently.

Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>

### 7. Matrix experiment
<{Details(Klikni sem pro zobrazení)}>
The goal of this assignment is to evaluate your implementation of cache-oblivious
matrix transposition experimentally and to compare it with the trivial algorithm
which transposes by definition.

You are given a test program (`matrix_experiment_sim`) which evaluates your
implementation from the previous assignment on different simulated caches and
matrices of different sizes. For each experiment, the average number of cache
misses per item is reported (the diagonal items which do not move are not
counted). The program also evaluates performance of the trivial transposition algorithm.
The simulated cache is fully associative and uses LRU replacement strategy.

You should run these experiments and write a report, which contains one plot of
the measured data for each cache type, showing dependency of the average number of
misses per item on the matrix size. There should be two curves in each plot: one for your
algorithm, another for the trivial one.

The report should discuss the experimental results and try to explain the observed
behavior (including any strange anomalies) using theory from the lectures.
If you want, you can carry out further experiments to gain better understanding
of the algorithm and include these in the report.

You should submit a PDF file with the report (and no source code).
You will get 1 temporary point upon submission if the file is syntactically correct;
proper points will be assigned later.
Please, add the time it took you to complete the assignment into the report as well.

#### Test program
The test program is given two arguments:
- Cache type:
    - `m1024-b16` – cache of 1024 items organized in 16-item blocks
    - `m8192-b64` – cache of 8192 items organized in 64-item blocks
    - `m65536-b256` – cache of 65536 items organized on 256-item blocks
    - `m65536-b4096` – cache of 65536 items organized in 4096-item blocks
- The implementation to test (`smart` or `naive`).

The output of the program contains one line per experiment, which consists of
the matrix size and the average number of cache misses per item.

*Warning:* The Python tests are slow, even though they use only a subset of the
matrix sizes. They can take about one hour to complete.
If your machine has multiple processors or cores, you can try `make -j`
to run the tests in parallel.

#### Optional: Tests on real hardware (for 5 extra points)
You can also test your transposition algorithm on real hardware
using the `matrix_experiment_real` program. The matrix is stored in row-major
order, each item takes 4 bytes.

The program takes one parameter, the implementation to test: `smart` or `naive`.
Its output contains one line per experiment, which consists of the matrix size
and the average time per item in nanoseconds.

However, the program is available only for C++, because general slowness of
Python completely hides all cache effects.

Again, your report should show a plot of the measured data and discuss the observed
effects. You should also include the configuration of caches in your machine.
(If you are using Linux, you can try the `machinfo` script from
[this repository](https://gitlab.kam.mff.cuni.cz/mj/aim.git).)

#### Hints
The following tools can be useful for producing nice plots:
- [pandas](https://pandas.pydata.org/)
- [matplotlib](https://matplotlib.org/)
- [gnuplot](http://www.gnuplot.info/)

A quick checklist for plots:
- Is there a caption explaining what is plotted?
- Are the axes clearly labelled? Do they have value ranges and units?
- Have you mentioned that this axis has logarithmic scale? (Logarithmic graphs
  are more fitting in some cases, but you should tell.)
- Is it clear which curve means what?
- Is it clear what are the measured points and what is an interpolated
  curve between them?
- Are there any overlaps? (E.g., the most interesting part of the curve
  hidden underneath a label?)
- **Is the graph distorted by compression artifacts?** (No, you shouldn't use JPEG for plots!)

In your discussion, please distinguish the following kinds of claims.
It should be always clear which is which:
- Experimental results (i.e., the raw data you obtained from the experiments)
- Theoretical facts (i.e., claims we have proved mathematically)
- Your hypotheses (e.g., when you claim that the graph looks like something is true,
  but you are not able to prove rigorously that it always holds)

Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>

### 8. Cuckoo hash
<{Details(Klikni sem pro zobrazení)}>
Implement Cuckoo hash table with simple tabulation hashing.

You are given a skeleton code which defines the table, implements
`lookup()`, and provides hash functions. You have to add an `insert()`
method.

If too many elements are moved during a single insert, the table must
be rehashed with new hash functions. See lecture notes for the particular
bounds.

The size of the table should stay constant
throughout the existence of the data structure.

Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>


### 9. Find duplcates
<{Details(Klikni sem pro zobrazení)}>
In this assignment, you are given a large file on input. Your goal is to find
duplicated lines and return every duplicated line once.

The challenging part of this assignment is the fact, that your program has to
run in a limited memory, using at most `64MB` for C++ and `12MB` for Python
(and Python itself requires about 5MB), and the input file can be considerably
larger than this memory limit. However, you can rely on the fact that the
number of duplicated lines is considerably smaller (so that all duplicated
lines fit in the memory at the same time).

Instead of handling a real file, you are given a data generator (an `iterator`
in C++ and a `generator` in Python). Note that limiting memory during the
tests works only on Linux (and not on Windows), and of course also in ReCodEx.

You can use full standard library of Python and C++ in this assignment,
including data structure implementations (also, `bytearray` might come handy).
Your solution must also work on other input data of the same size with similar
number of duplicates. Hence, solutions depending on the fact that each string is
uniquely determined by some its substring or similar properties of the input
will not be accepted.

As usual, you should submit only the `find_duplicates.{h,py}` file.

Note that due to the space constraints of the Python solutions, tests `10M` and `16M` are
not used and are always considered successful by ReCodEx.

Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).

Hints:
* Array [ False ] * 2**20 requires approximately 8 MB since Python stores it as an array of pointers to one value False. Use bytearray instead.
* Read carefully the documentation of bytearray and distinguish the terms bit and byte.
* In Python, do not import numpy or other libraries consuming more memory to load than available.
* The memory limit prevents storing all keys, so trying trivial solutions which store all keys in a dictionary is a waste of time.
* Count the number of duplicates and candidates for duplicates. For properly implemented hashing, those two numbers should be very close.
* Use profilers to trace memory usage; see e.g. https://docs.python.org/3/library/tracemalloc.html or https://valgrind.org/.

## Bonus: Explain parameter setting

For up to two bonus points, properly justify the choice of parameters in your data structure (e.g., by a comment in the source code or in ReCodEx).
One or two paragraphs should be sufficient, and you can, of course, use the theory from the lectures.
<{/Details}>

### 10. K-gram
<{Details(Klikni sem pro zobrazení)}>
Your task is to write a function which takes a string and an integer K
and it reports how many different K-grams (K-character substrings) the
string has.

You are given an algorithm for construction of the suffix array. For
simplicity, this algorithm has time complexity $O(n \log^2 n)$. Except
for constructing the suffix array, your algorithm should run in linear
time.

Source code templates can be found in [git](https://gitlab.kam.mff.cuni.cz/datovky/assignments/-/tree/master).
<{/Details}>