============================= Assignment 5: Bloom filters ============================= :Date: 13-03-2017 :Deadline: 22-03-2017 23:59 Objectives ========== You must implement a hash function API, an array API and a program that uses a bloom filter to detect duplicates in its input. Requirements ============ You must write a program `dups` which can process a very large number of lines on its standard input (containing possibly billions of input lines) and reproduces its input on its standard output without any duplicates. The program may drop a small fraction of the input which are not duplicates. Your solution must use a `Bloom filter`__ to detect duplicate items. .. __: https://en.wikipedia.org/wiki/Bloom_filter The new aspect in this assignment is that there is not one single good answer: a part of your grade will be proportional to "how well" your program is able to reproduce the original input without duplicates. Program arguments ================= Your program must accept multiple positional arguments on the command line: - *k*, the number of hash functions to use; - *n*, the size of the bloom filter. - *ks*, a set of `k` parameters to be passed to initialize the `k` different hash functions. Each parameter can be a single value or a combination of multiple arguments of different types, separated in a format of your choosing. Input / output definition ========================= The input items will be provided as zero or more separate lines of text on the standard input. The input can be arbitrarily large: you cannot assume a maximum number of lines in the input. Each line of input will contain only printable characters, in particular there will be no nul character in the input, and will contain at most 1024 characters. .. Should we give a limit on line length? Or just give them the my_get_line() function from the huffman assignment? Your program must print out the *unique* input items as-is on its standard output, as soon as possible after they are read. A unique input item is an item *that has a low probability to have been read before* (since the program started). For example in the following input:: hello world hello jane There are 3 unique input items, and the corresponding expected output is:: hello world jane Because when "hello" is encountered the 2nd time, it was seen before with a high probability (in this case, the probability is 1), so it is skipped. When there are no more input lines available to read (if that ever happens), Your program must then: 1. print a last line containing the number of input items read so far and, separated by a single space, the number of output lines produced so far. This line should be printed to `stderr`; 2. terminate with exit code `EXIT_SUCCESS`. For example with the input above the complete output would be:: hello world jane 4 3 Overall structure of the assignment =================================== 1. Implement a first version of the array API defined in ``bitvec.h``. Your first version can be very simple, and use standard C arrays. This would be a correct implementation. 2. Implement the hash function API defined in ``hash.h``, which requires you to define a *family* of k hash functions (from 0 to k-1). The quality of the Bloom filter you will use in step 3 is dependent on how different the k functions are from each other. *Note: Hash functions and Bloom filters will be covered in lecture on Thursday.* .. Do we need hash_init() (referred to in hash()) or will hash_alloc initialise the hash functions? 3. Implement a Bloom filter based on your array and hash function API from the previous two steps. You may build the Bloom filter directly into ``main.c`` or create a separate file ``bloom.c`` with its own header API ``bloom.h`` (+0.5pt for separate files). *Be sure to add all additional files to the Makefile under all the relevant targets, including the bloom_submit.tar.gz target so the files will be included in your tarball.* 4. Implement the `dups` program to read the input from standard input and detect potential duplicates using your Bloom filter, printing the non-duplicate entries to standard output. 5. Find good values for the `ks` parameter of your `dups` program and store these, separated by spaces, in the file called ``PARAMS``. Please note that you are free to determine the format (and even type) of these parameters. This does require you edit the behaviour of the ``init_ks`` function to match your format. You should document this, if you do so. 6. Iterate on step 2 (hash functions) and 5 to see if you can decrease your false positive rate (number of items your algorithm incorrectly decides are duplicates). 7. (Optionally) Improve your implementation of the array API, taking advantage of your knowledge that there are only two possible values at each position. Please note that you should try to assume as little as possible about the machine on which the code is used. Hint: https://en.wikipedia.org/wiki/Bit_array **Informatics only:** While you are iterating in step 6, keep a journal of your results and your intermediate implementations to present in your PAV report. Grading ======= Your grade starts from 0, and the following tests determine your grade: - +0.5pt if you have submitted an archive in the right format with an ``AUTHORS`` file. - +0.5pt if your source code builds without errors and you have modified ``hash.c``, ``main.c`` or ``bitvec.c`` in any way. - +1pt if your array API works properly. - +2pt if your hash function API works properly, and the k functions return a different value when applied to an empty string as input. Additionally, you have provided at least 10 different initialization parameters in the ``PARAMS`` file to test your functions. - +0.5pt if your Bloom filter is written in a separate files ``bloom.c`` and ``bloom.h``. - +2pt if your `dups` program works minimally: - the program never outputs the same item more than once. - given an input that contains *n* unique items, your program reproduces a non-zero fraction of *n* output items, regardless of the value of *n*. - +0-3pt depending on the false positive rate of your implementation (more output items overall = better program). - +1pt if your implementation of the array API is using less than one byte per position overall.