Assignment 5: Bloom filters
Date: |
13-03-2017 |
Deadline: | 22-03-2017 23:59 |
Objectives
You must implement a hash function API, an array API and a program
that uses a bloom filter to detect duplicates in its input.
Requirements
You must write a program dups which can process a very large number
of lines on its standard input (containing possibly billions of input
lines) and reproduces its input on its standard output without any duplicates.
The program may drop a small fraction of the input which are not duplicates.
Your solution must use a Bloom filter to detect duplicate items.
The new aspect in this assignment is that there is not one single good
answer: a part of your grade will be proportional to "how well" your
program is able to reproduce the original input without duplicates.
Program arguments
Your program must accept multiple positional arguments on the command line:
- k, the number of hash functions to use;
- n, the size of the bloom filter.
- ks, a set of k parameters to be passed to initialize the k different
hash functions. Each parameter can be a single value or a combination of
multiple arguments of different types, separated in a format of your
choosing.
Overall structure of the assignment
- Implement a first version of the array API defined in
bitvec.h. Your first version can be very simple, and use standard
C arrays. This would be a correct implementation.
- Implement the hash function API defined in hash.h, which requires you to
define a family of k hash functions (from 0 to k-1).
The quality of the Bloom filter you will use in step 3 is
dependent on how different the k functions are from each other.
Note: Hash functions and Bloom filters will be covered in lecture on
Thursday.
- Implement a Bloom filter based on your array and hash function API from
the previous two steps. You may build the Bloom filter directly into
main.c or create a separate file bloom.c with its own header API
bloom.h (+0.5pt for separate files). Be sure to add all
additional files to the Makefile under all the relevant targets, including
the bloom_submit.tar.gz target so the files will be included in your
tarball.
- Implement the dups program to read the input from standard input and
detect potential duplicates using your Bloom filter, printing the
non-duplicate entries to standard output.
- Find good values for the ks parameter of your dups program and store
these, separated by spaces, in the file called PARAMS. Please note that
you are free to determine the format (and even type) of these parameters.
This does require you edit the behaviour of the init_ks function to
match your format. You should document this, if you do so.
- Iterate on step 2 (hash functions) and 5 to see if you can decrease your
false positive rate (number of items your algorithm incorrectly decides are
duplicates).
- (Optionally) Improve your implementation of the array API, taking advantage
of your knowledge that there are only two possible values at
each position. Please note that you should try to assume as little as
possible about the machine on which the code is used. Hint:
https://en.wikipedia.org/wiki/Bit_array
Informatics only: While you are iterating in step 6, keep a journal of your
results and your intermediate implementations to present in your PAV report.
Grading
Your grade starts from 0, and the following tests determine your grade:
- +0.5pt if you have submitted an archive in the right format with an AUTHORS file.
- +0.5pt if your source code builds without errors and you have
modified hash.c, main.c or bitvec.c in any way.
- +1pt if your array API works properly.
- +2pt if your hash function API works properly, and the k functions return
a different value when applied to an empty string as input. Additionally, you
have provided at least 10 different initialization parameters in the
PARAMS file to test your functions.
- +0.5pt if your Bloom filter is written in a separate files bloom.c and
bloom.h.
- +2pt if your dups program works minimally:
- the program never outputs the same item more than once.
- given an input that contains n unique items, your program reproduces a
non-zero fraction of n output items, regardless of the value of n.
- +0-3pt depending on the false positive rate of your
implementation (more output items overall = better program).
- +1pt if your implementation of the array API is using less than one
byte per position overall.