PAssignment 1: F0 and F2 estimation
Getting started
In this assignment you will implement some of the F0 and F2 estimators we covered in class. The assignment will be autograded on gradescope, and you will be able to see your grades after you submit. You are also allowed up to 62 submissions (that’s 1 per day from the time of release until June 11).
To make things more fun, we are also running a public scoreboard! The scoreboard will rank by the amount of memory you use in your solutions (you can be anonymous in this scoreboard). The top teams on the scoreboard will receive bonus credit at the end of the quarter. At this time we are still deciding on the criteria for “top” teams and the amount of bonus credit.
The starter template is here. To set up the coding environment, you’ll need to install python 2.7 along with packages mmh3, numpy, and sortedcontainers. With python installed, you can install these packages with the command (executed in bash for Linux and cmd for Windows):
python -m pip install mmh3 numpy sortedcontainers
After you install these packages, verify that you can run the stream.py python file. You should see the output (some of the numbers may be different):
F0 estimate: 1234
Total memory units used: 10002
F2 estimate: 90784
Total memory units used: 10003
Implementing the assignment
To complete the assignment, you will have to fill out the implementations of f0_estimator and f2_estimator in stream.py. No other files need to be modified. Also, please do not change any of the file names.
The estimators you implement can be any of the algorithms we’ve seen so far in class (or other low-memory ones you are aware of). The functions you’ll implement must take in a stream of integers (a normal python list) and an error parameter eps. The output is a floating point value for the estimated quantity. Please do not modify the input stream variable (such as sorting it in place). This is a violation of the streaming model.
Using cs368lib
Since we are working in the streaming model, your algorithms must be extremely memory efficient. To track the amount of memory your program is using, you must use the functions provided to you in cs368lib.py. cs368lib.py must not be modified. The cs368lib file provides to you the following functions/classes that you may find useful:
class tracked_sortedlist: A dynamic list that retains sorted order. Any operations on this list run in logarithmic time. Any python list operations (such as indexing, iteration, or slicing) will work on this list.class tracked_list: A normal dynamic list. Appending or popping from the end is constant time. Any python list operations (such as indexing, iteration, or slicing) will work on this list.tracked_intandtracked_double: Use these functions when you want to create an integer or a double variable.hashandhash_double: These functions take in an integerelemand an integerseed.hashwill map the integerelemto a random 64-bit integer.hash_doublewill map the integer to a random value in [0, 1]. To get different hash functions, use differentseedvalues.1
Any memory you use must be tracked using cs368lib. Failure to do so will result in penalization.
Submissions, Grading, and ⚝Scoreboard⚝
Scoreboard names
When you submit to gradescope, please only upload stream.py. Upon submission, you will get a prompt for filling out your scoreboard name. Feel free to choose any name you wish.
Time limits
The judge will run your estimators on a variety of input distributions under a some-what tight time limit. This means that the time complexity of your solution is important as well. Each test case will get 1-10 seconds of run time. If you timeout on a test case, no credit is given for that test case.
Memory limits
The grader also grades you based on the amount of memory you use. If your submission uses too much memory on a test case, no credit is given. The accepted level of memory is min(3n, 20 \log n/eps^2) where n is the length of the stream. You are encouraged to optimize your solution to use even less memory.
Accepted level of error
If your solution does not return something within a (1 ± eps) factor of the exact solution, no credit is given.
Lastly, please do not submit excessively. Each submission takes a few minutes to judge and we only have 1 machine. Instead, test things locally when possible.
Final words
- Start early! Please do not leave the programming assignments to the last minute.
- For testing your solution, it may help to write random input generators and test locally. To help you test, the default implementation of the estimators returns the exact solution.
- Always keep in mind the time complexity of your solution. It must run fast enough to process 10-million integers in less than 10 seconds.
- This is the first year we’re piloting a programming assignment. If you have suggestions for new problems or improvements to current problems, let me know!
-
These hash functions use MurmurHash underneath the hood, which is not technically a k-wise independent hash function. However it is fast, and will work in practice. ↩︎