Assignment 3: A Heap of Fun

Due: Thu Feb 6 11:59 pm
Late submissions accepted until Sun Feb 9 11:59 pm

Assignment by Julie Zelenski and Michael Chang, with modifications by Nick Troccoli

Learning Goals

This assignment delves into pointers, arrays, and dynamic memory. You will be building your skills with:

manipulating pointers & arrays
managing memory allocation on the stack and heap
using the C library I/O functions
using tools such as Valgrind to help track down memory errors and leaks

Overview

In this assignment, you will implement simplified versions of the unix commands uniq and tail. There are three parts to the assignment:

implementing read_line, a helper function that reads a single line from a file. You'll use a heap-allocated array to store it and resize the array to perfectly fit the line.
implementing the mytail program that prints out the last n lines of the provided input, using your read_line function
implementing the myuniq program that prints out just the unique lines and their frequencies from the provided input, using your read_line function and a heap-allocated array of structs.

A few reminders:

The working on assignments page contains info about the assignment process.
The collaboration policy page outlines permitted assignment collaboration, emphasizing that you are to do your own independent thinking, design, coding, and debugging. For this assignment, we are happy to look at your implementation of your read_line function and your myuniq program if necessary, but we won't look at any code associated with mytail. Later systems classes like CS111, CS112, CS143, CS144, and CS149 have a blanket no-looking-at-code-ever policy, and we have the same policy for your final assignment this quarter. We want you to understand what it takes to code something up all by yourself without relying on outside resources to find your bugs for you.

If you are having trouble completing the assignment on your own, please reach out to the course staff.

To get started on the assignment, clone the starter project using the command

git clone /afs/ir/class/cs107/repos/assign3/$USER assign3

The starter project contains the following:

util.c mytail.c, myuniq.c and Makefile: three files that you modify, and their Makefile for compiling
custom_tests: the file where you should add custom tests for your programs
mycat.c: a program that calls your read_line function in util.c for testing purposes.
samples: a symbolic link to the shared directory for this assignment. It contains:
- SANITY.ini, prototypes.h and sanity.py: files to configure and run Sanity Check. You can ignore these.
- myuniq_soln, mytail_soln and mycat_soln: executable solutions for the programs you will write.
- colors, colors_no_end_newline, dictionary and names: sample text files for testing your unix filters
tools: contains symbolic links to the sanitycheck and submit programs for testing and submitting your work.

Starter Code

As you work through each part of the assignment, make sure to take time to review and understand the provided code in each part. The provided code relies on reading files, which works similarly, but has different syntax, across different languages. Refer to our guide to the C Standard Library (or your favorite C reference) to get an overview of how to read files using C. The test program mycat.c is given to you fully implemented for testing read_line, and you do not need to modify or add comments to it. The starter files for myuniq.c and mytail.c contain code to handle the command-line arguments and file setup, and you should add comments to them and implement the rest of their functionality. As you work through the assignment, make sure to understand the given code, and refer back to these questions to confirm your understanding:

What are the possible options for the mode argument to fopen? What happens if you attempt to open a file that doesn't exist or you don't have permissions for?
What is the consequence of failing to fclose a file?
How does a program switch to reading standard input when there is no filename argument?
Are the operations that read from a file also used to read standard input or is different code required for each?
The default behavior of the mycat program corresponds to invoking which command-line flag on standard cat?
mytail supports a command-line argument of -number to control the number of lines printed. How does the given code process that optional argument?
What range of values is accepted as an argument for mytail -number?
Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

Testing

This assignment heavily emphasizes testing. For read_line and each of the 2 programs you write below, you should also add at least 3 to 5 additional tests of your own (total of at least 10) in the custom_tests file that show thoughtful effort to develop comprehensive test coverage. When you add a test, also document your work by including comments in the custom_tests file that explain why you included each test and how the tests relate to one another. You can find suggested testing strategies on the testing page, linked to from the Assignments dropdown. We strongly encourage you to write some tests before you write any code, and then add additional tests as you write code.

We also encourage you to make your own text files for testing! Similar to the provided test files in the samples/ folder (like colors, names, etc.), having custom file inputs can help you test more thoroughly. When you submit your assignment, you'll be asked whether you'd like to add any custom files to your submission, because by default only the starter files are submitted. At that time, you can specify any additional test files of your own that you created.

The programs that you will write for this assignment are part of a special category of commands called "filters", and there are some cool ways we recommend to test them. A filter is a program that reads input (typically line by line) from the user or from a file, transforms the input in some way, and produces some output. Other examples of commands like this are cat, head, wc, sort, etc. Most filters accept either a filename as an argument or, if no argument is specified, read input directly typed into the terminal (called "standard input") - you enter ctl-d to signal you are done entering input. Reading from standard input can be handy for running a quick test without having to first create a file.

It is also very handy to test these kinds of programs by using them as part of a "compound" unix command. You can write a compound Unix command where the output of one command is "piped" to be the input of another command, like this, using the pipe | character:

sort samples/names | ./samples/mytail_soln

This sorts the samples/names file but instead of outputting the result to the terminal, it feeds it ("pipes it") as input to the mytail_soln program, which will print out just the last 10 lines of the input it receives.

This "compound" unix Command is called a pipeline.You can have pipelines with as many commands as you'd like, each piping their output to be the next one's input. You can read more about piping input in our Unix guide. This demonstrates the Unix philosophy, which is to move away from the construction of monolithic do-everything programs and instead encourage a proliferation of smaller commands, each focused and streamlined to perform one task well. Those small, discrete tools are combined to accomplish larger tasks.

Pipelines allow you to easily come up with more tests for the programs you'll write on this assignment, since they let you easily feed the output of another program as input to your program. Another pipeline example is grep include util.c | wc -l, which counts the number of lines in util.c that contain "include". It does this by using grep to find lines in util.c containing "include", and then feeds that output to wc, which outputs the number of lines in its received input.

If you would like to write pipelines in custom tests, if they start with a command other than one of the assignment programs (e.g. sort file.txt | mytail), you need to prefix the assignment program name with a $: sort file.txt | $mytail.

Important note: unfortunately GDB and Valgrind don't work well with pipeline commands. Instead,you can make a file with the input you want to test and then gdb/valgrind your program running on that file. If you want to save the output of a command to a file, you can use redirection >: check out our Unix guide for more information. For example, if you wish to debug the following pipeline:

sort samples/names | ./mytail

Since this is feeding the output of sort samples/names into your mytail program, you could instead save the output of sort samples/names into a file and debug with that file. You could save the output like this, and then run with that file instead:

myth$ sort samples/names > testfile.txt
myth$ valgrind ./mytail testfile.txt

Memory Usage and Efficiency

Memory

This assignment provides a lot of targeted practice in managing memory. You'll be using both the stack and heap for various purposes and need to take care to appropriately allocate and properly deallocate each. Here are some general memory guidelines:

Stack allocation is as simple as declaring a variable. The stack is convenient (auto-allocation/deallocation), type-safe, and very efficient.
Dynamic allocation is via malloc/realloc/free. The heap gives the programmer control over lifetime (explicit allocation/deallocation) and flexibility (storage can be re-sized/re-purposed), but comes at a cost in safety and efficiency.
Take care to understand proper use of stack versus heap allocation and when each is appropriate. As a general rule of thumb, unless a situation requires dynamic allocation, stack allocation is preferred. Often both techniques are used together in a program. Heap allocation is a necessity when:
- you have a very large allocation that could blow out the stack
- you need to control the memory lifetime, or memory must persist outside of a function call
- you need to resize memory after its initial allocation

It is essential that you are using Valgrind throughout. Check out our Valgrind guide, and/or revisit the Valgrind exercises in labs 2 and 3 to be sure you know how to put Valgrind to work for you. In particular, let Valgrind help you with:

Memory errors. This is by far the most important reason to have Valgrind on the job. If Valgrind is reporting errors, they need your immediate attention.
Memory leaks. If Valgrind reports leaks, you have some heap allocations without their matching deallocation. Given that leaks generally do not cause bugs, and incorrect deallocation can, we recommend that you do not worry about freeing memory until your program is finished. Then, you can go back and deallocate your memory as appropriate, ensuring correctness at each step.

Efficiency

We will also be looking at the memory and runtime efficiency of your programs. For memory efficiency, Valgrind can be a huge help, as the Valgrind report includes stats on your total heap usage. Look for a line at the end of a Valgrind report in this format:

total heap usage: 10 allocs, 10 frees, 888 bytes allocated

This is the aggregate total of all heap allocations during the program's lifetime. The count of allocs should always equal the count of frees. The count of allocations and total size allocated is a measure of your program's memory "footprint".

Runtime efficiency (how fast your program runs) is also important. Prefix a command with time to execute and report on the time used to complete it. For example, time ./mytail will measure your program's runtime, and time samples/mytail_soln will measure the sample solution's runtime. The time labeled "user" is the one of interest.

    % time ./mytail -1 samples/dictionary
    zygote

    real    0m0.010s
    user    0m0.006s
    sys     0m0.000s

    % time samples/mytail_soln -1 samples/dictionary
    zygote

    real    0m0.009s
    user    0m0.007s
    sys     0m0.000s

The best strategy for ensuring appropriate memory and runtime efficiency is to measure the sample executable on a given input and compare to the measurement of your program running on the same input. If your performance is in the same ballpark as ours (say within a factor of 2), you're golden, but you should aim for closer to validate your usage of memory and time is correct. When considering tradeoffs of space versus time, the sample executable shows you the balance we are looking for. If your program uses noticeably more memory or time, it indicates you have some unnecessary or redundant work that should be eliminated. Be sure to note that very small inputs are, well, too small for meaningful measurement; instead also measure on heftier input to see the program operating at scale.

1. `read_line`

Your first task is to implement the function read_line in util.c, with the following signature:

char *read_line(FILE *file_pointer);

The read_line function takes in one parameter, which is a file to read from, and returns the next line read from the file as a heap-allocated C string, or NULL if there's nothing more to read from the file. It's the client's responsibility to free the allocation when they're done using it. read_line is a more convenient version of the fgets library function, which also reads in the next line from a file (check out man fgets); but while fgets requires the caller pass in a fixed-size buffer to write to and truncates the line if it doesn't fit, read_line will heap-allocate space for the line and resize it as needed so that it can fit the entire line.

To implement read_line, you'll call fgets in a loop to read in the line in chunks and build up a full heap-allocated line that you then return. In this way, it's a "wrapper function" around fgets that makes it easier to use. You should not need to use other file I/O functions such as feof, fgetc, etc.

A FILE * is a type that represents a file in C, but it's not like other pointers in that you can't dereference it; instead, you pass it to other library functions such as fgets which can do things with that file like read from it. A FILE * refers to a specific place in the file, and as you read from it using e.g. fgets, it updates its state to advance through the file so the next time you call e.g. fgets, it will pick up where it left off.

The read_line function should be implemented with the following structure - make sure to call assert after every heap allocation!:

Make a 32 byte heap-allocated string (our "buffer") on the heap
Read as much of the next line as you can into the buffer using fgets
If you haven't reached the end of the line, realloc the buffer to double its current size, and go back to step 2. A line is considered to end after the first newline (\n) character or once you've reached the end of the file, whichever comes first.
Return the line, but omitting the newline character if it ends with one (if the entire line is just \n, then the returned string would be the empty string)

Note that if you immediately encounter the end of the file when calling fgets for the first time, meaning there's nothing to read, you should return NULL.

Make sure to read the manual page for fgets to learn more about how to use the function and to understand its quirks. For instance, fgets tells you that you've reached the end of the file by returning NULL if you try to call it when there are no more characters to be read. If there are more characters to be read, even if you read up to the end of the file, fgets won't return NULL until the next time you try to call it and there's nothing to be read. Here's an example:

FILE *file = ... // pretend file contains "hi"
char buf[1000];

// ptr isn't NULL even though we read up to the end
char *ptr = fgets(buf, 1000, file);
printf("%s\n", buf);  // prints "hi"

// now ptr is NULL, since there's nothing left to read
ptr = fgets(buf, 1000, file);

You may assume the following while implementing read_line:

fgets will never encounter an error
the specified file will be a text file and its contents can be read in as C strings
the caller will free the memory returned by read_line when they're done

Testing

This function can be tested in isolation with the provided mycat.c program, which you do not need to modify or comment, but whose code you should read over to understand how it calls your read_line function. You can also write sanitycheck custom tests with mycat. mycat is similar to the built-in cat Unix command; if you run it with a filename as an argument, it will print the contents of that file, one line at a time. If you run it with no arguments, it will expect input in the terminal, and once you enter ctl-d (to signal end of input), it will print the contents you entered in the same way, one line at a time.

Study the provided mycat.c code, along with our guide to the C Standard Library (or your favorite C reference) to get an overview of how file-reading works in C. You should review the given code to understand how it tests your read_line function. Refer to our questions in the starter code section to confirm your understanding.

If read_line is properly implemented, mycat should have no memory leaks or errors. Make sure to test thoroughly to ensure that in all cases, read_line doesn't leak any memory! If you do encounter leaks, see if the number of leaks is proportional to the length of the input. If it is, it means the leak occurs in many calls to read_line. If it isn't, it means the leak occurs just in a fixed place, such as the first line or the last line. Shrink your test case to be as small as possible to help narrow in on the leak!

The ./mycat Makefile sanity check test is a great test to ensure that you correctly output whitespace and newlines and correctly handle reaching the end of the file. Unlike other tests that are lenient with whitespace, this test requires an exact match. If this test is failing, it may indicate problems with detecting / handling the end-of-file case, or returning an empty string instead of NULL or vice versa.

Before moving on: make sure you have thoroughly tested your read_line function, covering various different cases of possible inputs, and that you have written your custom tests. You will use this function later in the assignment, so it's vital to ensure it's thoroughly tested before moving on!

2. `mytail`

Your next task is to use your read_line function to implement the mytail.c program, which is a simplified version of the Unix tail command. It takes in the name of a file and a number N > 0 and prints the last N lines of the file in order. If no N is specified, it defaults to N=10. Try out the provided sample solution, e.g. ./samples/mytail_soln util.c or ./samples/mytail_soln -4 Makefile.

myth$ ./samples/mytail_soln -4 Makefile
clean::
    rm -f $(PROGRAMS) *.o

.PHONY: clean all

If the input contains fewer than N total lines, all lines from the input are printed:

myth$ ./samples/mytail_soln -50 samples/colors
red
green
green
yellow
blue
blue
blue
blue

If you don't specify a filename, it will expect input in the terminal, and once you enter ctl-d (to signal end of input), it will print the contents you entered in the same way - printing the last N lines.

Review Starter Code

mytail.c comes with provided code to read user input, open the specified file, and call the print_last_n function that you must implement. Study the provided program code, along with our guide to the C Standard Library (or your favorite C reference) to get an overview of how file-reading works in C. You should read and understand the given code, work out how you will add to it, supply program and per-function comments to document your approach, and add inline comments where necessary to highlight any particularly tricky details. Refer to our questions in the starter code section to confirm your understanding of the code.

Note that we aren't looking at code for this particular program, because we want you to gain some experience triaging your own bugs early on in the quarter.

Implementation

Since we know in advance how many lines we ultimately need to store, the program should be implemented by creating a stack array of N strings, and reading lines into that array using your read_line function. This array will represent a "circular queue" that always stores at most the N most recent lines read in. When reading in a line, if there is space available in the array, insert the line in the next available space. Otherwise, replace the oldest line in the array with this new line. Continue with this process until you have read in all lines - the resulting array will contain the last <= N lines from the file (but not necessarily in order!). For example, let's say N = 3, and we have a file as follows:

line1
line2
line3
line4
line5

After we read in 3 lines, our array should look like [line1, line2, line3]. When we read in the 4th line, it would then look like [line4, line2, line3]. And when we read in the 5th line, we would then have [line4, line5, line3]. From there, you will need to print out the lines in the correct order as they appeared in the file: line3, line4, then line5. Try to think about how you can deduce the correct ordering for printing the lines based on what you know about how the array is constructed. You must use this circular queue approach in your implementation, and should not use a different approach, such as sliding all the elements down each time you insert a new element.

As a client of the read_line function, make sure you free memory appropriately to avoid any memory leaks.

Testing

This program can be tested with sanity check. We encourage you to use the provided test files in the samples/ folder and also to create your own test text files to use for testing and debugging.

3. `myuniq`

Your final task is to use your read_line function to implement the myuniq.c program, which is a simplified version of the Unix uniq command. It takes in the name of a file and prints the unique lines in the file along with how many times they appear. Only one copy of each unique line is printed, and the unique lines are printed in the same order they appeared in the input. Try out the provided sample solution, e.g. ./samples/myuniq_soln samples/colors:

myth$ ./samples/myuniq_soln samples/colors
      1 red
      2 green
      1 yellow
      4 blue

Review Starter Code

myuniq.c comes with provided code to read user input, open the specified file, and call the print_uniq_lines function that you must implement. Study the provided program code, along with our guide to the C Standard Library (or your favorite C reference) to get an overview of how file-reading works in C. You should read and understand the given code, work out how you will add to it, supply program and per-function comments to document your approach, and add any inline comments where necessary to highlight any particularly tricky details. Refer to our questions in the starter code section to confirm your understanding of the code.

Implementation

The program needs to keep track of all the unique lines it has seen and how many times it has seen them - to do this, you should use an array of structs. A struct allows us to store the lines and their frequencies in one array instead of two - it's up to you to specifically define the struct and what it contains. Because we do not know in advance how many unique lines are in the file and thus how long our array should be, you should use the "resize-as-you-go" approach, similar to your read_line function. Specifically, you should allocate the array on the heap with space initially for ESTIMATE entries (a provided constant), and if/when that fills up, resize the array to add another ESTIMATE entries. Repeat this process as needed, adding an additional ESTIMATE entries each time.

When you read in a line, you will need to manually linearly search through your array to find whether or not you have already seen it.

When printing information about each line, to match the output format of the sample solution when printing out the line count number, you should use the printf token %7d instead of just %d. This will print the number with width 7, meaning the output will look something like this:

      1 red
   5000 blue
  12413 green
1253473 yellow

As a client of the read_line function, make sure you free memory appropriately to avoid any memory leaks.

Testing

This program can be tested with sanity check. We encourage you to use the provided test files in the samples/ folder and also to create your own test text files to use for testing and debugging.

Submitting

Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work. When you submit, you may optionally indicate that you do not plan to make a submission after the on-time deadline. This allows the staff to start grading some submissions as soon as the on-time deadline passes, rather than waiting until after the late period to start grading. You will also have an opportunity to add any custom files to your submission that you created (e.g. for testing), because by default only the starter files are submitted.

Grading

Here is the tentative point breakdown:

Functionality (95 points)

Sanity cases (25 points) Correct results on the default sanity check tests.
Comprehensive/stress cases (40 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
Clean compile (2 points) Compiles cleanly with no warnings.
Clean runs under valgrind (15 points) Clean memory reports for all programs when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
Reasonable efficiency (8 points) We expect programs to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and extreme efforts to do so could detract from your code quality).
custom_tests (5 points) Your custom_tests file should include at least 10 tests of your own, 3-5 per program, that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.

Code review (buckets together weighted to contribute ~20 points)

The grader's code review is scored into a bucket per assignment part to emphasize the qualitative features of the review over the quantitative. The styleguide is a great overall resource for good program style. Here are some highlights for this assignment:

Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, const correctness, and so on.
Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
Documentation. You should document both the code you wrote and what we provided. We expect reasonably complete program overview and per-function comments that explain the overall design along with occasional use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

Post-Assignment Check-in

How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have written your own implementation of two Unix filter commands and an improved version of a C I/O function, using pointers, arrays and dynamic memory. That's a pretty darn impressive accomplishment!

To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task. They could also be useful as review before the exams.

The scan_token function you wrote for assignment 2 required that the client supply the memory to write the scanned token. The read_line function of assignment 3 instead allocates memory on the client's behalf. What are the advantages and disadvantages between the two approaches?
It is stated that a correct program should have a one-to-one correspondence between malloc calls and free calls. How does adding a realloc call change the needed number of malloc and free calls? Briefly explain your reasoning.
Which is the bigger problem: memory errors or memory leaks? Why? How can you tell which is which in a Valgrind report?
How can you gauge whether your program has acceptable performance in terms of runtime or memory efficiency?
When it is appropriate to use calloc instead of malloc?

Learning Goals

Overview

Starter Code

Testing

Memory Usage and Efficiency

Memory

Efficiency

1. read_line

Testing

2. mytail

Review Starter Code

Implementation

Testing

3. myuniq

Review Starter Code

Implementation

Testing

Submitting

Grading

Post-Assignment Check-in

1. `read_line`

2. `mytail`

3. `myuniq`