Assignment 3: A Heap of Trouble

Due: Mon Apr 30 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Wed May 2 11:59 pm

Assignment by Julie Zelenski and Michael Chang

Learning goals

This assignment should sharpen your skills in

  • wrangling pointers & arrays
  • managing memory allocation, both stack and heap
  • using the C library I/O functions
  • debugging memory errors (because we are sure you will encounter at least one or two :-)

You will be writing simplified version of the unix filters uniq and tail.

Get started

Check out the starter project from your cs107 repo using the command

git clone /afs/ir/class/cs107/repos/assign3/$USER assign3

The starter project contains various C files and supporting Makefile and custom_tests files. In the samples subdirectory, you'll find our sample solution programs.

Background: Unix filters

In this assignment, you will implement several filters. A filter is a program that reads input (typically line by line), transforms the input in some way, and produces some output. Some example unix filters are cat, sort, and grep.

cat is the simplest of all filters; the output just repeats the input. Invoking cat filename will print the contents of the named file. The man page for cat indicates that if no filename argument is given, the command will read from standard input. This means the program will read lines typed on the terminal by the user. Almost every unix filter (e.g. grep wc less head sort and more) conforms to this same interface, either reading from a named file or standard input. Reading from standard input can be handy for running a quick test without the overhead of creating a file.

Try this out for yourself:

  • Invoke cat Makefile to view the contents of the file.
  • Invoke with the -b flag to number the non-blank lines: cat -b Makefile.
  • Now invoke cat -b without a filename argument. The program will wait for you to type input.
  • Type a line of text and hit enter. It echoes the line with its line number. Type a few more lines and see they are also echoed back and numbered.
  • When you have finished typing your input, press control-d on a line by itself. This signals the end of input (EOF). The filter program exits.

Below are a few other unix filters to explore. For each, skim the man page to learn about its options. Run the filter on an existing file and then again without a filename to operate on standard input. Remember that control-d is used to signal EOF.

  • head outputs the first lines of the input
  • tail outputs the last lines of the input
  • wc outputs the count of lines, words, and characters in the input
  • grep outputs lines from the input that match the search pattern
  • uniq filters out adjacent duplicate lines from the input
  • sort outputs the input rearranged into sorted order

When running on standard input, you will notice that some filters immediately respond to each line you enter, e.g grep echoes the line if it matches the pattern. Others (e.g. sort) don't output anything until you signal EOF. The depends on whether the filter operation needs to see all of the input before it can determine what to output.

The true power of filters is that they can be combined into pipelines. A pipeline is a sequence of commands where the output of one command is passed to (or "piped into") the input of the next command. The pipe character | feeds the output of one command as input to the next. Filters are especially useful in pipelines because they can be used at the beginning, in the middle, or at the end.

Pipelines allow you to build fancier operations out of the basic filters. Some examples:

  • grep include util.c | wc -l extracts the matching lines and feeds them to wc. This reports the count of matches.
  • cat -b util.c | tail outputs the last 10 lines of the file, labeled with the original line numbers.
  • The standard uniq command only filters out duplicate lines if they are adjacent in the input. If instead your data has duplicate lines scattered throughout, you need to first rearrange so duplicates are adjacent for uniq to detect them. Use a pipeline to sort first. This three-stage pipeline will report the count of distinct lines: sort samples/names | uniq | wc -l.

Now it's your turn! How could you write a pipeline that prints out the three most common names in samples/names? (Hint: review the manpage for uniq to see what flags it has, and keep in mind that the same command can appear multiple times in a pipeline.)

The unix philosophy is to move away from the construction of monolithic do-everything programs and instead encourage a proliferation of smaller commands, each focused and streamlined to perform one task well. Those small, discrete tools are combined to accomplish larger tasks.

Review and comment starter code

The filter programs you will write for this assignment use simple file-reading. Dealing with input/output is one of the more mundane programming tasks. Each programming language has an I/O facility, they are all different and quirky, with many minute details that would be overkill to learn in advance. Peruse our guide to the C standard library to get an overview of reading files in C; go back for detailed specifics on a strictly need-to-know basis.

The test program mycat.c is given to you fully implemented. The starter files for myuniq.c and mytail.c contain code to handle the command-line arguments and file setup, you will implement the rest. You should read and understand the given code, work out how to change/extend it to suit your needs, and finally add comments to document your strategy.

Some issues to consider when reviewing the given code:

  • What are the possible options for the mode argument to fopen? What happens if you attempt to open a file that doesn't exist or you don't have permissions for?
  • What is the consequence of failing to fclose a file?
  • How does a program switch to reading standard input when there is no filename argument?
  • Are the operations that read from a file also used to read standard input or is different code required for each?
  • The default behavior of the mycat program corresponds to invoking which command-line flag on standard cat?
  • mytail supports a command-line argument of -number to control the number of lines printed. How does the given code process that optional argument?
  • What range of values is accepted as an argument for mytail -number?
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

As per usual, the code we provide has been stripped of its comments and it will be your job to provide the missing documentation.

1. Write read_line

The standard library function fgets is used to read text from a file.

char *fgets(char *buf, int bufsize, FILE *stream);

Like most library functions, fgets does not do any allocation, instead it depends on the client to providing the buffer to write into. The client must pre-allocate memory based on an expectation of what size will be "large enough" to store the line. If the client's choice of size is too small, a line-reading function might overflow the buffer (gets) or truncate the read operation (fgets).

A more client-friendly design would be for the read function to handle allocation internally and make of appropriate size. You are going to implement a read_line function that operates in this manner:

char *read_line(FILE *fp);

The read_line function reads the next line from a file. The return value is a dynamically-allocated and null-terminated string (or NULL if at EOF). This function performs like a snazzier version of fgets (in fact, internally it will be implemented using fgets) in that it reads the next line but this version also allocates the storage, making life easy for the client.

Specific details of the function's operation:

  • The function reads the next line from the file. A line is considered to end after the first newline character or EOF, whichever comes first.
  • The returned string includes the characters read up to the newline character, but does not include the newline character itself. If the next line consists of just a newline character, an empty string is returned. If there are no more characters remaining to be read (i.e. at EOF), the function returns NULL.
  • To allocate memory for the returned string, the function makes an initial allocation and will successively enlarge it if needed. To be more specific, it first mallocs a buffer of minimum size (32) and reads the first part of the line into the buffer, using a single call to fgets. If there is more to read, it reallocs the buffer to double its current size (64), and makes another call to fgets to read the rest of the line. It continues like this, doubling the allocation and reading more, until finally reaching the newline or EOF that signals the end of this line.
  • If unable to allocate sufficient memory, read_line should raise a fatal assert. As a habit, you should assert the result from every allocation request. Allocation failures are rare but deadly.
  • The return value of read_line is the address of a dynamically-allocated piece of memory. The client is responsible for deallocating this memory with free when no longer needed.

Write your implementation of read_line the util.c file. You can test it using our provided mycat.c program. The mycat program is integrated with sanitycheck. You will later use the read_line function writing the mytail and myuniq programs.

2. Implement mytail

What does the tail command do?

tail is a unix filter that prints the final N lines of the input. The value of N defaults to 10, but can also be set as a command-line flag, e.g. tail -4. Consider the following sample use of tail:

myth> cat samples/colors
red
green
green
yellow
blue
blue
blue
blue
myth> tail -3 samples/colors 
blue
blue
blue

If the input contains fewer than N total lines, all lines from the input are printed.

How does mytail operate?

The mytail program you are to write is similar in operation to the standard tail with these differences:

  • mytail supports only one command-line flag, -N where N is the number of lines to output (standard tail has a number of other flags)
  • mytail only reads one file either the named file (if specified) or standard input (if not) (standard tail can read from any number of file arguments)

While mytail is processing its input, it will need to keep track of N lines. For this you are to use an array of N entries. At any given point, this array holds the most recent N lines read. The array is a "sliding window" of N lines that is moving its way through the input. When mytail reaches the end of the input, the window will contain the final N lines.

When mytail starts processing the input, it will read the first N lines into the array. Upon reading the N+1st line, it should not enlarge the array, but instead should discard the 1st line of the input and use that array slot for the N+1st. The N+2nd overwrites the second line and so on. In this way, an array is used as a circular queue.

Since the array size is known in advance and doesn't change over its lifetime, this array is good opportunity to use stack allocation and keep things simple. There is some logic to work out in the circular queue, but most of the challenge in this program concerns being careful with the strings being allocated and deallocated.

3. Implement myuniq

What does the uniq command do?

uniq is a unix filter that removes duplicate lines from the input. The output contains one copy of each unique line in the input. When uniq is invoked with the -c flag, it also counts the number of occurrences for each line. Consider the following sample use of uniq -c:

myth> cat samples/colors
red
green
green
yellow
blue
blue
blue
blue
myth> uniq -c samples/colors 
    1 red
    2 green
    1 yellow
    4 blue

One detail to note about standard uniq is that it only detects duplicate lines when they are immediately adjacent in the output. Your version will improve on this by coalescing duplicate lines no matter where they occur in the input.

How does myuniq operate?

The myuniq program you are to write is similar in operation to the standard uniq with a few differences:

  • myuniq filters out all duplicate lines, no matter where they appear in the input (standard uniq only detects repeated lines when they are immediately adjacent to one another in the input)
  • Only one copy of each unique line is printed. The unique lines are printed in the same order they appeared in the input.
  • myuniq always prefixes each line of output with its count of occurrences in the input (this is the behavior of standard uniq when invoked with the -c flag)
  • myuniq supports no command-line flags (standard uniq has a number of flags)

In order to detect repeated lines anywhere in the input, myuniq must keep track of all of the lines it has seen as it moves through the input. It will also need to count the frequency for each line. An array of structs seems like just the tool for the job, but we have a conundrum. How many entries do we need for this array? Unlike mytail which knew in advance the exact number of entries to allocate, myuniq is the dark. It might need an array of 10 entries or 10,0000. Any arbitrary fixed choice will be too large for some inputs and too small for others. Sounds like an opportunity for resize-as-you-go!

The strategy we require for myuniq is to allocate the array for 100 entries as the initial estimate, and if/when that fills up, resize the array to add another 100 entries. Repeat as needed.

Most of the challenge in myuniq program concerns proper memory handling. By the time you are done with this assignment, you will be a heap master!

Valgrind

This assignment provides a lot of targeted practice in managing memory. You'll be using both the stack and heap for various purposes and need to take care to appropriately allocate and properly deallocate each. It is essential that you are using Valgrind throughout for its help with:

  1. Memory errors. This is by far the most important reason to have Valgrind on the job. If Valgrind is reporting errors, these are Red Alert and require your immediate attention.
  2. Memory leaks. If Valgrind reports leaks, you have some heap allocations without their matching deallocation. Cleaning these up is a more minor polish issue.
  3. Memory efficiency. The Valgrind report includes stats on your total heap usage. Look for a line at the end of a Valgrind report in this format:

    total heap usage: 10 allocs, 10 frees, 888 bytes allocated
    

    This is the aggregate total of all heap allocations during the program's lifetime. The count of allocs should always equal the count of frees. The count of allocations and total size allocated is a measure of your program's memory "footprint". Run our solution under Valgrind to see its footprint and compare yours to it as a measure of your program's memory efficiency. We very tightly specify the use of stack/heap memory for these programs, so the memory usage of a correct program should track the solution very closely. In grading, we usually give your programs some slack (say within 2-3x the sample) but you should aim for closer to validate your usage of memory is correct.

Advice/FAQ

Don't miss out on the good stuff in our companion document!
Go to advice/FAQ page

Grading

Here is the tentative point breakdown:

Functionality (90 points)

  • Sanity cases (25 points) Correct results on the default sanity check tests.
  • Comprehensive/stress cases (40 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
  • Clean compile (2 points) Compiles cleanly with no warnings.
  • Clean runs under valgrind (15 points) Clean memory reports for all programs when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
  • Reasonable efficiency (8 points) We expect programs to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and extreme efforts to do so could detract from your code quality...).

Code review (buckets together weighted to contribute ~20 points)

  • Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, const correctness, and so on.
  • Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You are to document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

On-time bonus (+5%)

The on-time bonus for this assignment is 5%. Submissions received by the due date earn the on-time bonus. The bonus is calculated as a percentage of the point score earned by the submission.

Finish and submit

Review the How to Submit page for instructions. Submissions received by the due date receive the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly!

How did it go for you? Review the post-task self-check.