Due: Thu Feb 6 11:59 pm
Late submissions accepted until Sun Feb 9 11:59 pm
Assignment by Julie Zelenski and Michael Chang, with modifications by Nick Troccoli
Learning Goals
This assignment delves into pointers, arrays, and dynamic memory. You will be building your skills with:
- manipulating pointers & arrays
- managing memory allocation on the stack and heap
- using the C library I/O functions
- using tools such as Valgrind to help track down memory errors and leaks
Overview
In this assignment, you will implement simplified versions of the unix commands uniq
and tail
. There are three parts to the assignment:
- implementing
read_line
, a helper function that reads a single line from a file. You'll use a heap-allocated array to store it and resize the array to perfectly fit the line. - implementing the
mytail
program that prints out the lastn
lines of the provided input, using yourread_line
function - implementing the
myuniq
program that prints out just the unique lines and their frequencies from the provided input, using yourread_line
function and a heap-allocated array of structs.
A few reminders:
- The working on assignments page contains info about the assignment process.
- The collaboration policy page outlines permitted assignment collaboration, emphasizing that you are to do your own independent thinking, design, coding, and debugging. For this assignment, we are happy to look at your implementation of your
read_line
function and yourmyuniq
program if necessary, but we won't look at any code associated withmytail
. Later systems classes like CS111, CS112, CS143, CS144, and CS149 have a blanket no-looking-at-code-ever policy, and we have the same policy for your final assignment this quarter. We want you to understand what it takes to code something up all by yourself without relying on outside resources to find your bugs for you.
If you are having trouble completing the assignment on your own, please reach out to the course staff.
To get started on the assignment, clone the starter project using the command
git clone /afs/ir/class/cs107/repos/assign3/$USER assign3
The starter project contains the following:
util.c
mytail.c
,myuniq.c
andMakefile
: three files that you modify, and their Makefile for compilingcustom_tests
: the file where you should add custom tests for your programsmycat.c
: a program that calls yourread_line
function inutil.c
for testing purposes.samples
: a symbolic link to the shared directory for this assignment. It contains:SANITY.ini
,prototypes.h
andsanity.py
: files to configure and run Sanity Check. You can ignore these.myuniq_soln
,mytail_soln
andmycat_soln
: executable solutions for the programs you will write.colors
,colors_no_end_newline
,dictionary
andnames
: sample text files for testing your unix filters
tools
: contains symbolic links to thesanitycheck
andsubmit
programs for testing and submitting your work.
Starter Code
As you work through each part of the assignment, make sure to take time to review and understand the provided code in each part. The provided code relies on reading files, which works similarly, but has different syntax, across different languages. Refer to our guide to the C Standard Library (or your favorite C reference) to get an overview of how to read files using C. The test program mycat.c
is given to you fully implemented for testing read_line
, and you do not need to modify or add comments to it. The starter files for myuniq.c
and mytail.c
contain code to handle the command-line arguments and file setup, and you should add comments to them and implement the rest of their functionality. As you work through the assignment, make sure to understand the given code, and refer back to these questions to confirm your understanding:
- What are the possible options for the mode argument to
fopen
? What happens if you attempt to open a file that doesn't exist or you don't have permissions for? - What is the consequence of failing to
fclose
a file? - How does a program switch to reading standard input when there is no filename argument?
- Are the operations that read from a file also used to read standard input or is different code required for each?
- The default behavior of the
mycat
program corresponds to invoking which command-line flag on standardcat
? mytail
supports a command-line argument of-number
to control the number of lines printed. How does the given code process that optional argument?- What range of values is accepted as an argument for
mytail -number
? - Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!
Testing
This assignment heavily emphasizes testing. For read_line
and each of the 2 programs you write below, you should also add at least 3 to 5 additional tests of your own (total of at least 10) in the custom_tests
file that show thoughtful effort to develop comprehensive test coverage. When you add a test, also document your work by including comments in the custom_tests
file that explain why you included each test and how the tests relate to one another. You can find suggested testing strategies on the testing page, linked to from the Assignments dropdown. We strongly encourage you to write some tests before you write any code, and then add additional tests as you write code.
We also encourage you to make your own text files for testing! Similar to the provided test files in the samples/
folder (like colors
, names
, etc.), having custom file inputs can help you test more thoroughly. When you submit your assignment, you'll be asked whether you'd like to add any custom files to your submission, because by default only the starter files are submitted. At that time, you can specify any additional test files of your own that you created.
The programs that you will write for this assignment are part of a special category of commands called "filters", and there are some cool ways we recommend to test them. A filter is a program that reads input (typically line by line) from the user or from a file, transforms the input in some way, and produces some output. Other examples of commands like this are cat
, head
, wc
, sort
, etc. Most filters accept either a filename as an argument or, if no argument is specified, read input directly typed into the terminal (called "standard input") - you enter ctl-d to signal you are done entering input. Reading from standard input can be handy for running a quick test without having to first create a file.
It is also very handy to test these kinds of programs by using them as part of a "compound" unix command. You can write a compound Unix command where the output of one command is "piped" to be the input of another command, like this, using the pipe |
character:
sort samples/names | ./samples/mytail_soln
This sorts the samples/names
file but instead of outputting the result to the terminal, it feeds it ("pipes it") as input to the mytail_soln
program, which will print out just the last 10 lines of the input it receives.
This "compound" unix Command is called a pipeline.You can have pipelines with as many commands as you'd like, each piping their output to be the next one's input. You can read more about piping input in our Unix guide. This demonstrates the Unix philosophy, which is to move away from the construction of monolithic do-everything programs and instead encourage a proliferation of smaller commands, each focused and streamlined to perform one task well. Those small, discrete tools are combined to accomplish larger tasks.
Pipelines allow you to easily come up with more tests for the programs you'll write on this assignment, since they let you easily feed the output of another program as input to your program. Another pipeline example is grep include util.c | wc -l
, which counts the number of lines in util.c
that contain "include". It does this by using grep
to find lines in util.c
containing "include", and then feeds that output to wc
, which outputs the number of lines in its received input.
If you would like to write pipelines in custom tests, if they start with a command other than one of the assignment programs (e.g. sort file.txt | mytail
), you need to prefix the assignment program name with a $
: sort file.txt | $mytail
.
Important note: unfortunately GDB and Valgrind don't work well with pipeline commands. Instead,you can make a file with the input you want to test and then gdb/valgrind your program running on that file. If you want to save the output of a command to a file, you can use redirection >
: check out our Unix guide for more information. For example, if you wish to debug the following pipeline:
sort samples/names | ./mytail
Since this is feeding the output of sort samples/names
into your mytail program, you could instead save the output of sort samples/names
into a file and debug with that file. You could save the output like this, and then run with that file instead:
myth$ sort samples/names > testfile.txt
myth$ valgrind ./mytail testfile.txt
Memory Usage and Efficiency
Memory
This assignment provides a lot of targeted practice in managing memory. You'll be using both the stack and heap for various purposes and need to take care to appropriately allocate and properly deallocate each. Here are some general memory guidelines:
- Stack allocation is as simple as declaring a variable. The stack is convenient (auto-allocation/deallocation), type-safe, and very efficient.
- Dynamic allocation is via malloc/realloc/free. The heap gives the programmer control over lifetime (explicit allocation/deallocation) and flexibility (storage can be re-sized/re-purposed), but comes at a cost in safety and efficiency.
- Take care to understand proper use of stack versus heap allocation and when each is appropriate. As a general rule of thumb, unless a situation requires dynamic allocation, stack allocation is preferred. Often both techniques are used together in a program. Heap allocation is a necessity when:
- you have a very large allocation that could blow out the stack
- you need to control the memory lifetime, or memory must persist outside of a function call
- you need to resize memory after its initial allocation
It is essential that you are using Valgrind throughout. Check out our Valgrind guide, and/or revisit the Valgrind exercises in labs 2 and 3 to be sure you know how to put Valgrind to work for you. In particular, let Valgrind help you with:
- Memory errors. This is by far the most important reason to have Valgrind on the job. If Valgrind is reporting errors, they need your immediate attention.
- Memory leaks. If Valgrind reports leaks, you have some heap allocations without their matching deallocation. Given that leaks generally do not cause bugs, and incorrect deallocation can, we recommend that you do not worry about freeing memory until your program is finished. Then, you can go back and deallocate your memory as appropriate, ensuring correctness at each step.
Efficiency
We will also be looking at the memory and runtime efficiency of your programs. For memory efficiency, Valgrind can be a huge help, as the Valgrind report includes stats on your total heap usage. Look for a line at the end of a Valgrind report in this format:
total heap usage: 10 allocs, 10 frees, 888 bytes allocated
This is the aggregate total of all heap allocations during the program's lifetime. The count of allocs should always equal the count of frees. The count of allocations and total size allocated is a measure of your program's memory "footprint".
Runtime efficiency (how fast your program runs) is also important. Prefix a command with time
to execute and report on the time used to complete it. For example, time ./mytail
will measure your program's runtime, and time samples/mytail_soln
will measure the sample solution's runtime. The time labeled "user" is the one of interest.
% time ./mytail -1 samples/dictionary
zygote
real 0m0.010s
user 0m0.006s
sys 0m0.000s
% time samples/mytail_soln -1 samples/dictionary
zygote
real 0m0.009s
user 0m0.007s
sys 0m0.000s
The best strategy for ensuring appropriate memory and runtime efficiency is to measure the sample executable on a given input and compare to the measurement of your program running on the same input. If your performance is in the same ballpark as ours (say within a factor of 2), you're golden, but you should aim for closer to validate your usage of memory and time is correct. When considering tradeoffs of space versus time, the sample executable shows you the balance we are looking for. If your program uses noticeably more memory or time, it indicates you have some unnecessary or redundant work that should be eliminated. Be sure to note that very small inputs are, well, too small for meaningful measurement; instead also measure on heftier input to see the program operating at scale.
1. read_line
Your first task is to implement the function read_line
in util.c
, with the following signature:
char *read_line(FILE *file_pointer);
The read_line
function takes in one parameter, which is a file to read from, and returns the next line read from the file as a heap-allocated C string, or NULL if there's nothing more to read from the file. It's the client's responsibility to free the allocation when they're done using it. read_line
is a more convenient version of the fgets
library function, which also reads in the next line from a file (check out man fgets
); but while fgets
requires the caller pass in a fixed-size buffer to write to and truncates the line if it doesn't fit, read_line
will heap-allocate space for the line and resize it as needed so that it can fit the entire line.
To implement read_line
, you'll call fgets
in a loop to read in the line in chunks and build up a full heap-allocated line that you then return. In this way, it's a "wrapper function" around fgets
that makes it easier to use. You should not need to use other file I/O functions such as feof
, fgetc
, etc.
A FILE *
is a type that represents a file in C, but it's not like other pointers in that you can't dereference it; instead, you pass it to other library functions such as fgets
which can do things with that file like read from it. A FILE *
refers to a specific place in the file, and as you read from it using e.g. fgets
, it updates its state to advance through the file so the next time you call e.g. fgets
, it will pick up where it left off.
The read_line
function should be implemented with the following structure - make sure to call assert
after every heap allocation!:
- Make a 32 byte heap-allocated string (our "buffer") on the heap
- Read as much of the next line as you can into the buffer using
fgets
- If you haven't reached the end of the line, realloc the buffer to double its current size, and go back to step 2. A line is considered to end after the first newline (
\n
) character or once you've reached the end of the file, whichever comes first. - Return the line, but omitting the newline character if it ends with one (if the entire line is just
\n
, then the returned string would be the empty string)
Note that if you immediately encounter the end of the file when calling fgets
for the first time, meaning there's nothing to read, you should return NULL
.
Make sure to read the manual page for fgets
to learn more about how to use the function and to understand its quirks. For instance, fgets
tells you that you've reached the end of the file by returning NULL
if you try to call it when there are no more characters to be read. If there are more characters to be read, even if you read up to the end of the file, fgets
won't return NULL
until the next time you try to call it and there's nothing to be read. Here's an example:
FILE *file = ... // pretend file contains "hi"
char buf[1000];
// ptr isn't NULL even though we read up to the end
char *ptr = fgets(buf, 1000, file);
printf("%s\n", buf); // prints "hi"
// now ptr is NULL, since there's nothing left to read
ptr = fgets(buf, 1000, file);
You may assume the following while implementing read_line
:
fgets
will never encounter an error- the specified file will be a text file and its contents can be read in as C strings
- the caller will free the memory returned by
read_line
when they're done
Testing
This function can be tested in isolation with the provided mycat.c
program, which you do not need to modify or comment, but whose code you should read over to understand how it calls your read_line
function. You can also write sanitycheck custom tests with mycat
. mycat
is similar to the built-in cat
Unix command; if you run it with a filename as an argument, it will print the contents of that file, one line at a time. If you run it with no arguments, it will expect input in the terminal, and once you enter ctl-d (to signal end of input), it will print the contents you entered in the same way, one line at a time.
Study the provided mycat.c
code, along with our guide to the C Standard Library (or your favorite C reference) to get an overview of how file-reading works in C. You should review the given code to understand how it tests your read_line
function. Refer to our questions in the starter code section to confirm your understanding.
If read_line
is properly implemented, mycat
should have no memory leaks or errors. Make sure to test thoroughly to ensure that in all cases, read_line
doesn't leak any memory! If you do encounter leaks, see if the number of leaks is proportional to the length of the input. If it is, it means the leak occurs in many calls to read_line
. If it isn't, it means the leak occurs just in a fixed place, such as the first line or the last line. Shrink your test case to be as small as possible to help narrow in on the leak!
The ./mycat Makefile
sanity check test is a great test to ensure that you correctly output whitespace and newlines and correctly handle reaching the end of the file. Unlike other tests that are lenient with whitespace, this test requires an exact match. If this test is failing, it may indicate problems with detecting / handling the end-of-file case, or returning an empty string instead of NULL or vice versa.
Before moving on: make sure you have thoroughly tested your read_line
function, covering various different cases of possible inputs, and that you have written your custom tests. You will use this function later in the assignment, so it's vital to ensure it's thoroughly tested before moving on!
2. mytail
Your next task is to use your read_line
function to implement the mytail.c
program, which is a simplified version of the Unix tail
command. It takes in the name of a file and a number N
> 0 and prints the last N
lines of the file in order. If no N
is specified, it defaults to N=10. Try out the provided sample solution, e.g. ./samples/mytail_soln util.c
or ./samples/mytail_soln -4 Makefile
.
myth$ ./samples/mytail_soln -4 Makefile
clean::
rm -f $(PROGRAMS) *.o
.PHONY: clean all
If the input contains fewer than N
total lines, all lines from the input are printed:
myth$ ./samples/mytail_soln -50 samples/colors
red
green
green
yellow
blue
blue
blue
blue
If you don't specify a filename, it will expect input in the terminal, and once you enter ctl-d (to signal end of input), it will print the contents you entered in the same way - printing the last N
lines.
Review Starter Code
mytail.c
comes with provided code to read user input, open the specified file, and call the print_last_n
function that you must implement. Study the provided program code, along with our guide to the C Standard Library (or your favorite C reference) to get an overview of how file-reading works in C. You should read and understand the given code, work out how you will add to it, supply program and per-function comments to document your approach, and add inline comments where necessary to highlight any particularly tricky details. Refer to our questions in the starter code section to confirm your understanding of the code.
Note that we aren't looking at code for this particular program, because we want you to gain some experience triaging your own bugs early on in the quarter.
Implementation
Since we know in advance how many lines we ultimately need to store, the program should be implemented by creating a stack array of N
strings, and reading lines into that array using your read_line
function. This array will represent a "circular queue" that always stores at most the N
most recent lines read in. When reading in a line, if there is space available in the array, insert the line in the next available space. Otherwise, replace the oldest line in the array with this new line. Continue with this process until you have read in all lines - the resulting array will contain the last <= N
lines from the file (but not necessarily in order!). For example, let's say N = 3
, and we have a file as follows:
line1
line2
line3
line4
line5
After we read in 3 lines, our array should look like [line1, line2, line3]
. When we read in the 4th line, it would then look like [line4, line2, line3]
. And when we read in the 5th line, we would then have [line4, line5, line3]
. From there, you will need to print out the lines in the correct order as they appeared in the file: line3
, line4
, then line5
. Try to think about how you can deduce the correct ordering for printing the lines based on what you know about how the array is constructed. You must use this circular queue approach in your implementation, and should not use a different approach, such as sliding all the elements down each time you insert a new element.
As a client of the read_line
function, make sure you free memory appropriately to avoid any memory leaks.
Testing
This program can be tested with sanity check. We encourage you to use the provided test files in the samples/
folder and also to create your own test text files to use for testing and debugging.
3. myuniq
Your final task is to use your read_line
function to implement the myuniq.c
program, which is a simplified version of the Unix uniq
command. It takes in the name of a file and prints the unique lines in the file along with how many times they appear. Only one copy of each unique line is printed, and the unique lines are printed in the same order they appeared in the input. Try out the provided sample solution, e.g. ./samples/myuniq_soln samples/colors
:
myth$ ./samples/myuniq_soln samples/colors
1 red
2 green
1 yellow
4 blue
If you don't specify a filename, it will expect input in the terminal, and once you enter ctl-d (to signal end of input), it will print the contents you entered in the same way - printing just the unique lines and the number of occurrences of each line.
Review Starter Code
myuniq.c
comes with provided code to read user input, open the specified file, and call the print_uniq_lines
function that you must implement. Study the provided program code, along with our guide to the C Standard Library (or your favorite C reference) to get an overview of how file-reading works in C. You should read and understand the given code, work out how you will add to it, supply program and per-function comments to document your approach, and add any inline comments where necessary to highlight any particularly tricky details. Refer to our questions in the starter code section to confirm your understanding of the code.
Implementation
The program needs to keep track of all the unique lines it has seen and how many times it has seen them - to do this, you should use an array of structs. A struct allows us to store the lines and their frequencies in one array instead of two - it's up to you to specifically define the struct and what it contains. Because we do not know in advance how many unique lines are in the file and thus how long our array should be, you should use the "resize-as-you-go" approach, similar to your read_line
function. Specifically, you should allocate the array on the heap with space initially for ESTIMATE
entries (a provided constant), and if/when that fills up, resize the array to add another ESTIMATE
entries. Repeat this process as needed, adding an additional ESTIMATE
entries each time.
When you read in a line, you will need to manually linearly search through your array to find whether or not you have already seen it.
When printing information about each line, to match the output format of the sample solution when printing out the line count number, you should use the printf
token %7d
instead of just %d
. This will print the number with width 7, meaning the output will look something like this:
1 red
5000 blue
12413 green
1253473 yellow
As a client of the read_line
function, make sure you free memory appropriately to avoid any memory leaks.
Testing
This program can be tested with sanity check. We encourage you to use the provided test files in the samples/
folder and also to create your own test text files to use for testing and debugging.
Submitting
Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work. When you submit, you may optionally indicate that you do not plan to make a submission after the on-time deadline. This allows the staff to start grading some submissions as soon as the on-time deadline passes, rather than waiting until after the late period to start grading. You will also have an opportunity to add any custom files to your submission that you created (e.g. for testing), because by default only the starter files are submitted.
Grading
Here is the tentative point breakdown:
Functionality (95 points)
- Sanity cases (25 points) Correct results on the default sanity check tests.
- Comprehensive/stress cases (40 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
- Clean compile (2 points) Compiles cleanly with no warnings.
- Clean runs under valgrind (15 points) Clean memory reports for all programs when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
- Reasonable efficiency (8 points) We expect programs to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and extreme efforts to do so could detract from your code quality).
- custom_tests (5 points) Your custom_tests file should include at least 10 tests of your own, 3-5 per program, that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.
Code review (buckets together weighted to contribute ~20 points)
The grader's code review is scored into a bucket per assignment part to emphasize the qualitative features of the review over the quantitative. The styleguide is a great overall resource for good program style. Here are some highlights for this assignment:
- Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, const correctness, and so on.
- Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
- Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
- Documentation. You should document both the code you wrote and what we provided. We expect reasonably complete program overview and per-function comments that explain the overall design along with occasional use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.
Post-Assignment Check-in
How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have written your own implementation of two Unix filter commands and an improved version of a C I/O function, using pointers, arrays and dynamic memory. That's a pretty darn impressive accomplishment!
To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task. They could also be useful as review before the exams.
- The
scan_token
function you wrote for assignment 2 required that the client supply the memory to write the scanned token. Theread_line
function of assignment 3 instead allocates memory on the client's behalf. What are the advantages and disadvantages between the two approaches? - It is stated that a correct program should have a one-to-one correspondence between
malloc
calls andfree
calls. How does adding arealloc
call change the needed number ofmalloc
andfree
calls? Briefly explain your reasoning. - Which is the bigger problem: memory errors or memory leaks? Why? How can you tell which is which in a Valgrind report?
- How can you gauge whether your program has acceptable performance in terms of runtime or memory efficiency?
- When it is appropriate to use
calloc
instead ofmalloc
?