Assignment 1: Reading Unix v6 Filesystems

Due: Thu Jan 26 11:59 pm
Late submissions accepted until Sat Jan 28 11:59 pm

Thanks to Mendel Rosenblum for this wonderful assignment. Based on writeups by John Ousterhout, Jerry Cain and Chris Gregg, with modifications by Nick Troccoli.

Assignment 1 Overview Session: Thursday, January 19 7-8PM on Zoom (link on Canvas). Note that the session is recorded and will be posted to the course website shortly after it’s over. Slides here, video here.

Unix V6 was an early version of the Unix operating system. It was released in 1975 and was the first version of Unix to see widespread usage outside Bell Labs. Unix V6 was a beautiful piece of software: simple and very small (less than 64 KBytes of code!), yet with many of the most powerful features found in modern Linux systems. For this project, we have created images of Unix V6 disks in a format virtually identical to how they would have appeared in 1976. Your task for this project is to recreate enough of the V6 file system code to read files from the disk. You will work in C for this project, since that was the language used for Unix V6.

Learning Goals

This assignment focuses on understanding the different layers in the Unix v6 filesystem. You will be building your skills with:

  • understanding the implementation of the Unix v6 filesystem as outlined in class
  • appreciating the complexity and tradeoffs in implementing a filesystem
  • harnessing the power of layering to implement filesystem components and make bland hardware work to be a fully operational filesystem
  • navigating a large code base, adding specific features and functions, and testing their behavior
[NEW FOR ASSIGN1] Assignment Support: Starting with assign1, we are not able to look at code when providing help in Helper Hours. This doesn't mean we can't help at the code level with questions - but we aren't able to look over assignment code when answering questions. We are happy to look over example code such as code from lecture or section, talk through concepts or bugs, debugging strategies, and more!

Overview

A key design principle we saw with filesystems is that of layering: one way to manage complexity in systems is to break it down into layers that sit on top of each other. There are several layers involved in the Unix v6 filesystem, from lowest to highest:

  • The block layer manages the details of communication with the disk, reading and writing individual sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.
  • The inode layer supplies higher layers with metadata about files. When we need to get a specific inode or know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).
  • The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.
  • The directory layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.
  • Lastly, the pathname layer implements full absolute path lookups, starting from the root directory. If you hand /usr/class/cs111/hello.txt to the pathname layer, it will utilize the directory layer beneath it, looping through the components of that full path, until it finds hello.txt.

Any user software accessing the filesystem will sit on top of the directory and pathname layers. If we’re trying to write an application that does meaningful work, we almost certainly don’t want to be bogged down thinking about the organization of bytes on disk, so we leave that to the aforementioned layers beneath us.

For this assignment, we give you a fully implemented block layer, and you’ll flesh out the others. By the end of the assignment, your program will be able to list and read files from a disk formatted with the Unix v6 filesystem!

Starter Code

To get started on the assignment, clone the starter project using the command

git clone /usr/class/cs111/repos/assign1/$USER assign1

This line creates a new folder called assign1 in your current working directory containing a copy of the starter code.

The starter project contains the following files listed below. There are a lot of provided files in this assignment, and part of the challenge is learning to navigate them, follow the logic, and add code where appropriate. Expect that it will take some time to figure out the starter code before you can actually start doing anything. This may be frustrating, but we want to push you to get better at reading code and finding your way around an existing system; such skills are critical if you want to work on systems comprised of hundreds of thousands or millions of lines of code. Here are some tips to help with this navigation:

  • consult the documentation above and further down in the spec
  • consult the comments throughout the provided files, particularly in the .h files for functions you must implement
  • skim through the test program code (diskimageaccess.c and function_tester.c) to see how your functions are tested
  • use grep to search the files for definitions or unknown types; for example, let's say you forget where struct inode is defined and want to see its definition. The format for grep is grep <pattern> <file(s) to search>. So if you wanted to search everything in the assignment folder for the struct inode definition, you might do grep "struct inode {" * to search all files in the current folder (that's the asterisk) for the text "struct inode {". This is also helpful to search for where a function is called. So if you wanted to search for where file_getblock is called, you might do grep "file_getblock(" *.

  • readme.txt: a text file where you will confirm your course enrollment details as part of the assignment
  • diskimg.[ch]: files for the block layer, which is fully implemented for you
  • inode.[ch]: files for the inode layer, which you will modify
  • file.[ch]: files for the file layer, which you will modify
  • directory.[ch]: files for the directory layer, which you will modify
  • pathname.[ch]: files for the pathname layer, which you will modify
  • diskimageaccess.c, function_tester.c, chksumfile.[ch], unixfilesystems.[ch], Makefile: files for the provided diskimageaccess and function_tester test programs and the Makefile for compiling.
  • direntv6.h: contains the definition of struct direntv6, which represents a single directory entry.
  • filsys.h: contains the definition of struct filsys, which represents the filesystem superblock. This is a slightly modified copy of the header file from Version 6 Unix. It also contains several useful constants. You can largely ignore the struct and all of the fields defined inside - it is included for the purposes of the testing framework, but none of the code you write needs to deal with the struct.
  • ino.h: contains the definition of struct inode, which represents a single inode. This comes from Version 6 of Unix, with some small modifications.
  • unixfilesystem.[ch]: contains the definition of struct unixfilesystem, which represents the filesystem as a whole. It also contains constants specifying the file system layout, including the sector address of the superblock and of the start of the inode region.
  • custom_tests: a file where you will add custom tests that you can run through sanitycheck
  • samples: a symbolic link to the shared directory for this assignment. It contains:
    • SANITY.ini and sanity.py: files to configure and run Sanity Check. You can ignore these.
    • diskimageaccess_soln: executable solution for the provided test program. Use this program to verify correct behavior for these programs.
    • function_tester_soln: executable solution for the other provided test program. Use this program to verify correct behavior for these programs.
    • disk_images: contains "basic", "depthFile", "dirFnameSize" and "basicExtended" disk images for testing, along with corresponding ".structure" files describing the file system structure (what files/directories are in them and what their inode numbers are) and corresponding ".output" files with expected output when run with diskimageaccess (you can also run these disk images with diskimageaccess_soln to see the expected output as well).
  • tools: contains symbolic links to the sanitycheck and submit programs for testing and submitting your work.

You should only need to modify files for the 4 layers you will implement, along with your custom_tests file.

Testing

There are 2 provided test harness programs to help you test on this assignment. You can run both in GDB for debugging, since they both call your filesystem layer functions.

diskimageaccess

The diskimageaccess program does more elaborate testing of entire filesystem layers and some individual functions. Compile using make and you will create the diskimageaccess program that executes your code. Run it with the -h flag to see how to use it:

$ ./diskimageaccess -h
Usage: ./diskimageaccess <options> diskimagePath
where <options> can be:
-h     display this help message
-q     don't print extra info
-b     test the inode_iget function in isolation
-m     test the inode layer in isolation
-i     print all inode checksums (test the inode and file layers)
-d     tests directory_findname to ensure no reading of invalid dirents when dir is only partly filled
-l     tests directory_findname to ensure names of length 14 work
-p     print all pathname checksums (test the directory and pathname layers)

diskimagePath should be the path to one of the test disks located in samples/disk_images. We currently have four test disks: basicDiskImage, depthFileDiskImage, dirFnameSizeDiskImage and basicDiskImageExtended. Depending on what flag you specify, you can test different layers of your implementation. For example, to run both the inode and pathname tests on the basic disk, you could run

$ ./diskimageaccess -ip samples/disk_images/basicDiskImage

This program tries to read all the files on the image using your code and outputs checksums of the inode and file contents, which uniquely summarize each file and should match ours.

function_tester

For more fine-grained debugging and testing, we have also provided the function_tester program which allows you to test individual function calls to your filesystem functions. You can use this if one of the diskimageaccess tests is failing or if you'd like to more specifically investigate a single call or calls to your functions to ensure they are working properly.

Compile using make and you will create the function_tester program that executes your code. Run it with the -h flag to see how to use it:

$ ./function_tester -h
Usage: ./function_tester <diskimagePath> <function> <arg1>...<argn>
<diskimagePath> is the path to a disk image file.
<function> is one of the assignment functions, e.g. inode_indexlookup or file_getblock.
<argX> is an arg to pass to that function.  Here are examples:
  inode_iget: specify the inode number to test with
  inode_indexlookup: specify the inode number followed by the fileBlockIndex to test with
  file_getblock: specify the inode number followed by the fileBlockIndex to test with
  directory_findname: specify the directory inumber followed by the entry name
  pathname_lookup: specify the absolute path

Sanity check runs your program against the sample solutions for these testing programs and compares results for various tests. You can also run the solution programs manually to see the intended behavior.

On this assignment, one testing requirement is that you add at least THREE additional tests to custom_tests using the function_tester program (don't add custom tests with diskimageaccess, since all possible test variations in the diskimageaccess program are already included in the default sanitycheck). In other words, you should add at least three tests that run function_tester with various arguments. Add a comment above each one (start the line with '#') documenting what that test is verifying.

Error Checking

Check out each layer's header file for a description of the required error checking for that function. In general, you should check for disk errors, as well as errors in calling any of your own functions from lower layers, in inode_indexlookup you should check to see if the specified fileBlockIndex is too large, and in higher layers you should return -1 if the directory entry or path is invalid. A "disk error" means checking the return value of diskimg_readsector. If your code does detect an error, be sure to log information about the error using fprintf(stderr, ...) and then return -1. E.g.

fprintf(stderr, "inode_indexlookup(fileBlockIndex=%d) error calling diskimg_readsector.\n",
  fileBlockIndex);
return -1;

The error messages are not required to exactly match the sample solution, but should be reasonable error messages that describe what error occurred. When in doubt, though, matching the sample solution error messages is the easiest way to go!

Clarification on printing error messages: We're updating the assignment requirements to make it so that printing error messages won’t be required for grading - but we still highly encourage them to help with debugging! In particular, they can help track down why a particular function is returning -1 (useful for debugging checksum errors). Another note - if you do print error messages, our solution does not print an error message if directory_findname can’t find an entry; so make sure to not print one there either to ensure you pass sanity check test 41. If you write custom tests for error cases, you will need to match the solution error messages in order for the tests to pass, as custom tests does a direct output comparison to determine whether the test passed or failed.

Working On The Assignment

Your job is to complete the implementation of 4 layers of the filesystem: the "inode layer", "file layer", "directory layer", and "pathname layer". Each of these layers has 2 corresponding files, and layers build on each other. You will need to modify the .c files for each layer. We also provide a fully-implemented "block layer" in diskimg.[ch] that provides the code for reading and writing sectors (note that the words block and sector are used interchangeably here) on the disk image.

This spec is long, but written incrementally - you can work as you read through in steps. However, check out our few additional tips at the end as you work.

Note: back in the 1970s, storage space — both on disk and in main memory — was at a premium. As a result, the Unix v6 file system goes to lengths to reduce the size of data it stores. You'll notice that many integer values in the structs we provided are stored using only 16 bits, rather than today's more standard 32 or 64. (In our code we use the type int16_t from stdint.h to get a 16-bit integer, but back in the '70s, the C int type was 16 bits wide.)

As you implement them, consider what functions from lower layers you have already implemented and how they can help! We recommend implementing them in the order specified below.

As you work, we highly recommend testing incrementally and ensuring each function is working before moving on. Sanity check can be helpful at certain points, but it only tests certain layers directly, and it also omits some output from the program in order to do an output comparison. To view the full testing output and to test more thoroughly, make sure to run the diskimageaccess and function_tester programs outside of sanity check. You can see more specific testing recommendations for each function below. There is also a custom_tests file you can put your own tests in and run that with sanitycheck.

(Provided) Block layer (diskimg.h and diskimg.c)

The lowest layer of the file system provides facilities for reading and writing individual disk blocks. In this project, the block layer reads and writes blocks from a disk image file rather than an actual disk, but the API is the same as it would be for an actual disk. The starter kit provides you with a complete implementation of this layer; you will use the function diskimg_readsector in your code (see the comments in diskimg.h for more information). You do not need to use diskimg_writesector.

When calling diskimg_readsector, make sure to do error-checking by checking if its return value isn't equal to DISKIMG_SECTOR_SIZE! This can help catch issues, since it returns -1 on error and should otherwise always return DISKIMG_SECTOR_SIZE (512) as the number of bytes read.

Inode layer (inode.h, inode.c)

This layer provides functions for finding inodes on disk and also for using the information in inodes to translate from logical block indexes within a file to physical block/sector numbers on disk. You will implement the functions inode_iget and inode_indexlookup. These functions take in a struct unixfilesystem * parameter that specifies info about the disk being read from. The "handle" within that struct is needed to call diskimg_readsector. Remember: the first inode in the first inode block has inumber 1! See unixfilesystem.h for details and helpful constants. Make sure to check out the documentation in inode.h for what the functions must do.

inode_iget

This function reads the data for the inode with the specified inumber from the filesystem and fills in an inode struct with that information.

Testing Recommendation

Sanity check includes some testing for this by running diskimageaccess with the -b flag. If any tests are failing, try running them outside of sanity check for more information. Another way to test this function in isolation is to pick an inode from the .structure file for a disk image and test it vs. the sample solution (try custom_tests!) to see if the inode is read in properly.

inode_indexlookup

This function returns the block number of the file data block at the specified index. What this means is that, if we take just the blocks for this file holding actual file data (e.g. ignoring indirect blocks) and index them sequentially, the fileBlockIndex parameter to this function specifies one of those indexes, and we must return the block/sector number at that index. In other words, the returned block number will always be for a block that stores actual file data. As a concrete example, let's say a file is 1,536 (512 * 3) bytes big. This means it takes up 3 blocks of data. Let's say furthermore that the block numbers it is using are 35, 65 and 21. This means in the inode it stores these three block numbers, and if you fetched blocks 35, 65 and 21 from disk you would have data for the entire file. But for inode_indexlookup, fileBlockIndex would be either 0, 1 or 2 - it's specifying the index into the file's data of the block to get, and it's your job to return the disk block number it corresponds to. For instance, if fileBlockIndex was 0, you would return 35. If fileBlockIndex was 1, you would return 65, etc. For a large file, let's say fileBlockIndex was 256; that would mean you should return the first direct block number contained in the second singly-indirect block (since the 256th block number for the file's data would be in the second singly-indirect block).

This implementation needs to manage direct, singly indirect, and doubly indirect block numbers. Translating for direct block numbers is straightforward (as we did in the example above), but translating for indirect block numbers is hard to get right.

Tips:

  • We suggest implementing this function first just for direct blocks; once you do this, you should pass the inode-level test for the depthFileDiskImage as that is the only disk image that doesn't contain any large files.
  • Next, we suggest implementing this function only handling singly-indirect blocks; in other words, assume large files will take up at most 7 singly-indirect blocks, and ignore the 8th doubly-indirect one. Once you do this, you should pass the inode-level test for the dirFnameSizeDiskImage as that disk image doesn't contain any files bigger than 7 * 256 * 512 bytes. Remember that indirect block numbers identify other blocks that contain 256 two-bytes integers (best managed via the uint16_t type).
  • Finally, we suggest implementing the case where you need to examine the 8th doubly-indirect block; once you do this, all inode-level tests should pass.
  • Make sure to share and unify code as much as possible between the singly-indirect and doubly-indirect block cases! Since doubly indirect block numbers lead to singly indirect block numbers, which lead to block numbers. Make sure whatever code you write to translate singly indirect block numbers also contributes (without cutting and pasting or otherwise copying it) to the process used to ultimately convert doubly indirect block numbers.
  • use the i_mode field of the inode struct to check if the file uses the large or small mapping scheme. The i_mode is a 16-bit integer whose individual bits indicate various properties of the inode. ino.h contains #defines that describe what each bit means. Specifically, bit 12 of i_mode indicates whether the file uses the large file mapping scheme. So if (i_mode & ILARG) != 0, then the inode's i_addr fields point at indirect and doubly-indirect blocks rather than directly at the data blocks.

Testing Recommendation

Sanity check includes some testing for this by running diskimageaccess with the -m flag. If any tests are failing, try running them outside of sanity check for more information. Another way to test this function in isolation is to use function_tester; choose an inode number from the .structure file that is either a small or large file depending on what format you'd like to test. Then try fetching each of its file blocks (0, 1, 2,...) via function_tester and verify that each of the numbers is correct (try custom_tests!).

File layer (file.h and file.c)

This layer provides a function file_getblock that reads a block of data from a file, given the file's i-number and the desired block within the file. You will write the implementation for this function. Make sure to check out the documentation in file.h for what the function must do.

file_getblock's job is to read the file data block at the specified index. In other words, whereas inode_indexlookup finds which block number contains the data, this function actually reads in that data. This function returns the number of meaningful bytes in a block/sector that contribute to the payload of some file (be it a regular file or a directory). In other words, it returns the number of bytes in that block that are actually used to store data for the file. As an example, if the file size happens to be a multiple of 512, then this function always returns 512. Alternatively, if a file isn’t a multiple of 512 and we have to get the payload stored in the very last block, you should expect to return a number that’s less than 512 (because only a part of what’s stored in the block is relevant).

Note: in a space-saving move, the designers of the file system stored the inode's size field as a 24-bit integer. There's no 24-bit integer type in C, so we represent this value using two fields in the inode struct: i_size1, which contains the least-significant 16 bits of the value, and i_size0, which contains the most-significant 8 bits of the value. We provide a function inode_getsize in inode.c that assembles these two fields into a normal C integer for you. This may come in handy in file_getblock!

Testing Recommendation

Sanity check includes some testing for this by running diskimageaccess with the -i flag. If any tests are failing, try running them outside of sanity check for more information. If your inode information isn't matching, revisit your inode_iget function. If your checksums aren't matching, identify the inode that is outputting incorrectly and check the .output file for more information about it. Pay close attention to whether that inode represents a small or large file; that could give you a hint as to where your logic might be going wrong! Then you can use function_tester to investigate further.

Directory layer (directory.h and directory.c)

This layer encapsulates knowledge about the structure of directories. You will write the function directory_findname, which reads a directory file to find the directory entry with a given name, if there is one. Make sure to check out the documentation in directory.h for what the function must do.

This function's job is to get the directory entry (defined as struct direntv6) from the given directory that matches the given name. As a hint, your implementation should make use of strncmp instead of strcmp, because some of the stored directory names aren’t '\0'-terminated. They are, however, guaranteed to be of length 14 or less. Check out the man page for strncmp and strcmp so you understand the difference.

Testing Recommendation

Sanity check includes some testing for this by running diskimageaccess with the -l and -d flags. If any tests are failing, try running them outside of sanity check for more information. Another way to test this function in isolation is with function_tester. Open the .structure file for the test disk and choose a file or folder to test with. Note both its name, as well as the inode number of its parent directory. Then test with function_tester, specifying the parent inode number as dirinumber and the chosen name as name.

Pathname layer (pathname.h, pathname.c)

This layer provides a function pathname_lookup which looks up a path and returns the i-number for the file specified by the path. You will implement this function. The function strsep will come in handy here! We encourage you to implement this function iteratively (though recursion is fine provided it is clean and tidy). Make sure to check out the documentation in pathname.h for what the function must do.

Testing Recommendation

Sanity check includes some testing for this by running diskimageaccess with the -p flag. If any tests are failing, try running them outside of sanity check for more information. diskimageaccess walks the entire file system directory tree, starting at the root directory, and prints out all of the paths that it finds, along with inode and checksum information for each file or directory found. If your inode information isn't matching, try tracing the lookup by hand to make sure that the correct step is taken each time (e.g. / to /foo, /foo to /foo/CVS, etc.). The .output output files show the inumbers for various path depths, so if you are looking up /a/b/c you can check that your lookup for /a, /a/b and /a/b/c are all correct. You can also use function_tester to test the pathname and directory layer functions directly.

Submitting and Grading

Once you are finished working and have saved all your changes, submit by running the submit tool in the tools folder: ./tools/submit. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work.

We will grade your assignment with the following criteria:

  • Code test results: we will run the sanity check tests as well as potential additional test cases to confirm your functions work as intended and implement the required error checking. You have access to all of the disk images we'll use for grading.
  • Custom tests: we will verify that you've added at least 3 custom tests using function_tester and that each of them has a comment above them describing the test
  • Clean build and clean valgrind reports (eg. no memory leaks or errors)
  • Code style: we will manually review your code and give feedback on coding style and design (see tips below for some style guidelines!)

General Tips

Layering

Try to avoid "breaking the layering" of the file system where possible. You want to aim to have a higher layer build on the layer below it as much as possible, rather than bypassing it and manually calling functions lower down - though in some places calling a function from a lower down layer (like diskimg_readsector) may be appropriate.

Constants

Make sure to use provided constants, and define your own constants, where appropriate. For instance, use INODE_START_SECTOR and ROOT_INUMBER in unixfilesystem.h instead of hardcoding numbers such as 2 and 1 in your code. There are calculations throughout your code, and your goal should be to make them as easy to follow as possible.

Buffers, Types and Casting

When creating a "buffer" (space to store data), carefully consider the type of the buffer. In places like inode_iget and directory_findname, we know what type of data we are reading in, so you should not use void * and casting, and instead declare the type as specifically as you can - i.e. instead of an array of chars, use an array of struct inodes or some other type. Only go with void *, manual pointer arithmetic, and memcpy when you don’t know the data types that are being stored, copied, and passed around. In the case of inode_iget, for instance, you know the sectors you’re examining contain arrays of struct inodes. Similarly, your implementation of directory_findname should declare an array of struct direntv6 instead of a character array of length 512, and you should be able to take advantage of the fact that an indirect block contains an array of 256 two-byte integers (uint16_t).

Memory Allocation

You should not need to use dynamic memory allocation on this assignment (i.e. malloc/realloc/free).