Assignment 2: Reading Unix v6 Filesystems


Due: Thu Jan 28 11:59 pm
Late submissions accepted until Sat Jan 30 11:59 pm

Thanks to Mendel Rosenblum for this wonderful assignment. Writeup by Jerry Cain, with modifications by Nick Troccoli and Chris Gregg.

An archaeological expedition in Murray Hill, NJ has uncovered some magnetic disk drives from the mid-1970s. After considerable effort, the dig's technical expert read the contents of the drives, and for each disk drive, she created a local file to be an exact bitwise copy of the disk’s data. The archaeologists determined the data on the disks was stored using the Version 6 Unix filesystem. Sadly, none of the archaeologists are familiar with that filesystem, so they haven't been able to read the image contents. Your task is to write a program that understands the Unix v6 filesystem to extract the file system data.

Learning Goals

This assignment focuses on understanding the different layers in the Unix v6 filesystem. This is a real system, people! You will be building your skills with:

  • understanding the implementation of the Unix v6 filesystem
  • appreciating the complexity and tradeoffs in implementing a filesystem
  • harnessing the power of layering to implement filesystem components
  • navigating a large code base and adding specific features and functions

Starter Code

To get started on the assignment, clone the starter project using the command

git clone /usr/class/cs110/repos/assign2/$USER assign2

This line creates a new folder called assign2 in your current working directory containing a copy of the starter code.

The starter project contains the following:

  • diskimg.[ch]: files for the block layer, which is fully implemented for you
  • inode.[ch]: files for the inode layer, which you will modify
  • file.[ch]: files for the file layer, which you will modify
  • directory.[ch]: files for the directory layer, which you will modify
  • pathname.[ch]: files for the pathname layer, which you will modify
  • diskimageaccess.c, chksumfile.[ch], unixfilesystems.[ch], Makefile: files for the provided diskimageaccess test program and the Makefile for compiling.
  • direntv6.h: contains the definition of struct direntv6, which represents a single directory entry. This was copied from section 2.5 in the textbook.
  • filsys.h: contains the definition of struct filsys, which represents the filesystem superblock. This is a slightly modified copy of the header file from Version 6 Unix.
  • ino.h: contains the definition of struct inode, which represents a single inode. This comes from Version 6 of Unix, with some small modifications.
  • unixfilesystem.[ch]: contains the definition of struct unixfilesystem, which represents the filesystem as a whole. It also contains constants specifying the file system layout, including the sector address of the superblock and of the start of the inode region.
  • samples: a symbolic link to the shared directory for this assignment. It contains:
    • SANITY.ini and sanity.py: files to configure and run Sanity Check. You can ignore these.
    • diskimageaccess_soln: executable solution for the provided test program. Use this program to verify correct behavior for these programs.
    • testdisks: contains "basic", "depthFile" and "dirFnameSize" disk images for testing, along with corresponding ".gold" files with expected output.
  • tools: contains symbolic links to the sanitycheck and submit programs for testing and submitting your work.

There are a lot of provided files in this assignment, and part of the challenge is learning to navigate them, follow the logic, and add code where appropriate. Here are some tips to help with this navigation:

  • consult the documentation above and further down in the spec
  • consult the comments throughout the provided files
  • skim through the test program code to see how your functions are tested
  • use grep to search the files for definitions or unknown types; for example, let's say you forget where struct inode is defined and want to see its definition. The format for grep is grep <pattern> <file(s) to search>. So if you wanted to search everything in the assignment folder for the struct inode definition, you might do grep "struct inode {" * to search all files in the current folder (that's the asterisk) for the text "struct inode {". This is also helpful to search for where a function is called. So if you wanted to search for where file_getblock is called, you might do grep "file_getblock(" *.

You are welcome to modify any of the files we give you. When making changes in the test harness, do so in a backward compatible way so that the testing scripts we give you continue to work. In other words, it's fine to add testing and debugging code, but make sure that when we run your program with -i and -p (see below), its output has the same format with or without your changes, so our automated tests won't break.

Important Reading

Before you try to write any code, you'll want to familiarize yourself with the details of the Unix V6 file system by reading Section 2.5 of the Saltzer and Kaashoek textbook. Click here to view the free online copy through Stanford.

Testing

Compile using make and you will create the diskimageaccess program that executes your code. Run it with the -h flag to see how to use it:

$ ./diskimageaccess -h
./diskimageaccess: invalid option -- 'h'
Usage: ./diskimageaccess <options> diskimagePath
where <options> can be:
-q     don't print extra info
-i     print all inode checksums (test the inode and file layers)
-p     print all pathname checksums (test the filename and pathname layers)

diskimagePath should be the path to one of the test disks located in samples/testdisks. We currently have three test disks: basicDiskImage, depthFileDiskImage, and dirFnameSizeDiskImage. Depending on what flag you specify, you can test different layers of your implementation. For example, to run both the inode and filename tests on the basic disk, you could run

$ ./diskimageaccess -ip samples/testdisks/basicDiskImage

This program tries to read all the files on the image using your code and outputs checksums of the inode and file contents, which should match ours. There is a fully implemented version in samples/diskimageaccess_soln for you to compare with. The expected output for running diskimageaccess on each disk image X is stored in the X.gold file inside the testdisks directory as well.

We encourage you to make use of the sanity check tool when testing your solution. Running sanitycheck allows you to test against a set of pre-provided test cases, which for this assignment are the same functionality tests we will use during grading. Check out the guide to sanitycheck here for more information.

Working On The Assignment

Your job is to complete the implementation of 4 layers of the filesystem: the "inode layer", "file layer", "directory layer", and "pathname layer". Each of these layers has 2 corresponding files, and layers build on each other. You will need to modify the .c files for each layer. We also provide a fully-implemented "block layer" in diskimg.[ch] that provides the code for reading and writing sectors (note that the words block and sector are used interchangeably here) on the disk image.

As you implement them, consider what functions from lower layers you have already implemented and how they can help! We recommend implementing them in the following order.

Testing Recommendations

As you work, we highly recommend testing incrementally and ensuring each function is working before moving on. Sanity check can be helpful at certain points, but it only tests certain layers directly, and it also omits some output from the program in order to do an output comparison. To view error messages or other testing output you are printing, make sure to run the diskimageaccess program outside of sanity check. You can see more specific testing recommendations for each function below.

Inode Layer (inode.[ch])

This defines and implements the interface for getting information about an inode and the blocks its data is stored in. There are two functions you must implement:

inode_iget

This function reads the data for the inode with the specified inumber from the filesystem and fills in an inode struct with that information.

Testing Recommendation

One way to test this function in isolation is to run the diskimageaccess test program with the -i flag in GDB, with a breakpoint on inode_iget. For a given call, inspect the inumber being passed in, and look at the corresponding .gold file for the disk image you are testing with to see what the mode and size for that inode should be (e.g. samples/testdisks/basicDiskImage.gold for the basic disk image). Then use gdb to confirm that the inode your function is reading in has these same values. Note that you may need to print out the inode's mode in hex (try p/x myInode.i_mode in GDB) and you may need to call the inode_getsize function to print out the correct inode size (try call inode_getsize(&myInode) in GDB).

inode_indexlookup

This function returns the block number of the file data block at the specified index. What this means is that, if we take just the blocks for this file holding actual file data (e.g. ignoring indirect blocks) and index them sequentially, the fileBlockIndex parameter to this function specifies which of those blocks we must return the block/sector number for. In other words, the returned block number will always be for a block that stores actual file data. As a concrete example, let's say a file is 1,536 (512 * 3) bytes big. This means it takes up 3 blocks of data. Let's say furthermore that the block numbers it is using are 35, 65 and 21. This means in the inode it stores these three block numbers, and if you fetched blocks 35, 65 and 21 from disk you would have data for the entire file. But for inode_indexlookup, fileBlockIndex would be either 0, 1 or 2 - it's specifying the index into the file's data of the block to get, and it's your job to return the disk block number it corresponds to. For instance, if fileBlockIndex was 0, you would return 35. If fileBlockIndex was 1, you would return 65, etc.

This implementation needs to manage direct, singly indirect, and doubly indirect block numbers. Translating for direct block numbers is straightforward (as we did in the example above), but translating for indirect block numbers is hard to get right.

  • Hint 1: Understand that indirect block numbers identify other blocks that contain 256 two-bytes integers (best managed via the uint16_t type).
  • Hint 2: Understand that doubly indirect block numbers lead to singly indirect block numbers, which lead to block numbers. Make sure whatever code you write to translate singly indirect block numbers also contributes (without cutting and pasting or otherwise copying it) to the process used to ultimately convert doubly indirect block numbers.
Testing Recommendation

One way to test this function in isolation is to write temporary testing code in file_getblock. Choose an inumber from the .gold output (e.g. samples/testdisks/basicDiskImage.gold for the basic disk image) that is either a small or large file depending on what format you'd like to test. Then fetch that inode and try fetching each of its file blocks (0, 1, 2,...) via inode_indexlookup and verify that each of the numbers is correct. For a small file, you can use GDB to examine that inode's list of block numbers and ensure the right one is returned. For example, if you have an inode with blocks {45, 23, 14, ...} then file block index 0 should return 45, 1 should return 23, 2 should return 14, etc. For a large file, you could write additional testing code to manually fetch the block(s) that should contain that file block to ensure you are getting the right answer. For example, if you have an inode with blocks {45, 23, 14, ...} then for file block index 256 you would fetch block 23 in your test code so that you could verify the return value from inode_indexlookup matches its first number.

File Layer (file.[ch])

The inode layer supports getting information about a file. This layer's job is to actually read the data a file contains, in chunks. This is one of the two layers our tests explicitly exercise. After this step, your code should pass the inode level functionality tests. There is one function you must implement:

file_getblock

This function's job is to read the file data block at the specified index. In other words, whereas inode_indexlookup finds which block number contains the data, this function actually reads in that data. This function returns the number of meaningful bytes in a block/sector that contribute to the payload of some file (be it a regular file or a directory). In other words, it returns the number of bytes in that block that are actually used to store data for the file. As an example, if the file size happens to be a multiple of 512, then this function always returns 512. Alternatively, if a file isn’t a multiple of 512 and we have to get the payload stored in the very last block, you should expect to return a number that’s less than 512 (because only a part of what’s stored in the block is relevant).

Testing Recommendation

This function is directly tested by sanity check. If any tests are failing, try running them outside of sanity check for more information. If your inode information isn't matching, revisit your inode_iget function. If your checksums aren't matching, identify the inode that is outputting incorrectly and check the .gold file for more information about it. Pay close attention to whether that inode represents a small or large file; that could give you a hint as to where your logic might be going wrong!

Directory Layer (directory.[ch])

This layer implements looking up a file with a certain name in a directory. There is one function you must implement:

directory_findname

This function's job is to get the directory entry (defined as struct direntv6) from the given directory that matches the given name. As a hint, your implementation should make use of strncmp instead of strcmp, because some of the stored directory names aren’t '\0'-terminated. They are, however, guaranteed to be of length 14 or less. Check out the man page for strncmp and strcmp so you understand the difference.

Testing Recommendation

One way to test this function in isolation is to write temporary testing code in pathname_lookup. Open the .gold output for the test disk (e.g. samples/testdisks/basicDiskImage.gold and choose a file or folder to test with that is directly contained within the root directory (/). For example, for the basic disk image, these would include "bigfile", "medfile", "o", "verybig", "CVS" and "foo". For any of these, add testing code that calls directory_findname to lookup that file name in directory inumber 1 (the root directory). You can inspect that the function fetches the correct struct direntv6 - specifically, the name should match, and the inumber should match the inumber listed in the .gold file (E.g. "Path /verybig 5 mode 0x91ff size 7340032 checksum" means verybig was inode 5). Additionally, you can take the inumber from that direntv6, fetch the inode with inode_iget, and verify its fields match the remainder of the listed .gold output, specifically the mode and size, like in the testing for inode_iget. You can further expand this testing method to test any of the file lookups by hand as long as you identify the correct dirinumber (e.g. in the basic disk image you could try looking up "Repository" within dirinumber 6 (/CVS), "Entries" within dirinumber 12 (/foo/CVS), etc.).

Pathname Layer (pathname.[ch])

The directory layer supports looking up a file in a single directory. This layer's job is to look up a file given its entire absolute path, potentially nested in multiple directories. This is the second of the two layers we explicitly test. Refer to Section 2.5.6 of the Salzer and Kaashoek book for this part. Afterward, all the pathname tests should pass. There is one function you must implement:

pathname_lookup

This function takes in an absolute path referring to a file or directory and returns the inumber of that specified file or directory. The function strsep will come in handy here!

Testing Recommendation

This function is directly tested by sanity check. If any tests are failing, try running them outside of sanity check for more information. If your inode information isn't matching, try tracing the lookup by hand to make sure that the correct step is taken each time (e.g. / to /foo, /foo to /foo/CVS, etc.). The .gold output files show the inumbers for various path depths, so if you are looking up /a/b/c you can check that your lookup for /a, /a/b and /a/b/c are all correct.

Implementation Notes

Saving Space

Back in the 1970s, storage space — both on disk and in main memory — was at a premium. As a result, the Unix v6 filesystem goes to lengths to reduce the size of data it stores. You'll notice that many integer values in the structs we provided are stored using only 16 bits, rather than today's more standard 32 or 64. (In our code we use the type int16_t from stdint.h to get a 16-bit integer, but back in the '70s, the C int type was 16 bits wide.)

In another space-saving move, the designers of the filesystem stored the inode's size field as a 24-bit integer. There's no 24-bit integer type in C, so we represent this value using two fields in the inode struct: i_size1, which contains the least-significant 16 bits of the value, and i_size0, which contains the most-significant 8 bits of the value. We provide a function inode_getsize in inode.c that assembles these two fields into a normal C integer for you.

The First Inode

Since there is no inode with an inumber of 0, the designers of the filesystem decided not to waste the 32 bytes of disk space to store it. The first inode in the first inode block has inumber 1; this inode corresponds to the root directory for the filesystem. (See unixfilesystem.h for details.)

Be careful not to assume that the first inode has an inumber of 0. Off-by-one errors are the worst.

Inode's i_mode

The 16-bit integer i_mode in the inode struct isn't really a number; rather, the individual bits of the field indicate various properties of the inode. ino.h contains #defines which describe what each bit means.

For instance, we say an inode is allocated if it points to an existing file. The most-significant bit (i.e. bit 15) of i_mode indicates whether the inode is allocated or not. So the C expression (i_mode & IALLOC) == 0 is true if the inode is unallocated and false otherwise.

Similarly, bit 12 of i_mode indicates whether the file uses the large file mapping scheme. So if (i_mode & ILARG)!= 0, then the inode's i_addr fields point at indirect and doubly-indirect blocks rather than directly at the data blocks.

Bits 14 and 13 form a 2-bit wide field specifying the type of file. This field is 0 for regular files and 2 (i.e. binary 10, or the constant IFDIR) for directories. So the expression (i_mode & IFMT) == IFDIR is true if the inode is a directory, and false otherwise.

Submitting and Grading

Once you are finished working and have saved all your changes, submit by running the submit tool in the tools folder: ./tools/submit. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work.

We will grade your assignment based on a combination of test results and code quality.

  • Code test results: 60 points (same tests as sanity check for this assignment)
  • Clean build and clean valgrind reports (eg. no memory leaks or errors): 6 points
  • Code quality: graded on bucket system [exceptional, solid, minor-problems, major-problems, etc.]

We'll run your code on a set of test disk images and check that your program gives the expected output. You have access to all of the images we'll use for grading and the correct output for each of them.

Finally, we'll read through your code and give you comments on style and overall quality. We'll look for places where you inappropriately break the layering of the file system or make other mistakes that don't necessarily cause your program to generate the wrong output. This means that a higher layer should build on the layer below it as much as possible, rather than bypassing it and manually calling functions lower down - though in some places calling a function from a lower down layer may be appropriate. We'll also look for common C programming errors. Your code should not need to use any heap allocation. Finally, we'll check that your code is easy to read: It should be well-formatted, organized, and be easy to follow.

Additional Tips and Tricks

  • Constants: make sure to use provided constants, and define your own constants, where appropriate. There are calculations throughout your code, and your goal should be to make them as easy to follow as possible. For example, use the constants in unixfilesystem.h where appropriate!
  • Declaring Buffers: When creating a buffer to store data, carefully consider the type of the buffer. In places like inode_iget and directory_findname, we know what type of data we are reading in, so you should not use void * and casting, and instead declare the type as specifically as you can. This is a very common style deduction, so consider this carefully!