Assignment 2: Journaling File System and OS Trust

NOTE: this website is out of date. This is the course web site from a past quarter, Fall 2023. If you are a current student taking the course, you should visit the current class web site instead. If the current website is not yet visible by going to cs111.stanford.edu, it may be accessible by visiting this link until the new page is mounted at this address. Please be advised that courses' policies change with each new quarter and instructor, and any information on this out-of-date page may not apply to you.

Due: Wed Oct 18 11:59 pm
Late submissions accepted until Fri Oct 20 11:59 pm

Assignment by David Mazières, with modifications by John Ousterhout, Nick Troccoli, Xiyu Zhang and Benji Xie.

In this project you will (1) implement part of a C++ program recover to recover filesystem metadata after a crash (you write a total of roughly 10-15 lines to replay the log entries when a crash occurs), (2) explore the structure and mechanisms of logging (journaling),and (3) explore more generally the trust that we place in operating systems (and filesystems in particular). We will be using the Unix V6 file system format from 1975 that you learned about in the previous assignment, but bringing it (at least partially) into the 21st century by journaling all metadata updates to a write-ahead log and tracking free blocks in a bitmap instead of a linked list.

Here are the learning goals for this assignment:

  • Learn about write-ahead logging, which is an important concept in file systems, databases, and many flash devices.
  • Learn how to repair inconsistencies by replaying log entries.
  • Better understand the tradeoffs in implementing crash recovery mechanisms
  • Navigate a large code base and add specific features and functions
  • Learn about the protections implemented by operating systems and the consequences if these protections are broken
  • Consider the ethical issues around long-term support (or lack thereof) for operating systems.

You can do each of the 3 parts of this assignment independently.

Assignment Infrastructure

This assignment takes the Unix V6 filesystem design and adds both a freemap bitmap (so that free list updates can be idempotent - a linked list does not allow this) and a write-ahead log. It also uses something called FUSE (File system in USErspace) to allow us to play around with live instances of this filesystem.

The C++ implementation we've provided uses some cool techniques and approaches; while not required to understand for the assignment, if you're interested we encourage you to explore the additional provided code files (it's a great example of a professional quality C++ project and it uses many advanced C++ features) and check out this page for more about how this updated design is possible:

Read about assign2 infrastructure design

Starter Code

Login to the myth cluster and clone the starter repo with this command:

git clone /usr/class/cs111/repos/assign2/$USER assign2

This will create a new directory assign2 in your current directory and clone a Git starter repository into that directory. Do your work for the assignment in this directory. The starter project contains many files, but you only need to modify replay.cc and questions.txt, though you will need to be familiar with a few more. In total, they are listed below:

  • questions.txt: a text file where you will answer questions for the assignment
  • replay.hh and replay.cc: the C++ class you will complete that can execute operations specified in a log. Part of the compiled recover program that can replay a log to repair a disk image.
  • examine: a program you can execute with the -r flag to run your recover program on a given disk image to repair it and see the results. It outputs information about the state of the filesystem after the recover program runs.
  • v6fs.hh: the header for the C++ type that models the Unix v6 filesystem used for this assignment. You will specifically use the Buffer type defined there which represents a single entry in the block cache.
  • bitmap.hh: the header for the C++ type that models the free list bitmap. You will specifically use the public methods defined there to update the freemap.
  • logentry.hh: the header file containing the definitions of different structs, each representing a different kind of log entry you may need to replay.
  • samples: a symbolic link to the shared directory for this assignment, containing sanity check files, a sample solution called examine_soln, and disk images to test with in disk_images.
  • mountv6, dumplog, fsckv6, v6: provided, fully-implemented tools that allow you to, in order: mount a filesystem image to interact with it, dump a filesystem image's log entries, run fsck on the image, and inspect various components of the filesystem state.
  • tools: contains symbolic links to the sanitycheck and submit programs for basic testing, and submitting your work.

Exercise #1: Experiments with Crash Recovery

In this exercise you will intentionally crash a V6 file system, leaving its on-disk state inconsistent. Then you will recover from the crash in two different ways, once with fsck and once with the log, compare the results, and answer some questions. To do this, you will use the same tools that you saw in section2.

Step 1: Crashing the file system

First, let's crash a file system. Working in the top-level directory for the assignment, we will create an empty disk image with mkfsv6 and mount it with mountv6, just like you did in section. To do this, we'll use the provided mnt folder, though this works with any folder you may create.

Notice that mnt is empty except for a readme file - for now! Once we run mountv6, the contents of the Unix v6 filesystem image will replace what is in that folder. Try it now (note: the & runs this command in the background - the filesystem is only mounted while this command is running, so we use & to run it in the background so we can use our shell for other things):

./mkfsv6 crash.img
./mountv6 -j --crash 935 crash.img mnt &

The --crash argument tells the file system driver that it should crash itself (exit without flushing dirty cache blocks) after 935 writes to the disk image file. Now type the following commands:

cd mnt
../create_files.py

The create_files.py script tries to create a large number of files with names file0, file1, and so on. Each file contains a single line of the form

This line belongs to file 101

The script groups the files in subdirectories, with 10 files in each subdirectory. The script will generate enough data to cause the file system to reach its --crash limit; when this happens the mountv6 process will exit and the file system will be unmounted, so create_files.py will exit with an error message. It should look exactly like the following, including the matching name of the file that was about to be created when it crashed - 795.

Crashing because of --crash option
Couldn't create file dir79/file795: Software caused connection abort

NOTE: If you're seeing file796 instead of file795, try deleting crash.img and following the steps again, but run the commands promptly one after the other (it turns out that a delay of ~30 seconds between mounting and running create_files.py causes the file it crashes on to be different).

Once you confirm that the file system has crashed identically to this output, exit mnt with cd .. and move on to the next section. We are going to recover the crashed file system twice: once with fsck and once using the file system's log.

Step 2: fsck to the rescue?

Let's give fsck a try in repairing this disk. We supply you with a version of fsck for this filesystem called fsckv6 that can restore a disk image to a consistent state, but often with heavy data loss. Since we are going to modify the disk image, make a copy of it like this:

cp crash.img crash2.img

Now run fsck on it:

./fsckv6 -y crash.img

The -y switch tells fsckv6 to print out information about any inconsistencies, and also repair the disk; without that switch it will print information about inconsistencies without actually repairing them. This version of fsck does not implement the lost+found recovery described in lecture: if it finds an inode that is allocated but there are no directory entries pointing to the inode, it just deletes the inode's file (this results in "freeing unreachable inode" messages in the output). Once you have run fsckv6, answer Questions 2A and 2B in questions.txt.

Next, remount the recovered image and explore it with the following commands:

./mountv6 -j crash.img mnt &
cd mnt
../check_files.py

The check_files.py script will scan over all of the files in the directory; its output indicates how many files it found, plus any errors it found in the files (such as a file whose contents are entirely zero, or a directory with no files in it). Once you have run check_files.py, answer Questions 2C, 2D, and 2E in questions.txt.

Once you're done, navigate out of the mnt directory and terminate the filesystem mount to make mnt go back to its original state:

cd ..
fusermount -u mnt

Step 2: Examining the log

Let's now examine the log for this corrupted disk image before trying to recover it by replaying the log. The dumplog tool lets you see the filesystem log - you can run it like this to view all log entries starting at the most recent checkpoint, shortened to just the first 150 lines - the "c" specifies that we just want entries from the last checkpoint onwards, and head -150 takes just the first 150 lines:

./dumplog crash2.img c | head -150

The log shows the same log entry types as described in section2 - some entries are grouped within a LogBegin and LogCommit message, which indicates they are part of a single transaction that should be completed atomically. Some log entries are also labeled with what they are updating (e.g. inode i_mtime field, which is last modified time, inode (i_size0, i_size1) fields, which store file size, individual block numbers, etc.).

Examine the log entries in the first transaction (from the first LogBegin to LogCommit - the first 18 log entries total) and answer Questions 2F and 2G in questions.txt.

Step 3: Journaling for the (partial) win

Now recover the second copy of the crashed image using log-based recovery:

samples/recover_soln crash2.img

samples/recover_soln is the sample solution program that repairs the disk by replaying the log (which you will complete the implementation of yourself later on this assignment!). Now mount the disk image just like in the previous steps to see what it looks like after repair:

./mountv6 -j crash2.img mnt &
cd mnt
../check_files.py

Then answer Questions 2H, 2I, and 2J in questions.txt.

Once you're done, navigate out of the mnt directory and terminate the filesystem mount:

cd ..
fusermount -u mnt

Exercise #2: recover

In this exercise you will write code that replays individual log entries to restore consistency to a disk image. Your code will be in the file replay.cc; this file will be compiled with additional code we've written to form the recover executable. You need to implement the following 3 short methods in replay.cc, each of which replays one type of entry that could appear in the log. Note that these method implementations will be short; a combined 10-15 lines across all 3, with only a handful of lines at most for each:

// Replays a "patch" log entry, which specifies a 
// change to all or part of the contents of a block,
// such as part of an inode or other metadata block.
void apply(const LogPatch &);

// Replays a "block alloc" log entry, which specifies
// that a block should be marked as allocated (and
// possibly zero-ed out).
void apply(const LogBlockAlloc &);

// Replays a "block free" log entry, which specifies
// that a block should be marked as free.
void apply(const LogBlockFree &);

These methods are called by the starter code, which handles all the logic for reading in the log and determining which part of the log should be replayed. Each of these methods takes in a different struct that contains information about the operation that should be performed. There are some other log entry types - for example, transactions (groups of entries that should be applied atomically) each begin with a LogBegin and end with a LogCommit. But those don't actually require any updates to "replay", and the starter project ensures that operations grouped in a transaction are replayed in their entirety or not at all (e.g. if only half of a transaction was written to disk, and there is no ending LogCommit).

Testing

You will write code to implement the recover program that can repair a Unix v6 disk image by replaying its log entries since the last log checkpoint. recover modifies the disk image it is repairing - so to test your code, instead of running recover directly, we've created an examine program that duplicates the disk image you specify, runs recover on that duplicate, and then outputs information about the resulting repaired disk. You can run it like this:

./examine -r samples/disk_images/create-file.img

It will output the disk image state after running your recover program. Specifically, it outputs a list of all used blocks followed by a list of all used inodes. Here is an example:

Reached log end: bad checksum
played log entries 2670558939 to 2670558952
2 used blocks (out of 2)
Block 3:
  0 01002e00 00000000 00000000 00000000  >  .             <
 16 01002e2e 00000000 00000000 00000000  >  ..            <
 32 1000612d 66696c65 2e747874 00000000  >  a-file.txt    <
 48 00000000 00000000 00000000 00000000  >                <
*
Block 4:
  0 54686973 20697320 736f6d65 2066696c  >This is some fil<
 16 6520636f 6e74656e 74732e0a 00000000  >e contents.     <
 32 00000000 00000000 00000000 00000000  >                <
*
2 used inodes (out of 16)
    1 drwxr-xr-x   2   0   0       48 Oct 11 2022 09:58:24  #1
        ino: 1
        i_mode: 0140755
        i_nlink: 2
        i_uid: 0
        i_gid: 0
        size(): 48
        i_addr[0]: 3
        i_addr[1]: 0
        i_addr[2]: 0
        i_addr[3]: 0
        i_addr[4]: 0
        i_addr[5]: 0
        i_addr[6]: 0
        i_addr[7]: 0
        atime: Oct 11 2022 09:58:24
        mtime: Oct 11 2022 09:58:24
   16 -rw-------   1   0   0       28 Oct 11 2022 09:58:24  #16
        ino: 16
        i_mode: 0100600
        i_nlink: 1
        i_uid: 0
        i_gid: 0
        size(): 28
        i_addr[0]: 4
        i_addr[1]: 0
        i_addr[2]: 0
        i_addr[3]: 0
        i_addr[4]: 0
        i_addr[5]: 0
        i_addr[6]: 0
        i_addr[7]: 0
        atime: Oct 11 2022 09:58:24
        mtime: Oct 11 2022 09:58:24

The output from examine contains the following information:

  • The first two lines print info about what was replayed and when it stopped.
  • The next line prints how many disk blocks are in use.
  • The next lines display the contents of each of those blocks in a side-by-side format showing hexadecimal on the left and the corresponding ASCII text on the right, with 16 bytes per line -- the same size as the direntv6 structure. The text version is useful to see the text in directory entries and text files, while the hexadecimal version is useful for block numbers and other information.
  • Next, it prints how many inodes are used; this is followed by the contents of each inode, starting with its number and including information like its permissions, last modified time, block numbers, size, etc.

You can also run samples/examine_soln on a disk image; this program will use our sample solution for recover instead of your code, so you can compare its output with the output generated by ./examine.

To run under GDB, you can run it on the recover program directly. But because it modifies the disk image it is repairing, make a copy of it first like this:

cp samples/disk_images/create-file.img .

Then run under GDB like this:

gdb recover
...
(gdb) run create-file.img

There is also a tool dumplog that lets you view the filesystem log. You can run it like this to view all log entries starting at the most recent checkpoint - the "c" specifies that we just want entries from the last checkpoint onwards:

./dumplog samples/disk_images/create-file.img c

For this assignment, all functionality grading test cases are included in sanity check's provided test cases. See the grading section for more info.

Essential Utility Modules

In order to implement log entry replay you will need to interact with the file system buffer cache and the freemap.

Buffer cache interface

In the previous assignment you used diskimg_readsector to read blocks from the disk. For this assignment, however, we have a block cache, so you will need to interface with that instead. The methods you will write all have access to an instance variable V6FS &fs_, which you will use for file I/O. The main methods on this variable (defined in v6fs.h) are:

struct V6FS {
    ...

    /* reads a block from disk, and returns the contents
     * in a buffer in the file system block cache.  
     * You can read its contents or write
     * to it and those changes will be made on disk.
     */
    Ref<Buffer> bread(uint16_t blockno);

    /* Returns a buffer for a particular block, but
     * doesn't actually read it from disk.  You therefore
     * cannot read its contents, but you can write a new
     * 512 bytes to it and those changes will be made on disk.
     * This is useful/more efficient than bread if you are
     * going to overwrite an entire block's contents, as
     * there's no need to read in the existing contents.
     */
    Ref<Buffer> bget(uint16_t blockno);

    ...
};

You will need to use both bread and bget in your solution, and it's very important to understand the difference between them. One of the most common (and confusing) mistakes is using bget when bread is needed; this results in the sudden appearance of garbage in metadata.

Both of the above methods return a Ref<Buffer>. You can treat a Ref<Buffer> as if it were a Buffer*, but it's actually a smart pointer that keeps a reference count of how many Ref's exist for a block in the cache. The cache won't evict a block while there exist Ref's for it. A Buffer contains the cached contents of a disk block, and it has the following relevant fields:

struct Buffer {
    ...
    char mem_[SECTOR_SIZE];
    void bdwrite();
    ...
};

mem_ is the actual contents of the block, which you can modify - it is 512 bytes big. If you modify mem_, you must then call bdwrite() to tell the cache that you have modified it and it should at some point be written back to disk.

Bitmap interface

The freemap is a bitmap stored in the freemap_ instance variable of V6Replay. The freemap is compact enough to store the entire map in memory. It is read from disk for you in the constructor V6Replay::V6Replay, and written back to disk for you at the end of the V6Replay::replay method. Normally you might think to use bit operations to manipulate individual bits in the freemap, but this implementation provides an .at method that gets a reference to a bit at a given index and also lets you modify it. For instance, you could say:

if (freemap_.at(blockNum)) {
  // blockNum is free
}

...

// mark block as allocated
freemap_.at(blockNum) = false;

We recommend implementing the log entries in the following order below.

LogPatch

This log entry specifies a change to portion of metadata on disk. This is used for operations like updating the block numbers stored in an inode or indirect block, updating an inode's last modified time, or updating directory entries. It contains specific bytes that must be written to the disk at position offset_in_block of block blockno; the length of bytes determines how many bytes are overwritten. logentry.hh contains the full definition of the LogPatch type passed in as a parameter; here is a snippet:

struct LogPatch {
    uint16_t blockno;           // Block number to patch
    uint16_t offset_in_block;   // Offset within block of patch
    std::vector<uint8_t> bytes; // Bytes to place into the block
};

Diagram by John Ousterhout

Note: the interface for reading and writing disk blocks is different in this assignment than the previous one; see Essential Utility Modules for more information.

Hint #1: Check out the std::vector data() method, which gives you the raw byte representation of the vector contents.

Hint #2: Note that the -> operator is the same as dereference + dot. In other words, data->field is the same as (*data).field.

Once you complete this method, no sanity check tests will pass yet - you will need to implement one more function before passing any new tests.

LogBlockFree

This log entry specifies that block number blockno, which was previously allocated, should be marked free (1 or true in the freemap). logentry.hh contains the full definition of the LogBlockFree type passed in as a parameter; here is a snippet:

struct LogBlockFree {
    uint16_t blockno;           // Block number of freed block
};

See Essential Utility Modules for information on how to manipulate the freemap.

Once you complete this method, along with LogPatch, you should now pass test #2, which must fix a deleted directory. If the filesystem still has lingering files that should have been deleted, or if it is missing directories like "." or "..", make sure your buffer cache writing is correct and written to disk. If the output says "free list was incorrect", double check how you are updating the free list.

LogBlockAlloc

This log entry specifies that block number blockno, which was previously free, should now be marked as allocated (0 or false in the freemap). In addition, if zero_on_replay is non-zero then the contents of the block must be cleared to zeroes. If zero_on_replay is zero, then the contents of the block must be left as-is. logentry.hh contains the full definition of the LogBlockAlloc type passed in as a parameter; here is a snippet:

struct LogBlockAlloc {
    uint16_t blockno;       // Block number that was allocated
    uint8_t zero_on_replay; // Metadata--should zero out block on replay
};

The zero_on_replay field is needed so that metadata and data blocks can be handled differently. It will be set whenever the allocated block is going to be used for metadata — i.e., as an indirect (or double-indirect) block, or for a directory. This is needed because new metadata blocks are always zeroed out when they are allocated during normal operation (e.g., all indirect block pointers in an indirect block will start off as zero, as will all entries in a new directory block). However, there is no guarantee that this zeroed-out block reached disk before the crash. If the block isn't cleared during crash recovery, it could contain arbitrary garbage that causes the file system to misbehave. If the block needs to contain any nonzero information, that information will be provided by subsequent LogPatch log entries.

For data blocks, it would be safe to zero out the block during crash recovery. However, this could lose valid data. The reason is that data block updates are not logged. If the data block was actually written safely to disk before the crash, zeroing the block will lose those contents. So, we don't clear data blocks during log replay; of course, this means that the file could inherit random garbage data, but this will not cause the file system to misbehave (presumably users will notice the garbage data and figure out what to do with it).

You may notice that whenever the log contains a LogBlockAlloc entry, there's also a LogPatch entry setting a pointer to that block in the same transaction. You may also find multiple LogBlockAlloc entries in the same transaction (one to allocate an indirect block, and another to allocate a data block, whose disk address will be stored in the indirect block using a LogPatch entry).

As a hint, consider using memset to help zero out the contents.

If you complete just the free list updating portion of this method, test #3, which must fix a created file, should pass. Once you completely implement this method, including possible block zeroing, all tests should pass -woohoo! If test #3 does not pass due to the free list being incorrect, double check how you are updating the free list.

3. Long-Term Support and Trust

For the last part of the assignment, we'll kick off our exploration of ethics in operating systems and systems software this quarter. File systems have been a great first example of one of the many critical pieces of functionality operating systems provide. Operating systems are large, complex systems that are built upon by many developers and used by a wide variety of people for long periods of time. This means that many people put their trust in operating systems and expect them to be reliable, secure and correct, and when this is not true, the impacts can be extremely far-reaching. In this assignment, we’ll start to examine this idea of operating system trust, the assumptions we make about them, and what happens when these are not upheld.

Elevated Privileges

We saw in lecture how the operating system runs in a privileged "kernel mode" where it is allowed to do things that regular user programs cannot do. This is to ensure that the OS handles important responsibilities itself, such as reading data from disk or reading the filesystem log, and prevents user programs from accessing things they shouldn't.

Imagine that a user using the Unix v6 filesystem just now created a folder called secret in the root directory with an unknown number of private files in it. Answer question 4A in questions.txt.

Trust

Much of our use of operating systems today is built on the trust that users have in operating systems and their security / reliability. For example, all CS111 students (and CS107 students, too!) share usage of the myth machines, which run Linux, and include a shared networked filesystem called AFS ("Andrew File System"). A key part of users' trust in filesystems relates to filesystem permissions; permissions dictate, for a given file, who can access it, and in what way (e.g. read, write, execute). Permissions are critical particularly for multi-user systems like the myth machines where many users share a filesystem. For instance, on the myth machines, by default you cannot read the contents of other users’ files.

Imagine that a bug was disclosed in the OS used on the myth machines (Ubuntu) that impacted the integrity of filesystem permissions - meaning that filesystem permissions may not function in the way they are expected to. Answer question 4B in questions.txt.

OS Longevity

Operating systems and filesystems have a very long lifespan. For instance, the current Windows filesystem, NTFS, was introduced in 1993. Once a system is adopted, it can be challenging to transition users to newer systems.

Although there is no requirement for operating systems to have a minimum support period, most organizations that develop operating systems provide support for around 10 years. This is typically enough for most users, but not all. For example, critical government infrastructure in the US, UK, and Netherlands relied on Windows XP (which debuted in 2001) through at least 2014.

Consider the following questions relating to OS minimum support periods and trust:

  • What is one argument for strengthening requirements for a minimum support period? (e.g., requiring it by law)?
  • What is one argument against requiring a minimum support period?

Once you have considered these questions, answer Questions 4C-4D in questions.txt.

Here are some other articles related to operating system support, in case you are interested:

Submitting and Grading

Once you are finished working and have saved all your changes, submit by running the submit tool in the tools folder: ./tools/submit. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work. You can confirm the timestamp of your latest submission in your course gradebook.

Here is a recap of the work that will be graded on this assignment (point amounts subject to slight change):

  • replay.cc: 40 points (10 points per test) (same tests as sanity check for this assignment) We'll run your code on a set of test disk images and check that your program gives the expected output. You have access to all of the images we'll use for grading and a sample solution to view the correct output for each of them.
  • Code style: we will manually review your code and give feedback on coding style (e.g. use of appropriate functions)
  • questions.txt: about 4 points per question (~56 points total)