Assignment 2: Journaling Filesystem and OS Trust

Due: Thu Feb 2 11:59 pm
Late submissions accepted until Sat Feb 4 11:59 pm

Assignment by David Mazières and John Ousterhout, with modifications by Nick Troccoli, Xiyu Zhang and Benji Xie.

Assignment 2 Overview Session: Sat. 1/28 afternoon at 2-2:50PM on Zoom (Zoom link on Canvas). Note that the session is recorded and will be posted to the course website shortly after it’s over. Slides here, video here.

In this project you will (1) implement part of a C++ program to recover filesystem metadata after a crash, (2) explore the structure and mechanisms of logging (journaling), and (3) explore more generally the trust that we place in operating systems (and filesystems in particular). We are using the same Unix V6 filesystem as the previous assignment (which had a block cache), but we've implemented it in C++ and modernized its structure to add a write-ahead metadata log and a free block bitmap (neither of which were present in the original version). Your task is to write part of the code for the apply program (you write a total of roughly 10-15 lines) to replay the log entries when a crash occurs, and also to answer some written short answer questions in readme.txt that explore how filesystems can use logging to recover from crashes, as well as how we rely on OSes and what can happen when assumptions we make about them are broken.

Learning Goals

This assignment focuses on understanding the mechanisms of write-ahead logging, crash recovery , and OS trust. You will be building your skills with:

  • understanding the implementation of a write-ahead log to record metadata operations
  • harnessing the power of a log to replay metadata operations and restore filesystem integrity
  • appreciating the tradeoffs in implementing crash recovery mechanisms
  • navigating a large code base and adding specific features and functions
  • learning about the protections implemented by operating systems and the consequences if these protections are broken
  • considering how we explicitly and implicitly trust operating systems and other large systems, and how this trust affects anyone who uses them.

Starter Code

To get started on the assignment, clone the starter project using the command

git clone /usr/class/cs111/repos/assign2/$USER assign2

This line creates a new folder called assign2 in your current working directory containing a copy of the starter code.

The starter project contains many files, but you do not need to work with most of them. However, there are several files you will need to partially familiarize yourself with, including the following - note that readme.txt and replay.cc are the only two files you need to modify for this assignment:

  • readme.txt: a text file where you will answer questions for the assignment
  • replay.hh and replay.cc: the C++ class you will complete that can execute operations specified in a log. Part of the compiled apply program that can replay a log to repair a disk image.
  • examine: a program you can execute with the -a flag to run your apply program on a given disk image to repair it and see the results. It outputs information about the state of the filesystem after the apply program runs.
  • v6fs.hh: the header for the C++ type that models the Unix v6 filesystem used for this assignment. You will specifically use the Buffer type defined there which represents a single entry in the block cache.
  • bitmap.hh: the header for the C++ type that models the free list bitmap. You will specifically use the public methods defined there to update the freemap.
  • logentry.hh: the header file containing the definitions of different structs, each representing a different kind of log entry you may need to replay.
  • samples: a symbolic link to the shared directory for this assignment, containing sanity check files, a sample solution called examine_soln, and disk images to test with in disk_images.
  • mountv6, dumplog, fsckv6, v6: provided, fully-implemented tools that allow you to, in order: mount a filesystem image to interact with it, dump a filesystem image's log entries, run fsck on the image, and inspect various components of the filesystem state.
  • tools: contains symbolic links to the sanitycheck and submit programs for basic testing, and submitting your work.

Testing

You will write code to implement the apply program that can repair a Unix v6 disk image by replaying its log entries since the last log checkpoint. apply modifies the disk image it is repairing - so to test your code, instead of running apply directly, we've created an examine program that duplicates the disk image you specify, runs apply on that duplicate, and then outputs information about the resulting repaired disk. You can run it like this:

./examine -a samples/disk_images/create-file.img

It will output the disk image state after running your apply program. Specifically, it outputs a list of all used blocks followed by a list of all used inodes. Here is an example:

Reached log end: bad checksum
played log entries 2670558939 to 2670558952
2 used blocks (out of 2)
Block 3:
  0 01002e00 00000000 00000000 00000000  >  .             <
 16 01002e2e 00000000 00000000 00000000  >  ..            <
 32 1000612d 66696c65 2e747874 00000000  >  a-file.txt    <
 48 00000000 00000000 00000000 00000000  >                <
*
Block 4:
  0 54686973 20697320 736f6d65 2066696c  >This is some fil<
 16 6520636f 6e74656e 74732e0a 00000000  >e contents.     <
 32 00000000 00000000 00000000 00000000  >                <
*
2 used inodes (out of 16)
    1 drwxr-xr-x   2   0   0       48 Oct 11 2022 09:58:24  #1
        ino: 1
        i_mode: 0140755
        i_nlink: 2
        i_uid: 0
        i_gid: 0
        size(): 48
        i_addr[0]: 3
        i_addr[1]: 0
        i_addr[2]: 0
        i_addr[3]: 0
        i_addr[4]: 0
        i_addr[5]: 0
        i_addr[6]: 0
        i_addr[7]: 0
        atime: Oct 11 2022 09:58:24
        mtime: Oct 11 2022 09:58:24
   16 -rw-------   1   0   0       28 Oct 11 2022 09:58:24  #16
        ino: 16
        i_mode: 0100600
        i_nlink: 1
        i_uid: 0
        i_gid: 0
        size(): 28
        i_addr[0]: 4
        i_addr[1]: 0
        i_addr[2]: 0
        i_addr[3]: 0
        i_addr[4]: 0
        i_addr[5]: 0
        i_addr[6]: 0
        i_addr[7]: 0
        atime: Oct 11 2022 09:58:24
        mtime: Oct 11 2022 09:58:24
  • The first two lines print info about what was replayed and when it stopped.
  • Then it prints how many blocks are used
  • It prints the contents of each of those blocks in a side-by-side hexadecimal and ASCII (text) format, with 16 bytes per line - the same size as the direntv6 structure. The text version is useful to see the text in directory entries and text files.
  • Then it prints how many inodes are used
  • Lastly it prints each inode, starting with its number and including information like its permissions, last modified time, block numbers, size, etc.

We provide a sample solution version that you can run to see the expected end state of the filesystem, like this:

./samples/examine_soln -a samples/disk_images/create-file.img

Sanity check works by running a series of tests and comparing your program's and the sample solution's outputs.

To run under GDB, you can run it on the apply program directly. But because it modifies the disk image it is repairing, make a copy of it first like this:

cp samples/disk_images/create-file.img .

Then run under GDB like this:

gdb apply
...
(gdb) run create-file.img

There is also a tool dumplog that lets you view the filesystem log. You can run it like this to view all log entries starting at the most recent checkpoint - the "c" specifies that we just want entries from the last checkpoint onwards:

./dumplog samples/disk_images/create-file.img c

For this assignment, all functionality grading test cases are included in sanity check's provided test cases. See the grading section for more info.

Assignment Infrastructure

This assignment takes the Unix V6 filesystem design and adds both a freemap bitmap (so that free list updates can be idempotent - a linked list does not allow this) and a write-ahead log. It uses some cool techniques and approaches to implement this; while not required to understand for the assignment, if you're interested we encourage you to explore the additional provided code files (it's a great example of a professional quality C++ project and it uses many advanced C++ features) and check out this page for more about how this updated design is possible:

Read about assign2 infrastructure design

1. apply

You need to implement the following 3 short methods in replay.cc, each of which replays one type of entry that could appear in the log. Note that these method implementations will be short; a combined 10-15 lines across all 3, with only a handful of lines at most for each:

// Replays a "patch" log entry, which specifies a 
// change to all or part of the contents of a block,
// such as part of an inode or other metadata block.
void apply(const LogPatch &);

// Replays a "block alloc" log entry, which specifies
// that a block should be marked as allocated (and
// possibly zero-ed out).
void apply(const LogBlockAlloc &);

// Replays a "block free" log entry, which specifies
// that a block should be marked as free.
void apply(const LogBlockFree &);

These methods are called by the starter code, which handles all the logic for reading in the log and determining which part of the log should be replayed. Each of these methods takes in a different struct that contains information about the operation that should be performed. There are some other log entry types - for example, transactions (groups of entries that should be applied atomically) each begin with a LogBegin and end with a LogCommit. But those don't actually require any updates to "replay", and the starter project ensures that operations grouped in a transaction are replayed in their entirety or not at all (e.g. if only half of a transaction was written to disk, and there is no ending LogCommit).

We recommend implementing them in the following order:

LogPatch

This log entry specifies a change to a portion of metadata on disk. This is used for operations like updating an inode's stored block numbers, updating an inode's last modified time, updating an indirect block's block numbers, updating directory entries, etc. This entry contains specific bytes that must be written to a specific disk block at a specific position in that block. logentry.hh contains the full definition of the LogPatch type passed in as a parameter; here is a snippet:

struct LogPatch {
    uint16_t blockno;           // Block number to patch
    uint16_t offset_in_block;   // Offset within block of patch
    std::vector<uint8_t> bytes; // Bytes to place into the block
};

Diagram by John Ousterhout

On the previous assignment, for this task you might have used diskimg_readsector and diskimg_writesector to modify a certain portion of a certain block. For this assignment, however, we have the block cache and need to interface with that instead. You have access to an fs_ instance variable that gives you access to the block cache. There are two relevant methods you can call on it:

/* Gets a block, and returns the contents
 * in a buffer.  You can read its contents or write
 * to it and those changes will be made on disk.
 */
Ref<Buffer> bread(uint16_t blockno);

/* Returns a buffer for a particular block, but
 * does not actually read it from disk.  You therefore
 * cannot read its contents, but you can write a new
 * 512 bytes to it and those changes will be made on disk.
 * This is useful/more efficient than bread if you are
 * going to overwrite an entire block's contents, as
 * there's no need to read in the existing contents.
 */
Ref<Buffer> bget(uint16_t blockno);

One of the most common (and confusing) mistakes is using bget when bread is needed; this results in the sudden appearance of garbage in metadata. Make sure to use the correct versions where needed!

These methods return a reference to a buffer Ref<Buffer> (a special type that helps with the block cache implementation by tracking how many Refs there are for a cache block, so the cache won't evict a block while it's referenced), but it can be treated just like a Buffer *. A Buffer represents a single block in the cache, and it has the following relevant fields:

struct Buffer {
    char mem_[SECTOR_SIZE];
    void bdwrite();
    ...
};

mem_ is the actual contents of the block, which you can modify - it is 512 bytes big. If you modify mem_, you must then call bdwrite() to tell the cache that you have modified it and it should at some point be written back to disk.

Your task for this method is to perform the operation specified in the LogPatch struct.

Hint #1: Check out the std::vector data() method, which gives you the raw byte representation of the vector contents.

Hint #2: Note that the -> operator is the same as dereference + dot. In other words, data->field is the same as (*data).field.

Once you complete this method, no sanity check tests will pass yet - you will need to implement one more function before passing any new tests.

LogBlockFree

This log entry specifies that a specific block number which was previously allocated should now be marked free. logentry.hh contains the full definition of the LogBlockFree type passed in as a parameter; here is a snippet:

struct LogBlockFree {
    uint16_t blockno;           // Block number of freed block
};

To make this change, you need to update the freemap to reflect that this block is now free. You have access to a freemap_ instance variable that represents the freemap. Normally you might think to use bit operations to manipulate individual bits in the freemap, but this implementation provides an .at method that gets a reference to a bit at a given index and also lets you modify it. For instance, you could say:

if (freemap_.at(blockNum)) {
  // blockNum is free
}

...

// mark block as allocated
freemap_.at(blockNum) = false;

Once you complete this method, along with LogPatch, you should now pass test #2, which must fix a deleted directory. If the filesystem still has lingering files that should have been deleted, or if it is missing directories like "." or "..", make sure your buffer cache writing is correct and written to disk. If the output says "free list was incorrect", double check how you are updating the free list.

LogBlockAlloc

This log entry specifies that a specific block number which was previously free should now be marked allocated, and potentially zeroed-out. logentry.hh contains the full definition of the LogBlockAlloc type passed in as a parameter; here is a snippet:

struct LogBlockAlloc {
    uint16_t blockno;       // Block number that was allocated
    uint8_t zero_on_replay; // Metadata--should zero out block on replay
};

In addition to updating the free list, one other requirement is that if zero_on_replay is is nonzero, then you must also set the contents of that block to be all zeroes.

The reason for the possible zeroing is because new metadata blocks should always be zeroed out when they are allocated, but new data blocks (actual file data) should not be (and this log entry is used for both categories of operations). For instance, for metadata blocks all indirect block pointers in an indirect block will be initialized to zero, all directory entries will be initialized to zero, etc. (If the block needs to contain any nonzero information, an additional LogPatch entry in the log would specify that). However, data blocks should not be zeroed out as this may cause data loss. The reason is that data block content changes are not logged, so if we zero one out it will not be restored by later log entries, thus potentially losing data. (This does mean there's a chance that data blocks coud inherit random garbage data, but this will not cause file system inconsistency).

As a hint, consider using memset to help zero out the contents.

If you complete just the free list updating portion of this method, test #3, which must fix a created file, should pass. Once you completely implement this method, including possible block zeroing, all tests should pass -woohoo! If test #3 does not pass due to the free list being incorrect, double check how you are updating the free list.

2. Crash Recovery Exploration

Note: this part is separate from the code you wrote in part 1, but requires understanding the different types of log entries as described in part 1.

Along with the apply program and the infrastructure to build and test log replay, this assignment also comes with tools to play around with simulated filesystems, corrupt them, look at their logs, and inspect the filesystem state. These tools are the same ones you saw in section2, and aren't required to complete part 1, but can be a lot of fun to play around with and also to better understand the mechanisms of crash recovery.

For the second part of the assignment, your task is to answer 4 short readme questions in your readme.txt file that ask about your findings from some short filesystem exploration with a disk image we have provided, samples/disk_images/corrupted-image.img. This filesystem was intended to have the following structure. However, in the middle of executing the user's command to create files 1-100, it crashed suddenly right before making the files 98-100:

.
├── dir1/
│   └── hello.txt
└── dir2/
    ├── 1
    ├── 2
    ├── 3
    ├── 4
    ├── 5
    ...
    ├── 98
    ├── 99
    └── 100

Step 1: Examining the log

Let's first examine the log for this corrupted disk image. The dumplog tool lets you see the filesystem log - you can run it like this to view all log entries starting at the most recent checkpoint, shortened to just the first 150 lines - the "c" specifies that we just want entries from the last checkpoint onwards, and head -150 takes just the first 150 lines:

./dumplog samples/disk_images/corrupted-image.img c | head -150

The log shows the same log entry types as described in part 1 - some entries are grouped within a LogBegin and LogCommit message, which indicates they are part of a single transaction that should be completed atomically. Some log entries are also labeled with what they are updating (e.g. inode i_mtime field, which is last modified time, inode (i_size0, i_size1) fields, which store file size, individual block numbers, etc.).

Examine the log entries in the first transaction (between the first LogBegin and LogCommit - the first 18 log entries total) and answer the following questions:

Readme Q1: The 4th log entry is a LogPatch for block 1026. Assuming block 1026 is the first payload block for the root directory, what operation does this log entry represent?

Readme Q2: Inode #101 is a newly-created directory. What block is marked as allocated to be used to store its directory entries?

Step 2: fsck to the rescue?

Let's give fsck a try in repairing this disk. We supply you with a version of fsck for this filesystem called fsckv6 that can restore a disk image to a consistent state, but often with heavy data loss. If there have been a lot of metadata changes, you will likely lose files. fsckv6 does not implement a lost+found mechanism but simply throws away unreachable files. In the presence of cross-allocation, it somewhat arbitrarily punches holes in files to fix the cross-allocation (without even checking which file has a later modification time, which might lessen the damage).

Since we are going to modify the disk image, make your own copy of it in your assign2 folder like this:

cp samples/disk_images/corrupted-image.img .

Now run fsck on it (without the -y flag, it just tells you what's wrong, without fixing any issues):

./fsckv6 -y corrupted-image.img

Now let's mount this disk image to see what it looks like after repair. There is a provided program mountv6 that lets you interact with a Unix v6 filesystem image as though it were a real filesystem on your computer. You can create an empty directory and run the provided mountv6 tool to mount a Unix v6 disk image in that folder. What this means is that while mountv6 is running, the root of the file system image will appear at this directory location, and you can traverse it, view/edit files, etc. as if it were a real filesystem on your computer. Pretty cool!

To do this, first make an empty folder mnt:

mkdir mnt

Notice that this is empty - for now! Once we run mountv6, the contents of the Unix v6 filesystem image will appear in that folder. Try it now (note: the & runs this command in the background - the filesystem is only mounted while this command is running, so we use & to run it in the background so we can use our shell for other things):

./mountv6 corrupted-image.img mnt &

Now cd into mnt and see what's there. (don't get your hopes up too much...)

Once you're done, navigate out of the mnt directory and run the following command to terminate the filesystem mount:

fusermount -u mnt

Readme Q3: What does the disk image look like after fsck runs? Why might that be?

Step 3: Journaling for the (partial) win

Now let's give apply a try in repairing this disk. Make a fresh copy of corrupted-image like this (you can overwrite the old one):

cp samples/disk_images/corrupted-image.img .

Now run apply_soln on it - this is the sample solution program that repairs the disk by replaying the log:

./samples/apply_soln corrupted-image.img

Now mount the disk image just like in step 2 to see what it looks like after repair:

./mountv6 corrupted-image.img mnt &

cd into mnt and see what's there. Note the intended structure of the filesystem - dir1 was supposed to store hello.txt (which contained the text "hello, world!") and dir2 was supposed to contain files named 1-97, all empty (they intended for 100 files, but it crashed before making file 98). (Hint: to sort files numerically, try ls -v).

Once you're done, navigate out of the mnt directory and run the following command to terminate the filesystem mount:

fusermount -u mnt

Readme Q4: What does the disk image look like after apply_soln runs, and how does this compare to what was recoverable with fsck? Are any files/directories missing? Why is the contents of hello.txt not recoverable?

3. Operating System Trust

For the last part of the assignment, we'll kick off our exploration of ethics in operating systems and systems software this quarter. File systems have been a great first example of one of the many critical pieces of functionality operating systems provide. Operating systems are large, complex systems that are built upon by many developers and used by a wide variety of people for long periods of time. This means that many people put their trust in operating systems and expect them to be reliable, secure and correct, and when this is not true, the impacts can be extremely far-reaching. In this assignment, we’ll start to examine this idea of operating system trust, the assumptions we make about them, and what happens when these are not upheld.

Elevated Privileges

We saw in lecture how the operating system runs in a privileged "kernel mode" where it is allowed to do things that regular user programs cannot do. This is to ensure that the OS handles important responsibilities itself, such as reading data from disk or reading the filesystem log, and prevents user programs from accessing things they shouldn't.

Readme Q5: Imagine that a user using the Unix v6 filesystem creates a folder in the root directory with an unknown number of private files in it. For each of the following scenarios, explain in 1-3 sentences how a malicious program could take advantage of the stated functionality to learn the names of the files contained in this private folder:

a) A user program discovered a way to directly call the readSector function, which is normally private to the operating system.

b) A user program discovered a way to directly access the filesystem log, which logs filesystem metadata operations in this case and is normally private to the operating system.

Trust

Much of our use of operating systems today is built on the trust that users have in operating systems and their security / reliability. For example, all CS111 students (and CS107 students, too!) share usage of the myth machines, which run Linux, and include a shared networked filesystem called AFS ("Andrew File System"). A key part of users' trust in filesystems relates to filesystem permissions; permissions dictate, for a given file, who can access it, and in what way (e.g. read, write, execute). Permissions are critical particularly for multi-user systems like the myth machines where many users share a filesystem. For instance, on the myth machines, by default you cannot read the contents of other users’ files.

Imagine that a bug was disclosed in the OS used on the myth machines (Ubuntu) that impacted the integrity of filesystem permissions - meaning that filesystem permissions may not function in the way they are expected to.

Readme Q6: In 2-4 sentences, explain whether or not this discovery would cause you to change your user behavior in how you use the myth machines. If it would, give an example of one way your behavior would change as a result of this discovery. If it would not, explain why not.

OS Longevity

Operating systems and filesystems have a very long lifespan. For instance, the current Windows filesystem, NTFS, was introduced in 1993. Once a system is adopted, it can be challenging to transition users to newer systems.

Though many operating systems are supported for a reasonably long time, there is no requirement for operating system makers to provide a minimum duration of software and security support - a “minimum support period". And if users continue to use operating systems after they are no longer supported, they can be at risk for security vulnerabilities and other problems. For example, the British, and Dutch governments paid Microsoft to continue supporting Windows XP beyond its retirement in 2014, and the US government also struggled in 2014 to transition from Windows XP to a more modern operating system version.

Readme Q7: In 1-3 sentences each, provide one argument in favor of requiring minimum support periods for operating systems, and one argument against requiring minimum support periods for operating systems.

Submitting and Grading

Once you are finished working and have saved all your changes, submit by running the submit tool in the tools folder: ./tools/submit. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work. You can confirm the timestamp of your latest submission in your course gradebook.

We will grade your assignment with the following criteria:

  • Code test results: 40 points (10 points per test) (same tests as sanity check for this assignment) We'll run your code on a set of test disk images and check that your program gives the expected output. You have access to all of the images we'll use for grading and a sample solution to view the correct output for each of them.
  • Code style: we will manually review your code and give feedback on coding style (e.g. use of appropriate functions)
  • Readme: 28 points (4 points per question)