Assignment 2: Journaling File System and OS Trust

Due: Wed Jan 28 11:59 pm
Late submissions accepted until Fri Jan 30 11:59 pm

Assignment and V6 FUSE implementation by David Mazières, with modifications by John Ousterhout, Nick Troccoli, Xiyu Zhang and Benji Xie.

In this project you will (1) implement part of a C++ program recover to recover filesystem metadata after a crash (you write a total of roughly 10-15 lines to replay the log entries when a crash occurs), (2) explore the structure and mechanisms of logging (journaling),and (3) explore more generally the trust that we place in operating systems (and filesystems in particular). We will be using the Unix V6 file system format from 1975 that you learned about in the previous assignment, but bringing it (at least partially) into the 21st century by journaling all metadata updates to a write-ahead log and tracking free blocks in a bitmap instead of a linked list.

Here are the learning goals for this assignment:

Learn about write-ahead logging, which is an important concept in file systems, databases, and many flash devices.
Learn how to repair inconsistencies by replaying log entries.
Better understand the tradeoffs in implementing crash recovery mechanisms
Navigate a large code base and add specific features and functions
Learn about the protections implemented by operating systems and the consequences if these protections are broken
Consider the ethical issues around long-term support (or lack thereof) for operating systems.

You can do each of the 3 parts of this assignment independently.

Assignment Infrastructure

This assignment takes the Unix V6 filesystem design and adds both a freemap bitmap and a write-ahead log. It also uses something called FUSE (File system in USErspace) to allow us to play around with live instances of this filesystem.

The C++ implementation we've provided uses some cool techniques and approaches; while not required to understand for the assignment, if you're interested we encourage you to explore the additional provided code files (it's a great example of a professional quality C++ project and it uses many advanced C++ features) and check out this page for more about how this updated design is possible:

Read about assign2 infrastructure design

Getting Started

git clone /usr/class/cs111/repos/assign2/$USER assign2

This will create a new directory assign2 in your current directory and clone a Git starter repository into that directory. Do your work for the assignment in this directory. The starter project contains many files, but you only need to modify replay.cc and questions.txt, though you will need to be familiar with a few more. In total, they are listed below:

questions.txt: a text file where you will answer questions for the assignment
replay.hh and replay.cc: the C++ class you will complete that can execute operations specified in a log. Part of the compiled recover program that can replay a log to repair a disk image.
examine: a program you can execute with the -r flag to run your recover program on a given disk image to repair it and see the results. It outputs information about the state of the filesystem after the recover program runs.
v6fs.hh: the header for the C++ type that models the Unix v6 filesystem used for this assignment. You will specifically use the Buffer type defined there which represents a single entry in the block cache.
bitmap.hh: the header for the C++ type that models the free list bitmap. You will specifically use the public methods defined there to update the freemap.
logentry.hh: the header file containing the definitions of different structs, each representing a different kind of log entry you may need to replay.
samples: a symbolic link to the shared directory for this assignment, containing sanity check files, a sample solution called examine_soln, and disk images to test with in disk_images.
mountv6, dumplog, fsckv6, v6: provided, fully-implemented tools that allow you to, in order: mount a filesystem image to interact with it, dump a filesystem image's log entries, run fsck on the image, and inspect various components of the filesystem state.
tools: contains symbolic links to the sanitycheck and submit programs for basic testing, and submitting your work.

The V6 file system we provide you has a block cache, and uses delayed writes to push modified data back to the underlying disk image. If the file system crashes, this is likely to result in metadata inconsistencies. For example, blocks still allocated in files might be marked free, posing a danger of cross-allocation, or directory entries might point at unallocated inodes. The good news is that we've augmented the file system to keep a write-ahead log of all metadata changes. Below, you'll explore the behavior of the filesystem and write code to replay 3 different kinds of log entries. You'll also explore questions of trust in operating systems.

Exercise #1: Experiments with Crash Recovery

In this exercise you will intentionally crash a V6 file system, leaving its on-disk state inconsistent. Then you will recover from the crash in two different ways, once with fsck and once with the log, and compare the results. To do this, you will use the same tools that you saw in section2. First run make in your assignment directory to compile those tools.

Step 1: Crashing the file system

First, let's crash a file system. Working in the top-level directory for the assignment, create an empty disk image and mount it in the provided mnt folder with the following commands, just like you did in section. Note that mnt is empty except for a readme file - for now! Once we run mountv6, the contents of the Unix v6 filesystem image will replace what is in that folder. Try it now:

./mkfsv6 crash.img
./mountv6 -j --crash 935 crash.img mnt &

The --crash argument tells the file system driver that it should crash itself (exit without flushing dirty cache blocks) after 935 writes to the disk image file. The & runs this command in the background - the filesystem is only mounted while this command is running, so we use & to run it in the background so we can use our shell for other things. Now type the following commands:

cd mnt
../create_files.py

The create_files.py script tries to create a large number of files with names file0, file1, and so on. Each file contains a single line of the form

This line belongs to file 101

The script groups the files in subdirectories, with 10 files in each subdirectory, where file0 - file9 are in dir0, file10 - file19 are in dir1, etc. The script will generate enough data to cause the file system to reach its --crash limit; when this happens the mountv6 process will exit and the file system will be unmounted, so create_files.py will exit with an error message. The error message should look exactly like the following, including the name of the file that was about to be created when it crashed: file795

Crashing because of --crash option
Couldn't create file dir79/file795: Software caused connection abort

NOTE: if the error message contains the file name file796, delete crash.img and repeat the steps above but run them promptly one after the other (a delay of 30 seconds or more between mounting and running create_files.py can change the behavior of the system so that you won't get the results we're expecting).

At this point the mnt directory will be unusable. Run

cd -

to return to the previous working directory. You may be wondering "why not just type cd ..?" Crashing the file system can leave things in a state where the .. link doesn't work properly.

Now we are going to recover the crashed file system. We'll do it twice: once with fsck and once using the file system's log.

Step 2: `fsck` to the rescue?

Let's give fsck a try in repairing this disk. Since we are going to modify the disk image, make a copy of it like this:

cp crash.img crash2.img

We have implemented a version of fsck for the V6 file system called fsckv6 that can restore a disk image to a consistent state, but often with heavy data loss. Run it on the original crashed image:

./fsckv6 -y crash.img

The -y switch tells fsckv6 to print out information about any inconsistencies, and also repair the disk; without that switch it will print information about inconsistencies without actually repairing them. This version of fsck does not implement the lost+found recovery described in lecture: if it finds an inode that is allocated but there are no directory entries pointing to the inode, it just deletes the inode's file (this results in "freeing unreachable inode" messages in the output).

Once you have run fsckv6, answer Questions 2A and 2B in questions.txt.

Next, remount the recovered image and explore it with the following commands:

./mountv6 -j crash.img mnt &
cd mnt
../check_files.py

The check_files.py script will scan over all of the files in the directory; its output indicates how many files it found, plus any errors it found in the files (such as a file whose contents are entirely zero, or a directory with no files in it).

Once you have run check_files.py, answer Questions 2C, 2D, and 2E in questions.txt.

Once you have answered the questions, navigate out of the mnt directory and terminate the file system mount:

cd ..
fusermount -u mnt

The mnt directory will now revert to its original form with nothing but a README.txt file.

Step 3: Examining the log

Before recovering the disk image by replaying the log, let's see take a look at the contents of the log. Run the dumplog tool to print out some of the entries in the log:

./dumplog crash2.img c | head -150

The c argument specifies that we only want entries generated after the most recent checkpoint, and head -150 will print out just the first 150 lines of output, discarding the rest.

The log shows the same log entry types as described in section2. In the output you'll be able to see log entries grouped between LogBegin and LogCommit entries. Each group of entries represents a single transaction that must be replayed atomically. Some log entries are also labeled with information about what they are updating (e.g. i_mtime is the last modified time in an inode, isize0 and isize1 show the file size, and so on).

Examine the log entries in the first transaction (from the first LogBegin to LogCommit, 18 entries in all) then answer Questions 2F and 2G in questions.txt.

Step 4: Journaling for the (partial) win

Now recover the second copy of the crashed image using log-based recovery:

samples/recover_soln crash2.img

This uses the sample solution to recover the crashed file system by replaying the log that was generated by the file system (you will complete the implementation of this yourself later in the assignment!). This will output a message "Reached log end (indicated by bad checksum)". The failed checksum is normal; it's how the replay program detects the end of the log.

Now mount the recovered disk image just like in the previous steps and inspect that file system:

./mountv6 -j crash2.img mnt &
cd mnt
../check_files.py

Then answer Questions 2H, 2I, and 2J in questions.txt.

Finally, navigate out of the mnt directory and terminate the file system mount:

cd ..
fusermount -u mnt

Exercise #2: `recover`

In this exercise you will write code for the recover program that replays individual log entries to restore consistency to a disk image. Your code will be in the file replay.cc; this file will be compiled with additional code we've written to form the recover executable. You need to implement the following 3 short methods in replay.cc, each of which replays one type of entry that could appear in the log. Note that these method implementations will be short; a combined 10-15 lines across all 3, with only a handful of lines at most for each:

// Replays a "patch" log entry, which specifies a 
// change to all or part of the contents of a block,
// such as part of an inode or other metadata block.
void apply(const LogPatch &);

// Replays a "block alloc" log entry, which specifies
// that a block should be marked as allocated (and
// possibly zero-ed out).
void apply(const LogBlockAlloc &);

// Replays a "block free" log entry, which specifies
// that a block should be marked as free.
void apply(const LogBlockFree &);

These methods are called by the starter code, which does all of the following for you:

Reads the log from the disk image
Finds all the valid log entries that must be applied. For example, it skips any entries preceding the most recent checkpoint, and stops replaying either at the end of the log or if it finds an inconsistent entry. The log entries contain checksums that allow us to detect if an entry was not correctly written (e.g. if the system crashed in the middle of writing an entry).
Checks for complete transactions: the logging mechanism allows a collection of log entries to be grouped into an atomic transaction. If a transaction isn't complete (the system crashed before writing all of the entries in a transaction) then none of its entries will be replayed.
We have also added code to the file system to generate log entries as needed and ensure that they are written to disk before any of the disk blocks affected by the log entries are written.

Once our code has determined that a log entry should be applied, it will invoke your code. There are some other log entry types - for example, transactions (groups of entries that s hould be applied atomically) each begin with a LogBegin and end with a LogCommit. But those don't actually require any updates to "replay".

Testing

As with previous assignments, you can run tools/sanitycheck to exercise your replay code with a collection of test cases. We will run sanitycheck when we grade your assignment. To test your code outside of sanity check, becauserecover modifies the disk image it is repairing, we've created an examine program that makes a copy of the disk image you specify, runs the recover program on that image to repair it, and then prints information about the repaired disk. Specifically, it outputs a list of all used blocks followed by a list of all used inodes. You can run it like this:

./examine -r samples/disk_images/create-file.img

Here is an example:

Expand/Collapse Example Output

Reached log end: bad checksum
played log entries 2670558939 to 2670558952
2 used blocks (out of 2)
Block 3:
  0 01002e00 00000000 00000000 00000000  >  .             <
 16 01002e2e 00000000 00000000 00000000  >  ..            <
 32 1000612d 66696c65 2e747874 00000000  >  a-file.txt    <
 48 00000000 00000000 00000000 00000000  >                <
*
Block 4:
  0 54686973 20697320 736f6d65 2066696c  >This is some fil<
 16 6520636f 6e74656e 74732e0a 00000000  >e contents.     <
 32 00000000 00000000 00000000 00000000  >                <
*
2 used inodes (out of 16)
    1 drwxr-xr-x   2   0   0       48 Oct 11 2022 09:58:24  #1
        ino: 1
        i_mode: 0140755
        i_nlink: 2
        i_uid: 0
        i_gid: 0
        size(): 48
        i_addr[0]: 3
        i_addr[1]: 0
        i_addr[2]: 0
        i_addr[3]: 0
        i_addr[4]: 0
        i_addr[5]: 0
        i_addr[6]: 0
        i_addr[7]: 0
        atime: Oct 11 2022 09:58:24
        mtime: Oct 11 2022 09:58:24
   16 -rw-------   1   0   0       28 Oct 11 2022 09:58:24  #16
        ino: 16
        i_mode: 0100600
        i_nlink: 1
        i_uid: 0
        i_gid: 0
        size(): 28
        i_addr[0]: 4
        i_addr[1]: 0
        i_addr[2]: 0
        i_addr[3]: 0
        i_addr[4]: 0
        i_addr[5]: 0
        i_addr[6]: 0
        i_addr[7]: 0
        atime: Oct 11 2022 09:58:24
        mtime: Oct 11 2022 09:58:24

The output from examine contains the following information:

The first two lines print info about what was replayed and when it stopped.
The next line prints how many disk blocks are in use.
The next lines display the contents of each of those blocks in a side-by-side format showing hexadecimal on the left and the corresponding ASCII text on the right, with 16 bytes per line -- the same size as the direntv6 structure. The text version is useful to see the text in directory entries and text files, while the hexadecimal version is useful for block numbers and other information.
Next, it prints how many inodes are used; this is followed by the contents of each inode, starting with its number and including information like its permissions, last modified time, block numbers, size, etc.

You can also run samples/examine_soln on a disk image; this program will use our sample solution for recover instead of your code, so you can compare its output with the output generated by ./examine.

To run under GDB, you will need to run recover (the executable containing your log replaying code) directly without using examine. But because it modifies the disk image it is repairing, make a copy of it first like this, replacing IMAGE with the image you want to copy:

cp samples/disk_images/IMAGE crash.img

Then run under GDB like this:

gdb recover
...
(gdb) run crash.img

There is also a tool dumplog that lets you view the filesystem log. You can run it like this (the "c" specifies that it should only display entries more recent than the last checkpoint), replacing IMAGE with the disk image you want to view the log for:

./dumplog samples/disk_images/IMAGE c

If you run dumplog and nothing appears, try making a fresh copy of the crashed disk and repeat the dumplog command, but be sure not to mount the crashed file system before running dumplog (mounting a file system will clear its log).

For this assignment, all functionality grading test cases are included in sanity check's provided test cases. See the grading section for more info.

Essential Utility Modules

In order to implement log entry replay you will need to interact with the file system buffer cache and the freemap.

Buffer cache interface

In the previous assignment you used diskimg_readsector to read blocks from the disk. For this assignment, however, we have a block cache, so you will need to interface with that instead. The methods you will write all have access to an instance variable V6FS &fs_, which you will use for file I/O. The main methods on this variable (defined in v6fs.hh) are:

struct V6FS {
    ...

    /* reads a block from disk, and returns the contents
     * in a buffer in the file system block cache.  
     * You can read its contents or write
     * to it and those changes will be made on disk.
     */
    Ref<Buffer> bread(uint16_t blockno);

    /* Returns a buffer for a particular block, but
     * doesn't actually read it from disk. The contents
     * of the buffer could be garbage.  However, if you are
     * going to overwrite an entire block (for instance
     * by zeroing it out), there is no reason to read the
     * old contents from disk, so bget is more efficient.
     */
    Ref<Buffer> bget(uint16_t blockno);

    ...
};

You will need to use both bread and bget in your solution, and it's very important to understand the difference between them. One of the most common (and confusing) mistakes is using bget when bread is needed; this results in the sudden appearance of garbage in metadata.

Both of the above methods return a Ref<Buffer>. You can treat a Ref<Buffer> as if it were a Buffer*, but it's actually a smart pointer that keeps a reference count of how many Ref's exist for a block in the cache. The cache won't evict a block while there exist Ref's for it. A Buffer contains the cached contents of a disk block. The two things you need to do to a buffer are to access the memory, and to tell the system when the buffer is dirty. The memory is in a byte array called mem_, while the method bdwrite() is used to tell the system that you have modified a buffer and it should at some point be written back. (bdwrite is a commonly used function name in Unix kernels, where the "d" stands for "delayed." A name such as mark_dirty() would arguably be more intuitive.)

struct Buffer {
    ...
    char mem_[SECTOR_SIZE];
    void bdwrite();
    ...
};

Bitmap interface

The freemap is a bitmap stored in the freemap_ instance variable of V6Replay. The freemap is compact enough to store the entire map in memory. It is read from disk for you in the constructor V6Replay::V6Replay, and written back to disk for you at the end of the V6Replay::replay method. Normally you might think to use bit operations to manipulate individual bits in the freemap, but this implementation provides an .at method that gets a reference to a bit at a given index and also lets you modify it. For instance, you could say:

if (freemap_.at(blockNum)) {
  // blockNum is free
}

...

// mark block as allocated
freemap_.at(blockNum) = false;

If you have time, take a look through the implementation of Bitmap in bitmap.hh. It's interesting to see how the class allows you to address individual bits in the bitmap.

We recommend implementing the log entries in the following order below.

`LogPatch`

This log entry specifies a change to portion of metadata on disk. This is used for operations like updating the block numbers stored in an inode or indirect block, updating an inode's last modified time, or updating directory entries. It contains specific bytes that must be written to the disk at position offset_in_block of block blockno; the length of bytes determines how many bytes are overwritten. logentry.hh contains the full definition of the LogPatch type passed in as a parameter; here is a snippet:

struct LogPatch {
    uint16_t blockno;           // Block number to patch
    uint16_t offset_in_block;   // Offset within block of patch
    std::vector<uint8_t> bytes; // Bytes to place into the block
};

A diagram showing where bytes in a logpatch should be added. An arrow labeled 'blockno' points to the start of a disk block, and some amount of initial space in that block is labeled 'offset_in_block'. Immediately following that, there is a highlighted portion of the block labeled 'bytes.size()'.

Diagram by John Ousterhout

When implementing this, check out the std::vector data() method, which gives you the raw byte representation of the vector contents.

Once you complete this method, no sanity check tests will pass yet - you will need to implement one more function before passing any new tests.

`LogBlockFree`

This log entry specifies that block number blockno, which was previously allocated, should be marked free (1 or true in the freemap). logentry.hh contains the full definition of the LogBlockFree type passed in as a parameter; here is a snippet:

struct LogBlockFree {
    uint16_t blockno;           // Block number of freed block
};

Once you complete this method, along with LogPatch, you should now pass test #2, which must fix a deleted directory. If the filesystem still has lingering files that should have been deleted, or if it is missing directories like "." or "..", make sure your buffer cache writing is correct and written to disk.

`LogBlockAlloc`

This log entry specifies that block number blockno, which was previously free, should now be marked as allocated (0 or false in the freemap). In addition, if zero_on_replay is non-zero then the contents of the block must be cleared to zeroes. If zero_on_replay is zero, then the contents of the block must be left as-is. logentry.hh contains the full definition of the LogBlockAlloc type passed in as a parameter; here is a snippet:

struct LogBlockAlloc {
    uint16_t blockno;       // Block number that was allocated
    uint8_t zero_on_replay; // Metadata--should zero out block on replay
};

The zero_on_replay field is needed because metadata and data blocks must be handled differently when replaying LogBlockAlloc entries. This field is set to 1 for all metadata blocks and 0 for all data blocks. When this field is 1, then after the block is allocated during replay its contents will be zeroed out. This is needed because new metadata blocks are always zeroed out when they are allocated during normal operation (e.g., all indirect block pointers in an indirect block will start off as zero, as will all entries in a new directory block). However, there is no guarantee that this zeroed-out block reached disk before the crash. If the block isn't cleared during crash recovery, it could contain arbitrary garbage that causes the file system to misbehave. Later metadata updates for that block will be reconstructed by subsequent LogPatch log entries. Data blocks should not be zeroed out because the recovery mechanism does not log data. If we zero out data blocks when replaying LogBlockAlloc entries, we will wipe any file data that has made it to disk (in other words, whenever we replay an operation on a file that causes a new data block to be allocated, we would end up wiping that block clean and lose valid data).

You may notice that whenever the log contains a LogBlockAlloc entry, there's also a LogPatch entry setting a pointer to that block in the same transaction. You may also find multiple LogBlockAlloc entries in the same transaction (one to allocate an indirect block, and another to allocate a data block, whose disk address will be stored in the indirect block using a LogPatch entry).

As a hint, consider using memset to help zero out the contents.

If you complete just the free list updating portion of this method, test #3, which must fix a created file, should pass. Once you completely implement this method, including possible block zeroing, all tests should pass -woohoo!

Exercise 3. Long-Term Support and Trust

For the last part of the assignment, we'll kick off our exploration of ethics in operating systems and systems software this quarter. File systems have been a great first example of one of the many critical pieces of functionality operating systems provide. Operating systems are large, complex systems that are built upon by many developers and used by a wide variety of people for long periods of time. This means that many people put their trust in operating systems and expect them to be reliable, secure and correct, and when this is not true, the impacts can be extremely far-reaching. In this assignment, we’ll start to examine this idea of operating system trust, the assumptions we make about them, and what happens when these are not upheld.

Elevated Privileges

We saw in lecture how the operating system runs in a privileged "kernel mode" where it is allowed to do things that regular user programs cannot do. This is to ensure that the OS handles important responsibilities itself, such as reading data from disk or reading the filesystem log, and prevents user programs from accessing things they shouldn't.

Imagine that a user using the Unix v6 filesystem just now created a folder called secret in the root directory with an unknown number of private files in it.

Answer question 3A in questions.txt.

Trust

Much of our use of operating systems today is built on the trust that users have in operating systems and their security / reliability. For example, all CS111 students (and CS107 students, too!) share usage of the myth machines, which run Linux, and include a shared networked filesystem called AFS ("Andrew File System"). A key part of users' trust in filesystems relates to filesystem permissions; permissions dictate, for a given file, who can access it, and in what way (e.g. read, write, execute). Permissions are critical particularly for multi-user systems like the myth machines where many users share a filesystem. For instance, on the myth machines, by default you cannot read the contents of other users’ files.

Imagine that a bug was disclosed in the OS used on the myth machines (Ubuntu) that impacted the integrity of filesystem permissions - meaning that filesystem permissions may not function in the way they are expected to.

Answer question 3B in questions.txt.

OS Longevity

Operating systems and filesystems have a very long lifespan. For instance, the current Windows filesystem, NTFS, was introduced in 1993. Once a system is adopted, it can be challenging to transition users to newer systems.

Although there is no requirement for operating systems to have a minimum support period, most organizations that develop operating systems provide support for around 10 years. This is typically enough for most users, but not all. For example, critical government infrastructure in the US, UK, and Netherlands relied on Windows XP (which debuted in 2001) through at least 2014.

Consider the following questions relating to OS minimum support periods and trust:

What is one argument for strengthening requirements for a minimum support period? (e.g., requiring it by law)?
What is one argument against requiring a minimum support period?

Once you have considered these questions, answer Questions 3C-3D in questions.txt.

Here are some other articles related to operating system support, in case you are interested:

Submitting and Grading

Once you are finished working and have saved all your changes, submit by running tools/submit.

We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, the earlier submit will earn points. Without a submission, we cannot grade your work. You can confirm the timestamp of your latest submission in your course gradebook.

Here is a recap of the work that will be graded on this assignment (point amounts subject to slight change):

replay.cc: 40 points (10 points per test) (same tests as sanity check for this assignment) We'll run your code on a set of test disk images and check that your program gives the expected output. You have access to all of the images we'll use for grading and a sample solution to view the correct output for each of them.
Code style: we will manually review your code and give feedback on coding style (e.g. use of appropriate functions)
questions.txt: about 4 points per question (~56 points total)

Frequently-Asked Questions

What is a disk image? A disk image is like a "dehydrated" filesystem; it's a way that we can store a filesystem within a single file. This allows us to create filesystems with different contents, package them up, and then use them for easy testing.

What does it mean to mount a disk image? This means that the filesystem contained in the disk image will magically appear in mnt; mnt "becomes its root directory". And you can interact with it just like any other folders/files, except it's simulated - it's a completely separate filesystem! This allows us to easily test/play around with different-looking filesystems, and in particular ones in the Unix V6 filesystem format, which is no longer used on modern systems.