Section 2: Filesystem Crash Recovery

Sections Thu Oct 13 to Fri Oct 14

Learning Goals

During this section, you will:

use the assign2 tools to play around with and try out crash recovery mechanisms
Learn how the log stores metadata operations and can restore filesystems after a crash

Get Started

The code for this week is the same as for assign2 - we'll be playing around with some of the provided assignment tools to explore crash recovery, similar to what you'll be doing on part 2 of the assignment. If you haven't already cloned the starter code, get a copy by using the command below.

git clone /afs/ir/class/cs111/repos/assign2/$USER assign2

Next, pull up the online section checkoff and have it open in a browser so you can jot things down as you go.

Exercises

assign2 has you implement part of a C++ program to recover filesystem metadata after a crash, and answer some short answer questions after using provided tools to explore the structure and mechanisms of logging (journaling). The assignment provides a C++ implementation of the Unix V6 Filesystem that has been modified to add a bitmap for the free list, and also add logging support for metadata operations. The exercises below will let you try out some of the tools included in the assignment (particularly useful for the assignment short answer questions) and see logging in action.

1. Viewing Logs

assign2 provides Unix v6 disk images like the prior assignment, but this time you have tools to both create your own and mount these images to navigate them just like your actual filesystem - create/edit/delete files, traverse folders, etc. Pretty cool!

In this exercise, we'll create a blank disk image with a single file in it, and then look at the log to see how it reflects the operations we performed.

First, run make in your assign2 folder to compile all the provided tools. Then, make your own blank disk image called v6.img:

./mkfsv6 v6.img

Now let's make a (for now, empty) directory where we will mount this disk image:

mkdir mnt

Mounting v6.img in mnt means that the v6.img filesystem will magically appear in mnt; mnt "becomes its root directory". And you can interact with it just like any other folder, except it's simulated - it's a completely separate filesystem! When you unmount it, mnt will be empty again. Let's try it:

./mountv6 -j v6.img mnt &

(The -j flag enables logging, and the & means run this command in the background - the filesystem will only appear in mnt while this command is running, so we need to keep this running). Note: the command runs in the background, so it may output text after the terminal reprompts. Just ignore it and continue entering commands as normal.

Now navigate into mnt - it's still empty, but now it's actually the mounted Unix v6 image you created. Try creating a file called file.txt there with the contents "Hello, world!" like this:

echo "Hello, world!" > file.txt

(The > saves the output of echo to a file, instead of printing it to the terminal).

Navigate out of mnt and run the following command to save and close the filesystem mount - this will save the current state of mnt back to your image v6.img:

fusermount -u mnt

(Note now that mnt is empty again!) Now let's look at the filesystem log to see how it logged the metadata operations:

./dumplog v6.img

Try to read through the log and match its operations back to our previous action of creating a new file called file.txt in the root directory. Here are some tips:

every transaction (group of log operations that should be atomic) starts with a LogBegin entry and ends with a LogCommit entry. Transactions are replayed in their entirety, or not at all.
LogPatch is an entry that updates a specified block number with new bytes starting at a specified offset. This is used for e.g. adding a directory entry, updating an inode's field, etc.
LogBlockAlloc is an entry that marks a specified block as allocated in the freemap.
An inode's size0 and size1 fields store the file size
An inode's i_mtime field stores the last modification time
Each log entry has an "LSN" (log sequence number) that should increment for each entry.
Some log entries may be added more than once

As you read through the log, try to answer the following questions:

Q1: What is the inumber of the file we created, file.txt?

Q2: What block number was allocated to store file.txt's payload data?

Once you have come up with answers to these questions, try using the v6 tool to confirm your answers. v6 lets you examine specific parts of the filesystem stored in v6.img without mounting it first (it assumes it's called v6.img). For instance, run the following to show the contents of a specified inode (#1 here - we need to put it in quotes because # is a special terminal character):

./v6 stat '#1'

This will show the inode for the root directory. Find the block number it uses to store its directory entries, and print out its contents like this, replacing NUM with the block number:

./v6 block NUM

This should show the names of 3 entries: ., .. and file.txt. It shows the hexadecimal on the left, and the text equivalent on the right.

Now try to use these commands to view the contents of the inode for file.txt, as well as the contents in its payload block. You can specify a path for an inode instead of an inode number like this: ./v6 stat /file.txt.

As you examine the inodes and blocks, try to answer the following questions:

Q3: What is the inumber for file.txt? Does that line up with your findings from the log?

Q4: What is the reported file size of file.txt? Does that line up with what you expect given the contents you stored there? (note: echo adds an ending \n character to the text you ask it to print)

2. Log Replay

The assignment comes with several test disk images that crashed suddenly while doing various operations, and one of your tasks on the assignment is to write code to replay the log entries and restore metadata. Let's look at one, samples/disk_images/create-file.txt (an image where the user created a file when the system crashed, and they hope to recover it).

First, make your own copy of the image to modify - the samples folder is read-only, and we will eventually run the apply_soln program to repair the image.

cp samples/disk_images/create-file.img .

Now, work through the following steps:

1.) Try mounting this image in the mnt folder as you did in the first problem, and running ls to see what's there. Use fusermount to unmount when you're done.

Q5: Is there anything there in the version of the filesystem from just after the crash?

2.) Make a new copy, overwriting the existing one (it turns out that mounting may alter the filesystem state and you want to see the original log). Now use dumplog to inspect the log. Run it with c to display the log from just the latest checkpoint - this is the portion that will be replayed to repair the disk: ./dumplog create-file.img c

3.) Inspect the log and read through each entry to gather information about what operations were being performed.

Q6: What do you predict the filesystem will look like after these log entries are replayed? What block(s) will no longer be in the free list?

4.) Now run the sample repair program solution to repair the disk by replaying the log from the latest checkpoint. Run ./samples/apply_soln create-file.img.

5.) Lastly, try mounting this image again in the mnt folder and see what's there. Use fusermount to unmount when you're done.

Q7: What was recovered by replaying the log?

3. Where the Log Ends

When replaying the log after a crash, finding the end of the log is tricky. The log just turns to garbage at some point after the last write, which might not even be for a full log record if the file system crashed. The starter project for assign2 handles the reading of the log and detecting where it ends so that only the correct log entries are replayed - these are the ones passed to your apply methods that you implement, so you don't need to worry about finding the end of the log in your own code. But it's a cool problem, so let's take a closer look at the journaled V6 filesystem design to see how the provided code tries to detect the end of the log.

The implementation uses the following approaches to verify the integrity of the log and stop once it gets to invalid data:

Diagram by John Ousterhout

each log entry has a header, payload and footer within it
each log entry has an LSN (Log sequence number), a unique number assigned to each log entry, incremented for each subsequent entry. This LSN is contained in both the header and footer.
the footer for an entry includes a checksum over the header and payload
transactions start with a LogBegin entry and end with a LogCommit entry. LogCommit also includes the LSN of the matching LogBegin.

Discuss how each of these approaches helps to detect edge cases in finding the end of the log. Here are some possible scenarios to consider:

Q8: the log space is reused and looped back over repeatedly, meaning the garbage data could appear as valid entries. How can we use the LSN to know that an entry is old?

Q9: sometimes, log entries may span sector boundaries (i.e. part of entry at the end of one sector, part at the start of another sector). What if a seemingly valid entry actually consists of the first half of a new log entry followed by the second half of an old one? How can storing the LSN in the footer help with this?

Q10: There's still the possibility that the "right" log sequence number for the footer could appear in arbitrary payload data from an older log record. How can we use the checksum to help with this?

Q11: Multiple log entries may be grouped together as part of a transaction. How might we know that a transaction wasn't fully completed and shouldn't be replayed?