Section 2: Filesystem Crash Recovery and System Calls

Sections Wed Jan 25 to Sat Jan 28

Learning Goals

During this section, you will:

  1. use the assign2 tools to play around with and try out crash recovery mechanisms
  2. Learn how the log stores metadata operations and can restore filesystems after a crash
  3. Get practice with file descriptors and using valgrind to identify open file descriptors

Get Started

Clone the section starter code by using the command below. This command creates a section2 directory containing the project files.

git clone /afs/ir/class/cs111/repos/lab2/shared section2

Pull up the online section checkoff and have it open in a browser so you can jot things down as you go.

Exercises

assign2 has you implement part of a C++ program to recover filesystem metadata after a crash, and answer some short answer questions after using provided tools to explore the structure and mechanisms of logging (journaling). The assignment provides a C++ implementation of the Unix V6 Filesystem that has been modified to add a bitmap for the free list, and also add logging support for metadata operations. The first 2 exercises below will let you try out some of the tools included in the assignment (particularly useful for the assignment short answer questions) and see logging in action. The third question will provide some practice with file descriptors (which come up on assign3).

1. Viewing Logs

assign2 provides Unix v6 disk images like the prior assignment, but this time you have tools to both create your own and mount these images to navigate them just like your actual filesystem - create/edit/delete files, traverse folders, etc. Pretty cool! The tools below (mkfsv6, mountv6, v6, and dumplog) are all ones that you'll see on assign2 and enable you to explore crash recovery behavior in detail.

In this exercise, we'll create a blank disk image with a single file in it, and then look at the log to see how it reflects the operations we performed.

First, make your own blank disk image called v6.img - the mkfsv6 tool lets us make a new blank Unix v6 disk image:

./mkfsv6 v6.img

Now let's make a (for now, empty) directory where we will mount this disk image:

mkdir mnt

Mounting v6.img in mnt means that the v6.img filesystem will magically appear in mnt; mnt "becomes its root directory". And you can interact with it just like any other folder, except it's simulated - it's a completely separate filesystem! When you unmount it, mnt will be empty again. We can mount a disk image with the mountv6 tool. Let's try it:

./mountv6 -j v6.img mnt &

(The -j flag enables logging, and the & means run this command in the background - the filesystem will only appear in mnt while this command is running, so we need to keep this running). Note: the command runs in the background, so it may output text after the terminal reprompts. Just ignore it and continue entering commands as normal.

Now navigate into mnt - it's still empty, but now it's actually the mounted Unix v6 image you created. Try creating a file called file.txt there with the contents "Hello, world!" like this:

echo "Hello, world!" > file.txt

(The > saves the output of echo to a file, instead of printing it to the terminal).

Navigate out of mnt and run the following command to save and close the filesystem mount - this will save the current state of mnt back to your image v6.img:

fusermount -u mnt

(Note now that mnt is empty again!) Now let's look at the filesystem log with the dumplog tool to see how it logged the metadata operations:

./dumplog v6.img

Try to read through the log and match its operations back to our previous action of creating a new file called file.txt in the root directory. Here are some tips:

  • every transaction (group of log operations that should be atomic) starts with a LogBegin entry and ends with a LogCommit entry. Transactions are replayed in their entirety, or not at all.
  • LogPatch is an entry that updates a specified block number with new bytes starting at a specified offset. This is used for e.g. adding a directory entry, updating an inode's field, etc.
  • LogBlockAlloc is an entry that marks a specified block as allocated in the freemap.
  • An inode's size0 and size1 fields store the file size
  • An inode's i_mtime field stores the last modification time
  • Each log entry has an "LSN" (log sequence number) that should increment for each entry.
  • Some log entries may be added more than once

As you read through the log, try to answer the following questions:

Q1: What is the inumber of the file we created, file.txt?

Q2: What block number was allocated to store file.txt's payload data?

Once you have come up with answers to these questions, try using the v6 tool to confirm your answers. v6 lets you examine specific parts of the filesystem stored in v6.img without mounting it first (it assumes it's called v6.img). For instance, run the following to show the contents of a specified inode (#1 here - we need to put it in quotes because # is a special terminal character):

./v6 stat '#1'

This will show the inode for the root directory. Find the block number it uses to store its directory entries, and print out its contents like this, replacing NUM with the block number:

./v6 block NUM

This should show the names of 3 entries: ., .. and file.txt. It shows the hexadecimal on the left, and the text equivalent on the right.

Now try to use these commands to view the contents of the inode for file.txt, as well as the contents in its payload block. You can specify a path for an inode instead of an inode number like this: ./v6 stat /file.txt.

As you examine the inodes and blocks, try to answer the following questions:

Q3: What is the inumber for file.txt? Does that line up with your findings from the log?

Q4: What is the reported file size of file.txt? Does that line up with what you expect given the contents you stored there? (note: echo adds an ending \n character to the text you ask it to print)

2. Log Replay

assign2 comes with several test disk images that crashed suddenly while doing various operations, and one of your tasks on the assignment is to write code to replay the log entries and restore metadata. Let's look at one, samples/create-file.img (an image where the user created a file when the system crashed, and they hope to recover it).

First, make your own copy of the image to modify - the samples folder is read-only, and we will eventually run the apply_soln program (the fully-implemented solution version of what you will write on assign2) to repair the image.

cp samples/create-file.img .

Now, work through the following steps:

1.) Try mounting this image in the mnt folder as you did in the first problem, and running ls to see what's there. Use fusermount to unmount when you're done.

Q5: Is there anything there in the version of the filesystem from just after the crash?

2.) Make a new copy, overwriting the existing one (it turns out that mounting may alter the filesystem state and you want to see the original log). Now use dumplog to inspect the log. Run it with c to display the log from just the latest checkpoint - this is the portion that will be replayed to repair the disk: ./dumplog create-file.img c

3.) Inspect the log and read through each entry to gather information about what operations were being performed.

Q6: What do you predict the filesystem will look like after these log entries are replayed? What block(s) will no longer be in the free list?

4.) Now run the sample repair program solution to repair the disk by replaying the log from the latest checkpoint. Run ./samples/apply_soln create-file.img.

5.) Lastly, try mounting this image again in the mnt folder and see what's there. Use fusermount to unmount when you're done.

Q7: What was recovered by replaying the log?

3) valgrind and File Descriptors

We saw in lecture how file descriptors are like "ticket numbers" for open resources in your program - like files and STDIN / STDOUT / STDERR. To practice with code that uses file descriptors and system calls, take a look at the provided catn.c program. It is a program that, like the built-in cat command, prints the contents of a file to the terminal. But catn prints only the first N bytes, which you specify when you run the program, like this, which would print the first 50 bytes of Makefile:

./catn Makefile 50

First, try tracing through the program to better understand the file descriptors and system calls used. Compile using make and try out the program. Consider the following as you do:

Q8: Why does the read function need to be called in a loop?

The author of this program forgot about closing file descriptors in their code. We must make sure that all file descriptors we open are properly closed. Valgrind can help notify us of any file descriptors left open. Run valgrind --track-fds=yes ./catn Makefile 50 to get information about any open file descriptors. (note: it's fine for file descriptors 0-2 to be left open, which refer to STDIN / STDOUT / STDERR; all file descriptors are closed when the program terminates).

Q9: Which file descriptor was left open that should be closed? Add a line to the code to ensure all file descriptors we need to close are closed.

Note: if you're SSHing via VSCode, you may see file descriptors other than the ones in this program being used, since VSCode uses some file descriptors as part of connecting to myth. You can ignore those (e.g. sometimes it's file descriptors 19, 20, 21 or 99). To confirm what file descriptors you're seeing may be part of VSCode, try logging in to myth directly via a terminal program and run valgrind through there to see what file descriptors are open.

4. [Extra]: Where the Log Ends

When replaying the log after a crash, finding the end of the log is tricky. The log just turns to garbage at some point after the last write, which might not even be for a full log record if the file system crashed. The starter project for assign2 handles the reading of the log and detecting where it ends so that only the correct log entries are replayed, so you don't need to worry about finding the end of the log in your own code. But it's a cool problem, so let's take a closer look at the journaled V6 filesystem design to see how the provided code tries to detect the end of the log.

The implementation uses the following approaches to verify the integrity of the log and stop once it gets to invalid data:

Diagram by John Ousterhout

  • each log entry has a header, payload and footer within it
  • each log entry has an LSN (Log sequence number), a unique number assigned to each log entry, incremented for each subsequent entry. This LSN is contained in both the header and footer.
  • the footer for an entry includes a checksum over the header and payload
  • transactions start with a LogBegin entry and end with a LogCommit entry. LogCommit also includes the LSN of the matching LogBegin.

Discuss how each of these approaches helps to detect edge cases in finding the end of the log. Here are some possible scenarios to consider:

Q10: the log space is reused and looped back over repeatedly, meaning the garbage data could appear as valid entries. How can we use the LSN to know that an entry is old?

Q11: sometimes, log entries may span sector boundaries (i.e. part of entry at the end of one sector, part at the start of another sector). What if a seemingly valid entry actually consists of the first half of a new log entry followed by the second half of an old one? How can storing the LSN in the footer help with this?

Q12: There's still the possibility that the "right" log sequence number for the footer could appear in arbitrary payload data from an older log record. How can we use the checksum to help with this?

Q13: Multiple log entries may be grouped together as part of a transaction. How might we know that a transaction wasn't fully completed and shouldn't be replayed?