Assignment 8: Journaling File System (Write-Ahead Log)
In this assignment you will implement file system recovery based on journaled metadata. We will be using the Unix V6 file system format from 1975 that you learned about in the previous assignment, but bringing it (at least partially) into the 21st century by journaling all metadata updates to a write-ahead log and tracking free blocks in a bitmap instead of a linked list.
This assignment also contains an additional ethics exercise related to long-term support for operating systems.
Here are the learning goals for this assignment:
- Learn about write-ahead logging, which is an important concept in file systems, databases, and many flash devices.
- Learn how to repair inconsistencies by replaying log entries.
- Learn about user-space file systems.
- Consider the ethical issues around long-term support (or lack thereof) for operating systems.
Project Overview
We will supply you with a C++ implementation of the V6 file system using FUSE (File system in USErspace). FUSE is a Linux feature allowing developers to implement file systems in ordinary user processes known as fuse drivers, without modifying the kernel. When applications make file system calls, the kernel passes the system calls through as messages to the user-level fuse driver, which then passes the results back to the kernel. The result is a fully functional file system that you can use just like the system's built-in file system to store files, run builds, and so on. Though this architecture incurs a bit of performance overhead, it is used by production file systems such as NTFS-3G and the exFAT implementation commonly used on linux. The V6 file system doesn't store its data directly on a disk; instead, it reads and writes a disk image stored in a file, similar to what you used in Assignment 7.
The V6 file system we provide you has an inode cache and a block cache, and uses delayed writes to push modified data back to the underlying disk image. If the file system crashes, this is likely to result in metadata inconsistencies. For example, blocks still allocated in files might be marked free, posing a danger of cross-allocation, or directory entries might point at unallocated inodes.
We supply you with a file system scavenger program fsckv6
that can
restore a disk image to a consistent state using techniques similar
to those discussed in lecture, but it may result in heavy data loss.
If there have been a lot of metadata changes, you
will likely lose files.
The good news is that we've augmented the file system to keep a write-ahead log of all metadata changes. We've also created a program that can replay log entries after a crash to recover lost metadata and ensure that file system data structures are consistent.
Your task for this assignment is to write code to replay 3 different kinds of log entries. The actual code you will write is tiny (only about a dozen lines), but you'll need to spend a bit of time learning about how the log works and about other facilities in the file system that you will use to write your code, such as the block cache and the freemap.
Getting Started
Login to the myth cluster and clone the starter repo with this command:
git clone /afs/ir/class/archive/cs/cs111/cs111.1236/repos/assign8/$USER assign8
This will create a new directory assign8
in your current
directory and clone a Git starter repository into that
directory. Do your work for the assignment
in this directory.
You will make all of your code changes in replay.cc
, which is responsible
for replaying log entries after a crash.
The directory also contains a Makefile
; if you type make
, it
will create several executables, including the following:
mkfsv6
: Creates an image file for a disk with no files other
than a root directory; the file system will use this
as if it were a disk, to store file system data.
mountv6
: This is the file system driver.
fsckv6
: A version of fsck for the V6 file system. It will restore
consistency to the disk, but it's likely to lose a lot of information.
recover
: This program will contain your code for crash recovery; it
will run after crashes to replay the log and restore consistency to the disk.
Try running make
now, followed by tools/sanitycheck
: almost all of
the tests should fail. See the section Testing Your Code below
for more information about how to test your assignment (the infrastructure
is a bit different from previous assignmnents).
Exercise 1: Using the FUSE Implementation of Unix V6
This exercise will walk you through how to run the FUSE implementation of the Unix V6 file system. In order to use a file system you must mount it. When a file system is mounted, its root directory overlays an existing directory in a parent file system; all of the information in the mounted file system will suddenly appear where the mount directory used to be.
Your starter repo contains a directory mnt
; you'll mount the V6
file system on top of it. Make sure that mnt
exists (it should have
a README.txt
file explaining the behavior of the mount directory).
Now invoke the following commands:
cp samples/disk_images/assign1.img v6.img
./mountv6 -j v6.img mnt &
The first command makes a private copy of a disk image that we have
created for you. The mountv6
command runs our implementation of
the V6 file system using v6.img
as the disk image: it mounts the
file system over the mnt
directory and will handle requests to access that file system
(the -j
option tells it to use its journaling mechanism
to track metadata changes).
This command will continue running in the background until the file
system is unmounted.
The output that appears on your screen may seem confusing because mountv6
is running concurrently with the shell and each program generates output (the
shell writes a prompt and mountv6
generates an information message).
This is a race, so the outputs may appear in any order.
Nonetheless, the shell is ready for you to type more commands.
Now type
ls mnt
You will see that the old contents of the directory
have disappeared. Instead, you'll see the root directory of the
mounted file system. It contains one subdirectory, assign1
, in
which we have made a copy of the starter files for Assignment 1.
The behavior of the
mounted file system is indistinguishable from other directories and files
except that performance will be a little worse because requests have
to be forwarded to the mountv6
process. Type the following commands:
cd mnt/assign1
ls
make
./nearest
All of the file system kernel calls invoked by these commands were
forwarded to the mountv6
process, which carried them out using its
V6 implementation. Instead of reading and writing blocks on a disk,
mountv6
read and wrote blocks in the disk image file.
Once you have run the commands above, answer Question 1 in
questions.txt
.
Finally, unmount the file system by invoking the following commands
(you must run the fusermount
command in the same directory where
you invoked ./mountv6
; otherwise it will fail):
cd ../..
fusermount -u mnt
This command will cause the mountv6
process to exit after shutting
down the file system cleanly. The pre-existing contents of the mnt
directory will now reappear. The changes you made to the V6 file system
have been saved in the disk image file. If you invoke ./mountv6
to
remount the file system, its contents will appear just as they were
at the time it was unmounted.
Exercise 2: Experiments with Crash Recovery
In this exercise you will intentionally crash a V6 file system, leaving
its on-disk state inconsistent. Then you will recover from the crash
in two different ways, once with fsck
and once with the log, and
compare the results.
First, let's crash a file system. Working in the top-level directory for the assignment, create an empty disk image and mount it with the following commands:
./mkfsv6 crash.img
./mountv6 -j --crash 935 crash.img mnt &
The --crash
argument tells the file system driver that it
should crash itself (exit without flushing dirty cache blocks)
after 935 writes to the disk image file. Now type the following
commands:
cd mnt
../create_files.py
The create_files.py script tries to create a large number of files
with names file0
, file1
, and so on. Each file contains a single
line of the form
This line belongs to file 101
The script groups the files in subdirectories, with 10 files in each
subdirectory.
The script will generate enough data to cause the file system to reach
its --crash
limit; when this happens the mountv6
process will exit
and the file system
will be unmounted, so create_files.py
will exit with an
error message.
Once the file system has crashed, answer Question 2A in questions.txt
.
Now we are going to recover the crashed file system. We'll
do it twice: once with fsck
and once using the file system's log. First, make a copy of the
crashed disk image:
cp crash.img crash2.img
We have implemented a version of fsck
for the V6 file system; run it
on the original crashed image:
./fsckv6 -y crash.img
The -y
switch tells fsckv6
to print out information about any
inconsistencies, and also repair the disk; without that switch
it will print information about inconsistencies without actually
repairing them. This version
of fsck
does not implement the lost+found
recovery described in
lecture: if it finds an inode that is allocated but there are no directory
entries pointing to the inode, it just deletes the inode's file (this
results in "freeing unreachable inode" messages in the output).
Once you have run fsckv6
, answer Questions 2B and 2C in questions.txt
.
Next, remount the recovered image and explore it with the following commands:
./mountv6 -j crash.img mnt &
cd mnt
../check_files.py
The check_files.py
script will scan over all of the files in
the directory; its output indicates how many files it found, plus
any errors it found in the files (such as a file whose contents
are entirely zero, or a directory with no files in it).
Once you have run check_files.py
, answer Questions 2D, 2E, and 2F
in questions.txt
.
Now recover the second copy of the crashed image using log-based recovery:
cd ..
samples/recover_soln crash2.img
This program recovers the crashed file system by replaying the log that was generated by the file system. It includes our sample solution for the code you will write.
Finally, unmount the file system recovered with fskv6
,
mount the file system recovered with recover_soln
, and inspect that
file system:
fusermount -u mnt
./mountv6 -j crash2.img mnt &
cd mnt
../check_files.py
Then answer Questions 2G, 2H, and 2I in questions.txt
.
Exercise #3: Implementing Log Replay
In this exercise you will write code that replays individual log entries
to restore consistency to a disk image. Your code will be in the file
replay.cc
; this file will be compiled with additional code we've
written to form the recover
executable. We've already written code that does
the following things:
- Reads the log from the disk image
- Finds all the valid log entries that must be applied. For example, it skips any entries preceding the most recent checkpoint, and stops replaying either at the end of the log or if it finds an inconsistent entry. The log entries contain checksums that allow us to detect if an entry was not correctly written (e.g. if the system crashed in the middle of writing an entry).
- Checks for complete transactions: the logging mechanism allows a collection of log entries to be grouped into an atomic transaction. If a transaction isn't complete (the system crashed before writing all of the entries in a transaction) then none of its entries will be replayed.
- We have also added code to the file system to generate log entries as needed and ensure that they are written to disk before any of the disk blocks affected by the log entries are written.
If you're interested in learning more about the structure of the V6 file system and its log, check out this page with additional information.
Once our code has determined that a log entry should be applied, it
will invoke your code, which consists of several V6Replay::apply
methods in replay.cc
. The methods are overloaded: they all have the
same name, but each method has a different argument type reflecting the
specific type of
log entry that it must process. You must fill in the bodies
of these methods.
There are 3 different types of log entry that you must process:
Patch bytes
struct LogPatch {
uint16_t blockno; // Block number to patch
uint16_t offset_in_block; // Offset within block of patch
std::vector<uint8_t> bytes; // Bytes to place at offset_in_block
};
This log entry specifies a change to portion of metadata on disk.
This is used for operations like updating the block numbers stored
in an inode or indirect block, updating an inode's last modified time,
or updating directory entries.
It contains specific bytes that must be written to the disk at position
offset_in_block
of block blockno
; the length of bytes
determines
how many bytes are overwritten.
Note: the interface for reading and writing disk blocks is different in this assignment than in Assignment 7; see Essential Utility Modules below.
Free block
struct LogBlockFree {
uint16_t blockno; // Block number of freed block
};
Specifies that block number blockno
, which was previously allocated,
should be marked free (1 or true in the freemap).
See Essential Utility Modules below for information
on how to manipulate the freemap.
Once you have implemented the apply
methods for LogPatch
and
LogBlockFree
you should be able to pass the first sanity test
(repairing unlink-rmdir.img
).
Allocate block
struct LogBlockAlloc {
uint16_t blockno; // Block number that was allocated
uint8_t zero_on_replay; // Metadata--should zero out block on replay
};
Specifies that block number blockno
, which was previously free,
should now be marked as allocated (0 or false in the freemap). In addition,
if zero_on_replay
is non-zero then the contents of the block must
be cleared to zeroes. If zero_on_replay
is zero, then the contents
of the block must be left as-is.
The zero_on_replay
field is needed so that metadata and data blocks
can be handled differently. It will be set whenever the allocated block is
going to be used for metadata — i.e., as an indirect (or double-indirect)
block, or for a directory. This is needed because new metadata blocks
are always zeroed out when they are allocated during normal operation
(e.g., all indirect block pointers in an indirect block will start off
as zero, as will all entries in a new directory block). However, there
is no guarantee that this zeroed-out block reached disk before the
crash. If the block isn't cleared during crash recovery, it could contain
arbitrary garbage that causes the file system to misbehave. If the
block needs to contain any nonzero information, that information will
be provided by subsequent LogPatch
log entries.
For data blocks, it would be safe to zero out the block during crash recovery. However, this could lose valid data. The reason is that data block updates are not logged. If the data block was actually written safely to disk before the crash, zeroing the block will lose those contents. So, we don't clear data blocks during log replay; of course, this means that the file could inherit random garbage data, but this will not cause the file system to misbehave (presumably users will notice the garbage data and figure out what to do with it).
You may notice that whenever the log contains a LogBlockAlloc
entry,
there's also a LogPatch
entry setting a pointer to that block in the
same transaction. You may also find multiple LogBlockAlloc
entries in the same transaction (one to allocate an indirect block,
and another to allocate a data block, whose disk address will be stored
in the indirect block using a LogPatch
entry).
Once all three of the apply
methods have been implemented, you
should be able to pass all of the sanity tests.
Other log entry types
In addition to the log entry types above, which you must implement,
there are three other types of log entry that we have implemented
for you: LogBegin
and LogCommit
, which identify the start and
end of each transaction, and LogRewind
, which indicates that the
log wraps back to the beginning of the log storage area. For more
details on these entries, see the V6 extra information
page.
Essential Utility Modules
In order to implement log entry replay you will need to interact with the file system buffer cache and the freemap.
Buffer cache interface
In the previous assignment you used diskimg_readsector
to read
blocks from the disk. For this assignment, however, we have a block
cache, so you will need to interface with that instead.
The methods you will write all have access to an instance variable
V6FS &fs_
, which you will use for file I/O.
The main methods on this variable (defined in v6fs.h
) are:
struct V6FS {
...
Ref<Buffer> bread(uint16_t blockno);
Ref<Buffer> bget(uint16_t blockno);
...
};
-
bread
reads a block from disk, and returns the contents in a buffer in the file system block cache. -
bget
returns a buffer for a particular disk block, but doesn't actually read it from disk. The contents of the buffer could be garbage. However, if you are going to overwrite an entire block (for instance by zeroing it out), there is no reason to read the old contents from disk, sobget
is more efficient.
You will need to use both bread
and bget
in your solution, and it's
very important to understand the difference between them. One of the
most common (and confusing) mistakes is using bget
when bread
is
needed; this results in the sudden appearance of garbage in metadata.
Both of the above methods return a Ref<Buffer>
. You can treat a
Ref<Buffer>
as if it were a Buffer*
, but it's actually a smart
pointer that keeps a reference count of how many Ref
's exist
for a block in the cache. The cache won't evict a block while
there exist Ref
's for it. It's a clever class that is similar
in some ways to std::unique_lock
, which you used earlier in this class,
and also to std::shared_ptr
, which some of you may have seen
before.
If you have time, take a look at the implementation of Ref
in cache.hh
.
A Buffer
contains the cached contents of a disk block. The two
things you need to do to a buffer are to access the memory, and to
tell the system when the buffer is dirty. The memory is in a simple
byte array called mem_
, while the method bdwrite()
is used to tell
the system that you have modified a buffer and it should at some point
be written back. (bdwrite
is a commonly used function name in Unix
kernels, where the "d" stands for "delayed." A name such as
mark_dirty()
would arguably be more intuitive.)
struct Buffer {
...
char mem_[SECTOR_SIZE];
void bdwrite();
...
};
Bitmap interface
The freemap is a simple bitmap stored in the freemap_
instance
variable of V6Replay
. The
freemap is compact enough to store the entire map in memory. It is
read from disk for you in the constructor V6Replay::V6Replay
, and
written back to disk for you at the end of the V6Replay::replay
method.
The freemap is implemented by the Bitmap
structure defined in
bitmap.hh
, which is similar to a
std::vector<bool>
.
Unlike std::vector<bool>
, however, Bitmap
offers a feature in
which valid indices start at an arbitrary number. Since the first
data block is INODE_START_SECTOR + s_isize
(a.k.a. datastart()
)
rather than block zero, the first bit physically on disk in the
freemap area corresponds to this "datastart" block. The Bitmap
structure handles the translation because we construct it with a
min_index
. What this means in practice is that to mark block bn
free, you just say freemap_.at(bn) = true
; to mark it allocated, you
say freemap_.at(bn) = false
. Bitmap
itself will do the work of
translating bn
to the appropriate bit.
If you have time, take a look through the implementation of Bitmap
in bitmap.hh
. It's interesting to see how the class allows you to
address individual bits in the bitmap.
Testing Your Code
As with previous assignments, you can run tools/sanitycheck
to
exercise your replay code with a collection of test cases.
We will run sanitycheck
when we grade your assignment.
However, the underlying infrastructure for testing is different from
previous assignments. There is no test
program.
If you would like to run a single test in isolation, use the
shell script examine
, which can be invoked as follows:
./examine -r samples/disk_images/image
where image
is the name of a corrupt disk image that we have
prepared for you. This script will make a copy of the disk image,
run the recover
program on that image to repair it, and then
print information about the repaired disk.
The output from examine
contains the following information:
- The first two lines print info about what was replayed and when it stopped.
- The next line prints how many disk blocks are in use.
- The next lines display the contents of each of those blocks in a
side-by-side format showing hexadecimal on the left and the
corresponding ASCII text on the right, with 16 bytes per line -- the
same size as the
direntv6
structure. The text version is useful to see the text in directory entries and text files, while the hexadecimal version is useful for block numbers and other information. - Next, it prints how many inodes are used; this is followed by the contents of each inode, starting with its number and including information like its permissions, last modified time, block numbers, size, etc.
You can also run samples/examine_soln
on a disk image; this program
will use our sample solution for recover
instead of your code, so you
can compare its output with the output generated by ./examine
.
If you wish to run gdb
on your code, you will need to run recover
(the executable containing your log replaying code) directly without
using examine
. To do this, you will first need to make a copy
of the image to repair:
cp samples/disk_images/image crash.img
Then you can run recover
, either directly or under gdb
like this:
gdb recover
...
(gdb) run crash.img
There is also a tool dumplog
that lets you view the file system log. You
can run it like this (the "c" specifies that it should only display
entries more recent than the last checkpoint):
./dumplog samples/disk_images/image c
As usual, we do not guarantee that the tests we have provided are exhaustive, so passing all of the tests is not necessarily sufficient to ensure a perfect score (CAs may discover other problems in reading through your code).
Troubleshooting
-
If your file system ever crashes and gets linux into a bad state, you may need to remove the fuse mount. The command to do so is:
fusermount -u mnt
(assuming
mnt
is the name of your mount point). This won't work if you still have a process in the directory, so make sure tocd
out of themnt
directory if you are having problems. -
Remember that
..
won't work if your file system has crashed, so you'll need to givecd
an absolute pathname. A convenient choice iscd $PWD
orcd $PWD/..
, since your shell maintains the$PWD
environment variable to the path to the current working directory.
Exercise 4: Long-Term Support and Trust
The final exercise for this project continues our discussion of ethical issues related to trust.
Operating systems and file systems have a very long lifespan. For instance, the current Windows filesystem, NTFS, was introduced in 1993. Once a system is adopted, it can be challenging for customers to transition to newer systems.
Although there is no requirement for operating systems to have a minimum support period, most organizations that develop operating systems provide support for around 10 years. This is typically enough for most users, but not all. For example, critical government infrastructure in the US, UK, and Netherlands relied on Windows XP (which debuted in 2001) through at least 2014.
Consider the following questions relating to OS minimum support periods and trust:
- What is one argument for strengthening requirements for a minimum support period? (e.g., requiring it by law)?
- What is one argument against requiring a minimum support period?
- Identify one way that an organization developing & maintaining an operating system could support users in inferring trust that the minimum support period will be respected.
- Identify one way that an organization developing & maintaining an operating system could support users in substituting some need for trusting that the minimum support period would be respected.
Once you have considered these questions, answer Questions 4A-4D in
questions.txt
.
Here are some other articles related to operating system support, in case you are interested:
- Dutch and British Bovernments Pay to Keep Windows XP Alive
- Why the Military Can't Quit Windows XP
- Government Computers Vulnerable To Hackers
Submitting Your Work
Once you are finished working and have saved all your changes, submit by
running tools/submit
.
We recommend you do a trial submission in advance of the deadline to allow time to work through any snags. You may submit as many times as you like; we will grade the latest submission. Submitting a stable but unpolished/unfinished version is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, the earlier submit will earn points. Without a submission, we cannot grade your work. You can confirm the timestamp of your latest submission in your course gradebook.
Grading
Here is a recap of the work that will be graded on this assignment:
questions.txt
: answer all of the questions.replay.cc
: add enough code to properly replay the three types of log entries.
We will grade your code using the provided sanity check tests and possible additional autograder tests. We will also review your code for style and complexity. Check out our course style guide for tips and guidelines for writing code with good style!
Credits
This assignment, along with the FUSE implementation of the V6 file system, was created by David Mazières.