File System Crash Recovery
Lecture Notes for CS 140
Spring 2014
John Ousterhout
- Readings for this topic from Operating Systems: Principles and Practice:
Chapter 14 up through Section 14.1.
- The problem: crashes can happen anywhere, even in the middle
of critical sections:
- Lost data: information cached in main memory may not
have been written to disk yet.
- E.g. original Unix: up to 30 seconds worth of changes
- Inconsistency:
- If a modification affects multiple blocks, a crash
could occur when some of the blocks have been written
to disk but not the others.
- Examples:
- Adding block to file: free list was updated to indicate
block in use, but file descriptor wasn't yet written to point to
block.
- Creating link to a file: new directory entry refers to
file descriptor, but reference count wasn't updated in file descriptor.
- Ideally, we'd like something like an atomic operation where multi-block
operations happen either in their entirety or not at all.
Approach #1: check consistency during reboot, repair problems
- Example: Unix fsck ("file system check")
- During every system boot fsck is executed.
- Checks to see if disk was shut down cleanly; if so, no more
work to do.
- If disk didn't shut down cleanly (e.g., system crash,
power failure, etc.), then scan disk contents, identify
inconsistencies, repair them.
- Example: block in file and also in free list
- Example: reference count for a file descriptor doesn't match
the number of links in directories
- Example: block in two different files
- Example: file descriptor has a reference count > 0 but is not
referenced in any directory.
- Limitations of fsck:
- Restores disk to consistency, but doesn't prevent loss
of information; system could end up unusable.
- Security issues: a block could migrate from the password
file to some other random file.
- Can take a long time: 1.5 hours to read every block in a
medium-size disk today. Can't restart system until
fsck completes. As disks get larger, recovery
time increases.
Approach #2: ordered writes
- Prevent certain kinds of inconsistencies by making updates
in a particular order.
- For example, when adding a block to a file, first write
back the free list so that it no longer contains the
file's new block.
- Then write the file descriptor, referring to the new block.
- What can we say about the system state after a crash?
- In general:
- Never write a pointer before initializing the block
it points to (e.g., indirect block).
- Never reuse a resource (inode, disk block, etc.) before
nullifying all existing pointers to it.
- Never clear last pointer to a live resource before
setting new pointer (e.g. mv).
- Result: no need to wait for fsck when rebooting
- Problems:
- Can leak resources (run fsck in background to reclaim
leaked resources).
- Requires lots of synchronous metadata writes, which
slows down file operations.
- Improvement:
- Don't actually write the blocks synchronously, but record
dependencies in the buffer cache.
- For example, after adding a block to a file add
dependency between file descriptor block and free list block.
- When it's time to write the file descriptor back to disk, make
sure that the free list block has been written first.
- Tricky to get right: potentially end up with
circular dependencies between blocks.
Approach #3: write-ahead logging
- Also called journaling file systems
- Implemented in Linux ext3 and NTFS (Windows).
- Similar in function to logs in database systems; allows
inconsistencies to be corrected quickly during reboots
- Before performing an operation, record information about
the operation in a special append-only log file; flush this
info to disk before modifying any other blocks.
- Example: adding a block to a file
- Log entry: "I'm about to add block 99421 to file descriptor 862 at block
index 93"
- Then the actual block updates can be carried out later.
- If a crash occurs, replay the log to make sure all updates
are completed on disk.
- Guarantees that once an operation is started, it will eventually complete.
- Problem: log grows over time, so recovery could be slow.
- Solution: checkpoint
- Occasionally stop and flush all dirty blocks to disk.
- Once this is done, the log can be cleared.
- Typically the log is used only for metadata (free list, file descriptors,
indirect blocks), not for actual file data.
- Logging advantages:
- Recovery much faster.
- Eliminate inconsistencies such as blocks confused between files.
- Log can be localized in one area of disk, so writes are faster
(no seeks).
- Metadata writes can be delayed a long time, for better performance.
- Logging disadvantages:
- Synchronous disk write before every metadata operation.
Remaining problems
- Can still lose recently-written data after crash
- Solution: apps can use fsync to force data to disk.
- Disks fail
- One of the greatest causes of problems in large datacenters
- Solution: replication or backup copies (e.g., on tape)
- Disk writes are not atomic:
- If a block is being written at the time of the crash,
it may be left in inconsistent state (neither old contents
nor new).
- At the level of sectors, inconsistencies are detectable;
after crash, sector will be either
- Old contents
- New contents
- Unreadable trash
- But, blocks are typically multiple sectors. After crash:
- Sectors 0-5 of block may have new contents.
- Sectors 6-7 of block may have old contents.
- Example: appending to log
- If adding new log entries to an existing log block,
crash could cause old info in the block to be lost.
- Solution:
- Replicated log writes (if crash corrupts one of the logs,
the other will still be safe).
- Add checksums and/or versions to detect incomplete writes.
- Conclusions:
- To get highest performance, must give up some crash recovery
capability.
- Must decide what kinds of failures you want to recover from.