File System Crash Recovery
Optional readings for this topic from Operating Systems: Principles and Practice: Chapter 14 up through Section 14.1.
The problem: crashes can happen anywhere, even in the middle of critical sections:
- Lost data: information cached in main memory may not
have been written to disk yet.
- E.g. original Unix: up to 30 seconds worth of changes
- If a modification affects multiple blocks, a crash could occur when some of the blocks have been written to disk but not the others.
- Adding block to file: free list was updated to indicate block in use, but inode wasn't yet written to point to block.
- Creating link to a file: new directory entry refers to inode, but reference count wasn't updated in inode.
- The block cache may reorder writes.
- Ideally, we'd like something like an atomic operation where multi-block operations happen either in their entirety or not at all.
Approach #1: check consistency during reboot, repair problems
fsck ("file system check")
- During every system boot
- Checks to see if system was shut down cleanly; if so, no more work to do.
- If system didn't shut down cleanly (e.g., system crash, power failure, etc.), then scan disk contents, identify inconsistencies, repair them.
- Example: block in file and also in free list
- Example: reference count for an inode doesn't match the number of links in directories
- Example: block in two different files
- Example: inode has a reference count > 0 but is not referenced in any directory.
- Restores disk to consistency, but doesn't prevent loss of information; system could end up unusable.
- Security issues: a block could migrate from the password file to some other random file.
- Can take a long time: can't restart system until
fsckcompletes. As disks get larger, recovery time increases.
Approach #2: ordered writes
Prevent certain kinds of inconsistencies by making updates in a particular order.
- For example, when adding a block to a file, first write back the free list so that it no longer contains the file's new block.
- Then write the inode, referring to the new block.
- What can we say about the system state after a crash?
- In general:
- Never write a pointer before initializing the block it points to (e.g., indirect block).
- Never reuse a resource (inode, disk block, etc.) before nullifying all existing pointers to it.
- Never clear last pointer to a live resource before
setting new pointer (e.g.
Result: no need to wait for
fsck when rebooting
- Can leak resources (run
fsckin background to reclaim leaked resources).
- Requires lots of synchronous metadata writes, which slows down file operations.
- Don't actually write the blocks synchronously, but record dependencies in the buffer cache.
- For example, after adding a block to a file, add
dependency between inode block and free list block.
- When it's time to write the inode back to disk, make sure that the free list block has been written first.
- Tricky to get right: potentially end up with circular dependencies between blocks.
Approach #3: write-ahead logging
Also called journaling file systems
Implemented in Linux ext3 and NTFS (Windows).
Similar in function to logs in database systems; allows inconsistencies to be corrected quickly during reboots
- Before performing an operation, record information about the operation in a special append-only log file; flush this info to disk before modifying any other blocks.
- Example: adding a block to a file
- Log entry: "I'm about to add block 99421 to inode 862 at block index 93"
- Then the actual block updates can be carried out later, in any order.
- If a crash occurs, replay the log to make sure all updates are completed on disk.
- Guarantees that once an operation is started, it will eventually complete.
- Problem: log grows over time, so recovery could be slow.
- Solution: occasional checkpoints:
- Record current log head
- Flush all dirty blocks to disk
- Once this is done, the log can be cleared up to the recorded position
- Typically the log is used only for metadata (free list, inodes, indirect blocks), not for actual file data.
- Recovery much faster.
- Eliminate inconsistencies such as blocks confused between files.
- Log written sequentially, so log writes are faster (no seeks).
- Metadata writes can be delayed a long time, for better performance.
- Synchronous disk write before every metadata operation.
Solution: delay log writes
- Assign log positions immediately, but don't write
- Mark each cache block with latest log position related to that cache block
- Before evicting cache block, flush log
- This separates durability from consistency
Crashes can still lose recently-written data if it hasn't been flushed to disk.
- Solution: apps can use
fsyncto force data to disk.
- One of the greatest causes of problems in large datacenters
- Solution: replication or backup copies (e.g., on tape)
- Interesting tradeoffs between performance, durability, and consistency
- To get highest performance, must give up some crash recovery capability.
- Must decide what kinds of failures you want to recover from.