Managing Flash Memory
Lecture Notes for CS 140
Spring 2020
John Ousterhout
- Readings for this topic from Operating Systems: Principles and Practice:
Section 12.2.
- Solid state (semiconductor) storage ("SSD"s) have replaced disks in many
applications (e.g. phones and other devices).
- Comparison to disk:
- No moving parts, so more reliable
- 100x faster access
- More shock-resistant
- Cost/bit 5-10x higher than disk
- Comparison to DRAM:
- Nonvolatile: values persist even if device is powered off
- Cost/bit 5-10x lower
- 1000x slower
- Flash memory hardware:
- Total chip capacity up to 512 Gbytes
- Storage divided into erase units (typically 256 Kbytes),
which are subdivided into pages (typically 512 bytes or 4 Kbytes)
- Storage is read in units of pages
- Two kinds of writes:
- Erase: sets all of the bits in an erase unit to 1's.
- Write: modifies an individual page, can only clear bits
to 0 (writing 1's has no effect).
- Can write repeatedly to clear more bits.
- Wear-out: once a page has been erased many times (typically
around 100,000, as low as 1,000 in some new devices) it no longer
stores information reliably.
- Typical flash memory performance:
- Read performance: 20-100 microsconds latency,
100-2500 MBytes/sec.
- Erasure time: 2 ms
- Write performance: 200 microseconds latency,
100-500 MBytes/sec.
- Most flash memory devices are packaged with a
flash translation layer (FTL):
- Software that manages the flash device
- Typically provides an interface like that for a disk
(read and write blocks)
- Use with existing file system software
- FTLs are interesting pieces of software, but they have several
problems:
- Sacrifice performance
- Waste capacity
- Proprietary: no information available about implementation and
performance pathologies
- One possible approach for FTLs: direct mapped (e.g., some cheap
flash sticks)
- FAT format often used
- Virtual block i is stored on page i of the flash device
- Reads are simple
- To write virtual block i:
- Read erase unit containing page i
- Erase the entire unit
- Rewrite erase unit with modified page
- What's wrong with this approach?
- To avoid these problems:
- Separate virtual block number from physical location in flash memory
- A given virtual block can occupy different pages in flash memory
over time.
- Keep a block map that maps from virtual blocks to physical pages
- Reads must first lookup the physical location in the block map
- For writes:
- Find a free and erased page
- Write virtual block to that page
- Update block map with new location
- Mark previous page for virtual block as free
- This introduces additional issues
- How to manage map (is it stored on the flash device?)
- How to manage free space (e.g. wear leveling)
- One approach: keep block map in memory, rebuild on startup:
- Don't store block map on flash device
- Each page on flash contains an additional header:
- Virtual block number
- Allocated bit (1 => free, 0 => allocated)
- Written bit (1 => not yet written (all zeroes), 0 => written)
- Garbage bit (0 => no longer in use, must be erased before reusing page)
- A-W-G bits track lifecycle of page:
- Just erased: 1-1-1
- About to write data: 0-1-1
- Block successfully written: 0-0-1
- Block deleted (new copy written elsewhere): 0-0-0
- Why is 0-1-1 state needed?
- On startup, read entire contents of flash memory to rebuild
block map (32 seconds for 8GB, 512 seconds for 128GB).
- To reduce memory utilization for block map, store block map in
flash, cache parts of it in memory
- Header for each flash page indicates whether that page is a
data page or a map page
- Keep locations of map pages in memory (map-map)
- Scan flash on startup to re-create map-map
- During writes, must write new map page plus new data page
- Some reads may require 2 flash operations
- Garbage pages accumulate in erase units, which reduces
effective capacity.
- Solution: garbage collection
- Find erase units with many free pages
- Copy live pages to a clean erase unit (update block map)
- Erase and reuse old erase unit
- Note: must always keep at least one clean erase unit to use for
garbage collection!
- Hard to achieve good performance and good utilization at the same time:
- If the flash device is 90% utilized, write cost increases by
> 10x:
- To get space for one new erase unit, must garbage collect 10 old
erase units
- 9 will still be valid and must be copied
- 1 new erase unit gets written
- Total: 9 reads, 10 writes to write 1 new erase unit!
- This is called write amplification
- Frequent garbage collection (e.g. because of high utilization)
also wears out the device faster
- Lower utilization makes writes cheaper, but wastes space.
- Ideal situation: hot and cold data
- Some erase units contain only data that is never modified ("cold"),
so they are always full and never need to be garbage collected.
- Other erase units contain data that is quickly
overwritten; we can just wait until all of the pages have been
overwritten, then reuse the erase unit.
- There are ways to encourage such a bimodal distribution.
- Wear-leveling:
- Want all erase units to be erased at about the same rate
- Use garbage collection to move data between "hot" and "cold"
pages.
- Main problem is with cold erase units (hot ones turn over
quickly)
- Every once in a while, garbage collect a cold erase unit (one
that hasn't been erased very often) even if it has no free
pages
- This moves the cold data to another erase unit and allows
the unworn erase unit to be assigned new data, which will
probably turn over more quickly.
- Incorporating flash memory as a disk-like device with FTL is inefficient:
- Duplication:
- OS already keeps various index structures for files:
- These are equivalent to the block map
- If OS could manage the flash directly, it could combine
the block map with file indexes
- Lack of information:
- FTL doesn't know when OS has freed a block; only finds out when
block is overwritten
- Thus FTL may rewrite dead blocks during garbage collection!
- Newer flash devices offer trim command that allows OS to
indicate deletion (but must modify OS file systems).
- Better long-term solution: new file systems designed just for flash memory
- Lots of interesting issues and design alternatives
- Has been explored by research teams, but no widely-used
implementations
- Need ability to bypass the FTL
- Interesting opportunity
- Organize entire file system like a log
- Newest alternative: nonvolatile memory such as Intel 3D XPoint
- Reads: 300ns
- Writes: 100ns
- Capacity: 512 GB per DIMM!
- Not yet clear how to use these (a file system sacrifices most
of the performance)