Virtual Memory

Time: 180 min.

Optional readings for this topic from Operating Systems: Principles and Practice: Chapter 8.

How can one memory be shared among several concurrent processes?

But first, let's think about ...

Single-tasking (no sharing):

Highest memory holds OS.
Process is allocated memory starting at 0, up to the OS area.
Examples: early batch monitors where only one job ran at a time. It could corrupt the OS, which would be rebooted by an operator. Some early personal computers were similar.

Goals for sharing memory:

Multitasking: allow multiple processes to be memory-resident at once.
Transparency: no process should need to be aware of the fact that memory is shared. Each must run regardless of the number and/or locations of processes.
Isolation: processes mustn't be able to corrupt each other.
Efficiency (both of CPU and memory) shouldn't be degraded badly by sharing. After all, the purpose of sharing is to increase overall efficiency.

Load-time relocation:

Highest memory holds OS.
First process loaded at 0; others fill empty spaces.
Use first-fit or best-fit allocation to manage available memory.
When a process is loaded, relocate it so that it can run in its allocated memory area, similar to linking:
- Linker outputs information about addresses in executable files
- Similar to unresolved references in object files: indicates which locations contain memory addresses
- OS modifies addresses when it loads process (add base address)
What are the problems with this approach?
- Any process can corrupt any other process and/or the operating system.
- Only one region per process.
- Must declare size statically.
- Can't grow regions if adjacent space is in use.
- Can't move once loaded.
- Memory fragmentation.

Dynamic Address Translation

Instead of translating addresses statically when a program is loaded, add hardware (memory management unit) that changes addresses dynamically during every memory reference.

Draw picture of processor and memory, with MMU in between. Show address and data lines.

Each address generated by a thread (called a virtual address) is translated in hardware to a physical address. This happens during every memory reference.

Results in two views of memory, called address spaces:

Virtual address space is what the program sees
Physical address space is the actual allocation of memory

Draw virtual address space and physical address space under processor/memory diagram, show correspondences.

Base and Bound Translation

Two hardware registers:

Base: physical address corresponding to virtual address 0.
Bound: highest allowable virtual address.

On each memory reference:

Compare virtual address to bound register, trap if >=
Add virtual address to base to produce physical address.

Each process has its own base and bound values, which are saved in the process control block.
Draw address spaces.

Each process appears to have a completely private memory whose size is determined by the bound register.

Processes are isolated from each other and OS.

No address relocation is necessary when a process is loaded.

Go through base & bound example:

Show memory, ip, sp, base, bound
Initially: base = 6000, bound = 2000
Ask about physical locations for stack, current code
Walk through "CALL 140" instruction, pushing the return address onto the stack.
Note how the address is sent to memory on the data lines?
Now stop the process and relocate the process to 3000
Walk through the return.
Point out that physical addresses are never visible to the thread.

Philosophy of meaning: comes only from use; it's not innate. Hmmm, not clear this fits here; skip?

OS runs with relocation disabled, so it can access all of memory (a bit in the processor status word controls relocation).

Must prevent users from turning off relocation or modifying the base and bound registers (another bit in PS for user/kernel mode).

Problem: how does OS regain control once it has given it up? Must do two things simultaneously:

Branch into or out of the OS.
Change protection (turn relocation on or off).
Special instructions control this. Note that the branch address must be carefully controlled. This is like a new form of procedure call that saves and restores more than just the IP.

Base & bound is cheap (only 2 hardware registers) and fast: the add and compare can be done in parallel.

Explain how swapping can work.

What's wrong with base and bound relocation?

Only one contiguous region per program.
How to manage separate stack?
Can't support read-only information (e.g., code)
How to share memory between processes (e.g., code)?
Memory fragmentation.

Multiple segments

Each process is split among several variable-size areas of memory, called segments.

E.g. one segment for code, one segment for data and heap, one segment for stack.
Segments correspond roughly to linker sections.

Segment map holds the bases and bounds for all the segments of a process, plus protection bit for each segment: read-write versus read-only.

The segment map is stored in the MMU; modified during context switches

Memory mapping logic consists of table lookup + add + compare.

Each memory reference must indicate a segment number and offset:

Top bits of address select segment, low bits the offset.
Example: PDP-10 with high and low segments selected by high-order address bit.
Or, segment can be selected implicitly by the instruction (e.g. code vs. data, stack vs. data, or x86 prefixes). This approach was most commonly used to expand the address space of machines with very small addresses, such as PDP-11s

Advantage of segmentation: flexibility

Manage each segment separately:
- Grow and shrink independently
- Swap to disk
Can share segments between processes (e.g., shared code).
Can move segments to compact memory and eliminate fragmentation.

Talk about stacks growing downward? This takes about 15 minutes... probably not worth it. Save for an exam question?

What's wrong with segmentation?

Variable length results in memory fragmentation.
Fixed number of segments still creates limits: can't mmap files.
Address space is rigidly divided.

Paging

Divide virtual and physical memory into fixed-sized chunks called pages. The most common size is 4 Kbytes. Unfortunately, this is way too small today.

For each process, a page map specifies the physical page for of that process' virtual pages pages, along with read-only and "present" bits. No need for bound registers, since all pages are always the same size.

Page map stored in contiguous memory (with base register in hardware).

Translation process: page number always comes directly from the address. Since page size is a power of two, no comparison or addition is necessary. Just do table lookup and bit concatenation.

Easy to allocate: keep a free list of available pages and grab the first one. Easy to swap since everything is the same size, which is usually the same size as disk blocks.

Flexible management: Can represent a segment with a collection of pages, starting on any page boundary.

Problem: for modern machines, page maps can be very large:

Consider x86-64 addressing architecture: 48-bit addresses, 4096-byte pages.
Draw diagram.
Class exercise: compute size of page map (assuming 8 bytes per entry): 2**39 bytes (500 GB)!
Ideally, each page map should fit in a page.
Most processes are small, so most page map entries are unused.
Even large processes use their address space sparsely (e.g., code at the bottom, stack at the top)

Derive the full Intel x86-64 architecture bottom-up:

PML1 (how big is top-level page map?)
PML2 ... PML4
Make it clear that the page maps form a tree structure

Solution: multi-level page maps. Intel x86-64 addressing architecture:

4 Kbyte pages: low-order 12 bits of virtual address hold offset within page.
4 levels of page map, each indexed with 9 bits of virtual address.
Each page map fits in one page (page map entries are 8 bytes).
Can omit empty page maps.

Have class go through 3 memory references:

Code page 0
Data page
Highest stack page

What subset of the page map structure is needed for the simplest process with code, data, and stack?

What needs to happen if the data segment grows to two pages?

Note that the virtual address space need not be the same size as the physical address space.

Note that x86 processors actually support several different translation schemes.

With paging, processes can share memory flexibly:

A single page
An entire page table at any level (PML1, PML2, etc.)

Next problem: page maps are too large to load into fast memory in the MMU.

Page maps kept in main memory
Relocation unit holds base address for top-level page map
With x86-64 architecture, must make 4 memory references to translate a virtual address!

Translation Lookaside Buffers (TLBs)

Solution to page translation overhead: create a small hardware cache of recent translations.

Each cache entry stores the page number portion of a virtual address (36 bits for x86-64) and the corresponding physical page number (40 bits for x86-64).
Typical TLB sizes: 64-2048 entries.
On each memory reference, compare the page number from the virtual address with the virtual page numbers in every TLB entry (in parallel).
If there is a match, use the corresponding physical page number.
If no match, perform the full address translation and save the information in the TLB (replace one of the existing entries).
TLB "hit rates" typically 95% or more.
Mention that the scheme above is fully associative, so it requires one comparator for each entry. Other schemes are possible, but beyond the scope of this class.

TLB complications:

When context switching, must invalidate all of the entries in the TLB (mappings will be different for the next process). Chip hardware does this automatically when the page map base register is changed.
If virtual memory mappings change for the current process (e.g. page moved), must invalidate some TLB entries. Special hardware instruction for this.

Miscellaneous Topics

How does the operating system get information from user memory? E.g. I/O buffers, parameter blocks. Note that the user passes the OS a virtual address.

In some systems the OS just runs unmapped:
- OS reads page maps and translates user addresses in software.
- Addresses that are contiguous in the virtual address space may not be contiguous physically. Thus I/O operations may have to be split up into multiple blocks. Show example from slides.
Most newer systems include kernel and user memory in same virtual address space (but kernel memory not accessible in user mode (special bit in page map entries)). This makes life easier for the kernel, although it doesn't solve the I/O problem.

Aliasing: the same physical address can appear at multiple virtual locations

Another issue with paging: internal fragmentation.

Can't allocate partial pages, so for small chunks of information only part of the page will be used
Result: wasted space at the ends of some pages
Define external fragmentation, compare with internal fragmentation.
Not much of a problem in today's systems:
Ask students why:
- Logical segments such as code or stack tend to be much larger than a page.
- Percentage wasted space from fragmentation is small.
What happens if page sizes grow?
We will see more about internal and external fragmentation when we talk about files.