

# **Key Nehalem Choices**

Glenn Hinton Intel Fellow Nehalem Lead Architect Feb 17, 2010

# Outline

- Review NHM timelines and overall issues
- Converged core tensions
- Big debate wider vectors vs SMT vs more cores
- How decided features
- Power, Power, Power
- Summary

### **Tick-Tock Development Model: Pipelined developments – 5+ year projects**



- Merom started in 2001, released 2006
- Nehalem started in 2003 (with research even earlier)
- Sandy bridge started in 2005
- Clearly already working on new Tock after Sandy Bridge
- Most of Nehalem uArch was decided by mid 2004
  - But most detailed engineering work happened in 2005/06/07

<sup>1</sup>Intel® Core<sup>™</sup> microarchitecture (formerly Merom) 45nm next generation Intel® Core<sup>™</sup> microarchitecture (Penryn) Intel® Core<sup>™</sup> Microarchitecture (Nehalem) Intel® Microarchitecture (Westmere) Intel® Microarchitecture (Sandy Bridge)

All dates, product descriptions, availability and plans are forecasts and subject to change without notice.

### **Nehalem:**

### Lots of competing aspects



### **The Blind Men and the Elephant**

It was six men of Indostan To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind

...

And so these men of Indostan Disputed loud and long, Each in his own opinion Exceeding stiff and strong, Though each was partly in the right, And all were in the wrong!

- Common CPU Core for multiple uses
- Mobile (Laptops)
- Desktop
- •Server/HPC
- •Workloads?
- •How tradeoff?

### • Mobile

- 1/2/4 core options; scalable caches
- Low TDP power CPU and GPU
- Very low "average" power (great partial sleep state power)
- Very low sleep state power
- Low V-min to give best power efficiency when active
- Moderate DRAM bandwidth at low power
- Very dynamic power management
- Low cost for volume
- Great single threaded performance
  - Most apps single threaded
- Desktop
- Server
- Workloads Productivity, Media, ISPEC, FSPEC, 32 vs 64 bit
- How tradeoff?

- Mobile
- Desktop
  - 1/2/4 core options; scalable caches
  - Media processing, high end game performance
  - Moderate DRAM bandwidth
  - Low cost for volume
  - Great single threaded performance
    - Most apps single threaded
- Server
- Workloads Productivity, Media, Games, ISPEC, FSPEC, 32 vs 64 bit
- How tradeoff?

- Mobile
- Desktop
- Server
  - More physical address bits (speed paths, area, power)
  - More RAS features (ECC on caches, TLBs, Metc)
  - Larger caches, TLBs, BTBs, multi-socket snoop, etc
  - Fast Locks and multi-threaded optimizations
  - More DRAM channels (BW and capacity) & more external links
  - Dynamic power management
  - Many cores (4, 8, etc) so need low power per core
  - SMT gives large perf gain since threaded apps
  - Low V-min to allow many cores to fit in low blade power envelops
- Workloads Workstation, Server, ISPEC, FSPEC, 64 bit
- How tradeoff?

#### • Mobile

- 1/2/4 core options; scalable caches
- Low TDP power CPU and GPU
- Very low "average" power (great partial sleep state power)
- Very low sleep state power
- Low V-min to give best power efficiency when active
- Moderate DRAM bandwidth at low power
- Very dynamic power management
- Low cost for volume
- Great single threaded performance (most apps single threaded)
- Desktop
  - 1/2/4 core options; scalable caches
  - Media processing, high end game performance
  - Moderate DRAM bandwidth
  - Low cost for volume
  - Great single threaded performance (most apps single threaded)
- Server
  - More physical address bits (speed paths, area, power)
  - More RAS features (ECC on caches, TLBs, Metc)
  - Larger caches, TLBs, BTBs, multi-socket snoop, etc
  - Fast Locks and multi-threaded optimizations
  - More DRAM channels (BW and capacity) and more external links
  - Dynamic power management
  - Many cores (4, 8, etc) so need low power per core
  - SMT gives large perf gain since threaded apps
  - Low V-min to allow many cores to fit in low blade power envelops
- Workloads Productivity, Media, Workstation, Server, ISPEC, FSPEC, 32 vs 64 bit, etc
- How tradeoff?

# **Early Base Core Selection - 2003**

### Goals

- Best single threaded perf?
- Lowest cost dual core?
- Lowest power dual core?
- Best laptop battery life?
- Most cores that fit in server size?
- Best total throughput for cost/power in multi-core?
- Least engineering costs?
- Major options
  - Enhanced Northwood (P4) pipeline?
  - Enhanced P6 pipeline (like Merom Core 2 Duo)?
  - New from scratch pipeline?
- Why went with enhanced P6 (Merom) pipeline?
  - Lower power per core, lower per core die size, lower total effort
  - Better SW optimization consistency
- Likely gave up some ST perf (10-20%?)
  - But unlikely to have been able to do 'bigger' alternatives

# 2004 Major Decision: Cores vs Vectors vs SMT

- Just use older Penryn cores and have 3 or 4 of them?
  - No single threaded performance gains
- Put in wider vectors (like recently announced AVX)?
  - 256bit wide vectors (called VSSE back then)
  - Very power and area efficient, *if* doing wide vectors
  - Consumes die size and power when not using
- Add SMT per core and have fewer cores?
  - Very power and area efficient
  - Adds a lot of complexity; some die cost and power when not using
- What do servers want? Lots of cores/threads
- Laptops? Low power cores
- HE Desktops? Great media/game performance
- Options with similar die cost:
  - 2 enhanced cores + SMT + Wide Vectors?
  - 3 enhanced cores + SMT?
  - 4 simplified cores?

### 2 cores vs 4 Cores pros/cons

- 2 Cores+VSSE+SMT
  - Somewhat smaller die size
  - Lower power than 4 core
  - Better ST perf
  - Best media if use VSSE?
    - Not clear looks like a wash
  - Specialized MIPS sometimes unused
  - VSSE gives perf for apps not easily threaded
    - Is threading really mainly for wizards?
  - New visible ISA feature like MMX, SSE

- 4 Cores
  - Better generic 4T perf
    - TPPC, multi-tasking
  - Best media perf on legacy 4T-enabled apps
  - Simpler HW design
  - Die size somewhat bigger
  - Simpler/harder SW enabling
    - Simpler since no VSSE
    - No SMT asymmetries
    - Harder since general 4T
  - 4T perf is also specialized
    - But probably less than VSSE
  - TDP Power somewhat higher
  - Somewhat lower average powerSince smaller single core
  - More granular to hit finer segments (1/2/3/4 core options)
  - More complex power management
  - Trade uArch change resources for power reduction

# **Nehalem Direction**

- Tech Readiness Direction
  - 4/3/2/1 cores supported
  - VSSE dropped
    - Saves power and die size
  - SMT maintained
- VSSE in core less valued by servers and mobile
  - Casualty of Converged Core
  - Utilize scaled 4 core solution to recover media perf
- SMT to increase threaded perf
  - Initially target servers
- Spend more effort to reduce power





### Early goal: Remove Multi-Core Perf tax (In power constrained usages)

- Early Multi-cores lower freq than single core variants
- When started Nehalem dual cores still 2-3 years away...
  - Wanted 4 cores in high-end volume systems A big increase
- Lower power envelopes planned for all Nehalem usages
  - Thinner laptops, blade servers, small-form-factor desktops, etc
- Many apps still single threaded
- All cores can have a lot of power limits highest TDP freq
  - If just had one core the highest frequency could be a lot higher
- Turbo Mode/Power Gates removed this multi-core tax
  - One of biggest ST perf gains for mobile/power constrained usages
- Considered ways to do Turbo Mode
  - Decided must have flexible means to tune late in project
  - Added PCU with micro-controller to dynamically adapt

### **Power Control Unit**



# Intel<sup>®</sup> Core<sup>™</sup> Microarchitecture (Nehalem) Turbo Mode

Power Gating

Zero power for inactive cores (C6 state)

### No Turbo





# Intel<sup>®</sup> Core<sup>™</sup> Microarchitecture (Nehalem) Turbo Mode

Power Gating

Zero power for inactive cores

Turbo Mode

In response to workload adds additional performance bins within headroom

### No Turbo



# Intel<sup>®</sup> Core<sup>™</sup> Microarchitecture (Nehalem) Turbo Mode



19

### uArch features – Converged Core

- Difficult balancing the needs of the 3 conflicting requirements
- All CPU core features must be very power efficient
  - Helps all segments, especially laptops and multi-core servers
  - Requirement was to beat a 2:1 power/perf ratio
  - Ended up more like 1.3:1 power/perf ratio for perf features added
- Segment specific features can't add much power or die size
- Initial Uncore optimized for 4 core DP server but had to scale down well to 2 core volume part
- Many things mobile wanted also helped servers
  - Low power cores, lower die size per core, etc
  - Active power management
  - More synergy than originally thought

# **Nehalem Power Efficiency Features**

- Only adding power efficient uArch features
  - Net power : performance ratio of Nehalem core ~1.3 : 1
    - Far better than voltage scaling
- Reducing min operating voltage with linear freq decline
  - Cubic power reduction with ~linear perf reduction
- Implementing C6/Power Gated low-power state
  - Provides significant reduction in average mobile power
- Turbo mode
  - Allows processor to utilize entire available power envelope
  - Reduces performance penalty from multi-core on ST apps





# The First Intel<sup>®</sup> Core<sup>™</sup> Microarchitecture (Nehalem) Processor



QPI: Intel® QuickPath Interconnect (Intel® QPI)

### A Modular Design for Flexibility

### **Scalable Cores**

Same core for all segments

Common software optimization

**Common feature set** 

#### Intel® Core™ Microarchitecture (Nehalem) **45nm**









#### Servers/Workstations

Energy Efficiency, Performance, Virtualization, Reliability, Capacity, Scalability

#### <u>Desktop</u>

Performance, Graphics, Energy Efficiency, Idle Power, Security

#### **Mobile**

Battery Life, Performance, Energy Efficiency, Graphics, Security

#### **Optimized cores to meet multiple market segments**

# Modularity

### 45nm Lynnfield/Clarksfield

### 45nm Nehalem Core i7







### 32nm Clarkdale/Arrandale

#### **Converged Core**

#### Intel® Xeon® Processor 5500 series based Server platforms Server Performance comparison to Xeon 5400 Series



details

### Leadership on key server benchmarks

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit 46p://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. \* Other names

#### Intel® Xeon® Processor 5500 series based Server platforms HPC Performance comparison to Xeon 5400 Series



additional details

### **Exceptional gains on HPC applications**

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Afte://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. \* Other names

### **Summary**

- Nehalem was Intel's first truly `converged core'
- Difficult tradeoffs between segments
- Result: Outstanding Server, DT, and mobile parts
- Acknowledge the outstanding Architects, Designers, and Validators that made this project a great success
  - A great team overcomes almost all challenges

### Intel® Xeon® 5500 Performance



### **Over 30 New 2S Server and Workstation World Records!**

Percentage gains shown are based on comparison to Xeon 5400 series; Performance results based on published/submitted results as of April 27, 2009. Platform configuration details are available at <a href="http://www.intel.com/performance/server/xeon/summary.htm">http://www.intel.com/performance/server/xeon/submitted results as of April 27, 2009. Platform configuration details are available at <a href="http://www.intel.com/performance/server/xeon/summary.htm">http://www.intel.com/performance/server/xeon/submitted results as of April 27, 2009. Platform configuration details are available at <a href="http://www.intel.com/performance/server/xeon/summary.htm">http://www.intel.com/performance/server/xeon/submitted results as of April 27, 2009. Platform configuration details are available at <a href="http://www.intel.com/performance/server/xeon/summary.htm">http://www.intel.com/performance/server/xeon/summary.htm</a> \*Other names and brands may be claimed as the property of others

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more outprmation on performance tests and on the performance of Intel products, visit Intel Performance Renchmark Limitations.

#### Intel® Xeon® Processor 5500 Series based Platforms

**ISV Application Performance** 

| ISV Application                           | Xeon 5500 vs<br>Xeon 5400               | ISV Application                       | Xeon 5500 v<br>Xeon 5400 |
|-------------------------------------------|-----------------------------------------|---------------------------------------|--------------------------|
| Ansys* CFX11* – CFD Simulation            | NSYS +88%                               | Kingsoft* JXIII Online* – Game Server | <b>EW</b> +98%           |
| ERDAS* ERMapper Suite*                    | erdas +69%                              | Mediaware* Instream* – Video Conv.    | +73%                     |
| ESI Group* PAM-CRASH*                     | ESI GROUP +50%                          | Neowiz* Pmang* – Game Portal          | +100%                    |
| ExitGames* Neutron* – Service Platform    | +80%                                    | Neusoft* – Healthcare                 | <b>soft</b> +131%        |
| Epic* – EMR Solution                      | HEDRIC ALL PARTY +82%                   | Neusoft* – Telecom BSS VMware*        | <mark>soft</mark> +115%  |
| Giant* Juren* – Online Game               |                                         | NHN Corp* Cubrid* – Internet DB       | <b>nhn.</b> +44%         |
| IBM* DB2* 9.5 – TPoX XML Information Mana | gement software +60%                    | QlikTech* QlikView* – Bl              | <mark>(View</mark> +36%  |
| IBM* Informix* Dynamic Server             | gement software +84%                    | SAS* Forecast Server*                 | <u>sas</u> +80%          |
| IBM* solidDB* – In Memory DB              | gement software +87%                    | SAP* NetWeaver* – BI                  | <b>SAP</b> +51%          |
| Image Analyzer * – Image Scanner          | H100%                                   | SAP* ECC 6.0* – ERP Workload          | <b>SAP</b> +71%          |
| Infowrap* – Small Data Set                | Ojinfowrap +129%                        | Schlumberger* Eclipse300*             | hlumberger +213%         |
| Infowrap* – Weather Forecast              | ()INFOWRAP +155%                        | SunGard* BancWare Focus ALM* SUNG     | <b>ARD</b> +38%          |
| Intense* IECCM* – Output Mgmt.            | e In10s +150%                           | Supcon* – APC Intel. Sensor 印控·SUP    | <b>CON</b> +191%         |
| Intersystems* Cache* – EMR                | INTERSYSTEMS +63%                       | TongTech* – Middleware                | *€ +95%                  |
| Kingdee* APUSIC* – Middleware             | Kingdee<br><sup>4, 22</sup> +93%        | UFIDA* NC* – ERP Solution UFID        | <mark>4 用友</mark> +230%  |
| Kingdee* EAS* – ERP                       | Kingdee<br><sup>44. 024</sup> /fiteox   | UFIDA* Online – SaaS Hosting          | A用友 +237%                |
| Kingdom* – Stock Transaction              | <b>修金版科校</b><br>www.szkingdom.com +141% | Vital Images* – Brain Perfusion 4DCT* | +77%                     |

### Exceptional gains (1.6x to 3x) on ISV applications

Source: Results measured and approved by Intel and ISV partners on pre-production platforms. March 30, 2009.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit htp://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. \* Other names

### **Execution Unit Overview**

Unified Reservation Station

- Schedules operations to Execution units
- Single Scheduler for all Execution Units
- Can be used by all integer, all FP, etc.

Execute 6 operations/cycle

- 3 Memory Operations
  - 1 Load
  - 1 Store Address
  - 1 Store Data
- 3 "Computational" Operations



# **Increased Parallelism**

- Goal: Keep powerful execution engine fed
- Nehalem increases size of out-oforder window by 33%
- Must also increase other corresponding structures

Concurrent uOps Possible



| Structure           | Intel <sup>®</sup> Core™<br>microarchitecture<br>(formerly Merom) | Intel® Core™<br>microarchitecture<br>(Nehalem) | Comment                                  |
|---------------------|-------------------------------------------------------------------|------------------------------------------------|------------------------------------------|
| Reservation Station | 32                                                                | 36                                             | Dispatches operations to execution units |
| Load Buffers        | 32                                                                | 48                                             | Tracks all load<br>operations allocated  |
| Store Buffers       | 20                                                                | 32                                             | Tracks all store operations allocated    |

#### **Increased Resources for Higher Performance**

<sup>1</sup>Intel<sup>®</sup> Pentium<sup>®</sup> M processor (formerly Dothan) Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (formerly Merom) Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem)

## **New TLB Hierarchy**

- Problem: Applications continue to grow in data size
- Need to increase TLB size to keep the pace for performance
- Nehalem adds new low-latency unified 2<sup>nd</sup> level TLB

|                                        | # of Entries |  |  |
|----------------------------------------|--------------|--|--|
| 1 <sup>st</sup> Level Instruction TLBs |              |  |  |
| Small Page (4k)                        | 128          |  |  |
| Large Page (2M/4M)                     | 7 per thread |  |  |
| 1 <sup>st</sup> Level Data TLBs        |              |  |  |
| Small Page (4k)                        | 64           |  |  |
| Large Page (2M/4M)                     | 32           |  |  |
| New 2 <sup>nd</sup> Level Unified TLB  |              |  |  |
| Small Page Only                        | 512          |  |  |



# **Faster Synchronization Primitives**

- Multi-threaded software becoming more prevalent
- Scalability of multi-thread applications can be limited by synchronization
- Synchronization primitives: LOCK prefix, XCHG
- Reduce synchronization latency for legacy software



### Greater thread scalability with Nehalem

<sup>1</sup>Intel<sup>®</sup> Pentium<sup>®</sup> 4 processor Intel<sup>®</sup> Core<sup>™</sup>2 Duo processor Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem)-based processor

### **Intel® Hyper-Threading Technology**

- Also known as Simultaneous Multi-Threading (SMT)
  - Run 2 threads at the same time per core
- Take advantage of 4-wide execution engine
  - Keep it fed with multiple threads
  - Hide latency of a single thread
- Most *power efficient* performance feature
  - Very low die area cost
  - Can provide significant performance benefit depending on application
  - Much more efficient than adding an entire core
- Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem) advantages
  - Larger caches
  - Massive memory BW



Simultaneous multi-threading enhances performance and energy efficiency

# **Intel® Hyper-Threading Technology**

- Nehalem is a scalable multi-core architecture
- Hyper-Threading Technology augments benefits
  - Power-efficient way to boost performance in all form factors:
    higher multi-threaded performance, faster multi-tasking
    response



Nehalem Microarchitecture

Without HT Technology

0

|                             | Hyper-Tl    | Multi-cores |            |
|-----------------------------|-------------|-------------|------------|
|                             | Shared or   |             |            |
|                             | Partitioned | Replicated  | Replicated |
| Register State              |             | Х           | х          |
| Return Stack                |             | Х           | х          |
| Reorder Buffer              | Х           |             | х          |
| Instruction TLB             | Х           |             | х          |
| <b>Reservation Stations</b> | Х           |             | х          |
| Cache (L1, L2)              | Х           |             | х          |
| Data TLB                    | Х           |             | Х          |
| Execution Units             | Х           |             | Х          |

- Next generation Hyper-Threading Technology:
  - Low-latency pipeline architecture
    - Enhanced cache architecture
    - Higher memory bandwidth

### Enables 8-way processing in Quad Core systems, 4-way processing in Small Form Factors

Intel<sup>®</sup> Microarchitecture codenamed Nehalem

## Caches

- New 3-level Cache Hierarchy
- 1<sup>st</sup> level caches
  - 32kB Instruction cache
  - 32kB, 8-way Data Cache
    - Support more L1 misses in parallel than Intel<sup>®</sup> Core<sup>™</sup>2 microarchitecture
- 2<sup>nd</sup> level Cache
  - New cache introduced in Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem)
  - Unified (holds code and data)
  - 256 kB per core (8-way)
  - Performance: Very low latency
    - 10 cycle load-to-use
  - Scalability: As core count increases, reduce pressure on shared cache

#### Core



## **3rd Level Cache**

- Shared across all cores
- Size depends on # of cores
  - Quad-core: Up to 8MB (16-ways)
  - Scalability:
    - Built to vary size with varied core counts
    - Built to easily increase L3 size in future parts
- Perceived latency depends on frequency ratio between core & uncore
- Inclusive cache policy for best performance
  - Address residing in L1/L2 *must* be present in 3<sup>rd</sup> level cache



# Hardware Prefetching (HWP)

- HW Prefetching critical to hiding memory latency
- Structure of HWPs similar as in Intel<sup>®</sup> Core<sup>™</sup>2 microarchitecture
  - Algorithmic improvements in Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem) for higher performance
- L1 Prefetchers
  - Based on instruction history and/or load address pattern
- L2 Prefetchers
  - Prefetches loads/RFOs/code fetches based on address pattern
  - Intel Core microarchitecture (Nehalem) changes:
    - Efficient Prefetch mechanism
      - Remove the need for Intel® Xeon® processors to disable HWP
    - Increase prefetcher **aggressiveness** 
      - Locks on address streams quicker, adapts to change faster, issues more prefetchers more aggressively (when appropriate)

# **SW Prefetch Behavior**

- PREFETCHT0: Fills L1/L2/L3
- PREFETCHT1/T2: Fills L2/L3
- PREFETCHNTA: Fills L1/L3, L1 LRU is not updated
- SW prefetches can conduct page walks
- SW prefetches can spawn HW prefetches
  - SW prefetch caching behavior not obeyed on HW prefetches

### Memory Bandwidth – Initial Intel<sup>®</sup> Core<sup>™</sup> Microarchitecture (Nehalem) Products

- 3 memory channels per socket
- ≥ DDR3-1066 at launch
- Massive *memory BW*

#### • Scalability

- Design IMC and core to take advantage of BW
- Allow performance to scale with cores
  - Core enhancements
    - Support more cache misses per core
    - Aggressive hardware prefetching w/ throttling enhancements
  - Example IMC Features
    - Independent memory channels
    - Aggressive Request Reordering





Source: Intel Internal measurements – August 20081

#### Massive memory BW provides performance and scalability

<sup>1</sup>HTN: Intel® Xeon® processor 5400 Series (Harpertown) NHM: Intel® Core™ microarchitecture (Nehalem)

# **Memory Latency Comparison**

- Low memory latency critical to high performance
- Design integrated memory controller for low latency
- Need to optimize both local and remote memory latency
- Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem) delivers
  - Huge reduction in local memory latency
  - Even remote memory latency is fast
- Effective memory latency depends per application/OS
  - Percentage of local vs. remote accesses
  - Intel Core microarchitecture (Nehalem) has lower latency regardless of mix



<sup>1</sup>Next generation Quad-Core Intel® Xeon® processor (Harpertown) Intel® CoreTM microarchitecture (Nehalem)

# Virtualization

- To get best virtualized *performance* 
  - Have best native performance
  - Reduce:
    - # of transitions into/out of virtual machine
    - Latency of transitions
- Intel<sup>®</sup> Core<sup>™</sup> microprocessor (Nehalem) virtualization features
  - Reduced latency for transitions
  - Virtual Processor ID (VPID) to reduce effective cost of transitions
  - Extended Page Table (EPT) to reduce # of transitions

Great virtualization performance with Intel® Core™ microarchitecture (Nehalem)

## **Latency of Virtualization Transitions**

- Microarchitectural
  - Huge latency reduction generation over generation
  - Nehalem continues the trend
- Architectural
  - Virtual Processor ID (VPID) added in Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem)
  - Removes need to flush TLBs on transitions



### Higher Virtualization Performance Through Lower Transition Latencies

<sup>1</sup>Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (formerly Merom) 45nm next generation Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Penryn) Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem)

## **Extended Page Tables (EPT) Motivation**



- A VMM needs to protect physical memory
  - Multiple Guest OSs share the same physical memory
  - Protections are implemented through page-table virtualization
- Page table virtualization accounts for a significant portion of virtualization overheads
  - VM Exits / Entries
- The goal of EPT is to reduce these overheads

# **EPT Solution**



#### • Intel<sup>®</sup> 64 Page Tables

- Map Guest Linear Address to Guest Physical Address
- Can be read and written by the guest OS

#### • New EPT Page Tables under VMM Control

- Map Guest Physical Address to Host Physical Address
- Referenced by new EPT base pointer
- No VM Exits due to Page Faults, INVLPG or CR3 accesses

### Intel<sup>®</sup> Core<sup>™</sup> Microarchitecture (Nehalem-EP) Platform Architecture

- Integrated Memory Controller
  - 3 DDR3 channels per socket
  - Massive memory *bandwidth*
  - Memory Bandwidth scales with # of processors
  - Very low memory latency
- Intel® QuickPath Interconnect (Intel® QPI)
  - New point-to-point interconnect
  - Socket to socket connections
  - Socket to chipset connections
  - Build *scalable* solutions



### Significant performance leap from new platform

Intel<sup>®</sup> Core<sup>™</sup> microarchitecture (Nehalem-EP) Intel<sup>®</sup> Next Generation Server Processor Technology (Tylersburg-EP)

### **Intel Next-Generation Mainstream Processors<sup>1</sup>**

| Feature                                                               | Core™ i7                      | Lynnfield               | Clarkdale   | Clarksfield | Arrandale     |  |
|-----------------------------------------------------------------------|-------------------------------|-------------------------|-------------|-------------|---------------|--|
| Processing Threads<br>[via Intel® Hyper-Threading<br>Technology (HT)] | 8                             | Up to 8                 | Up to 4     | Up to 8     | Up to 4       |  |
| Processor Cores                                                       | 4                             | 4                       | 2           | 4           | 2             |  |
| Shared Cache                                                          | 8MB                           | Up to 8MB               | Up to 4MB   | Up to 8MB   | Up to 4MB     |  |
| Integrated Memory<br>Controller Channels                              | 3 ch. DDR3                    | 2 ch. DDR3              |             |             |               |  |
| DDR Freq Support<br>(sku dependent)                                   | 800, 1066                     | 1066, 1333              |             |             | 800, 1066     |  |
| # DIMMs/Channels                                                      | 2                             | 2                       |             | 1           |               |  |
| PCI Express* 2.0                                                      | 2x16 or 4x8, 1x4<br>(via X58) | 1x16 or 2x8             | 1x16 or 2x8 | 1x16 or 2x8 | 1x16 (1.0)    |  |
| Processor Graphics                                                    | No                            | No                      | Yes         | No          | Yes           |  |
| Processor Package TDP                                                 | 130W                          | 95W                     | 73W         | 55W and 45W | 35W, 25W, 18W |  |
| Socket                                                                | LGA 1366                      | .366 LGA 1156           |             | rPGA, BGA   |               |  |
| Platform Support                                                      | X58 & ICH10                   | Intel® 5 series Chipset |             |             |               |  |
| Processor Core Process<br>Technology                                  | 45nm                          | 45nm                    | 32nm        | 45nm        | 32nm          |  |
|                                                                       |                               |                         |             |             |               |  |

### **Bringing Intel<sup>®</sup> Core<sup>™</sup> i7 Benefits into Mainstream**

<sup>1</sup>Not all features are on all products, subject to change