Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
detect loops that have a very low number of iterations. Core Duo’s loop
detector can detect loops with smaller iteration counts, a feature that saves
power and improves performance by lowering the number of instruction
fetches and BTB accesses.
SSE3
Core Duo introduces a new member into the SSE family of ISA extensions:
SSE3. The SSE3 instruction set consists of 13 new instructions, which Intel’s
Software Developer’s Manual
summarizes as follows:
z
One
x
87 FPU instruction used in integer conversion
z
One SIMD integer instruction that addresses unaligned data loads
z
Two SIMD floating-point packed ADD/SUB instructions
z
Four SIMD floating-point horizontal ADD/SUB instructions
z
Three SIMD floating-point LOAD/MOVE/DUPLICATE instructions
z
Two thread-synchronization instructions
These new instructions fill in some gaps left in the SSE family and in the
x
87 extensions, mostly in the areas of byte shuffling and floating-point inter-element arithmetic. These are the areas in which the SSE family has been
weakest when compared with AltiVec.
Floating-Point Improvement
When programmers use the
x
87 floating-point instructions to perform floating-point computations, they have multiple options available to them for dealing
with the more complicated aspects of floating-point math like number formats
and rounding behavior. The
x
87 FPU has a special register called the
floating-point control word (FPCW)
, which programmers can use to tell the FPU how they’d like it to handle these issues. In short, the FPCW holds configuration
data for the floating-point unit, and programmers write new data into that
register whenever they’d like to change the FPU’s configuration.
All Intel designs prior to Core Duo have assumed that programmers very
rarely write to the FPCW. Because of this assumption, Intel’s chip architects
have never associated any rename registers with the FPCW. As it turns out,
however, some types of programs contain code that writes to the FPCW fairly
252
Chapter 12
frequently, most often to change the FPU’s rounding control options. For
such programs, a single copy of the FPCW is a significant bottleneck, because
the entire floating-point pipeline must stall until that one register is finished being updated.
Core Duo is the first Intel processor to feature a set of microarchitectural
rename registers for the FPCW. These four new rename registers enable Core
Duo to extract more parallelism from floating-point code by eliminating false
register name conflicts associated with the FCPW. (For more on false register
name conflicts, data hazards, and register renaming, see Chapter 4.)
Integer Divide Improvement
Integer divisions are rare in most code, but when they do occur, they stall the
complex integer unit for many cycles. The CIU must grind through the large
number of computations and bit shifts that it takes to produce a division
result; no other instructions can enter the CIU’s pipeline during this time.
Core Duo’s complex integer unit tries to shorten integer division’s long
latencies by examining each
x
86 integer divide instruction (idiv) that it encounters in order to see if it can exit the division process early. For idiv
instructions that have smaller data sizes and need fewer iterations inside the
ALU hardware to produce a valid result, the integer unit stops the division
once the required number of iterations has completed. This technique
reduces average idiv latencies because the ALU no longer forces every idiv,
regardless of data size, to go through the same number of iterations. In some
cases, an idiv that would take 12 cycles on Dothan takes only 4 cycles on Core
Duo, and in others the latency can be reduced from 20 cycles (Dothan) to
12 cycles (Core Duo).
Virtualization Technology
The SSE3 instructions aren’t the only new extensions added to the
x
86 ISA.
Intel also used Core Duo to introduce its Virtualization Technology, called
VT-x
, along with a set of supporting ISA extensions called
Virtual Machine
Extensions (VMX)
.
VT-x is worthy of its own chapter, but I’ll summarize it very briefly here.
In a nutshell, VT-x enables a single processor to run multiple operating
system/application stacks simultaneously, with each stack thinking that it has
complete control of the processor. VT-x accomplishes this by presenting a
virtual processor
to each operating system instance. A
virtual machine monitor
(VMM)
then runs at a level beneath the operating systems, closest to the processor hardware, and manages the multiple operating system instances
running on the virtual processors.
With virtualization technology, a single, possibly underutilized multi-
core processor can be made to do the work of multiple computers, thereby
keeping more of its execution hardware busy during each cycle. Indeed,
VT-x can be thought of as a way to increase power efficiency simply by giving
the processor more work to do, so that fewer execution slots per cycle are
wasted due to idleness.
Intel’s Pentium M, Core Duo, and Core 2 Duo
253
Summary: Core Duo in Historical Context
Core Duo’s improvements on the Dothan design enabled Intel to offer a
dual-core part with the power dissipation characteristics of previous single-
core parts. Because it integrated two cores onto a single die, Core Duo could
also offer a significant speedup for workloads involving multiple instruction
streams (or
threads of execution
in computer science parlance). However, more radical changes to the microarchitecture were needed if Intel was to
meet its goal of dramatically increasing performance on single instruction
streams without also increasing clockspeed and power consumption.
Core 2 Duo
The Intel Core microarchitecture introduced in the Core 2 Duo line of
processors represents Intel’s most ambitious attempt since the Pentium Pro
to increase single-threaded performance independently of clockspeeds.
Because its designers took a “more hardware” instead of “more clockspeed”
approach to performance, Core is bigger and wider than just about any
mass-market design that has come before it (see Table 12-3). Indeed, this
“more of everything” is readily apparent with a glance at the diagram of the
new microarchitecture in Figure 12-10.
In every phase of Core’s 14-stage pipeline, there is more of just about
anything you could think of: more decoding logic, more re-order buffer
space, more reservation station entries, more issue ports, more execution
hardware, more memory buffer space, and so on. In short, Core’s designers
took everything that has already been proven to work and added more of it,
along with a few new tricks and tweaks.
Table 12-3:
Features of the Core 2 Duo/Solo
Introduction Date
July 27, 2006
Process
65 nanometer
Transistor Count
291 million
Clock Speed at Introduction
1.86 to 2.93 GHz
L1 Cache Size
32KB instruction, 32KB data
L2 Cache Size (on-die)
2MB or 4MB
x86 ISA Extensions
EM64T for 64-bit support
Core is wider in the decode, dispatch, issue, and commit pipeline phases
than every processor covered in this book except the PowerPC 970. Core’s
instruction window, which consists of a 96-entry reorder buffer and a 32-entry
reservation station, is bigger than that of any previous Intel microarchitecture except for Netburst. However, as I’ve mentioned before, bigger doesn’t automatically mean better. There are real-world limits on the number of instruc-
tions that can be executed in parallel, so the wider the machine, the more
execution slots per cycle that can potentially go unused because of limits to
instruction-level parallelism (ILP)
. Furthermore, Chapter 3 described how memory latency can starve a wide machine for code and data, resulting in a
254
Chapter 12
waste of execution resources. Core has a number of features that are there
solely to address ILP and memory latency issues and to ensure that the
processor is able to keep its execution units full.
Front End
Instruction Fetch
BPU
Translate x86/
Decode
Branch
Unit
Reorder Buffer (ROB)
Reservation Station (RS)
Port 0
Port 1
Port 5
Port 5
Port 0
Port 1
Port 0
Port 1
Port 5
Port 5
Port 4
Port 3
Port 2
MMX5
FADD
FMUL
CIU1
CIU2
SIU
Store
Store
Load
MMX0
MMX1
Data
Addr.
Addr.
VSHUF
BU
F/VMOV
F/VMOV
F/VMOV
VFADD
VFMUL
Floating-Point
MMX/SSE Unit
Branch
Unit
Integer Unit
Load-Store Unit
Unit
Vector ALUs
Scalar ALUs
Memory Access Units
Back End
Re-order Buffer
(ROB)
Commit
Commitment Unit
Figure 12-10: The Intel Core microarchitecture
NOTE
The Intel Core microarchitecture family actually consists of three nearly identical
microarchitectural variants, each of which is known by its code name. Merom is the
low-power mobile microarchitecture, Conroe is the desktop microarchitecture, and
Woodcrest is the server microarchitecture.
In the front end, micro-ops fusion and a new trick called
macro-fusion
work together to keep code moving into the back end; and in the back end,
a greatly enlarged instruction window ensures that more instructions can
reach the execution units on each cycle. Intel has also fixed an important
SSE bottleneck that existed in previous designs, thereby massively improving
Core’s vector performance over that of its predecessors.
In the remainder of this chapter, I’ll talk about all of these improvements
and many more, placing each of Core’s new features in the context of Intel’s
overall focus on balancing performance, scalability, and power consumption.
Intel’s Pentium M, Core Duo, and Core 2 Duo
255
The Fetch Phase
As I’ll discuss in more detail later, Core has a higher decode rate than any of
its predecessors. This higher decode rate means that more radical design
changes were needed in the fetch phase to prevent the decoder from being
starved for instructions. A simple increase in the size of the fetch buffer
wouldn’t cut it this time, so Intel tried a different approach.
Core’s fetch buffer is only 32 bytes—the size of the fetch buffer on the
original P6 core. In place of an expanded fetch buffer, Core sports an entirely
new structure that sits in between the fetch buffer and the decoders: a bona
fide instruction queue.
Core’s 18-entry IQ, depicted in Figure 12-11, holds about the same
number of
x
86 instructions as the Pentium M’s 64-byte fetch buffer. The predecode hardware can move up to six
x
86 instructions per cycle from the fetch buffer into the IQ, where a new feature called macro-fusion is used to
prepare between four and five
x
86 instructions each cycle for transfer from the IQ to the decode hardware.
L1 Instruction Cache
Simple Decoder
2 x 16-byte
Fetch Buffer
Simple Decoder
18-entry
Instruction
Queue
Micro-op
Reorder Buffer
Simple Decoder
Buffer
(ROB)
Complex
Decoder
Translate/
Microcode
Engine
x86 Decode
x86 instruction path
macro-fused instruction path
micro-op instruction path
fused micro-op instruction path
Figure 12-11: Core’s fetch and decode hardware
256
Chapter 12
NOTE
Core’s instruction queue also takes over the hardware loop buffer function of previous
designs’ fetch buffers.
Macro-Fusion
A major new feature of Core’s front end hardware is its ability to fuse pairs of
x
86 instructions together in the predecode phase and send them through a single decoder to be translated into a single micro-op. This feature, called
macro-fusion
, can be used only on certain types of instructions; specifically, compare and test instructions can be macro-fused with branch instructions.
Core’s predecode phase can send one macro-fused
x
86 instruction per
cycle to any one of the front end’s four decoders. (As we’ll see later, Core has four instruction decoders, one more than its predecessors.) In turn, the decode