Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
one processor: the microarchitecture’s code name, the microarchitecture’s
brand name, and the processor’s brand name.
The processors described in this chapter are commonly known by the
code names that Intel assigned to their microarchitectures prior to their
launch. The first version of the Pentium M was called
Banias
, and a later revision was called
Dothan
; the code name for Core Duo was
Yonah
. The code name situation for Core 2 Duo is a bit complicated, and will be unraveled in
the appropriate section. These code names are still used periodically by Intel
and others, so you’ll often need to be familiar with them in order to know
which design is being discussed in a particular article or white paper.
Figure 12-1 should help you understand and keep track of the different
names and codenames used throughout this chapter. Note that the related
information for the Pentium 4 is included for reference.
Microarchitecture
Willamette, Prescott, etc.
Banias, Dothan
Yonah
Merom/Conroe/Woodcrest
Code Name
Microarchitecture
Netburst
Intel Original Mobile
Intel Core
Brand Name
Microarchitecture
Microarchitecture
Processor
Pentium 4
Pentium M
Core Duo/Solo
Core 2 Duo/Solo
Brand Name
Figure 12-1: Code names and official brand names for some recent Intel architectures and
implementations
236
Chapter 12
To forestall any potential confusion, I have avoided the use of both the
brand names and the code names for the microarchitectures under discussion.
Instead, I typically employ the official brand name that Intel has given the
desktop
processor (as opposed to the mobile or server processor) that implements a particular microarchitecture.
The Rise of Power-Efficient Computing
Although the so-called “mobile revolution” had clearly arrived by the time
Intel introduced the Pentium M in 2003, the previous years’ rapid growth in
portable computer sales wasn’t the only reason Intel and other processor
manufacturers had begun to pay serious attention to the power dissipation of
their chips. As transistor sizes steadily shrank and designers became able to
cram more power-hungry circuits into each square millimeter of a chip’s
surface area, a new barrier to processor performance loomed on the near
horizon: the power wall.
Power wall
is a term used by Intel to describe the point at which its chips’
power density
(the number of watts dissipated per unit area) began to seriously limit further integration and clockspeed scaling. The general idea behind
the power wall is straightforward. Though the explanation here leaves out a
number of factors like the effects of per-device capacitance and supply voltage, it should nonetheless give you enough of a handle on the phenomenon that
you can understand some of the major design decisions behind the processors
covered in this chapter.
Power Density
The amount of power that a chip dissipates per unit area is called its power
density, and there are two types of power density that concern processor
architects: dynamic power density and static power density.
Dynamic Power Density
Each transistor on a chip dissipates a small amount of power when it is
switched, and transistors that are switched rapidly dissipate more power than
transistors that are switched slowly. The total amount of power dissipated per
unit area due to switching of a chip’s transistors is called
dynamic power density
.
There are two factors that work together to cause an increase in dynamic
power density: clockspeed and transistor density.
Increasing a processor’s clockspeed involves switching its transistors
more rapidly, and as I just mentioned, transistors that are switched more
rapidly dissipate more power. Therefore, as a processor’s clockspeed rises, so
does its dynamic power density, because each of those rapidly switching tran-
sistors contributes more to the device’s total power dissipation. You can also
increase a chip’s dynamic power density by cramming more transistors into
the same amount of surface area.
Intel’s Pentium M, Core Duo, and Core 2 Duo
237
Figure 12-2 illustrates how transistor density and clockspeed work
together to increase dynamic power density. As the clockspeed of the device
and the number of transistors per unit area rise, so does the overall dynamic
power density.
Clockspeed
Power
Density
Transistor Density
Figure 12-2: Dynamic power density
Static Power Density
In addition to clockspeed-related increases in dynamic power density, chip
designers must also contend with the fact that even transistors that aren’t
switching will still leak current during idle periods, much like how a faucet
that is shut off can still leak water if the water pressure behind it is high
enough. This
leakage current
causes an idle transistor to constantly dissipate a trace amount of power. The amount of power dissipated per unit area
due to leakage current is called
static power density
.
Transistors leak more current as they get smaller, and consequently
static power densities begin to rise across the chip when more transistors
are crammed into the same amount of space. Thus even relatively low clock-
speed devices with very small transistor sizes are still subject to increases in power density if leakage current is not controlled. If a silicon device’s
overall power density gets high enough, it will begin to overheat and will
eventually fail entirely. Thus it’s critical that designers of highly integrated devices like modern
x
86 processors take power efficiency into account when designing a new microarchitecture.
Power density is a major, growing concern for every type of micro-
processor, regardless of the type of computer in which the processor is
238
Chapter 12
intended to be used. The same types of power-aware design decisions
that are important for a mobile processor are now just as critical for a
server processor.
The Pentium M
In order to meet the new challenges posed by the power-efficient computing
paradigm, Intel’s Israel-based design team drew on the older, time-tested P6
microarchitecture as the basis for its new low-power design, the Pentium M.
The Pentium M takes the overall pipeline organization and layout of the P6
in its Pentium III incarnation and builds on it substantially with a number of
innovations, allowing it to greatly exceed its predecessor in both power
efficiency and raw performance (see Table 12-1).
Table 12-1:
Features of the Pentium M
Introduction Date
March 12, 2003
Process
0.13 micron
Transistor Count
77 million
Clock Speed at Introduction
1.3 to 1.6 GHz
L1 Cache Size
32KB instruction, 32KB data
L2 Cache Size (on-die)
1MB
Most of the Pentium M’s new features are in its front end, specifically in
its fetch, decode, and branch-prediction hardware.
The Fetch Phase
As I explained in Chapter 5, the original P6 processor fetches one 16-byte
instruction packet per cycle from the I-cache into a buffer that’s two instruc-
tion packets (or 32 bytes) deep. (This fetch buffer is roughly analogous to the
PowerPC instruction queue [IQ] described in previous chapters.) From the
fetch buffer,
x
86 instructions can move at a rate of up to three instructions per cycle into the P6 core’s three decoders. This fetch and decode process
is illustrated in Figure 12-3.
On the Pentium M and its immediate successor, Core Duo, the fetch
buffer has been widened to 64 bytes. Thus the front end’s predecode hard-
ware can hold and examine up to four 16-byte instruction packets at a time.
This deeper buffer, depicted in Figure 12-4, is necessary to keep the newer
design’s much improved decode hardware (described later) from starving.
A second version of the Pentium M, commonly known by its code name,
Dothan
, modifies this 64-byte fetch buffer to do double duty as a hardware loop buffer. This fetch/loop buffer combination is also used in the Pentium
M’s successor, the Core Duo.
Intel’s Pentium M, Core Duo, and Core 2 Duo
239
L1 Instruction Cache
2 x 16-byte
Fetch Buffer
Simple Decoder
Simple Decoder
6-entry
40-entry
Micro-op
Reorder Buffer
Queue
(ROB)
Complex
Decoder
Microcode
Translate/
Engine
x86 Decode
x86 instruction path
micro-op instruction path
Figure 12-3: The P6 architecture’s fetch and decode hardware
The Hardware Loop Buffer
A
hardware loop buffer
caches the block of instructions that is located inside a program loop. Because the instructions inside a loop are repeated many times,
storing them in a front-end buffer keeps the processor from having to re-fetch
them on each loop iteration. Thus the loop buffer is a feature that saves power
because it cuts down on the number of accesses to the I-cache and to the
branch prediction unit’s branch target buffer.
The Decode Phase: Micro-ops Fusion
One of the most important ways that Intel’s Pentium M architects were able
to get more performance out of the P6 architecture was by improving on the
predecessor design’s decode rate.
You’ll recall from Chapter 5 that each of the P6’s two simple/fast decoders
can output a maximum of one micro-op per cycle to the micro-op queue,
for a total of two micro-ops per cycle. All instructions that translate into more than one micro-op must use the single complex/slow decoder, which can
output up to four micro-ops per cycle. Thus, the P6 core’s decode hardware
can output a maximum of six micro-ops per cycle to its micro-op queue.
240
Chapter 12
L1 Instruction Cache
4 x 16-byte
Fetch Buffer
Simple Decoder
Simple Decoder
Micro-op
Reorder Buffer
Queue
(ROB)
Complex
Decoder
Microcode
Translate/
Engine
x86 Decode
x86 instruction path
micro-op instruction path
fused micro-op instruction path
Figure 12-4: The Pentium M’s fetch and decode hardware
For certain types of operations, especially memory operations, the P6’s
decoding scheme can cause a serious bottleneck. As I’ll discuss in more
detail later,
x
86 store instructions and a specific category of load instructions decode into two micro-ops, which means that most
x
86 memory accesses
must use the complex/slow decoder. During bursts of memory instructions,
the complex/slow decoder becomes backed up with work while the other
two decoders sit idle. At such times, the P6’s decoding hardware decodes
only two micro-ops per cycle, a number far short of its peak decode rate,
six micro-ops per cycle.
The Pentium M’s redesigned decoding unit contains a new feature
called
micro-ops fusion
that eliminates this bottleneck for memory accesses and enables the processor to increase the number of
x
86 instructions per cycle that it can convert to micro-ops. The Pentium M’s two simple/fast decoders are able to take certain
x
86 instructions that normally translate into two micro-ops and translate them into a single fused micro-op. These two decoders
can send either one micro-op per cycle or one fused micro-op per cycle to
the micro-op queue, as depicted in Figure 12-5. Because both simple and fast
decoders can now process these formerly two–micro-op memory instructions,
the Pentium M’s front end can actually achieve the maximum decode rate
of six micro-ops per cycle during long stretches of memory traffic.
Intel’s Pentium M, Core Duo, and Core 2 Duo
241
Simple Decoder
Simple Decoder
Micro-op
Queue
Complex
Decoder
Variable-length x86 instruction
Micro-fused store
Micro-fused load-op
Micro-op
Figure 12-5: Micro-ops fusion on the Pentium M
From the micro-op queue, each fused micro-op moves on to the instruc-
tion window, where it is assigned to a single ROB and RS entry and tracked
just like a normal micro-op through the rest of the Pentium M’s pipeline.
Note that the Pentium M’s back end treats the two constituent parts of a fused
micro-op as independent of each other for the purposes of issuing and exe-
cution. Thus the two micro-ops that make up a fused micro-op can issue in
parallel through two different issue ports or serially through the same port,
whichever is appropriate. Once the two micro-ops have completed execution,