Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
that it takes to support such instructions has not tended to grow either.
Transistors, on the other hand, have shrunk rapidly since the Pentium
was introduced. When you put these two facts together, this means that the
relative cost (in transistors) of
x
86 support, a cost that is mostly concentrated in an
x
86 CPU’s front end, has dropped as CPU transistor counts have
increased.
x
86 support accounts for well under 10 percent of the transistors on
the Pentium 4, and this percentage is even smaller for the very latest Intel
processors. This steady and dramatic decrease in the relative cost of legacy
support has contributed significantly to the ability of
x
86 hardware to catch up to and even surpass its RISC competitors in both integer and floating-point performance. In other words, Moore’s Curves have been extremely
kind to the
x
86 ISA.
92
Chapter 5
In spite of the high price it paid for
x
86 support, the Pentium was commercially successful, and it furthered Intel’s dominance in the
x
86 market that the company had invented. But for Intel to take
x
86 performance to the next level, it needed to take a series of radical steps with the follow-on to the Pentium, the Pentium Pro.
NOTE
Here and throughout this book, I use the term
Moore’s Curves
in place of the more
popular phrase
Moore’s Law
. For a detailed explanation of the phenomenon referred
to by both of these terms, see my article at Ars Technica entitled “Understanding
Moore’s Law” (http://arstechnica.com/paedia/m/moore/moore-1.html).
The Intel P6 Microarchitecture: The Pentium Pro
Intel’s P6 microarchitecture, first implemented in the Pentium Pro, was by any
reasonable metric a resounding success. Its performance was significantly
better than that of the Pentium, and the market rewarded Intel handsomely
for it. The microarchitecture also proved extremely scalable, furnishing Intel
with a good half-decade of desktop dominance and paving the way for
x
86 systems to compete with RISC in the workstation and server markets.
Table 5-2 summarizes the evolution of the P6 microarchitecture’s features.
Table 5-2:
The Evolution of the P6
Pentium Pro Vitals
Pentium II Vitals
Pentium III Vitals
Introduction Date
November 1, 1995
May 7, 1997
February 26, 1999
Process
0.60/0.35 micron
0.35 micron
0.25 micron
Transistor Count
5.5 million
7.5 million
9.5 million
Clock Speed at Introduction
150, 166, 180, and 200 MHz
233, 266, and 300 MHz
450 and 500 MHz
L1 Cache Size
8KB instruction, 8KB data
16KB instruction,
16KB instruction,
16KB data
16KB data
L2 Cache Size
256KB or 512KB (on-die)
512KB (off-die)
512KB (on-die)
x
86 ISA Extensions
MMX
SSE added in 1999
What was the P6’s secret, and how did it offer such a quantum leap
in performance? The answer is complex and involves the contribution of
numerous technologies and techniques, the most important of which had
already been introduced into the
x
86 world by Intel’s smaller
x
86 competitors (most notably, AMD’s K5): the decoupling of the front end’s fetching and
decoding functions from the back end’s execution function by means of an
instruction window.
Figure 5-6 illustrates the basic P6 microarchitecture. As you can see, this
microarchitecture sports a quite a few prominent features that distinguish it
fundamentally from the designs we’ve studied thus far.
The Intel Pentium and Pentium Pro
93
Front End
Instruction Fetch
BU
Translate
x
86/
Branch
Decode
Unit
Reorder Buffer (ROB)
Reservation Station (RS)
Port 4
Port 3
Port 2
Port 0
Port 0
Port 1
Store
Store
Load
CIU
SIU
Data
Addr.
Addr.
FPU
Floating-
Point
Load-Store Unit
Integer Unit
Unit
Memory Access Units
Scalar ALUs
Back End
Reorder Buffer
(ROB)
Commit
Commitment Unit
Figure 5-6: The Pentium Pro
Decoupling the Front End from the Back End
In the Pentium and its predecessors, instructions travel directly from the
decoding hardware to the execution hardware, as depicted in Figure 5-7. In
this simple processor, instructions are
statically scheduled
by the dispatch logic for execution by the two ALUs. First, instructions are fetched and decoded.
Next, the control unit’s dispatch logic examines a pair of instructions using a
set of hardwired rules to determine whether or not they can be executed in
parallel. If the two instructions can be executed in parallel, the control unit
sends them to the two ALUs, where they’re simultaneously executed on the
same clock cycle. When the two instructions have
completed
their execution
94
Chapter 5
phase (i.e., their results are available on the data bus), they’re put back in
program order, and their results are written back to the register file in the
proper sequence.
Front End
Fetch
Decode/
Dispatch
Back End
ALU1
ALU2
Execute
Execute
Write-Back
Figure 5-7: Static scheduling in the original Pentium
This static, rules-based approach to dispatching instructions is rigid and
simplistic, and it has two major drawbacks, both stemming from the fact that
although the code stream is inherently sequential, a superscalar processor
attempts to execute parts of it in parallel. Specifically, static scheduling
z
adapts poorly to the dynamic and ever-changing code stream;
z
makes poor use of wider superscalar hardware.
Because the Pentium can dispatch at most two operations simultaneously
from its decode hardware to its execution hardware on each clock cycle, its
dispatch rules look at only two instructions at a time to see if they can or cannot be dispatched simultaneously. If more execution hardware were added to
the Pentium, and the dispatch width were increased to three instructions
per cycle (as it is in the P6), the rules for determining which instructions go
where would need to be able to account for various possible combinations
of two and three instructions at a time in order to get those instructions to
the right execution unit at the right time. Furthermore, such rules would
inevitably be difficult for programmers to optimize for, and if they weren’t
overly complex, there would necessarily exist many common instruction
sequences that would perform suboptimally under the default rule set.
In plain English, the makeup of the code stream would change from
application to application and from moment to moment, but the rules
responsible for scheduling the code stream’s execution on the Pentium’s
back end would be forever fixed.
The Intel Pentium and Pentium Pro
95
The Issue Phase
The solution to the dilemma posed by static execution is to dispatch newly
decoded instructions into a special buffer that sits between the front end and
the execution units. Once this buffer collects a handful of instructions that
are waiting to execute, the processor’s
dynamic scheduling
logic can examine the instructions and, after taking into account the state of the processor
and the resources currently available for execution,
issue
instructions from the buffer to the execution units at the most opportune time and in the
optimal order. The dynamic scheduling logic has quite a bit of freedom to
reorder the code stream so that instructions execute optimally, even if it
means that two (or more) instructions must be executed not just in parallel
but in reverse order. With dynamic scheduling, the current context in which
a particular instruction finds itself executing can have much more of an
impact on when and how it’s executed. In replacing the Pentium’s control
unit with the combination of a buffer and a dynamic scheduler, the P6
microarchitecture replaces fixed rules with flexibility.
Of course, instructions that have been issued from the buffer to the
execution units out of program order must be put back in program order
once they’ve completed their execution phase, so another buffer is needed
to catch the instructions that have completed execution and to place them
back in program order. We’ll discuss that second buffer more in a moment.
Figure 5-8 shows the two new buffers, both of which work together to
decouple the execute phase from the rest of the instruction’s lifecycle.
In the processor depicted in Figure 5-8, instructions flow in program
order from the decode stage into the first buffer, the
issue buffer
, where they sit until the processor’s dynamic scheduler determines that they’re ready to
execute. Once the instructions are ready to execute, they flow from the issue
buffer into the execution unit. This move, when instructions travel from the
issue buffer where they’re scheduled for optimal execution into the execution
units themselves, is called
issuing
.
There are a number of factors that can prevent an instruction from
executing out of order in the manner described earlier. The instruction may
depend for input on the results of an as-yet-unexecuted instruction, or it may
be waiting on data to be loaded from memory, or it may be waiting for a busy
execution unit to become available, or any one of a number of other condi-
tions may need to be met before the decoded instruction is ready to be sent
off to the proper execution unit. But once the instruction is ready, the sched-
uler sees that it is issued to the execution unit, where it will be executed.
This new twist on the standard instruction lifecycle is called
out-of-order
execution
, or
dynamic execution
, and it requires the addition of two new phases to our instruction’s lifecycle, as shown in Table 5-3. The first new phase is the
issue phase
, and it encompasses the buffering and reordering of the code stream that I’ve just described.
The issue phase is implemented in different ways by different processors.
It may take multiple pipeline stages, and it may involve the use of multiple
buffers arranged in different configurations. What all of the different imple-
mentations have in common, though, is that instructions enter the issue
96
Chapter 5
phase and then wait there for an unspecified amount of time until the
moment is right for them to execute. When they execute, they may do so
out of program order.
Front End
Fetch
Decode/
Dispatch
Back End
Issue
ALU1
ALU2
Execute
Execute
Complete
Write-Back
Commit Unit
Figure 5-8: Dynamic scheduling using buffers
Aside from its use in dynamic scheduling, another important function of
the issue buffer is that it allows the processor to “squeeze” bubbles out of the pipeline prior to the execution phase. The buffer is a queue, and instructions
that enter it drop down into the bottommost available entry.
Table 5-3:
Phases of a Dynamically Scheduled
Instruction’s Lifecycle
1
Fetch
In order
2
Decode/dispatch
3
Issue
Reorder
4
Execute
Out of order
5
Complete
Reorder
6
Write-back (commit)
In order
The Intel Pentium and Pentium Pro
97
So if an instruction is preceded by a pipeline bubble, when it enters the
issue buffer, it will drop down into the empty space directly behind the
instruction ahead of it, thereby eliminating the bubble.
Of course, the issue buffer’s ability to squeeze out pipeline bubbles
depends on the front end’s ability to produce more instructions per cycle
than the back end can consume. If the back end and front end move in lock
step, the pipeline bubbles will propagate through the issue queues into the
back end.
The Completion Phase
The second phase that out-of-order execution adds to an instruction’s
lifecycle is the
completion phase
. In this phase, instructions that have finished executing, or
completed execution
, wait in a second buffer to have their results written back to the register file
in program order
. When an instruction’s results are written back to the register file and the programmer-visible machine state