Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (65 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
12.62Mb size Format: txt, pdf, ePub

that it takes to support such instructions has not tended to grow either.

Transistors, on the other hand, have shrunk rapidly since the Pentium

was introduced. When you put these two facts together, this means that the

relative cost (in transistors) of
x
86 support, a cost that is mostly concentrated in an
x
86 CPU’s front end, has dropped as CPU transistor counts have

increased.

x
86 support accounts for well under 10 percent of the transistors on

the Pentium 4, and this percentage is even smaller for the very latest Intel

processors. This steady and dramatic decrease in the relative cost of legacy

support has contributed significantly to the ability of
x
86 hardware to catch up to and even surpass its RISC competitors in both integer and floating-point performance. In other words, Moore’s Curves have been extremely

kind to the
x
86 ISA.

92

Chapter 5

In spite of the high price it paid for
x
86 support, the Pentium was commercially successful, and it furthered Intel’s dominance in the
x
86 market that the company had invented. But for Intel to take
x
86 performance to the next level, it needed to take a series of radical steps with the follow-on to the Pentium, the Pentium Pro.

NOTE

Here and throughout this book, I use the term
Moore’s Curves
in place of the more
popular phrase
Moore’s Law
. For a detailed explanation of the phenomenon referred
to by both of these terms, see my article at Ars Technica entitled “Understanding
Moore’s Law” (http://arstechnica.com/paedia/m/moore/moore-1.html).

The Intel P6 Microarchitecture: The Pentium Pro

Intel’s P6 microarchitecture, first implemented in the Pentium Pro, was by any

reasonable metric a resounding success. Its performance was significantly

better than that of the Pentium, and the market rewarded Intel handsomely

for it. The microarchitecture also proved extremely scalable, furnishing Intel

with a good half-decade of desktop dominance and paving the way for

x
86 systems to compete with RISC in the workstation and server markets.

Table 5-2 summarizes the evolution of the P6 microarchitecture’s features.

Table 5-2:
The Evolution of the P6

Pentium Pro Vitals

Pentium II Vitals

Pentium III Vitals

Introduction Date

November 1, 1995

May 7, 1997

February 26, 1999

Process

0.60/0.35 micron

0.35 micron

0.25 micron

Transistor Count

5.5 million

7.5 million

9.5 million

Clock Speed at Introduction

150, 166, 180, and 200 MHz

233, 266, and 300 MHz

450 and 500 MHz

L1 Cache Size

8KB instruction, 8KB data

16KB instruction,

16KB instruction,

16KB data

16KB data

L2 Cache Size

256KB or 512KB (on-die)

512KB (off-die)

512KB (on-die)

x
86 ISA Extensions

MMX

SSE added in 1999

What was the P6’s secret, and how did it offer such a quantum leap

in performance? The answer is complex and involves the contribution of

numerous technologies and techniques, the most important of which had

already been introduced into the
x
86 world by Intel’s smaller
x
86 competitors (most notably, AMD’s K5): the decoupling of the front end’s fetching and

decoding functions from the back end’s execution function by means of an

instruction window.

Figure 5-6 illustrates the basic P6 microarchitecture. As you can see, this

microarchitecture sports a quite a few prominent features that distinguish it

fundamentally from the designs we’ve studied thus far.

The Intel Pentium and Pentium Pro

93

Front End

Instruction Fetch

BU

Translate
x
86/

Branch

Decode

Unit

Reorder Buffer (ROB)

Reservation Station (RS)

Port 4

Port 3

Port 2

Port 0

Port 0

Port 1

Store

Store

Load

CIU

SIU

Data

Addr.

Addr.

FPU

Floating-

Point

Load-Store Unit

Integer Unit

Unit

Memory Access Units

Scalar ALUs

Back End

Reorder Buffer

(ROB)

Commit

Commitment Unit

Figure 5-6: The Pentium Pro

Decoupling the Front End from the Back End

In the Pentium and its predecessors, instructions travel directly from the

decoding hardware to the execution hardware, as depicted in Figure 5-7. In

this simple processor, instructions are
statically scheduled
by the dispatch logic for execution by the two ALUs. First, instructions are fetched and decoded.

Next, the control unit’s dispatch logic examines a pair of instructions using a

set of hardwired rules to determine whether or not they can be executed in

parallel. If the two instructions can be executed in parallel, the control unit

sends them to the two ALUs, where they’re simultaneously executed on the

same clock cycle. When the two instructions have
completed
their execution
94

Chapter 5

phase (i.e., their results are available on the data bus), they’re put back in

program order, and their results are written back to the register file in the

proper sequence.

Front End

Fetch

Decode/

Dispatch

Back End

ALU1

ALU2

Execute

Execute

Write-Back

Figure 5-7: Static scheduling in the original Pentium

This static, rules-based approach to dispatching instructions is rigid and

simplistic, and it has two major drawbacks, both stemming from the fact that

although the code stream is inherently sequential, a superscalar processor

attempts to execute parts of it in parallel. Specifically, static scheduling

z

adapts poorly to the dynamic and ever-changing code stream;

z

makes poor use of wider superscalar hardware.

Because the Pentium can dispatch at most two operations simultaneously

from its decode hardware to its execution hardware on each clock cycle, its

dispatch rules look at only two instructions at a time to see if they can or cannot be dispatched simultaneously. If more execution hardware were added to

the Pentium, and the dispatch width were increased to three instructions

per cycle (as it is in the P6), the rules for determining which instructions go

where would need to be able to account for various possible combinations

of two and three instructions at a time in order to get those instructions to

the right execution unit at the right time. Furthermore, such rules would

inevitably be difficult for programmers to optimize for, and if they weren’t

overly complex, there would necessarily exist many common instruction

sequences that would perform suboptimally under the default rule set.

In plain English, the makeup of the code stream would change from

application to application and from moment to moment, but the rules

responsible for scheduling the code stream’s execution on the Pentium’s

back end would be forever fixed.

The Intel Pentium and Pentium Pro

95

The Issue Phase

The solution to the dilemma posed by static execution is to dispatch newly

decoded instructions into a special buffer that sits between the front end and

the execution units. Once this buffer collects a handful of instructions that

are waiting to execute, the processor’s
dynamic scheduling
logic can examine the instructions and, after taking into account the state of the processor

and the resources currently available for execution,
issue
instructions from the buffer to the execution units at the most opportune time and in the

optimal order. The dynamic scheduling logic has quite a bit of freedom to

reorder the code stream so that instructions execute optimally, even if it

means that two (or more) instructions must be executed not just in parallel

but in reverse order. With dynamic scheduling, the current context in which

a particular instruction finds itself executing can have much more of an

impact on when and how it’s executed. In replacing the Pentium’s control

unit with the combination of a buffer and a dynamic scheduler, the P6

microarchitecture replaces fixed rules with flexibility.

Of course, instructions that have been issued from the buffer to the

execution units out of program order must be put back in program order

once they’ve completed their execution phase, so another buffer is needed

to catch the instructions that have completed execution and to place them

back in program order. We’ll discuss that second buffer more in a moment.

Figure 5-8 shows the two new buffers, both of which work together to

decouple the execute phase from the rest of the instruction’s lifecycle.

In the processor depicted in Figure 5-8, instructions flow in program

order from the decode stage into the first buffer, the
issue buffer
, where they sit until the processor’s dynamic scheduler determines that they’re ready to

execute. Once the instructions are ready to execute, they flow from the issue

buffer into the execution unit. This move, when instructions travel from the

issue buffer where they’re scheduled for optimal execution into the execution

units themselves, is called
issuing
.

There are a number of factors that can prevent an instruction from

executing out of order in the manner described earlier. The instruction may

depend for input on the results of an as-yet-unexecuted instruction, or it may

be waiting on data to be loaded from memory, or it may be waiting for a busy

execution unit to become available, or any one of a number of other condi-

tions may need to be met before the decoded instruction is ready to be sent

off to the proper execution unit. But once the instruction is ready, the sched-

uler sees that it is issued to the execution unit, where it will be executed.

This new twist on the standard instruction lifecycle is called
out-of-order

execution
, or
dynamic execution
, and it requires the addition of two new phases to our instruction’s lifecycle, as shown in Table 5-3. The first new phase is the
issue phase
, and it encompasses the buffering and reordering of the code stream that I’ve just described.

The issue phase is implemented in different ways by different processors.

It may take multiple pipeline stages, and it may involve the use of multiple

buffers arranged in different configurations. What all of the different imple-

mentations have in common, though, is that instructions enter the issue

96

Chapter 5

phase and then wait there for an unspecified amount of time until the

moment is right for them to execute. When they execute, they may do so

out of program order.

Front End

Fetch

Decode/

Dispatch

Back End

Issue

ALU1

ALU2

Execute

Execute

Complete

Write-Back

Commit Unit

Figure 5-8: Dynamic scheduling using buffers

Aside from its use in dynamic scheduling, another important function of

the issue buffer is that it allows the processor to “squeeze” bubbles out of the pipeline prior to the execution phase. The buffer is a queue, and instructions

that enter it drop down into the bottommost available entry.

Table 5-3:
Phases of a Dynamically Scheduled

Instruction’s Lifecycle

1

Fetch

In order

2

Decode/dispatch

3

Issue

Reorder

4

Execute

Out of order

5

Complete

Reorder

6

Write-back (commit)

In order

The Intel Pentium and Pentium Pro

97

So if an instruction is preceded by a pipeline bubble, when it enters the

issue buffer, it will drop down into the empty space directly behind the

instruction ahead of it, thereby eliminating the bubble.

Of course, the issue buffer’s ability to squeeze out pipeline bubbles

depends on the front end’s ability to produce more instructions per cycle

than the back end can consume. If the back end and front end move in lock

step, the pipeline bubbles will propagate through the issue queues into the

back end.

The Completion Phase

The second phase that out-of-order execution adds to an instruction’s

lifecycle is the
completion phase
. In this phase, instructions that have finished executing, or
completed execution
, wait in a second buffer to have their results written back to the register file
in program order
. When an instruction’s results are written back to the register file and the programmer-visible machine state

Other books

Blood of the Innocents by Collett, Chris
Bear Lake- Book Four by A. B. Lee, M. L. Briers
100 Days To Christmas by Delilah Storm
Sapphic Cowboi by K'Anne Meinel
Nobody's Angel by Clark, Jack
The Slave Dancer by Paula Fox