Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (63 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

2.38Mb size Format: txt, pdf, ePub

Read Book Download Book

The Intel Pentium and Pentium Pro

Because the integer pipeline is the shortest, it’s normally taken to be

the default pipeline in discussions of a processor’s microarchitecture. So

when you see a reference, either in this text or in another text, to a super-

scalar processor’s
pipeline
or
basic pipeline
, you should assume it’s referring to the processor’s integer pipeline.

The Pentium’s basic integer pipeline is five stages long, with the stages

broken down as follows:

Prefetch/Fetch

Instructions are fetched from the instruction cache

and aligned in prefetch buffers for decoding.

Decode-1

Instructions are decoded into the Pentium’s internal instruc-

tion format using a fast set of hardware-based rules. Branch prediction

also takes place at this stage.

Decode-2

Instructions that require the microcode ROM are decoded

here. Also, address computations take place at this stage.

Execute

The integer hardware ALU executes the instruction.

Write-back

The results of the computation are written back to the

These stages should be familiar to you, although the first three stages are

slightly different from those of the simple four-stage pipelines described so

far. Let’s take a quick trip through the Pentium’s pipeline stages, so that you

can examine each one in a bit more detail.

The
prefetch/fetch
stage corresponds to the fetch phase of the standard

instruction lifecycle. Unlike the simple, uniformly sized two-byte instructions

of our example DLW architecture,
x
86 instructions can range in size from 1

to 17 bytes in length, though the average instruction length is a little under 3

bytes.
x
86’s widely variable instruction length complicates the Pentium’s fetch stage, because instructions cannot simply be fetched and then fed directly

into the decode stage. Instead,
x
86 instructions are first fetched into a buffer, where each instruction’s boundaries are detected and marked. From this

buffer, the marked instructions are then aligned and sent to the Pentium’s

decode hardware.

The decode phase of the Pentium’s execution process is split into two

pipeline stages, the first of which corresponds most closely to the decode pipe-

line stage with which you’re already familiar. The
decode-1
stage takes the newly fetched instruction and decodes it into the Pentium’s internal instruction format, so that it can be used to tell the processor’s execution units how

to manipulate the data stream. The decode-1 stage also involves the Pentium’s

branch unit, which checks the currently decoding instruction to see if it’s

a branch, and if it is a branch, to determine its type. It’s at this point that

branch prediction
takes place, but we’ll cover branch prediction in more detail in a moment.

The main difference between the Pentium’s five-stage pipeline and the

four-stage pipelines discussed in Chapter 3 lies in the second decode stage.

RISC ISAs, like the primitive DLW ISA of Chapters 1 and 2, support only a

few simple, load-store addressing modes. In contrast, the
x
86 ISA supports
84

Chapter 5

multiple complex addressing modes, which were originally designed to make

assembly language programmers’ lives easier but ended up making everyone’s

lives more difficult. These addressing modes require extra address compu-

tations, and these computations are relegated to the
decode-2
stage, where dedicated address computation hardware handles them before dispatching

the instruction to the execution units.

The decode-2 stage is also where the Pentium’s microcode ROM kicks in.

The Pentium decodes many
x
86 instructions directly in its decoding hardware, but the longer instructions are decoded by means of a microcode ROM, as

desc
ribed in the section “A Brief History of the ISA” on page 71.

Once instructions have been decoded, the Pentium’s control unit deter-

mines when they can be dispatched to the back end and to which execution

unit. So the control unit’s job is to coordinate the movement of instructions

from the processor’s front end to its back end, so that they can enter the

execute and write-back phases of the execution process.

The last two pipeline stages—execute and write-back—should be familiar

to you by now. The next major section will describe the Pentium’s execution

units, so think of it as a more detailed discussion of the execute stage.

The Branch Unit and Branch Prediction

Before we take a closer look at the back end of the Pentium, let’s look at one

aspect of the Pentium’s front end in a bit more detail: the
branch unit (BU)
.

On the right side of the Pentium’s front end, shown earlier in Figure 5-1,

notice a branch unit attached to the instruction fetch and decode/dispatch

pipeline stages. I’ve depicted the branch unit as part of the front end of the

machine—even though it technically still counts as a
memory access unit
—

because the BU works closely with the instruction fetcher, steering it by

means of the program counter to different sections of the code stream.

The branch unit contains the
branch execution unit (BEU)
and the
branch
prediction unit (BPU)
, and whenever the front end’s decoder encounters a conditional branch instruction, it sends it to the BU to be executed. The

BU in turn usually needs to send it off to one of the other execution units

to have the instruction’s branch condition evaluated, so that the BU can

determine if the branch is
taken
or
not taken
. Once the BU determines that the branch has been taken, it has to get the starting address of the next block

of code to be executed. This address, the
branch target
, must be calculated, and the front end must be told to begin fetching code at the new address.

In older processors, the entire processor just sort of sat idle and waited for

the branch condition to be evaluated, a wait that could be quite long if the

evaluation involved a complex calculation of some sort. Modern processors

use a technique called
speculative execution
, which involves making an educated guess at which direction the branch will ultimately take and then beginning

execution at the new branch target
before
the branch’s condition is actually evaluated. This educated guess is made using one of a variety of branch

prediction techniques, about which I’ll talk more in a moment. Speculative

execution is used to keep the delays associated with evaluating branches

from introducing bubbles into the pipeline.

The Intel Pentium and Pentium Pro

Instructions that are speculatively executed cannot write their results

back to the register file until the branch condition is evaluated. If the BPU

predicted the branch correctly, those speculative instructions can then be

marked as non-speculative and have their results written back just like

regular instructions.

Branch prediction can backfire when the processor incorrectly predicts a

branch. Such mispredictions are bad, because if all of those instructions that

the processor has loaded into the pipeline and begun speculatively executing

turn out to be from the wrong branch target, the pipeline must be flushed

of the erroneous, speculative instructions and their attendant results. After

this pipeline flush, the front end must then fetch the correct branch target

address so that the processor can begin executing at the right place in the

code stream.

As you learned in Chapter 3, flushing the pipeline of instructions and

then refilling it again causes a huge hit to the processor’s average completion

rate and overall performance. Furthermore, there’s a delay (and therefore a

few cycles worth of pipeline bubbles) associated with calculating the correct

branch target and loading the new instruction stream into the front end.

This delay can degrade performance significantly, especially on branch-

intensive code. It is therefore imperative that a processor’s branch prediction

hardware be as accurate as possible.

There are two main types of branch prediction: static prediction and

dynamic prediction.
Static branch prediction
is simple and relies on the assumption that the majority of backward-pointing branches occur in the context of

repetitive loops, where the branch instruction is used to determine whether

or not to repeat the loop again. Most of the time, a loop’s exit condition will

be false, which means that the loop’s branch will evaluate to taken, thereby

instructing the machine to repeat the loop’s code one more time. This being

the case, static branch prediction merely assumes that all backward branches

are taken. For a branch that points forward to a block of code that comes later

in the program, the static predictor assumes that the branch is not taken.

Static prediction is very fast because it doesn’t involve any table lookups

or calculations, but its success rate varies widely with the program’s instruc-

tion mix. If the program is full of loops, static prediction works fairly well; if it’s not full of loops, static branch prediction performs quite poorly.

To get around the problems associated with static prediction, computer

architects use a variety of algorithms for predicting branches based on a

program’s past behavior. These
dynamic branch prediction
algorithms usually involve the use of both of two types of tables—the
branch history table (BHT)
and the
branch target buffer (BTB)
—to record information about the outcomes of branches that have already been executed. The BHT creates an entry for

each conditional branch that the BU has encountered on its last few cycles.

This entry includes some bits that indicate the likelihood that the branch

will be taken based on its past history. When the front end encounters a

branch instruction that has an entry in its BHT, the branch predictor uses

this branch history information to decide whether or not to speculatively

execute the branch.

Chapter 5

Should the branch predictor decide to execute the branch speculatively,

it needs to know exactly where in memory the branch is pointing—in other

words, it needs a branch target. The BTB stores the branch targets of pre-

viously executed branches, so when a branch is taken, the BPU grabs the

speculative branch target from the BTB and tells the front end to begin

fetching instructions from that address. Hopefully, the BTB contains an

entry for the branch you’re trying to execute, and hopefully that entry is

correct. If the branch target either isn’t there or is wrong, you’ve got a

problem. I won’t get into the issues surrounding BTB performance, but

suffice it to say that a larger BTB is usually better, because it can store more branch targets and thus lower the chances of a BTB miss.

The Pentium uses both static and dynamic branch prediction techniques

to prevent mispredictions and branch delays. If a branch instruction does

not have an entry in the BHT, the Pentium uses static prediction to decide

which path to take. If the instruction does have a BHT entry, dynamic pre-

diction is used. The Pentium’s BHT holds 256 entries, which means that it

does not have enough space to store information on most of the branches in

an average program. Nonetheless, the BHT allows the Pentium to predict

branches with a much higher success rate, from 75 to 85 percent, according

to Intel, than if it used static prediction alone. The Pentium also uses a

BTB to store predicted branch targets. In most of Intel’s literature and

diagrams, the BTB and BHT are combined under the label the
front-end

BTB
and are considered a single structure.

The Pentium’s Back End

The Pentium’s superscalar back end is fairly straightforward. It has two

five-stage integer pipelines, which Intel has designated U and V, and one

six-stage floating-point pipeline. This section will take a closer look at each

of these ALUs.

The Integer ALUs

The Pentium’s U and V integer pipes are not fully symmetric. U, as the

default pipe, is slightly more capable and contains a shifter, which V lacks.

For this reason, in Figure 5-1, I’ve labeled the U pipe
simple integer unit (SIU)
and the V pipe
complex integer unit (CIU)
. Most of the designs that we’ll study throughout this book have asymmetrical integer units, where one integer

unit is more complex and capable of handling more types of instructions

than the other, simpler unit.

The Pentium’s integer units aren’t fully independent. There is a set

of restrictions, which I won’t take time to outline, that places limits on

which combinations of integer instructions can be issued in parallel. All

told, though, the Pentium’s two integer units initially provided solid enough

integer performance for it to be competitive in its day, especially for integer-