Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
The Intel Pentium and Pentium Pro
83
Because the integer pipeline is the shortest, it’s normally taken to be
the default pipeline in discussions of a processor’s microarchitecture. So
when you see a reference, either in this text or in another text, to a super-
scalar processor’s
pipeline
or
basic pipeline
, you should assume it’s referring to the processor’s integer pipeline.
The Pentium’s basic integer pipeline is five stages long, with the stages
broken down as follows:
1.
Prefetch/Fetch
Instructions are fetched from the instruction cache
and aligned in prefetch buffers for decoding.
2.
Decode-1
Instructions are decoded into the Pentium’s internal instruc-
tion format using a fast set of hardware-based rules. Branch prediction
also takes place at this stage.
3.
Decode-2
Instructions that require the microcode ROM are decoded
here. Also, address computations take place at this stage.
4.
Execute
The integer hardware ALU executes the instruction.
5.
Write-back
The results of the computation are written back to the
register file.
These stages should be familiar to you, although the first three stages are
slightly different from those of the simple four-stage pipelines described so
far. Let’s take a quick trip through the Pentium’s pipeline stages, so that you
can examine each one in a bit more detail.
The
prefetch/fetch
stage corresponds to the fetch phase of the standard
instruction lifecycle. Unlike the simple, uniformly sized two-byte instructions
of our example DLW architecture,
x
86 instructions can range in size from 1
to 17 bytes in length, though the average instruction length is a little under 3
bytes.
x
86’s widely variable instruction length complicates the Pentium’s fetch stage, because instructions cannot simply be fetched and then fed directly
into the decode stage. Instead,
x
86 instructions are first fetched into a buffer, where each instruction’s boundaries are detected and marked. From this
buffer, the marked instructions are then aligned and sent to the Pentium’s
decode hardware.
The decode phase of the Pentium’s execution process is split into two
pipeline stages, the first of which corresponds most closely to the decode pipe-
line stage with which you’re already familiar. The
decode-1
stage takes the newly fetched instruction and decodes it into the Pentium’s internal instruction format, so that it can be used to tell the processor’s execution units how
to manipulate the data stream. The decode-1 stage also involves the Pentium’s
branch unit, which checks the currently decoding instruction to see if it’s
a branch, and if it is a branch, to determine its type. It’s at this point that
branch prediction
takes place, but we’ll cover branch prediction in more detail in a moment.
The main difference between the Pentium’s five-stage pipeline and the
four-stage pipelines discussed in Chapter 3 lies in the second decode stage.
RISC ISAs, like the primitive DLW ISA of Chapters 1 and 2, support only a
few simple, load-store addressing modes. In contrast, the
x
86 ISA supports
84
Chapter 5
multiple complex addressing modes, which were originally designed to make
assembly language programmers’ lives easier but ended up making everyone’s
lives more difficult. These addressing modes require extra address compu-
tations, and these computations are relegated to the
decode-2
stage, where dedicated address computation hardware handles them before dispatching
the instruction to the execution units.
The decode-2 stage is also where the Pentium’s microcode ROM kicks in.
The Pentium decodes many
x
86 instructions directly in its decoding hardware, but the longer instructions are decoded by means of a microcode ROM, as
desc
ribed in the section “A Brief History of the ISA” on page 71.
Once instructions have been decoded, the Pentium’s control unit deter-
mines when they can be dispatched to the back end and to which execution
unit. So the control unit’s job is to coordinate the movement of instructions
from the processor’s front end to its back end, so that they can enter the
execute and write-back phases of the execution process.
The last two pipeline stages—execute and write-back—should be familiar
to you by now. The next major section will describe the Pentium’s execution
units, so think of it as a more detailed discussion of the execute stage.
The Branch Unit and Branch Prediction
Before we take a closer look at the back end of the Pentium, let’s look at one
aspect of the Pentium’s front end in a bit more detail: the
branch unit (BU)
.
On the right side of the Pentium’s front end, shown earlier in Figure 5-1,
notice a branch unit attached to the instruction fetch and decode/dispatch
pipeline stages. I’ve depicted the branch unit as part of the front end of the
machine—even though it technically still counts as a
memory access unit
—
because the BU works closely with the instruction fetcher, steering it by
means of the program counter to different sections of the code stream.
The branch unit contains the
branch execution unit (BEU)
and the
branch
prediction unit (BPU)
, and whenever the front end’s decoder encounters a conditional branch instruction, it sends it to the BU to be executed. The
BU in turn usually needs to send it off to one of the other execution units
to have the instruction’s branch condition evaluated, so that the BU can
determine if the branch is
taken
or
not taken
. Once the BU determines that the branch has been taken, it has to get the starting address of the next block
of code to be executed. This address, the
branch target
, must be calculated, and the front end must be told to begin fetching code at the new address.
In older processors, the entire processor just sort of sat idle and waited for
the branch condition to be evaluated, a wait that could be quite long if the
evaluation involved a complex calculation of some sort. Modern processors
use a technique called
speculative execution
, which involves making an educated guess at which direction the branch will ultimately take and then beginning
execution at the new branch target
before
the branch’s condition is actually evaluated. This educated guess is made using one of a variety of branch
prediction techniques, about which I’ll talk more in a moment. Speculative
execution is used to keep the delays associated with evaluating branches
from introducing bubbles into the pipeline.
The Intel Pentium and Pentium Pro
85
Instructions that are speculatively executed cannot write their results
back to the register file until the branch condition is evaluated. If the BPU
predicted the branch correctly, those speculative instructions can then be
marked as non-speculative and have their results written back just like
regular instructions.
Branch prediction can backfire when the processor incorrectly predicts a
branch. Such mispredictions are bad, because if all of those instructions that
the processor has loaded into the pipeline and begun speculatively executing
turn out to be from the wrong branch target, the pipeline must be flushed
of the erroneous, speculative instructions and their attendant results. After
this pipeline flush, the front end must then fetch the correct branch target
address so that the processor can begin executing at the right place in the
code stream.
As you learned in Chapter 3, flushing the pipeline of instructions and
then refilling it again causes a huge hit to the processor’s average completion
rate and overall performance. Furthermore, there’s a delay (and therefore a
few cycles worth of pipeline bubbles) associated with calculating the correct
branch target and loading the new instruction stream into the front end.
This delay can degrade performance significantly, especially on branch-
intensive code. It is therefore imperative that a processor’s branch prediction
hardware be as accurate as possible.
There are two main types of branch prediction: static prediction and
dynamic prediction.
Static branch prediction
is simple and relies on the assumption that the majority of backward-pointing branches occur in the context of
repetitive loops, where the branch instruction is used to determine whether
or not to repeat the loop again. Most of the time, a loop’s exit condition will
be false, which means that the loop’s branch will evaluate to taken, thereby
instructing the machine to repeat the loop’s code one more time. This being
the case, static branch prediction merely assumes that all backward branches
are taken. For a branch that points forward to a block of code that comes later
in the program, the static predictor assumes that the branch is not taken.
Static prediction is very fast because it doesn’t involve any table lookups
or calculations, but its success rate varies widely with the program’s instruc-
tion mix. If the program is full of loops, static prediction works fairly well; if it’s not full of loops, static branch prediction performs quite poorly.
To get around the problems associated with static prediction, computer
architects use a variety of algorithms for predicting branches based on a
program’s past behavior. These
dynamic branch prediction
algorithms usually involve the use of both of two types of tables—the
branch history table (BHT)
and the
branch target buffer (BTB)
—to record information about the outcomes of branches that have already been executed. The BHT creates an entry for
each conditional branch that the BU has encountered on its last few cycles.
This entry includes some bits that indicate the likelihood that the branch
will be taken based on its past history. When the front end encounters a
branch instruction that has an entry in its BHT, the branch predictor uses
this branch history information to decide whether or not to speculatively
execute the branch.
86
Chapter 5
Should the branch predictor decide to execute the branch speculatively,
it needs to know exactly where in memory the branch is pointing—in other
words, it needs a branch target. The BTB stores the branch targets of pre-
viously executed branches, so when a branch is taken, the BPU grabs the
speculative branch target from the BTB and tells the front end to begin
fetching instructions from that address. Hopefully, the BTB contains an
entry for the branch you’re trying to execute, and hopefully that entry is
correct. If the branch target either isn’t there or is wrong, you’ve got a
problem. I won’t get into the issues surrounding BTB performance, but
suffice it to say that a larger BTB is usually better, because it can store more branch targets and thus lower the chances of a BTB miss.
The Pentium uses both static and dynamic branch prediction techniques
to prevent mispredictions and branch delays. If a branch instruction does
not have an entry in the BHT, the Pentium uses static prediction to decide
which path to take. If the instruction does have a BHT entry, dynamic pre-
diction is used. The Pentium’s BHT holds 256 entries, which means that it
does not have enough space to store information on most of the branches in
an average program. Nonetheless, the BHT allows the Pentium to predict
branches with a much higher success rate, from 75 to 85 percent, according
to Intel, than if it used static prediction alone. The Pentium also uses a
BTB to store predicted branch targets. In most of Intel’s literature and
diagrams, the BTB and BHT are combined under the label the
front-end
BTB
and are considered a single structure.
The Pentium’s Back End
The Pentium’s superscalar back end is fairly straightforward. It has two
five-stage integer pipelines, which Intel has designated U and V, and one
six-stage floating-point pipeline. This section will take a closer look at each
of these ALUs.
The Integer ALUs
The Pentium’s U and V integer pipes are not fully symmetric. U, as the
default pipe, is slightly more capable and contains a shifter, which V lacks.
For this reason, in Figure 5-1, I’ve labeled the U pipe
simple integer unit (SIU)
and the V pipe
complex integer unit (CIU)
. Most of the designs that we’ll study throughout this book have asymmetrical integer units, where one integer
unit is more complex and capable of handling more types of instructions
than the other, simpler unit.
The Pentium’s integer units aren’t fully independent. There is a set
of restrictions, which I won’t take time to outline, that places limits on
which combinations of integer instructions can be issued in parallel. All
told, though, the Pentium’s two integer units initially provided solid enough
integer performance for it to be competitive in its day, especially for integer-