Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
is permanently altered, that instruction is said to
commit
. Instructions must commit in program order if the illusion of sequential execution is to be maintained. This means that no instruction can commit until all of the instructions
that were originally ahead of it in the code stream have committed.
The requirement that all instructions must commit in their original
program order is what necessitates the second buffer shown in Figure 5-8.
The processor needs a place to collect instructions as they complete the out-
of-order execution phase of their lifecycle, so that they can be put back in
their original order before being sent to the final write stage, where they’re
committed. Like the issue buffer described earlier, this completion buffer
can take a number of forms. We’ll look at the form that this buffer takes in
the P6 shortly.
I stated previously that an instruction sits in the completion phase’s
buffer, which I’ll call the
completion buffer
for now, and waits to have its result written back to the register file. But where does the instruction’s result wait
during this interim period? When an instruction is executed out of order, its
result goes into a special rename register that has been allocated especially
for use by that instruction. Note that this rename register is part of the
processor’s internal bookkeeping apparatus, which means it is not a part of
the programming model and is therefore not visible to the programmer. The
result waits in this hidden rename register until the instruction commits, at
which time the result is written from the rename register into the programmer-
visible architectural register file. After the instruction’s result is committed, the rename register then goes back into the pool of available rename
registers, where it can be assigned to another instruction on a later cycle.
The P6’s Issue Phase: The Reservation Station
The P6 microarchitecture feeds each newly decoded instruction into a buffer
called the
reservation station (RS)
, where it waits until all of its execution requirements are met. Once they’ve been met, the instruction then moves out of the
reservation station into an execution unit (i.e., it is issued), where it executes.
98
Chapter 5
A glance at the P6 diagram (Figure 5-6) shows that up to three instruc-
tions per cycle can be dispatched from the decoders into the reservation
station. And as you’ll see shortly, up to five instructions per cycle can be
issued from the reservation station into the execution units. Thus the
Pentium’s original superscalar design, in which two instructions per cycle
could dispatch from the decoders directly into the back end, has been
replaced with a buffered design in which three instructions can dispatch
into the buffer and five instructions can issue out of it on any given cycle.
This buffering action, and the decoupling of the front end’s fetch/
decode bandwidth from the back end’s execution bandwidth that it enables,
are at the heart of the P6’s performance gains.
The P6’s Completion Phase: The Reorder Buffer
Because the P6 microarchitecture must commit its instructions in order, it
needs a place to keep track of the original program order of each instruction
that enters the reservation station. Therefore, after the instructions are
decoded, they must travel through the
reorder buffer (ROB)
before flowing into the reservation station. The ROB is like a large logbook in which the P6 can
record all the essential information about each instruction that enters the
out-of-order back end. The primary function of the ROB is to ensure that
instructions come out the other side of the out-of-order back end in the same
order in which they entered it. In other words, it’s the reservation station’s
job to see that instructions are executed in the most optimal order, even if
that means executing them out of program order. It’s the reorder buffer’s
job to ensure that the finished instructions get put back in program order
and that their results are written to the architectural register file in the
proper sequence. To this end, the ROB stores data about each instruction’s
status, operands, register needs, original place in the program, and so on.
So newly decoded instructions flow into the ROB, where their relevant
information is logged in one of 40 available entries. From there, they pass on
to the reservation station, and then on to the back end. Once they’re done
executing, they wait in the ROB until they’re ready to be committed.
The role I’ve just described for the reorder buffer should be familiar to
you at this point. The reorder buffer corresponds to the structure that I
called the
completion buffer
earlier, but with a few extra duties assigned to it.
If you look at my diagram of the P6 microarchitecture, you’ll notice that
the reorder buffer is depicted in two spots: the front end and the commit unit.
This is because the ROB is active in both of these phases of the instruction’s
lifecycle. The ROB is tasked with tracking instructions as they move through
the phases of their lifecycle and with putting the instructions back in program
order at the end of their lifecycle. So newly decoded instructions must be
given a tracking entry in the ROB and have a temporary rename register
allocated for their private use. Similarly, newly executed instructions must
wait in the ROB before they can commit by having the contents of the
temporary rename register that holds their result permanently written to
the architectural register file.
The Intel Pentium and Pentium Pro
99
As implied in the previous sentence, not only does the P6’s ROB act as a
completion buffer and an instruction tracker, but it also handles register
renaming. Each of the P6 microarchitecture’s 40 ROB entries has a
data field
that holds program data just like an
x
86 register. These fields give the P6’s back end 40 microarchitectural rename registers to work with, and they’re
used in combination with the P6’s
register allocation table (RAT)
to implement register renaming in the P6 microarchitecture.
The Instruction Window
The reservation station and the reorder buffer together make up the heart
of the P6’s out-of-order back end, and they account for its drastic clock-for-
clock performance advantage over the original Pentium. These two buffers—
the one for reshuffling and optimizing the code stream (the RS) and the
other for unshuffling and reordering the code stream (the ROB)—enable
the P6 processor to dynamically and intelligently adapt its operation to fit the needs of the ever-changing code stream.
A common metaphor for thinking about and talking about the P6’s RS +
ROB combination, or analogous structures on other processors, is that of an
instruction window. The P6’s ROB can track up to 40 instructions in various
stages of execution, and its reservation station can hold and examine up to
20 instructions to determine the optimal time for them to execute. Think of
the reservation station’s 20-instruction buffer as a window that moves along the sequentially ordered code stream; on any given cycle, the P6 is looking through
this window at that visible segment of the code stream and thinking about
how its hardware can optimally execute the 20 or so instructions that it sees
there.
A good analogy for this is the game of Tetris, where a small preview
window shows you the next piece that will come your way while you’re
deciding how best to place the currently falling piece. Thus at any given
moment, you can see a total of two Tetris pieces and think about how those
two should fit with the pieces that have gone before and those that might
come after.
The P6 microarchitecture’s job is a little harder than the average Tetris
player’s, because it must maneuver and optimally place as many as three
falling pieces at a time; hence it needs to be able to see farther ahead into
the future in order to make the best decisions about what to place where and
when. The P6’s wider instruction window allows the processor to look further
ahead in the code stream and to juggle its instructions so that they fit together with the currently available execution resources in the optimal manner.
The P6 Pipeline
The P6 has a 12-stage pipeline that’s considerably longer than the Pentium’s
five-stage pipeline. I won’t enumerate and describe all 12 stages individually,
but I will give a general overview of the phases that the P6’s pipeline passes
through.
100
Chapter 5
BTB access and instruction fetch
The first three-and-a-half pipeline stages are dedicated to accessing the
branch target buffer and fetching the next instruction. The P6’s two-
cycle instruction fetch phase is longer than the Pentium’s one-cycle fetch
phase, but it keeps the L1 cache access latency from holding back the
clock speed of the processor as a whole.
Decode
The next two-and-a-half stages are dedicated to decoding
x
86 instructions and breaking them down into the P6’s internal, RISC-like instruction
format. We’ll discuss this instruction set translation, which takes place
in all modern
x
86 processors and even in some RISC processors, in
more detail shortly.
Register rename
This stage takes care of register renaming and logging instructions in
the ROB.
Write to RS
Writing instructions from the ROB into the RS takes one cycle, and it
occurs in this stage.
Read from RS
At this point, the issue phase of the instruction’s lifecycle is under way.
Instructions can sit in the RS for an unspecified number of cycles before
being read from the RS. Even if they’re read from the RS immediately
after entering it, it takes one cycle to move instructions out of the RS,
through the
issue ports
and into the execution units.
Execute
Instruction execution can take one cycle, as in the case of simple
integer instructions, or multiple cycles, as in the case of floating-point
instructions.
Commit
These two final cycles are dedicated to writing the results of the instruc-
tion execution back into the ROB, and then committing the instructions
by writing their results from the ROB into the architectural register file.
Lengthening the P6’s pipeline as described in this chapter has two primary
beneficial effects. First, it allows Intel to crank up the processor’s clock speed, since each of the stages is shorter and simpler and can be completed quicker.
The second effect is a little more subtle and less widely appreciated.
The P6’s longer pipeline, when combined with its buffered decoupling
of fetch/decode bandwidth from execution bandwidth, allows the processor
to hide hiccups in the fetch and decode stages. In short, the nine pipeline
stages that lie ahead of the execute stage combine with the RS to form a deep
buffer for instructions. This buffer can hide gaps and hang-ups in the flow
of instructions in much the same way that a large water reservoir can hide
interruptions in the flow of water to a facility.
The Intel Pentium and Pentium Pro
101
But on the downside (to continue the water reservoir example), when
one dead animal is spotted floating in the reservoir, the whole thing has to
be flushed. This is sort of the case with the P6 and a branch misprediction.
Branch Prediction on the P6
The P6’s architects expended considerably more resources than its prede-
cessor on branch prediction and managed to boost dynamic branch predic-
tion accuracy from the Pentium’s approximately 75 percent rate to upwards
of 90 percent. The P6 has a 512-entry BHT + BTB, and it uses four bits to
record branch history information (compared to the Pentium’s two-bit
predictor). The four-bit prediction scheme allows the Pentium to store more
of each branch’s history, thereby increasing its ability to correctly predict
branch outcomes.
As you learned in Chapter 2, branch prediction gets more important as
pipelines get longer, because a pipeline flush due to a misprediction means
more lost cycles and a longer recovery time for the processor’s instruction
throughput and completion rate.
Consider the case of a conditional branch whose outcome depends on
the result of an integer calculation. On the original Pentium, the calculation
happens in the fourth pipeline stage, and if the branch prediction unit
(BPU) has guessed incorrectly, only three cycles worth of work would be lost
in the pipeline flush. On the P6, though, the conditional calculation isn’t
performed until stage 10, which means 10 cycles worth of work get flushed if
the BPU guesses incorrectly.
When a dynamically scheduled processor executes instructions specula-
tively, those speculative instructions and their results are stored in the ROB
just like non-speculative instructions. However, the ROB entries for the spec-
ulative instructions are marked as speculative and prevented from committing