Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
Unit
Decode/Dispatch
Vector Issue
FP Issue
General Issue
Queue
Queue
Queue
RS
RS
RS
RS
RS
RS
RS
RS
RS
RS
RS
RS
VPU-1
VSIU-1
VCIU-1 VFPU-1
FPU-1
IU1a-1
IU2-1
LSU-1
VPU-2
Finish
VCIU-2 VFPU-2
FPU-2
Finish
LSU-2
VCIU-3 VFPU-3
FPU-3
Vector
Load-
VCIU-4 VFPU-4
FPU-4
Permute
Store
Unit
FPU-5
Integer Unit
Unit
Finish
Vector ALU
FPU
Memory Access
Vector Arithmetic Logic Units
Scalar Arithmetic Logic Units
Units
Back End
Completion
Queue
Write
Commit Unit
Figure 7-4: The basic microarchitecture of the G4e
144
Chapter 7
Before instructions can enter the G4e’s pipeline, they have to be avail-
able in its 32KB instruction cache. This instruction cache, together with the
32KB data cache, makes up the G4e’s 64KB L1 cache. An instruction leaves
the L1 and goes down through the various front-end stages until it hits the
back end, at which point it’s executed by one of the G4e’s eight execution
units (not counting the branch execution unit, which we’ll talk about in a
second).
As I’ve already noted, the G4e breaks down the G4’s classic, four-stage
pipeline into seven, shorter stages:
G4
G4e
1
Fetch
1
Fetch-1
2
Fetch-2
2
Decode/dispatch
3
Decode/dispatch
4
Issue
3
Execute
5
Execute
6
Complete
4
Write-back
7
Write-back (Commit)
Notice that the G4e dedicates one pipeline stage each to the character-
istic issue and complete phases that bracket the out-of-order execution phase
of a dynamically scheduled instruction’s lifecycle.
Let’s take a quick look at the basic pipeline stages of the G4e, because
this will highlight some of the ways in which the G4e differs from the original
G4. Also, an understanding of the G4e’s more classic RISC pipeline will
provide you with a good foundation for our upcoming discussion of the
Pentium 4’s much longer, more peculiar pipeline.
Stages 1 and 2: Instruction Fetch
These two stages are both dedicated primarily to grabbing an instruction
from the L1 cache. Like its predecessor, the G4, the G4e can fetch up to four
instructions per clock cycle from the L1 cache and send them on to the next
stage. Hopefully, the needed instructions are in the L1 cache. If they aren’t
in the L1 cache, the G4e has to hit the much slower L2 cache to find them,
which can add up to nine cycles of delay into the instruction pipeline.
Stage 3: Decode/Dispatch
Once an instruction has been fetched, it goes into the G4e’s 12-entry instruc-
tion queue to be decoded. Once instructions are decoded, they’re dispatched
at a rate of up to three non-branch instructions per cycle to the proper
issue
queue
.
Note that the G4e’s dispatch logic dispatches instructions to the issue
queues in accordance with
“The Four Rules of Instruction Dispatch” on
page 127.
The only modification to the rules is in the issue buffer rule; instead Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
145
of requiring that the proper execution unit and reservation station be
available before an instruction can be dispatched, the G4e requires that
there be space in one of the three issue queues.
Stage 4: Issue
The issue stage is the place where the G4e differs the most from the G4.
Specifically, the presence of the G4e’s three issue queues endows it with
power and flexibility that the G4 lacks.
As you learned in Chapter 6, instructions can stall in the original G4’s
dispatch stage if there is no execution unit available to take them. The G4e
eliminates this potential dispatch stall condition by placing a set of buffers,
called issue queues, in between the dispatch stage and the reservation
stations. On the G4e, it doesn’t matter if the execution units are busy and
their reservation stations are full; an instruction can still dispatch to the
back end if there is space in the proper issue queue.
The six-entry
general issue queue (GIQ)
feeds the integer ALUs and can
accept up to three instructions per cycle from the dispatch unit. It can also
issue up to three instructions per cycle
out of order
from its bottommost three entries to any of the G4e’s three integer units or to its LSU.
The four-entry
vector issue queue (VIQ)
can accept up to two instructions per cycle from the dispatch unit, and it can issue up to two instructions per
cycle from its bottommost two entries to any two of the four vector execution
units. But note that unlike the GIQ, instructions must issue
in order
from the bottom of the VIQ.
Finally, the single-entry
floating-point issue queue (FIQ)
can accept one instruction per cycle from the dispatch unit, and it can issue one instruction
per cycle to the FPU.
With the help of the issue queues, the G4e’s dispatcher can keep
dispatching instructions and clearing the instruction queue, even if the
execution units and their attached reservation stations are full. Further-
more, the GIQ’s out-of-order issue ability allows integer and memory
instructions in the code stream to flow around instructions that are stalled in
the execute phase, so that a stalled instruction doesn’t back up the pipeline
and cause pipeline bubbles. For example, if a multicycle integer instruction
is stalled in the bottom GIQ entry because the complex integer unit is busy,
single-cycle integer instructions and load/store instructions can continue to
issue to the simple integer units and the LSU from the two slots behind the
stalled instruction.
Stage 5: Execute
The execute stage is pretty straightforward. Here, the instructions pass from
the reservation stations into the execution units to be executed. Floating-
point instructions move into the floating-point execution unit, vector instruc-
tions move into one of the four AltiVec units, integer instructions move into
one of the G4e’s four integer execution units, and memory accesses move
into the LSU. We’ll talk about these units in a bit more detail when we
discuss the G4e’s back end.
146
Chapter 7
Stages 6 and 7: Complete and Write-Back
In these two stages, the instructions enter the completion queue to be put
back into program order, and their results are written back to the register
file. It’s important that the instructions are rearranged to reflect their
original ordering so that the illusion of in-order execution is maintained.
The user needs to think that the program’s commands were executed one
after the other, the way they were written.
Branch Prediction on the G4e and Pentium 4
The G4e and the Pentium 4 each use both static and dynamic branch pre-
diction techniques to prevent mispredictions and branch delays. If a branch
instruction does not have an entry in the BHT, both processors will use static
prediction to decide which path to take. If the instruction does have a BHT
entry, dynamic prediction is used. The Pentium 4’s BHT is quite large;
at 4,000 entries, it has enough space to store information on most of the
branches in an average program.
The earlier PIII’s branch predictor had a success rate of around 91 per-
cent, and the Pentium 4 allegedly uses an even more advanced algorithm
to predict branches, so it should perform even better. The Pentium 4 also
uses a BTB to store predicted branch targets. Note that in most of Intel’s
literature and diagrams, the BTB and BHT are combined under the label
the front-end BTB
.
The G4e has a BHT size of 2,000 entries, up from 512 entries in the
original G4. I don’t have any data on the G4e’s branch prediction success
rate, but I’m sure it’s fairly good. The G4e has a 128-entry BTIC, which is
twice as large as the original G4’s 64-entry BTIC. The G4e’s BTIC stores the
first four instructions in the code stream starting at each branch target, so it goes even further than the original G4 in preventing branch-related pipeline
bubbles.
Because of its long pipeline, the Pentium 4 has a
minimum misprediction
penalty
of 20 clock cycles for code that’s in the L1 cache—that’s the minimum, but the damage can be much worse, especially if the correct branch can’t be
found in the L1 cache. (In such a scenario, the penalty is upward of 30 cycles.) The G4e’s seven-stage pipeline doesn’t pay nearly as high of a price for misprediction as the Pentium 4, but it does take more of a hit than its four-stage
predecessor, the G4. The G4e has a minimum misprediction penalty of six
clock cycles, as opposed to the G4’s minimum misprediction penalty of only
four cycles.
In conclusion, both the Pentium 4 and the G4e spend more resources
than their predecessors on branch prediction, because their deeper pipelines
make mispredicted branches a major performance killer.
The Pentium 4 and G4e do actually have one more branch prediction
trick up their sleeves that’s worth at least noting, even though I won’t discuss it in any detail. That trick comes in the form of
software branch hints
, or extra information that a compiler or programmer can attach to conditional branch
instructions. This information gives the branch predictor clues as to the
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
147
expected behavior of the branch, whether the compiler or programmer
expects it to be taken or not taken. There doesn’t seem to be much infor-
mation available on how big of a help these hints are, and Intel at least
recommends that they be used sparingly since they can increase code size.
An Overview of the Pentium 4’s Architecture
Even though the Pentium 4’s pipeline is much longer than that of the
G4e, it still performs most of the same functions. Figure 7-5 illustrates the
Pentium 4’s basic architecture so that you can compare it to the picture of
the G4e presented in Figure 7-4. Due to space and complexity constraints,
I haven’t attempted to show each pipeline stage individually like I did with
the G4e. Rather, I’ve grouped the related ones together so you can get a
more general feel for the Pentium 4’s layout and instruction flow.
Front End
Instruction Fetch
BU
Translate
x
86/
Decode
Branch
Unit
L1 Instruction Cache
(Trace Cache)
BU
Trace Cache Fetch
(TC)
Back End
uop
Queue
Reorder Buffer (ROB)
Integer & General FP
Memory
Queue
Queue
Fast Integer
Slow Int. & General FP
Simple FP
Memory
Scheduler
Scheduler
Scheduler
Scheduler
Port 0
Port 1
Port 1
Port 1
Port 0
Load
Store
Port
Port
SIU1
SIU2
CIU
FPU
FPU
LOAD
STORE
SIU1
SIU2
SIMD
SIMD
FPU &
FPU &
Vector
Integer Units
Vector
Load-Store Unit
ALU
STORE
Re-order Buffer
(ROB)
Write
Completion Unit
Figure 7-5: Basic architecture of the Pentium 4
148
Chapter 7
The first thing to notice about Figure 7-5 is that the L1 instruction cache
is actually sitting after the fetch and decode stages in the Pentium 4’s front
end. This oddly located instruction cache—called the
trace cache
—is one of the Pentium 4’s most innovative and important features. It also greatly
affects the Pentium 4’s pipeline and basic instruction flow, so you have to
understand it before we can talk about the Pentium 4’s pipeline in detail.
Expanding the Instruction Window
Chapter 5 talked about the buffering effect of deeper pipelining on the P6
and how it allows the processor to smooth out gaps and hiccups in the code
stream. The analogy I used was that of a reservoir, which can smooth out
interruptions in the flow of water from a central source.
One of the innovations that makes this reservoir approach effective is the
decoupling of the back end from the front end by means of the reservation