Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
doesn’t know or care if the instructions are coming from the trace cache or
the microcode ROM. All it sees is a constant, uninterrupted stream of
instructions.
Intel hasn’t said exactly how big the trace cache is—only that it holds
12,000 micro-ops. Intel claims this is roughly equivalent to a 16KB to 18KB
I-cache.
By way of finishing up our discussion of the trace cache and introducing
our detailed walk-through of the Pentium 4’s pipeline, I should note two
final aspects of the trace cache’s effect on the pipeline. First, the trace cache still needs a short instruction fetch stage so that micro-ops can be fetched
from it and sent to the allocation and scheduling logic. When we look at the
Pentium 4’s basic execution pipeline, you’ll see this stage. Second, the trace
cache actually has its own little BPU for predicting the directions and return
addresses of branches within the trace cache itself. So the trace cache doesn’t
eliminate branch processing and prediction entirely from the picture; it just
alleviates their effects on performance.
154
Chapter 7
An Overview of the Pentium 4’s Pipeline
Now let’s step back and take a look at the Pentium 4’s basic execution pipeline.
Here’s a breakdown of the various pipeline stages.
Stages 1 and 2: Trace Cache Next Instruction Pointer
In these stages, the Pentium 4’s trace cache fetch logic gets a pointer to the
next instruction in the trace cache.
Stages 3 and 4: Trace Cache Fetch
These two stages fetch an instruction from the trace cache to be sent to the
back end.
Stage 5: Drive
This is the first of two special
drive stages
in the Pentium 4’s pipeline, each of which is dedicated to driving signals from one part of the processor to the
next. The Pentium 4 runs so fast that sometimes a signal can’t make it all the
way to where it needs to be in a single clock pulse, so the processor dedicates
some pipeline stages to letting these signals propagate across the chip. These
drive stages are there because the Pentium 4’s designers intend for the chip
to reach such stratospheric clock speeds that stages like this are absolutely
necessary.
At the end of these first five stages, the Pentium 4’s trace cache sends up
to three micro-ops per cycle into a large, FIFO
micro-op queue
. This in-order queue, which sits in between the Pentium 4’s front end and back end, smoothes
out the flow of instructions to the back end by squeezing out any fetch- or
decode-related bubbles. Micro-ops enter the top of the queue and fall down
to rest at the lowest available entry, directly above the most recent micro-op to enter the queue. Thus, any bubbles that may have been ahead of the micro-ops disappear from the pipeline at this point. The micro-ops leave the bottom
of the micro-op queue in program order and proceed to the next pipeline
stage.
Stages 6 Through 8: Allocate and Rename (ROB)
In this group of stages, up to three instructions per cycle move from the
bottom of the micro-op queue and are allocated entries in the Pentium 4’s
ROB and rename registers. With regard to the latter, the
x
86 ISA specifies only eight GPRs, eight FPRs, and eight VPRs, but the Pentium 4 has 128 of
each type of register in its rename register files.
The allocator/renamer stages also allocates each micro-op an entry in
one of the two micro-op queues detailed in the next section and can send up
to three micro-ops per cycle into these queues.
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
155
Stage 9: Queue
To implement out-of-order execution, the Pentium 4 flows micro-ops from
its trace cache through the ROB and into two deep micro-op queues that sit
between its instructions’ dispatch and execution phases. These two queues
are the
memory micro-op queue
and the
arithmetic micro-op queue
. The memory micro-op queue holds memory operations (loads and stores) that are destined
for the Pentium 4’s LSU, while the arithmetic micro-op queue holds all other
types of operations.
The two main micro-op queues are roughly analogous to the G4e’s issue
queues, but with one crucial difference: the Pentium 4’s micro-op queues are
FIFO queues, while the G4e’s issue queues are not. For the Pentium 4, this
means an instruction passes into and out of a micro-op queue
in program order
with respect to the other instructions in its own queue. However, instructions
can still exit the bottom of each queue
out of program order
with respect to instructions in the other queue.
These two micro-op queues feed micro-ops into the scheduling logic in
the next stage.
Stages 10 Through 12: Schedule
The micro-op queues described in the preceding section are only part of the
Pentium 4’s dynamic scheduling logic. The other half consists of a set of four
micro-op schedulers whose job it is to schedule micro-ops for execution and
to determine to which execution unit the micro-ops should be passed. Each
of these schedulers consists of a smaller, 8- to 12-entry micro-op queue attached to a bit of scheduling logic. The scheduler’s scheduling logic arbitrates with
the other schedulers for access to the Pentium 4’s four issue ports, and it
removes micro-ops from its in-order scheduling queue and sends them
through the right port at the right time.
An instruction cannot exit from a scheduling queue until its input
operands are available and the appropriate execution unit is available.
When the micro-op is ready to execute, it is removed from the bottom of
its scheduling queue and passed to the proper execution unit through one
of the Pentium 4’s four issue ports, which are analogous to the P6 core’s
five issue ports in that they act as gateways to the back end’s execution
units.
Here’s a breakdown of the four schedulers:
Memory scheduler
Schedules memory operations for the LSU.
Fast IU scheduler
Schedules ALU operations (simple integer and logical instructions) for
the Pentium 4’s two double-speed integer execution units. As you’ll see
in the next chapter, the Pentium 4 contains two integer ALUs that run at
twice the main core’s clock speed.
156
Chapter 7
Slow IU/general FPU scheduler
Schedules the rest of the integer instructions and most of the floating-
point instructions.
Simple FP scheduler
Schedules simple FP instructions and FP memory operations.
These schedulers feed micro-ops through the four dispatch ports
described in the next stage.
Stages 13 and 14: Issue
The P6 core’s reservation station sends instructions to the back end via one
of five issue ports. The Pentium 4 uses a similar scheme, but with four issue
ports instead of five. There are two
memory ports
for memory instructions: the load port and the store port, for loads and stores, respectively. The remaining
two ports, called
execution ports
, are for all the other instructions: execution port 0 and execution port 1. The Pentium 4 can send a total of six micro-ops
per cycle through the four execution ports. This issue rate of six micro-ops per cycle is more micro-ops per cycle than the front end can fetch and decode
(three per cycle) or the back end can complete (three per cycle), but that’s
okay because it gives the machine some headroom in its middle so that it can
have bursts of activity.
You might be wondering how six micro-ops per cycle can move through
four ports. The trick is that the Pentium 4’s two execution ports are double-
speed, meaning that they can dispatch instructions (integer only) on the
rising and falling edges of the clock. But we’ll talk more about this in the
next chapter. For now, here’s a breakdown of the two execution ports and
which execution units are attached to them:
Execution port 0:
Fast integer ALU1
This unit performs integer addition, subtraction,
and logical operations. It also evaluates branch conditionals and exe-
cutes store-data micro-ops, which store data into the outgoing store
buffer. This is the first of two double-speed integer units, which operate
at twice the core clock frequency.
Floating-point/SSE move
This unit performs floating-point and SSE
moves and stores. It also executes the FXCH instruction, which means that
it’s no longer “free” on the Pentium 4.
Execution port 1:
Fast integer ALU2
This very simple integer ALU performs only integer
addition and subtraction. It’s the second of the two double-speed inte-
ger ALUs.
Slow integer ALU
This integer unit handles all of the more time-
consuming integer operations, like shift and rotate, that can’t be
completed in half a clock cycle by the two fast ALUs.
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
157
Floating-point/SEE/MMX ALU
This unit handles floating-point and
SSE addition, subtraction, multiplication, and division. It also handles all
MMX instructions.
In Figure 7-5, I’ve labeled the instruction flow paths going into each
execution unit to show which dispatch port instructions must pass through
in order to reach which execution unit.
Stages 15 and 16: Register Files
This stage is where the execution units, upon receiving the instructions,
read each instruction’s input operands from the appropriate register file.
To return to the discussion from Chapter 1, this step is the read step in the
read-execute-write cycle of computation.
Stage 17: Execute
In this stage, the instructions are actually executed by the back end’s execution units. We’ll take a closer look at the Pentium 4’s back end in the next chapter.
Stage 18: Flags
If an instruction’s outcome stipulates that it needs to set any flags in the PSW, then it does so at this stage.
Stage 19: Branch Check
Here’s the stage where the Pentium 4 checks the outcome of a conditional
branch to see if it has just wasted 19 cycles of its time executing some code
that it’ll have to throw away. By stage 19, the branch condition has been
evaluated, and the front end knows whether or not the branch predictor’s
guess was right or not.
Stage 20: Drive
You’ve already met the drive stage. Again, this stage is dedicated to propa-
gating signals across the chip.
Stages 21 and Onward: Complete and Commit
Although Intel only lists the last part of the execution phase—stage 20—as
part of the “normal Pentium 4 pipeline,” for completeness, I’ll include the
write-back phase. Completed instructions file into their pre-assigned entries
in the ROB, where they’re put back in program order before having their
results update the machine’s architectural state.
158
Chapter 7
As you can see, the Pentium 4’s 20-stage pipeline does much of the same
work in mostly the same order as the G4e’s seven-stage pipeline. By dividing
the pipeline into more stages, though, the Pentium 4 can reach higher clock
rates. As I’ve already noted, this deeply pipelined approach fits with the
Pentium 4’s “narrow and deep” design philosophy.
The Pentium 4’s Instruction Window
Before I discuss the nature of the Pentium 4’s instruction window, note that
I’m using the terms
instruction pool
and
instruction window
somewhat inter-changeably. These two terms represent two slightly different metaphors for
thinking about the set of queues and buffers positioned between a processor’s
front end and its back end. Instructions collect up in these queues and
buffers—just like water collects in a pool or reservoir—before being drained
away by the processor’s back end. Because the instruction pool represents a
small segment of the code stream, which the processor can examine for
dependencies and reorder for optimal execution, this pool can also be said
to function as a window on the code stream. Now that that’s clear, let’s take a
look at the Pentium 4’s instruction window.
As I explained in the previous chapter, the older P6 core’s RS and ROB
made up the heart of its instruction window. The Pentium 4 likewise has a
ROB for tracking micro-ops, and in fact, and its 126-entry ROB is much larger
than that of the P6. The buffer functions of the P6’s reservation station,
however, have been divided among multiple structures. The previous section’s
pipeline description explains how these structures are configured.
This partitioning of the instruction window into memory and arithmetic
portions by means of the two scheduling queues has the effect of ensuring
that both types of instructions will always have space in the window, and that
an overabundance of one instruction type will not crowd the other type out of
the window. The multiple schedulers provide fine-grained control over the
instruction flow, so that it’s optimally reordered for the fast execution units.