Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
the appropriate execution unit’s reservation station so that the instructions
behind it in the instruction queue can move up and be dispatched.
The small size of the 604’s reservation stations compared to similar struc-
tures on the P6 is due to the fact that the 604’s pipeline is relatively short.
Pipeline stalls aren’t quite as devastating for performance on a machine with
a 6-stage pipeline as they are on a machine with a 12-stage pipeline, so the
604 doesn’t need as large of an instruction window as its super-pipelined
counterparts.
The Four Rules of Instruction Dispatch
Here are the four most important rules governing instruction dispatch on
the 604:
The in-order dispatch rule
Before an instruction can dispatch, all of the instructions preceding that
instruction must have dispatched. In other words, instructions dispatch
from the instruction queue in program order. It is not until instructions
have arrived at the reservation stations, where they may issue out of order
to the execution units, that the original program order is disrupted.
The issue buffer/execution unit availability rule
Before the dispatch logic can send an instruction to an execution unit’s
reservation station, that reservation station must have an entry available.
If an instruction doesn’t need to go to a reservation station because its
inputs are available at the time of dispatch, the required execution unit
must have a pipeline slot available, and the unit’s reservation station
must be empty (i.e., there are no older instructions waiting to execute)
before the instruction can be sent to the execution unit. (This rule is
modified on the PowerPC 7450—aka G4e—and we’ll cover the modifi-
cation
in “The PowerPC 7400 (aka the G4)” on page 133.
) PowerPC Processors: 600 Series, 700 Series, and 7400
127
The completion buffer availability rule
For an instruction to dispatch, there must be space available in the
completion queue so that a new entry can be created for the instruction.
Remember, the completion queue (or ROB) keeps track of the program
order of each in-flight instruction, so any instruction that enters the out-
of-order back end must be logged in the completion queue first.
The rename register availability rule
There must be enough rename registers available to temporarily store
the results for each register that the instruction will modify.
If a dispatched instruction meets the requirements imposed by these
rules, and if it meets the other more instruction-specific dispatch rules not
listed here, it can dispatch from the instruction queue to the back end.
All of the PowerPC processors discussed in this chapter that have reserva-
tion stations are subject to (at least) these four dispatch rules, so keep these rules in mind as we talk about instruction dispatch throughout the rest of this
chapter. Note that all of the processors—including the 604—have additional
rules that govern the dispatch of specific types of instructions, but these four general dispatch rules are the most important.
The Completion Phase: The 604’s Reorder Buffer
As with the P6 microarchitecture, the reservation stations aren’t the only
structures that make up the 604’s instruction window. The 604 has a 16-entry
reorder buffer (ROB) that performs the same function as the P6 micro-
architecture’s much larger 40-entry ROB.
The ROB corresponds to the simpler completion queue on older PPC
processors. In the dispatch stage, not only are instructions sent to the back
end’s reservation stations, but entries for the dispatched instructions are
allocated an entry in the ROB and a set of rename registers. In the com-
pletion stage, the instructions are put back in program order so that their
results can be written back to the register file in the subsequent write-back
stage. The completion stage corresponds to what I’ve called the completion
phase of an instruction’s lifecycle, and the write-back stage corresponds to
what I’ve called the
commit phase
.
The 604’s ROB is much smaller than the P6’s ROB for the same reason
that the 604’s reservation stations are fewer: the 604 has a much shallower
pipeline, which means that it needs a much smaller instruction window
for tracking fewer in-flight instructions in order to achieve the same
performance.
The trade-off for this lack of complexity and lower pipeline depth is a
lower clock speed. The 6-stage 604 debuted in May 1995 at 120 MHz, while
the 12-stage Pentium Pro debuted later that year (November 1995) at speeds
ranging from 150 to 200 MHz.
128
Chapter 6
Summary: The 604 in Historical Context
With a 32KB split L1 cache, the 604 had a much heftier cache than its prede-
cessors, which it needed to help keep its deeper pipeline fed. The larger cache, higher dispatch and issue rate, wider back end, and deeper pipeline made for a
solid RISC performer that was easily able to keep pace with its
x
86 competitors.
Still, the Pentium Pro was no slouch, and its performance was scaling
well with improvements in processor manufacturing techniques. Apple
needed more power from AIM to keep the pace, and more power is what
they got with a minor microarchitectural revision that came to be called
the 604e.
The PowerPC 604e
The 604e built on gains made by the 604 with a few core changes that
included a doubling of the L1 cache size (to 32KB instruction/32KB data)
and the addition of a new independent execution unit: the
condition register
unit (CRU)
.
The previous 600-series processors had moved the responsibility for
handling condition register logical operations back and forth among various
units (the integer unit in the 601, the system unit in the 603/603e, and the
branch unit in the 604). Now with the 604e, these operations got an execu-
tion unit of their own. The 604e sported a functional block in its back end
that was dedicated to handling condition register logical operations, which
meant that these not uncommon operations didn’t tie up other execution
units—like the integer unit or the branch unit—that had more serious
work to do.
The 604e’s branch unit, now that it was free from having to handle CR
logical operations, got a few expanded capabilities that I won’t detail here.
The 604e’s caches, in addition to being enlarged, also got additional copy-
back buffers and a handful of other enhancements.
The 604e was ultimately able to scale up to 350 MHz once it moved from
a 0.35 to a 0.25 micron manufacturing process, making it a successful part for
Apple’s budding RISC media workstation line.
The PowerPC 750 (aka the G3)
The PowerPC 750—known to Apple users as the G3—is a design based heavily
on the 603/603e. Its four-stage pipeline is the same as that of the 603/603e,
and many of the features of its front end and back end will be familiar to you
from our discussion of the older processor. Nonetheless, the 750 sports a few
very powerful improvements over the 603e that make it faster than even the
604e, as you can see in Table 6-4.
PowerPC Processors: 600 Series, 700 Series, and 7400
129
.
Table 6-4:
Features of the PowerPC 750
Introduction Date
September 1997
Process
0.25 micron
Transistor Count
6.35 million
Die Size
67 mm2
Clock Speed at Introduction
200–300 MHz
Cache Sizes
64KB split L1, 1MB L2
First Appeared In
Power Macintosh G3
The 750’s significant improvement in performance over the 603/603e is
the result of a number of factors, not the least of which are the improvements
that IBM made to the 750’s integer and floating-point capabilities.
A quick glance at the 750’s layout (see Figure 6-4) reveals that its back end
is wider than that of the 603. More specifically, where the 603 has a single
integer unit, the 750 has two—a simple integer unit (SIU) and complex inte-
ger unit (CIU). The 750’s complex integer unit handles all integer instructions, while the simple integer unit handles all integer instructions except multiply
and divide. Most of the integer instructions that execute in the SIU are
single-cycle instructions.
Like the 603 (and the 604), the 750’s floating-point unit can execute all
single-precision floating-point operations—including multiply—with a latency
of three cycles. And like the 603, early versions of the 750 had to insert a
pipeline bubble after every third floating-point instruction in its pipeline;
this is fixed in later IBM-produced versions of the 750. Double-precision
floating-point operations, with the exception of operations involving mul-
tiplication, also take three cycles on the 750. Double-precision multiply and
multiply-add operations take four cycles, because the 750 doesn’t have a full
double-precision FPU.
The 750’s load-store unit and system register unit perform the same
functions described in the preceding section for the 603, so they don’t merit
further comment.
The 750’s Front End, Instruction Window, and Branch Instruction
The 750 fetches up to four instructions per cycle into its six-entry instruction queue, and it dispatches up to two non-branch instructions per cycle from
the IQ’s two bottom entries. The dispatch logic follows the four dispatch rules
described earlier when deciding when an instruction is eligible to dispatch,
and each dispatched instruction is assigned an entry in the 750’s six-entry
ROB (compare the 603’s five-entry ROB).
130
Chapter 6
Front End
Instruction Fetch
BU
Branch
Instruction Queue
Unit
Decode/Dispatch
Reserv.
Reserv.
Reserv.
Reserv.
Station
Station
Station
Station
FPU-1
IU1-1
IU2-1
LSU-1
FPU-2
LSU-2
FPU-3
Load-
Floating-
Integer
Store
Point Unit
Unit
Unit
Memory Access
Scalar Arithmetic Logic Units
Units
Back End
Completion
Queue
Write
Commit Unit
Figure 6-4: Microarchitecture of the PowerPC 750
As on the 603 and 604, newly dispatched instructions enter the reserva-
tion station of the execution unit to which they have been dispatched, where
they wait for their operands to become available so that they can issue. The
750’s reservation station configuration is similar to that of the 603 in that, with the exception of the two-entry reservation station attached to the 750’s LSU,
all of the execution units have single-entry reservation stations. And like the
603, the 750’s branch unit has no reservation station.
Because the 750’s instruction window is so small, it has half the rename
registers of the 604. Nonetheless, the 750’s six general-purpose and six floating-point rename registers still put it ahead of the 603’s number of rename registers (five GPRs and four FPRs). Like the 603, the 750 has one rename register
each for the CR, LR, and CTR.
PowerPC Processors: 600 Series, 700 Series, and 7400
131
You would think that the 750’s smaller reservation stations and shorter
ROB would put it at a disadvantage with respect to the 604, which has a larger
instruction window. But the 750’s pipeline is shorter than that of the 604, so
it needs fewer buffers to track fewer in-flight instructions. More importantly,
though, the 750 has one very clever trick up its sleeve that it uses to keep its pipeline full.
Recall that standard dynamic branch prediction schemes generally use
a branch history table (BHT) in combination with a branch target buffer (BTB)
to speculate on the outcome of branch instructions and to redirect the
processor’s front end to a different point in the code stream based on this
speculation. The BHT stores information on the past behavior (taken or not
taken) of the most recently executes branch instructions, so that the processor
can determine whether or not it should take these branches if it encounters
them again. The target addresses of recently taken branches are stored in the
BTB, so that when the branch prediction hardware decides to speculatively
take a branch, it has immediate access to that branch’s target address without
having to recalculate it. The target address of the speculatively taken branch
is loaded from the BTB into the instruction register, so that on the next fetch
cycle, the processor can begin fetching and speculatively executing instruc-
tions from the target address.
The 750 improves on this standard scheme in a very clever way. Instead
of storing only the target addresses of recently taken branches in a BTB, the
750’s 64-entry
branch target instruction cache (BTIC)
stores the instruction that is located at the branch’s target address. When the 750’s branch prediction