Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
When the individual IOPs in a group reach their proper issue queues, they
can then be issued out of program order to the execution units at a rate of
The G5: IBM’s PowerPC 970
197
eight IOPs/cycle for all the queues combined. Before they reach the comple-
tion stage, however, they need to be placed back into their group so that an
entire group of five IOPs can be completed on each cycle. (Don’t worry if this
sounds a bit confusing right now. The group dispatch and formation scheme
will become clearer when we discuss the 970’s peculiar issue queue structure.)
The price the 970 pays for the reduced bookkeeping overhead afforded
it by the dispatch grouping scheme is a loss of execution efficiency brought
on by the diminished granularity of control that comes from being able to
dispatch, schedule, issue, and complete instructions on an individual basis.
Let me explain.
The 970’s Dispatch Rules
When the 970’s front end assembles an IOP group, there are certain rules it
must follow. The first rule is that the group’s five slots must be populated with IOPs in program order, starting with the oldest IOP in slot 0 and moving up
to newest IOP in slot 4. Another rule is that all branch instructions must go
in slot 4, and slot 4 is reserved for branch instructions only. This means that
if the front end can’t find a branch instruction to put in slot 4, it can issue
one less instruction that cycle.
Similarly, there are some situations in which the front end must insert
noops into the group’s slots in order to force a branch instruction into slot 4.
Noop
(pronounced “no op”) is short for
no operation
. It is a kind of non-instruction instruction that means “Do nothing.” In other words, the front
end must sometimes insert empty execution slots, or pipeline bubbles, into
the instruction stream in order to make the groups comply with the rules.
The preceding rules aren’t the only ones that must be adhered to when
building groups. Another rule dictates that instructions destined for the
conditional register unit (CRU) can go only in slots 0 and 1.
And then there are the rules dealing with cracked and millicoded
instructions. Consider the following from IBM’s POWER4 white paper:
Cracked instructions flow into groups as any other instructions
with one restriction. Both IOPs must be in the same group. If both
IOPs cannot fit into the current group, the group is terminated
and a new group is initiated. The instruction following the cracked
instruction may be in the same group as the cracked instruction,
assuming there is room in the group. Millicoded instructions
always start a new group. The instruction following the millicoded
instruction also initiates a new group.
And that’s not all! A group has to have the following resources available
before it can even dispatch to the back end. If just one of following resources
is too tied up to accommodate the group or any of its instructions, then the
entire group has to wait until that resource is freed up before it can dispatch:
Group completion table (GCT) entry
The group completion table is the 970’s equivalent of a reorder buffer
or completion queue. While a normal ROB keeps track of individual
in-flight instructions, the GCT tracks whole dispatch groups. The GCT
has 20 entries for keeping track of 20 active groups as the groups’
198
Chapter 10
constituent instructions make their way through the ~100 execution
slots available in the back end’s pipelines. Regardless of how few instruc-
tions are actually in the back end at a given moment, if those instructions
are grouped so that all 20 GCT entries happen to be full, no new groups
can be dispatched.
Issue queue slot
If there aren’t enough slots available in the appropriate issue queues to
accommodate all of a group’s instructions, the group must wait to dispatch.
(In a moment I’ll elaborate on what I mean by “appropriate issue queues.”)
Rename registers
There must be enough register rename resources available so that any
instruction that requires register renaming can issue when it’s dispatched
to its issue queue.
Again, when it comes to the preceding restrictions, one bad instruction
can keep the whole group from dispatching.
Because of its use of groups, the 970’s dispatch bandwidth is sensitive
to a complex host of factors, not the least of which is a sort of “internal
fragmentation” of the group completion table that could potentially arise
and needlessly choke dispatch bandwidth if too many of the groups in the
GCT are partially or mostly empty.
In order to keep dispatch bottlenecks from stopping the fetch/decode
portion of the pipeline, the 970 can buffer up to four dispatch groups in a
four-entry
dispatch queue
. So if the preceding requirements are not met and there is space in the dispatch queue, a dispatch group can move into the
queue and wait there for its dispatch requirements to be met.
Predecoding and Group Dispatch
The 970 uses a trick called
predecoding
in order to move some of the work of group formation higher up in the pipeline, thereby simplifying and speeding
up the latter decode and group formation phases in the front end. As instruc-
tions are fetched from the L2 cache into the L1 I-cache, each instruction is
predecoded and marked with a set of five predecode bits. These bits indicate
how the instruction should be grouped—in other words, if it should be first
in its group, last in its group, or unrestricted; if it will be a microcoded instruction; if it will trigger an exception; if it will be split or not; and so on. This information is used by the decode and group formation hardware to quickly
route instructions for decoding and to group them for dispatch.
The predecode hardware also identifies branches and marks them for
type—conditional or unconditional. This information is used by the 970’s
branch prediction hardware to implement branch folding, fall-through,
and branch prediction with minimal latencies.
Some Preliminary Conclusions on the 970’s Group Dispatch Scheme
In the preceding section, I went into some detail on the ins and outs of group
formation and group dispatching in the 970. If you only breezed through
The G5: IBM’s PowerPC 970
199
the section and thought, “All of that seems like kind of a pain,” then you got
90 percent of the point I wanted to make. Yes, it is indeed a pain, and that
pain is the price the 970 pays for having both width and depth at the same
time. The 970’s big trade-off is that it needs less logic to support its long pipeline and extremely wide back end, but in return, it has to give up a measure
of granularity, flexibility, and control over the dispatch and issuing of its
instructions. Depending on the makeup of the instruction stream and how
the IOPs end up being arranged, the 970 could possibly end up with quite a
few groups that are either mostly empty, partially empty, or stalled waiting
for execution resources.
So while the 970 may be theoretically able to accommodate 200 instruc-
tions in varying stages of fetch, decode, execution, and completion, the reality is probably that under most circumstances, a decent number of its execution
slots will be empty on any given cycle due to dispatch, scheduling, and com-
pletion limitations. The 970 makes up for this with the fact that it just has so many available slots that it can afford to waste some on group-related pipeline
bubbles.
The PowerPC 970’s Back End
The PowerPC 970 sports a total of 12 execution units, depending on how you
count them. Even a more conservative count that lumps together the three
SIMD integer and floating-point units and doesn’t count the branch execution
unit would still give nine execution units.
In the following three sections, I’ll discuss each of the 970’s execution
units, comparing them to the analogous units on the G4e, and in some cases
the Pentium 4. As the discussion develops, keep in mind that a simple com-
parison of the types and numbers of execution units for each of the three
processors is not at all adequate to the task of sorting out the real differences between the processors. Rather, there are complicating factors that make comparisons much more difficult than one might naïvely expect. Some of these
factors will be evident in the sections dealing with each type of execution
unit, but others won’t turn up until we discuss the 970’s issue queues in the
last third of the chapter.
NOTE
As I cover each part of the 970’s back end, I’ll specify the number of rename registers of
each type (integer, floating-point, vector) that the 970 has. If you compare these numbers to the equivalent numbers for the G4e, you’ll see that 970 has many more rename
registers than its predecessor. This increased number of rename registers is necessary
because the 970’s instruction window (up to 200 instructions in-flight) is significantly
higher than that of the G4e (up to 16 instructions in-flight). The more instructions a
processor can hold in-flight at once, the more rename registers it needs in order to pull
off the kinds of tricks that a large instruction window enables you to do—i.e., dynamic
scheduling, loop unrolling, speculative execution, and the like. In a nutshell, more
instructions on the chip in various stages of execution means more data needs to be
stored in more registers.
200
Chapter 10
Integer Unit, Condition Register Unit, and Branch Unit
In the chapters on the Pentium 4 and G4e, I described how both of these
processors embody a similar approach to integer computation in that they
divide integer operations into two types: simple and complex. Simple integer
instructions, like add, are the most common type of integer instruction and
take only one cycle to execute on most hardware. Complex integer instruc-
tions (e.g., integer division) are rarer and take multiple cycles to execute.
In keeping with the quantitative approach to computer design’s central
dictum, “Make the common case fast,”1 both the Pentium 4 and G4e split up
their integer hardware into two specialized types of execution units: a group
of units that handle only simple, single-cycle instructions and a single unit
that handles complex, multi-cycle instructions. By dedicating the majority of
their integer hardware solely to the rapid execution of the most common
types of instructions (the simple, single-cycle ones), the Pentium 4 and the
G4e are able to get increased integer performance out of a smaller amount
of overall hardware.
Think of the multiple fast IUs (or SIUs) as express checkout lanes for
one-item shoppers and the single slow IU as a general-purpose checkout lane
for multiple-item shoppers in a supermarket where most of the shoppers
only buy a single item. This kind of specialization keeps that one guy who’s
stocking up for Armageddon from slowing down the majority of shoppers
who just want to duck in and grab eggs or milk on the way home from work.
The PPC 970 differs from both of these designs in that it has two
general-purpose IUs that execute almost all integer instructions. To return to
the supermarket analogy, the 970 has two general-purpose checkout lanes in a
supermarket where most of the shoppers are one-item shoppers. The 970’s
two IUs are attached to 80 64-bit GPRs (32 architected and 48 rename).
Why doesn’t the 970 have more specialized hardware (express checkout
lanes) like the G4e and Pentium 4? The answer is complicated, and I’ll take an
initial stab at answering it in a moment, but first I should clear something up.
The Integer Units Are Not Fully Symmetric
I said that the 970’s two IUs execute “almost all” integer instructions, because the units are not, in fact, fully symmetric. One of the IUs performs fixed-point divides, and the other handles SPR operations. So the 970’s IUs are
slightly specialized, but not in the same manner as the IUs of the G4e and
Pentium 4. If the G4e and Pentium 4 have express checkout lanes, the 970
has something more akin to a rule that says, “All shoppers who bought some-
thing at the deli must go through line 1, and all shoppers who bought
something at the bakery must go through line 2; everyone else is free to
go through either line.”
Thankfully, integer divides and SPR instructions are relatively rare, so
the impact on performance of this type of forced segregation is minimized.
In fact, if the 970 didn’t have the group formation scheme, this seemingly
1 See John L. Hennessy and David A. Patterson,
Computer Architecture: A Quantitative Approach
, Third Edition (Morgan Kauffman Publishers: 2003).
The G5: IBM’s PowerPC 970
201
minor degree of specialization might hardly be worth commenting on. But
as it stands, group-related scheduling issues turn this specialization into a
potentially negative factor—albeit a minor one—for integer performance
for reasons that we’ll discuss later on in this chapter.
Integer Unit Latencies and Throughput