Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
four fully pipelined vector processing units, it has plenty of hardware to go
around. As a brief recap from the last chapter, here’s a breakdown of the
G4e’s four AltiVec units:
z
vector permute unit (VPU)
z
vector simple integer unit (VSIU)
206
Chapter 10
z
vector complex integer unit (VCIU)
z
vector floating-point unit (VFPU)
All of this vector execution hardware is tied to a generous register file that
consists of thirty-two 128-bit architectural registers and sixteen additional
vector rename registers. Furthermore, each of the units is attached to a
four-entry vector issue queue that can issue two vector ops per cycle to any
of the four vector units.
NOTE
The SIMD instruction set known as AltiVec was codeveloped by both IBM and Motorola,
and it is co-owned by both of them. Motorola has a trademark on the AltiVec name, so
in most of IBM’s literature (but not all), the instruction set is referred to as VMX, presumably for Vector Multimedia Extensions. VMX and AltiVec are therefore two different
names for the same group of 162 vector instructions added to the PowerPC ISA. This
chapter uses the name AltiVec for both the G4e and the 970.
The 970’s AltiVec implementation looks pretty much like the original
G4’s—the MPC7400—except for the addition of issue queues for dynamic
execution. The 970 has the following units:
z
vector permute unit (VPU)
z
vector arithmetic logic unit (VALU)
z
vector simple integer unit (VSIU)
z
vector complex integer unit (VCIU)
z
vector floating-point unit (VFPU)
These units are attached to 80 vector registers—32 architectural registers
and 48 rename registers.
Notice that the vector execution units listed here are essentially the same
units as in the G4e, but they’re grouped differently. This grouping makes all
the difference. Take a look at the simplified diagram in Figure 10-3.
The 970 can dispatch up to four vector IOPs per cycle total to its two
vector
logical issue queues
—two IOPs per cycle in program order to the 16-entry VPU logical issue queue and two IOPs per cycle in program order to
the 20-entry VALU logical issue queue. (Each of these logical issue queues
consists physically of a pair of interleaved queues that operate together as a
single queue, but we’ll talk more about how the logical queues are actually
implemented in a moment.) Each of the two logical queues can then issue
one vector operation per cycle to any of the units attached to it. This means
that the 970 can issue one IOP per cycle to the VPU and one IOP per cycle to
any of the VALU’s three subunits.
As you can see, if you place the 970’s vector unit and the G4e’s vector unit
side by side, the 970 and the G4e can issue the same number of vector instruc-
tions per cycle to their AltiVec units, but the 970 is much more limited in the
combinations of instructions it can issue, because one of its two instructions
must be a vector permute. For instance, the G4e would have no problem
issuing both a complex integer instruction and a simple integer instruction to
its VCIU and VSIU in the same cycle, whereas the 970 would only be able to
issue one of these in a cycle.
The G5: IBM’s PowerPC 970
207
itm10_03.fm Page 208 Wednesday, October 25, 2006 1:57 PM
4 IOPs
0
1
2
3
4
per cycle
(total)
Vector
Vector
Issue
Issue
Queues
Queues
1 IOP
1 IOP
per cycle
per cycle
VPU-1
VSIU-1
VCIU-1 VFPU-1
VPU-2
VCIU-2 VFPU-2
VCIU-3 VFPU-3
Vector
Permute
VCIU-4 VFPU-4
Unit
Vector ALU
Vector Arithmetic Logic Units
Figure 10-3: The 970’s vector unit
Table 10-3 replicates a chart from Apple that compares the AltiVec
execution unit latencies for the two G4s and the 970.2
Table 10-3:
Vector Instruction Latencies for the G4, G4e, and 970
Hardware Unit
7400/7410
7450/7455
PPC 970
Vector simple integer unit (VSIU)
1
1
21
Vector complex integer unit (VCIU)
3
4
51
Vector floating-point unit (VFPU)
4 (5)2
4 (5)2,3
81
Vector permute unit (VPU)
1
2
24
1 An extra cycle latency is required if the data is next used in VPU.
2 The VFPU takes an extra cycle if Java mode is turned on. (It is off by default on Mac OS X, but on by default on Mac OS 9.)
3 Some FP-related VSIU instructions were moved to the VFPU for later G4. These only take two cycles instead of the usual 4 (5).
4 An extra cycle latency is required if the data is next used in VCIU/VSIU/VFPU.
Notice that the 970’s vector instruction latencies are close to those in the
G4e, which is an excellent sign given the fact that the 970’s pipeline is longer and its issue queues are much deeper (18 instructions on the 970 versus 4
instructions on the G4e). The 970’s larger instruction window and deeper
issue queues allow the processor to look farther ahead in the instruction
2 This is a truncated version of Apple’s
AltiVec Instruction Cross-Reference
(http://
developer.apple.com/hardware/ve/instruction_crossref.html).
208
Chapter 10
stream and extract more instruction-level parallelism (ILP) from the vector
instruction stream. This means that the 970’s vastly superior dynamic sched-
uling hardware can squeeze more performance out of slightly inferior vector
execution hardware, a capability that, when coupled with a high-bandwidth
memory subsystem and an ultrafast front-side bus, enable its vector perfor-
mance to exceed that of the G4e.
Floating-Point Issue Queues
Figure 10-1 shows the five dispatch slots connected to issue queues for each
of the functional units or groups of functional units. Figure 10-1 is actually
oversimplified, since the true relationship between the dispatch slots, issue
queues, and functional units is more complicated than depicted there. The
actual physical issue queue configuration is a bit hard to explain in words,
so Figure 10-4 shows how the floating-point issue queues really look.
0
1
2
3
4
Logical Issue
Logical Issue
Queue 1
Queue 2
FPU1
FPU2
Floating-Point
Unit
Figure 10-4: The 970’s floating-point issue queues
Each of the floating-point execution units is fed by what I’ve called a
logical issue queue. As you can see in Figure 10-4, each 10-entry logical issue
queue actually consists of an interleaved pair of five-entry physical issue queues, which together feed a single floating-point execution unit.
Figure 10-4 also shows how the four non-branch dispatch slots each feed
a specific physical issue queue. Floating-point IOPs that dispatch from slots
0 and 3 go into the tops of the two physical issue queues that are attached
to them. Similarly, floating-point IOPs that dispatch from slots 1 and 2 go
The G5: IBM’s PowerPC 970
209
into the tops of their two attached physical issue queues. As IOPs issue from
different places in the physical queues, the IOPs “above” them in the queue
“fall down” to fill in the gaps.
Each pair of physical issue queues—the pair attached to slots 0 and 3,
and the pair attached to slots 1 and 2—is interleaved and functions as a
single “logical” issue queue. An IOP can issue from a logical issue queue to
an execution unit when all of its sources are set, and the oldest IOP with all
of its sources set in a logical queue is issued to the attached execution unit.
This means that, as is the case with the G4e’s issue queues, instructions are
issued
in
program order from within each logical issue queue, but
out of
program order with respect to the overall code stream.
NOTE
The term
logical issue queue
is one that I’ve coined for the purposes of this chapter,
and is not to my knowledge used in IBM’s literature. IBM prefers the phrase
common interleaved queues
.
It’s important for me to emphasize that individual IOPs issue from
their respective logical issue queues
completely independent of their dispatch
group
. So the execution units’ schedulers are blind to which group an IOP
is in when it comes time to schedule the IOP and issue it to an execution
unit. Thus the dispatch groups are allowed to break apart after they reach
the level of the issue queue, and they’re reassembled after execution in the
group completion table (GCT).
Integer and Load-Store Issue Queues
The integer and load-store execution units are fed by issue queues in a similar
manner to the floating-point units, but they’re slightly more complex because
they share issue queues. Take a look at Figure 10-5 to see how this works.
As with the floating-point issue queues, integer or memory access
IOPs in dispatch slots 0 and 3 go into the two integer physical issue queues
that are specifically intended for them. The twist is that this pair of queues is shared by two execution units: IU1 and LSU1. So integer or memory IOPs from
slots 0 and 3 are sent into the appropriate issue queue pair—or logical issue
queue—and these IOPs then issue to either IU1 or LSU1 as the integer sched-
uler sees fit. Similarly, integer or memory access IOPs from dispatch slots
1 and 2 go into their respective physical queues, both of which feed IU2
and LSU2. As with the floating-point issue queues, these four 9-entry inter-
leaved queues should be thought of as being grouped into two 18-entry logical
issue queues, where each logical issue queue works together with the appro-
priate scheduler to feed a pair of execution units.
BU and CRU Issue Queues
The branch unit and condition register unit issue queues work in a similar
manner to what I’ve described previously, with the differences depending on
grouping and issue restrictions. So the CRU has a single 10-entry logical issue
queue comprised of two issue queues with five entries each, one for slot 0
210
Chapter 10
0
1
2
3
4
Integer/LSU
Integer/LSU
Logical Issue
Logical Issue
Queue 1
Queue 2
IU1
LSU1
LSU2
IU2
Integer
Load-Store
Integer
Unit
Unit
Unit
Figure 10-5: The 970’s IU and LSU issue queues
and one for slot 1 (because CR IOPs can be placed only in slots 0 and 1).
The branch execution unit has a single logical issue queue comprised of one
12-entry issue queue for slot 4 (because branch IOPs can go only in slot 4).
Vector Issue Queues
The vector issue queues are laid out slightly different than the other issue
queues. The vector issue queue configuration is depicted in Figure 10-6.
The vector permute unit is fed from one 16-entry (four entries × four
queues) logical issue queue connected to all four non-branch dispatch slots,
and the vector ALU is fed from a 20-entry (five entries × four queues) logical
issue queue that’s also connected to all four non-branch dispatch slots. As with all of the other issue queue pairs, one IOP per cycle can issue in order from
each logical issue queue to any of the execution units that are attached to it.
The Performance Implications of the 970’s Group Dispatch
Scheme
Because of the way it affects the 970’s issue queue design, the group formation
scheme has some interesting performance implications. Specifically, proper
code scheduling is important in ways that it wouldn’t normally be for the
other processors discussed here.
The G5: IBM’s PowerPC 970
211
4 IOPs
per cycle
0
1
2
3
4
(total)
Vector Logical
Vector Logical
Issue Queue 1
Issue Queue 2
1 IOP
per cycle
1 IOP
per cycle
VPU-1
VSIU-1
VCIU-1
VFPU-1
VPU-2
VCIU-2
VFPU-2
VCIU-3
VFPU-3
Vector
Permute
VCIU-4
VFPU-4
Unit
Vector Arithmetic Logic Unit
Vector Arithmetic Logic Units
Figure 10-6: The 970’s vector issue queues
Instead of trying to explain this point, I’ll illustrate it with an example.
Let’s look at an instruction with few group restrictions—the floating-point add.
The 970’s group formation rules dictate that the fadd can go into any of the