Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
The vast majority of integer instructions take one cycle to execute on the 970,
while a few more complex integer instructions can take more cycles. Note
that this one-cycle number signifies integer throughput. As a result, simple,
non-dependent integer IOPs can issue and finish at a rate of one per cycle.
Dependent integer IOPs, on the other hand, must be separated by a dead
cycle, giving a latency of two cycles. Note that this two-cycle latency applies
to operations in the same IU or in different IUs.
In the end, this unfortunate increase in latency is somewhat than miti-
gated by other factors, which I’ll discuss shortly. Nonetheless, the two-cycle
latency issue has turned out to have a non-trivial impact on the 970’s integer
performance.
The CRU
I haven’t said all there is to say about the 970’s integer-processing capabilities, so the preceding summary isn’t quite complete. As I mentioned, the 970
divides up its integer resources in a slightly different way from that of the
Pentium 4 or G4e.
Some of the operations handled by the integer units on the Pentium 4
and G4e are instead handled in different places by the 970. Specifically, there’s one type of fixed-point operation normally handled by an integer unit that
in the 970’s case gets its own separate execution unit. The 970 has a dedi-
cated unit for handling logical operations related to the PPC’s condition
register: the CRU. On the G4e these condition register (CR) operations are
handled by the complex integer unit. Thus the 970, in giving these operations
their own separate unit, takes some of the processing load off of the two
integer units.
The PowerPC Condition Register
The CR is a part of the PowerPC ISA that handles some of the functions of
the
x
86’s processor status word (PSW), but in a more flexible and programmer-friendly way. Almost all PowerPC arithmetic operations, when they finish
executing, have the option of setting various bits (or flags) in the PPC’s 32-bit condition register as a way of recording information about their outcome
for future reference (in other words, Was the result positive or negative?
Did it overflow or underflow the register?). So you might think of the CR as a
place to store metadata for arithmetic results. Subsequent instructions, like
conditional branch instructions, can then check the CR’s flags and thereby
use that metadata to decide what to do.
202
Chapter 10
The flag combinations that instructions can set in the condition register
are called
condition codes
, and the condition register has room enough to store up to eight separate condition codes, which can describe the outcome of eight
different instructions. Enabling the programmer to manipulate those con-
dition codes are a collection of PowerPC instructions that perform logical
operations (AND, OR, NOT, NOR, etc.) directly on the flags in the CR.
The CRU is the execution unit that executes those instructions.
Preliminary Conclusions About the 970’s Integer Performance
To summarize, the G4e dedicates the majority of its integer hardware to
the execution of simple, one-cycle instructions, with the remainder of the
hardware dedicated to the execution of less common complex instructions.
In contrast, with two rare exceptions, all of the PPC 970’s integer hardware
can execute almost any type of integer instruction. Table 10-2 shows a break-
down of how the three types of integer operations we’ve discussed—simple,
complex, and CR logical—are handled by the G4e and the 970.
Table 10-2:
Integer Operations on the PowerPC 970 and G4e
Simple Int.
Complex Int.
CR Logical
SPR
PPC 970
IU1, IU2
IU1, IU21
CRU
IU1
G4e
SIU1-SIU3
CIU
CIU
CIU
1 IU2 handles all divides on the 970.
As you can see from Table 10-2, the G4e has more and more specialized
integer hardware at its disposal than the 970. Also, in terms of instruction
latencies, the G4e’s integer hardware is slightly faster for common, simple
integer operations than that of the 970. Finally, as I hinted at earlier and will develop in more detail shortly, the 970’s integer performance—as well its
performance on other types of code—is fairly sensitive to compiler optimiza-
tion and scheduling. All of this adds up to make 32-bit integer computation
the place where the 970 looks the weakest compared to the competition.
Load-Store Units
Chapter 1 discussed the difference between instructions that access memory
(loads and stores) and instructions that do actual computation (integer
instructions, floating-point instructions, etc.). Just like integer instructions are executed in the IUs and floating-point instructions are executed in the
FPUs, memory access instructions have their own specialized execution units
in the form of one or more load-store units (LSUs).
Chapter 1 also discussed the fact that in order to access memory via a
load or a store, it’s usually necessary to perform an address calculation so the processor can figure out the location in memory that it should access. Even
though such address calculations are just simple integer operations, they’re
usually not handled by the processor’s integer hardware. Instead, all of the
processors under discussion here have dedicated address generation hardware
as part of their LSUs. Consequently, address generation takes place as part of
the execution phase of the LSU’s pipeline.
The G5: IBM’s PowerPC 970
203
The G4e has one LSU that executes all of the loads and stores for the
entire chip (integer, floating-point, and vector). As mentioned earlier, the
G4e’s LSU contains dedicated integer hardware for address calculations.
The 970 has two identical LSUs execute all of the loads and stores for the
entire chip. This gives it literally twice the load-store hardware of the G4e,
which it needs in order to keep all the instructions in its much larger instruc-
tion window fed with data. The 970’s load-store hardware is more comparable
to that of the Pentium 4, which also features a larger instruction window.
Front-Side Bus
A
bus
is an electrical conduit that connects two components in a computer system, allowing them to communicate and share data and/or code. If a computer system is like a large office building, then buses are like the phone lines that keep all the employees connected to each other and to the outside world.
The
front-side bus (FSB)
is the bus that connects the computer’s CPU to the core logic chipset. If buses are the phone lines of a computer system, then
the core logic chipset is the operator and switchboard. The
core logic chipset
, or
chipset
for short, opens and closes bus connections between components and routes data around the system. Figure 10-2 shows a simple computer
system consisting of a CPU, RAM, a video card, a hard drive, and a chipset.
AGP
Bus
Core Logic
Memory Bus
Front-Side Bus
Chipset
ATA
Bus
Figure 10-2: Core logic chipset
204
Chapter 10
As you can see from Figure 10-2, the front-side bus is the processor’s sole
means of communication with the rest of the system, so it needs to be very fast.
A processor’s front-side bus usually operates at a clock speed that is some
fraction of the core clock speed of the CPU, and the 970 is no different. On the first release of Apple’s G5 towers, the 970’s front-side bus operates at half the clock speed of the 970. So for a 2 GHz 970, the FSB runs at 1 GHz DDR. (DDR
stands for
double data rate
, which means that the bus physically runs at 500 MHz, but data is transferred on the rising and falling edges of the clock pulse.)
The 970 can run at other multiples of the FSB clock, including three, four,
and six times the FSB clock speed.
Because the 970’s front-side bus is composed of two unidirectional
channels, each of which is 32 bits wide, the total theoretical peak bandwidth
for the 900 MHz bus is 7.2GB per second. Address and control information is
multiplexed onto the bus along with data, so IBM estimates that the bus’s total
peak bandwidth for data alone (after subtracting the bandwidth used for
address and control information) is somewhere around 6.4GB per second.
In the end, this high-bandwidth FSB is one of the 970’s largest perfor-
mance advantages over its competitors. The 970’s two LSUs place high
demands on the FSB in order to keep the 970’s large instruction window
full and its wide back end fed with data. When coupled with a high-
bandwidth memory subsystem like dual-channel DDR400, the 970’s fast
FSB and dual LSUs make it a great media workstation platform.
The Floating-Point Units
The G4e’s very straightforward floating-point implementation has a single
FPU that takes a minimum of five cycles to finish executing the fastest floating-point instructions. (Some instructions take many more cycles.) The FPU is
served by 48 microarchitectural floating-point registers (32 registers for the
PPC ISA and 16 additional rename registers). Finally, single- and double-
precision floating-point operations take the same amount of time.
The 970’s floating-point implementation is almost exactly like the G4e’s,
except there’s twice as much hardware. The 970 has two identical FPUs, each
of which can execute the fastest floating-point instructions (like the fadd) in
six cycles. As with the G4e, single- and double-precision operations take the
same amount of time to execute. The 970’s two FPUs are fully pipelined for
all operations except floating-point divides, which are very costly in terms of
cycles and stall both FPUs. And finally, the 970 has a larger number of FPRs:
80 total microarchitectural registers, where 32 are PowerPC architectural
registers and the remaining 48 are rename registers.
Before moving on, I should note one peculiarity in the way that the
970 handles floating-point instructions. Some floating-point instructions—
particularly the fused multiply-add series of instructions—are not translated
into IOPs by the decode hardware, but instead are executed directly by the
FPU. The reason for this is fairly straightforward.
The G5: IBM’s PowerPC 970
205
Recall from Chapter 4 that the amount of die space taken up by the reg-
ister file increases approximately with the square of the number of register
file ports. The vast majority of PowerPC instructions specify at most two source registers and one destination register, which means that they use at most two
register file read ports and one register file write port. However, there is a
handful of PowerPC instructions that require more ports. In order to keep
the size of the 970’s register files to a minimum, the 970’s designers opted
not to add more ports to the register files in order to accommodate this small
number of instructions. Instead, the 970 has two ways of keeping this small
group of instructions from stalling due to the structural hazards that might
be brought on by their larger-than-average port requirements:
z
Non–floating-point instructions that require more than three ports total
are dealt with at the decode stage. All of the 970’s IOPs are restricted to
read at most two registers and write at most one register, so instructions
that do not obey this restriction are cracked into multiple IOPs that do
obey it.
z
There are a few types of floating-point instructions that do not fit the
three-port requirement, the most common of which is the fused multi-
ply-add series of instructions. For performance reasons, cracking a float-
ing-point fused multiply-add instruction into multiple IOPs is neither
necessary nor desirable. Instead, such instructions are simply passed to
the FPU as decoded PowerPC instructions—not as IOPs—where they
execute by accessing the register file on more than one cycle. Since these
instructions take multiple cycles to execute anyway, the extra register file
read and/or write cycles are simply overlapped with computation cycles
so that they don’t add to the instruction’s latency.
As the preceding discussion indicates, the 970’s floating-point hardware
compares quite favorably with that of the G4e. Twice the hardware does not
necessarily equal twice the performance, but it’s clear that the 970 performs
significantly better, clock for clock, on floating-point code. This performance
advantage is not only due to the doubled amount of floating-point hardware,
but also to the 970’s longer pipeline and clock-speed advantage. Furthermore,
the 970’s fast front-side bus (effectively half the core clock speed), when
coupled with a high-bandwidth memory subsystem, gives it a distinct advan-
tage over the G4e in bandwidth-intensive floating-point code. Note that this
last point also applies to vector code, but more on that in a moment.
Vector Computing on the PowerPC 970
The G4e’s AltiVec implementation is the strongest part of its design. With