Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (86 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
3.55Mb size Format: txt, pdf, ePub

The vast majority of integer instructions take one cycle to execute on the 970,

while a few more complex integer instructions can take more cycles. Note

that this one-cycle number signifies integer throughput. As a result, simple,

non-dependent integer IOPs can issue and finish at a rate of one per cycle.

Dependent integer IOPs, on the other hand, must be separated by a dead

cycle, giving a latency of two cycles. Note that this two-cycle latency applies

to operations in the same IU or in different IUs.

In the end, this unfortunate increase in latency is somewhat than miti-

gated by other factors, which I’ll discuss shortly. Nonetheless, the two-cycle

latency issue has turned out to have a non-trivial impact on the 970’s integer

performance.

The CRU

I haven’t said all there is to say about the 970’s integer-processing capabilities, so the preceding summary isn’t quite complete. As I mentioned, the 970

divides up its integer resources in a slightly different way from that of the

Pentium 4 or G4e.

Some of the operations handled by the integer units on the Pentium 4

and G4e are instead handled in different places by the 970. Specifically, there’s one type of fixed-point operation normally handled by an integer unit that

in the 970’s case gets its own separate execution unit. The 970 has a dedi-

cated unit for handling logical operations related to the PPC’s condition

register: the CRU. On the G4e these condition register (CR) operations are

handled by the complex integer unit. Thus the 970, in giving these operations

their own separate unit, takes some of the processing load off of the two

integer units.

The PowerPC Condition Register

The CR is a part of the PowerPC ISA that handles some of the functions of

the
x
86’s processor status word (PSW), but in a more flexible and programmer-friendly way. Almost all PowerPC arithmetic operations, when they finish

executing, have the option of setting various bits (or flags) in the PPC’s 32-bit condition register as a way of recording information about their outcome

for future reference (in other words, Was the result positive or negative?

Did it overflow or underflow the register?). So you might think of the CR as a

place to store metadata for arithmetic results. Subsequent instructions, like

conditional branch instructions, can then check the CR’s flags and thereby

use that metadata to decide what to do.

202

Chapter 10

The flag combinations that instructions can set in the condition register

are called
condition codes
, and the condition register has room enough to store up to eight separate condition codes, which can describe the outcome of eight

different instructions. Enabling the programmer to manipulate those con-

dition codes are a collection of PowerPC instructions that perform logical

operations (AND, OR, NOT, NOR, etc.) directly on the flags in the CR.

The CRU is the execution unit that executes those instructions.

Preliminary Conclusions About the 970’s Integer Performance

To summarize, the G4e dedicates the majority of its integer hardware to

the execution of simple, one-cycle instructions, with the remainder of the

hardware dedicated to the execution of less common complex instructions.

In contrast, with two rare exceptions, all of the PPC 970’s integer hardware

can execute almost any type of integer instruction. Table 10-2 shows a break-

down of how the three types of integer operations we’ve discussed—simple,

complex, and CR logical—are handled by the G4e and the 970.

Table 10-2:
Integer Operations on the PowerPC 970 and G4e

Simple Int.

Complex Int.

CR Logical

SPR

PPC 970

IU1, IU2

IU1, IU21

CRU

IU1

G4e

SIU1-SIU3

CIU

CIU

CIU

1 IU2 handles all divides on the 970.

As you can see from Table 10-2, the G4e has more and more specialized

integer hardware at its disposal than the 970. Also, in terms of instruction

latencies, the G4e’s integer hardware is slightly faster for common, simple

integer operations than that of the 970. Finally, as I hinted at earlier and will develop in more detail shortly, the 970’s integer performance—as well its

performance on other types of code—is fairly sensitive to compiler optimiza-

tion and scheduling. All of this adds up to make 32-bit integer computation

the place where the 970 looks the weakest compared to the competition.

Load-Store Units

Chapter 1 discussed the difference between instructions that access memory

(loads and stores) and instructions that do actual computation (integer

instructions, floating-point instructions, etc.). Just like integer instructions are executed in the IUs and floating-point instructions are executed in the

FPUs, memory access instructions have their own specialized execution units

in the form of one or more load-store units (LSUs).

Chapter 1 also discussed the fact that in order to access memory via a

load or a store, it’s usually necessary to perform an address calculation so the processor can figure out the location in memory that it should access. Even

though such address calculations are just simple integer operations, they’re

usually not handled by the processor’s integer hardware. Instead, all of the

processors under discussion here have dedicated address generation hardware

as part of their LSUs. Consequently, address generation takes place as part of

the execution phase of the LSU’s pipeline.

The G5: IBM’s PowerPC 970

203

The G4e has one LSU that executes all of the loads and stores for the

entire chip (integer, floating-point, and vector). As mentioned earlier, the

G4e’s LSU contains dedicated integer hardware for address calculations.

The 970 has two identical LSUs execute all of the loads and stores for the

entire chip. This gives it literally twice the load-store hardware of the G4e,

which it needs in order to keep all the instructions in its much larger instruc-

tion window fed with data. The 970’s load-store hardware is more comparable

to that of the Pentium 4, which also features a larger instruction window.

Front-Side Bus

A
bus
is an electrical conduit that connects two components in a computer system, allowing them to communicate and share data and/or code. If a computer system is like a large office building, then buses are like the phone lines that keep all the employees connected to each other and to the outside world.

The
front-side bus (FSB)
is the bus that connects the computer’s CPU to the core logic chipset. If buses are the phone lines of a computer system, then

the core logic chipset is the operator and switchboard. The
core logic chipset
, or
chipset
for short, opens and closes bus connections between components and routes data around the system. Figure 10-2 shows a simple computer

system consisting of a CPU, RAM, a video card, a hard drive, and a chipset.

AGP

Bus

Core Logic

Memory Bus

Front-Side Bus

Chipset

ATA

Bus

Figure 10-2: Core logic chipset

204

Chapter 10

As you can see from Figure 10-2, the front-side bus is the processor’s sole

means of communication with the rest of the system, so it needs to be very fast.

A processor’s front-side bus usually operates at a clock speed that is some

fraction of the core clock speed of the CPU, and the 970 is no different. On the first release of Apple’s G5 towers, the 970’s front-side bus operates at half the clock speed of the 970. So for a 2 GHz 970, the FSB runs at 1 GHz DDR. (DDR

stands for
double data rate
, which means that the bus physically runs at 500 MHz, but data is transferred on the rising and falling edges of the clock pulse.)

The 970 can run at other multiples of the FSB clock, including three, four,

and six times the FSB clock speed.

Because the 970’s front-side bus is composed of two unidirectional

channels, each of which is 32 bits wide, the total theoretical peak bandwidth

for the 900 MHz bus is 7.2GB per second. Address and control information is

multiplexed onto the bus along with data, so IBM estimates that the bus’s total

peak bandwidth for data alone (after subtracting the bandwidth used for

address and control information) is somewhere around 6.4GB per second.

In the end, this high-bandwidth FSB is one of the 970’s largest perfor-

mance advantages over its competitors. The 970’s two LSUs place high

demands on the FSB in order to keep the 970’s large instruction window

full and its wide back end fed with data. When coupled with a high-

bandwidth memory subsystem like dual-channel DDR400, the 970’s fast

FSB and dual LSUs make it a great media workstation platform.

The Floating-Point Units

The G4e’s very straightforward floating-point implementation has a single

FPU that takes a minimum of five cycles to finish executing the fastest floating-point instructions. (Some instructions take many more cycles.) The FPU is

served by 48 microarchitectural floating-point registers (32 registers for the

PPC ISA and 16 additional rename registers). Finally, single- and double-

precision floating-point operations take the same amount of time.

The 970’s floating-point implementation is almost exactly like the G4e’s,

except there’s twice as much hardware. The 970 has two identical FPUs, each

of which can execute the fastest floating-point instructions (like the fadd) in

six cycles. As with the G4e, single- and double-precision operations take the

same amount of time to execute. The 970’s two FPUs are fully pipelined for

all operations except floating-point divides, which are very costly in terms of

cycles and stall both FPUs. And finally, the 970 has a larger number of FPRs:

80 total microarchitectural registers, where 32 are PowerPC architectural

registers and the remaining 48 are rename registers.

Before moving on, I should note one peculiarity in the way that the

970 handles floating-point instructions. Some floating-point instructions—

particularly the fused multiply-add series of instructions—are not translated

into IOPs by the decode hardware, but instead are executed directly by the

FPU. The reason for this is fairly straightforward.

The G5: IBM’s PowerPC 970

205

Recall from Chapter 4 that the amount of die space taken up by the reg-

ister file increases approximately with the square of the number of register

file ports. The vast majority of PowerPC instructions specify at most two source registers and one destination register, which means that they use at most two

register file read ports and one register file write port. However, there is a

handful of PowerPC instructions that require more ports. In order to keep

the size of the 970’s register files to a minimum, the 970’s designers opted

not to add more ports to the register files in order to accommodate this small

number of instructions. Instead, the 970 has two ways of keeping this small

group of instructions from stalling due to the structural hazards that might

be brought on by their larger-than-average port requirements:

z

Non–floating-point instructions that require more than three ports total

are dealt with at the decode stage. All of the 970’s IOPs are restricted to

read at most two registers and write at most one register, so instructions

that do not obey this restriction are cracked into multiple IOPs that do

obey it.

z

There are a few types of floating-point instructions that do not fit the

three-port requirement, the most common of which is the fused multi-

ply-add series of instructions. For performance reasons, cracking a float-

ing-point fused multiply-add instruction into multiple IOPs is neither

necessary nor desirable. Instead, such instructions are simply passed to

the FPU as decoded PowerPC instructions—not as IOPs—where they

execute by accessing the register file on more than one cycle. Since these

instructions take multiple cycles to execute anyway, the extra register file

read and/or write cycles are simply overlapped with computation cycles

so that they don’t add to the instruction’s latency.

As the preceding discussion indicates, the 970’s floating-point hardware

compares quite favorably with that of the G4e. Twice the hardware does not

necessarily equal twice the performance, but it’s clear that the 970 performs

significantly better, clock for clock, on floating-point code. This performance

advantage is not only due to the doubled amount of floating-point hardware,

but also to the 970’s longer pipeline and clock-speed advantage. Furthermore,

the 970’s fast front-side bus (effectively half the core clock speed), when

coupled with a high-bandwidth memory subsystem, gives it a distinct advan-

tage over the G4e in bandwidth-intensive floating-point code. Note that this

last point also applies to vector code, but more on that in a moment.

Vector Computing on the PowerPC 970

The G4e’s AltiVec implementation is the strongest part of its design. With

Other books

Nina's Dom by Raven McAllan
They by J. F. Gonzalez
Conceit by Mary Novik
For Better or Worsted by Betty Hechtman