Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (87 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

2.59Mb size Format: txt, pdf, ePub

Read Book Download Book

four fully pipelined vector processing units, it has plenty of hardware to go

around. As a brief recap from the last chapter, here’s a breakdown of the

G4e’s four AltiVec units:

vector permute unit (VPU)

vector simple integer unit (VSIU)

206

Chapter 10

vector complex integer unit (VCIU)

vector floating-point unit (VFPU)

All of this vector execution hardware is tied to a generous register file that

consists of thirty-two 128-bit architectural registers and sixteen additional

vector rename registers. Furthermore, each of the units is attached to a

four-entry vector issue queue that can issue two vector ops per cycle to any

of the four vector units.

NOTE

The SIMD instruction set known as AltiVec was codeveloped by both IBM and Motorola,
and it is co-owned by both of them. Motorola has a trademark on the AltiVec name, so
in most of IBM’s literature (but not all), the instruction set is referred to as VMX, presumably for Vector Multimedia Extensions. VMX and AltiVec are therefore two different
names for the same group of 162 vector instructions added to the PowerPC ISA. This
chapter uses the name AltiVec for both the G4e and the 970.

The 970’s AltiVec implementation looks pretty much like the original

G4’s—the MPC7400—except for the addition of issue queues for dynamic

execution. The 970 has the following units:

vector permute unit (VPU)

vector arithmetic logic unit (VALU)

vector simple integer unit (VSIU)

vector complex integer unit (VCIU)

vector floating-point unit (VFPU)

These units are attached to 80 vector registers—32 architectural registers

and 48 rename registers.

Notice that the vector execution units listed here are essentially the same

units as in the G4e, but they’re grouped differently. This grouping makes all

the difference. Take a look at the simplified diagram in Figure 10-3.

The 970 can dispatch up to four vector IOPs per cycle total to its two

vector
logical issue queues
—two IOPs per cycle in program order to the 16-entry VPU logical issue queue and two IOPs per cycle in program order to

the 20-entry VALU logical issue queue. (Each of these logical issue queues

consists physically of a pair of interleaved queues that operate together as a

single queue, but we’ll talk more about how the logical queues are actually

implemented in a moment.) Each of the two logical queues can then issue

one vector operation per cycle to any of the units attached to it. This means

that the 970 can issue one IOP per cycle to the VPU and one IOP per cycle to

any of the VALU’s three subunits.

As you can see, if you place the 970’s vector unit and the G4e’s vector unit

side by side, the 970 and the G4e can issue the same number of vector instruc-

tions per cycle to their AltiVec units, but the 970 is much more limited in the

combinations of instructions it can issue, because one of its two instructions

must be a vector permute. For instance, the G4e would have no problem

issuing both a complex integer instruction and a simple integer instruction to

its VCIU and VSIU in the same cycle, whereas the 970 would only be able to

issue one of these in a cycle.

The G5: IBM’s PowerPC 970

207

itm10_03.fm Page 208 Wednesday, October 25, 2006 1:57 PM

4 IOPs

per cycle

(total)

Vector

Issue

Queues

1 IOP

per cycle

VPU-1

VSIU-1

VCIU-1 VFPU-1

VPU-2

VCIU-2 VFPU-2

VCIU-3 VFPU-3

Vector

Permute

VCIU-4 VFPU-4

Unit

Vector ALU

Vector Arithmetic Logic Units

Figure 10-3: The 970’s vector unit

Table 10-3 replicates a chart from Apple that compares the AltiVec

execution unit latencies for the two G4s and the 970.2

Table 10-3:
Vector Instruction Latencies for the G4, G4e, and 970

Hardware Unit

7400/7410

7450/7455

PPC 970

Vector simple integer unit (VSIU)

Vector complex integer unit (VCIU)

Vector floating-point unit (VFPU)

4 (5)2

4 (5)2,3

Vector permute unit (VPU)

1 An extra cycle latency is required if the data is next used in VPU.

2 The VFPU takes an extra cycle if Java mode is turned on. (It is off by default on Mac OS X, but on by default on Mac OS 9.)

3 Some FP-related VSIU instructions were moved to the VFPU for later G4. These only take two cycles instead of the usual 4 (5).

4 An extra cycle latency is required if the data is next used in VCIU/VSIU/VFPU.

Notice that the 970’s vector instruction latencies are close to those in the

G4e, which is an excellent sign given the fact that the 970’s pipeline is longer and its issue queues are much deeper (18 instructions on the 970 versus 4

instructions on the G4e). The 970’s larger instruction window and deeper

issue queues allow the processor to look farther ahead in the instruction

2 This is a truncated version of Apple’s
AltiVec Instruction Cross-Reference
(http://

developer.apple.com/hardware/ve/instruction_crossref.html).

208

Chapter 10

stream and extract more instruction-level parallelism (ILP) from the vector

instruction stream. This means that the 970’s vastly superior dynamic sched-

uling hardware can squeeze more performance out of slightly inferior vector

execution hardware, a capability that, when coupled with a high-bandwidth

memory subsystem and an ultrafast front-side bus, enable its vector perfor-

mance to exceed that of the G4e.

Floating-Point Issue Queues

Figure 10-1 shows the five dispatch slots connected to issue queues for each

of the functional units or groups of functional units. Figure 10-1 is actually

oversimplified, since the true relationship between the dispatch slots, issue

queues, and functional units is more complicated than depicted there. The

actual physical issue queue configuration is a bit hard to explain in words,

so Figure 10-4 shows how the floating-point issue queues really look.

Logical Issue

Queue 1

Queue 2

FPU1

FPU2

Floating-Point

Unit

Figure 10-4: The 970’s floating-point issue queues

Each of the floating-point execution units is fed by what I’ve called a

logical issue queue. As you can see in Figure 10-4, each 10-entry logical issue

queue actually consists of an interleaved pair of five-entry physical issue queues, which together feed a single floating-point execution unit.

Figure 10-4 also shows how the four non-branch dispatch slots each feed

a specific physical issue queue. Floating-point IOPs that dispatch from slots

0 and 3 go into the tops of the two physical issue queues that are attached

to them. Similarly, floating-point IOPs that dispatch from slots 1 and 2 go

The G5: IBM’s PowerPC 970

209

into the tops of their two attached physical issue queues. As IOPs issue from

different places in the physical queues, the IOPs “above” them in the queue

“fall down” to fill in the gaps.

Each pair of physical issue queues—the pair attached to slots 0 and 3,

and the pair attached to slots 1 and 2—is interleaved and functions as a

single “logical” issue queue. An IOP can issue from a logical issue queue to

an execution unit when all of its sources are set, and the oldest IOP with all

of its sources set in a logical queue is issued to the attached execution unit.

This means that, as is the case with the G4e’s issue queues, instructions are

issued
in
program order from within each logical issue queue, but
out of
program order with respect to the overall code stream.

NOTE

The term
logical issue queue
is one that I’ve coined for the purposes of this chapter,
and is not to my knowledge used in IBM’s literature. IBM prefers the phrase
common interleaved queues
.

It’s important for me to emphasize that individual IOPs issue from

their respective logical issue queues
completely independent of their dispatch
group
. So the execution units’ schedulers are blind to which group an IOP

is in when it comes time to schedule the IOP and issue it to an execution

unit. Thus the dispatch groups are allowed to break apart after they reach

the level of the issue queue, and they’re reassembled after execution in the

group completion table (GCT).

Integer and Load-Store Issue Queues

The integer and load-store execution units are fed by issue queues in a similar

manner to the floating-point units, but they’re slightly more complex because

they share issue queues. Take a look at Figure 10-5 to see how this works.

As with the floating-point issue queues, integer or memory access

IOPs in dispatch slots 0 and 3 go into the two integer physical issue queues

that are specifically intended for them. The twist is that this pair of queues is shared by two execution units: IU1 and LSU1. So integer or memory IOPs from

slots 0 and 3 are sent into the appropriate issue queue pair—or logical issue

queue—and these IOPs then issue to either IU1 or LSU1 as the integer sched-

uler sees fit. Similarly, integer or memory access IOPs from dispatch slots

1 and 2 go into their respective physical queues, both of which feed IU2

and LSU2. As with the floating-point issue queues, these four 9-entry inter-

leaved queues should be thought of as being grouped into two 18-entry logical

issue queues, where each logical issue queue works together with the appro-

priate scheduler to feed a pair of execution units.

BU and CRU Issue Queues

The branch unit and condition register unit issue queues work in a similar

manner to what I’ve described previously, with the differences depending on

grouping and issue restrictions. So the CRU has a single 10-entry logical issue

queue comprised of two issue queues with five entries each, one for slot 0

210

Chapter 10

Integer/LSU

Logical Issue

Queue 1

Queue 2

IU1

LSU1

LSU2

IU2

Integer

Load-Store

Integer

Unit

Figure 10-5: The 970’s IU and LSU issue queues

and one for slot 1 (because CR IOPs can be placed only in slots 0 and 1).

The branch execution unit has a single logical issue queue comprised of one

12-entry issue queue for slot 4 (because branch IOPs can go only in slot 4).

Vector Issue Queues

The vector issue queues are laid out slightly different than the other issue

queues. The vector issue queue configuration is depicted in Figure 10-6.

The vector permute unit is fed from one 16-entry (four entries × four

queues) logical issue queue connected to all four non-branch dispatch slots,

and the vector ALU is fed from a 20-entry (five entries × four queues) logical

issue queue that’s also connected to all four non-branch dispatch slots. As with all of the other issue queue pairs, one IOP per cycle can issue in order from

each logical issue queue to any of the execution units that are attached to it.

The Performance Implications of the 970’s Group Dispatch

Scheme

Because of the way it affects the 970’s issue queue design, the group formation

scheme has some interesting performance implications. Specifically, proper

code scheduling is important in ways that it wouldn’t normally be for the

other processors discussed here.

The G5: IBM’s PowerPC 970

211

4 IOPs

per cycle

(total)

Vector Logical

Issue Queue 1

Issue Queue 2

1 IOP

per cycle

1 IOP

per cycle

VPU-1

VSIU-1

VCIU-1

VFPU-1

VPU-2

VCIU-2

VFPU-2

VCIU-3

VFPU-3

Vector

Permute

VCIU-4

VFPU-4

Unit

Vector Arithmetic Logic Unit

Vector Arithmetic Logic Units

Figure 10-6: The 970’s vector issue queues

Instead of trying to explain this point, I’ll illustrate it with an example.

Let’s look at an instruction with few group restrictions—the floating-point add.

The 970’s group formation rules dictate that the fadd can go into any of the

Other books

Let Me In by Michelle Lynn

Delicious by Shayla Black

The Drowning Lesson by Jane Shemilt

The Coptic Secret by Gregg Loomis

Story of My Life by Jay McInerney

Comanche Rose by Anita Mills

Draw the Dark by Ilsa J. Bick

One Night in Paradise by Maisey Yates

The Iron Sword (The Fae War Chronicles Book 1) by Jocelyn Fox

Between Boyfriends by Michael Salvatore