Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (84 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
7.13Mb size Format: txt, pdf, ePub

though, since, except for a few fairly old legacy applications, modern
x
86 apps use protected mode.

Conclusion

x
86 wasn’t the only consumer ISA to make the 64-bit leap in the first few years of the 21st century. IBM’s 970, more popularly known as the G5,

brought the PowerPC ISA moved into the same commodity 64-bit desktop

and server space as AMD’s Hammer. The next chapter will take an in-depth

look at the G5, comparing it to the processors we’ve studied so far.

192

Chapter 9

T H E G 5 : I B M ’ S P O W E R P C 9 7 0

The last PowerPC processor to succeed the Motorola

74
xx
family as the heart of Apple’s workstation line is

the IBM PowerPC 970—the processor in Apple’s G5

computer. This chapter takes an in-depth look at this

processor, comparing it to Motorola’s 7455 and, where

appropriate, Intel’s Pentium 4.

I’ll begin by taking a look at the 970’s overall design philosophy, and

then I’ll step through the stages of the 970’s pipeline, much as we did in the

previous two chapters on the Pentium 4 and G4e. Then we’ll talk about

instruction fetching, decoding, dispatching, issuing, execution, and comple-

tion, and we’ll end with a look at the 970’s back end.

At the outset, I should note that one of the most significant features of

the 970 is its 64-bit integer and address hardware. If you want to learn more

about 64-bit computing—what it can and cannot do, and what makes a 64-bit

processor like the 970 different from 32-bit processors like the G4e and

Pentium 4—make sure to read Chapter 9.

NOTE

With the exception of the section on vector processing, most of the discussion below is
relevant to IBM’s POWER4 microarchitecture—the multiprocessor server microarchitecture on which the PowerPC 970 is based.

Overview: Design Philosophy

In the previous chapters’ comparison of the design philosophies behind the

Pentium 4 and G4e, I tried to summarize each processor’s overall approach to

organizing its execution resources to squeeze the most performance out of

today’s software. I characterized the G4e’s approach as
wide and shallow
, because the G4e moves a few instructions through its very wide back end in

as few clock cycles as possible. I contrasted this approach to the Pentium 4’s

narrow and deep
approach, which focuses on pushing large numbers of instructions through a narrow back end over the course of a many clock cycles.

Using similar language, the 970’s approach could be characterized as
wide

and deep
. In other words, the 970 wants to have it both ways: an extremely wide back end and a 14-stage (integer) pipeline that, while not as deep as the

Pentium 4’s, is nonetheless built for speed. Using a special technique, which

we’ll discuss shortly, the 970 can have a whopping 200 instructions on-chip

in various stages of execution, a number that dwarfs not only the G4e’s

16-instruction window but also the Pentium 4’s 126-instruction one.

You can’t have everything, though, and the 970 pays a price for its “more is

better” design. When we discuss instruction dispatching and out-of-order exe-

cution on the 970, you’ll see what trade-offs IBM made in choosing this design.

Figure 10-1 shows the microarchitecture’s main functional blocks, and it

should give you a preliminary feel for just how wide the 970 is.

Don’t worry if all the parts of Figure 10-1 aren’t immediately intelligible,

because by the time this chapter is finished, you’ll have a good understand-

ing of everything depicted there.

Caches and Front End

Let’s take a short look at the caches for the 970 and the G4e. Table 10-1

should give you a rough idea of how the two chips compare.

Table 10-1:
The Caches of the PowerPC 970 and G4e

L1 I-cache

L1 D-cache

L2 Cache

PowerPC 970

64KB, direct-mapped

32KB, two-way assoc.

512KB, eight-way assoc.

G4e

32KB, eight-way assoc. 32KB, eight-way assoc. 256KB, eight-way assoc.

Table 10-1 shows that the 970 sports a larger instruction cache than the

G4e. This is because the 970’s pipeline is roughly twice the length of the G4e’s, which means that like the Pentium 4, the 970 pays a much higher performance

penalty when its pipeline stalls due to a miss in the I-cache. In short, the 970’s 64KB I-cache is intended to keep pipeline bubbles out of the chip’s pipeline.

194

Chapter 10

Front End

Instruction

Fetch

Branch

Scan

Branch

Unit

Instruction

Branch

Queue

Predict

Decode, Crack,

& Group

0

1

2

3

4

Vector Perm.

Vector Math

Floating-Point

Integer/Load-Store

Integer/Load-Store

CR/Branch Exec.

Queues

Queues

Queues

Queues 1

Queues 2

Queues

Vector

Vector

Vector

Condition

Branch

Vector

Simple Complex Floating-

FPU1

FPU2

IU1

LSU1

LSU2

IU2

Register

Exec.

Permute

Integer

Integer

Point

Unit

Unit

Floating-Point

Integer

Load-Store

Integer

CR

Branch

Vector Processing Unit

Unit

Unit

Unit

Unit

Unit

Unit

Back End

0

1

2

3

4

Completion Queue

Write

Commit Unit

Figure 10-1: The IBM PowerPC 970

When you combine the 970’s 32KB D-cache with its sizable 512KB L2, its

high-speed double data rate (DDR) front-side bus and its support for up to

eight data prefetch streams, you can see that this chip was made for floating-

point- and SIMD-intensive media applications. This chip performs better on

vector code than the G4e just based on these features alone.

Branch Prediction

Because of the depth of its pipeline and the width of its back end, the 970’s

designers spent a sizable chunk of the chip’s resources on branch prediction.

Like a high-hit-rate instruction cache, accurate branch prediction is essential

if the 970 is to keep its pipeline full and its extensive execution resources in constant use. As such, the 970’s extremely robust branch prediction unit (BPU)

The G5: IBM’s PowerPC 970

195

is one of its greatest strengths. This section takes a closer look at the top half of the 970’s front end and at the role that branch prediction plays in steering

that front end through the instruction stream.

The 970’s instruction fetch logic fetches up to eight instructions per

cycle from the L1 I-cache into an instruction queue, and on each fetch, the

front end’s branch unit scans these eight instructions in order to pick out

up to two branches. If either of the two branches is conditional, the branch

prediction unit predicts the condition’s outcome (taken or not taken) and

possibly its target address using one of two branch prediction schemes.

The first branch prediction scheme employed by the 970 is the standard

BHT-based scheme first described in Chapter 5. The 970’s BHT has 16 K

entries—four times the number of entries in the Pentium 4’s BHT and eight

times the number of entries in the G4’s BTIC. For each of these 16 K entries,

a one-bit flag tells the branch predictor whether the branch should be taken

or not taken.

The second scheme involves another 16 K entry table called the
global pre-

dictor table
. Each entry in this global predictor table is associated with an 11-bit vector that records the actual execution path taken by the previous 11
fetch
groups
. The processor uses this vector, which it constantly keeps up to date with the latest information, to set another one-bit flag for the global predictor table that specifies whether the branch should be taken or not taken.

Finally, there’s a third 16 K entry table that’s used by the 970’s front

end to keep track of which of the two schemes works best for each branch.

When a branch is finally evaluated, the processor compares the success of

both schemes and records in this selector table which scheme has done the

best job so far of predicting the outcome of that particular branch.

Spending all those transistors on such a massive amount of branch

prediction resources may seem like overkill right now, but when you’ve

completed the next section, you’ll see that the 970 can’t afford to intro-

duce any unnecessary bubbles into its pipeline.

The Trade-Off: Decode, Cracking, and Group Formation

As noted earlier, IBM’s PowerPC 970 fetches eight instructions per cycle

from the L1 cache into an instruction queue, from which the instructions

are pulled for decoding at a rate of eight per cycle. This compares quite

favorably to the G4e’s four instructions per cycle fetch and decode rate.

Much like the Pentium 4 and its predecessor, the P6, the PowerPC 970

translates PowerPC instructions into an 86-bit internal instruction format that

not only makes the instructions easier for the back end to schedule, but also

explicitly encodes register dependency information. IBM calls these internal

instructions
IOPs
, presumably short for
internal operations
. Like micro-ops on the Pentium 4, it is these IOPs that are actually executed out of order

by the 970’s back end. And also like micro-ops, cracking instructions down

into multiple, more atomic and more strictly defined IOPs can help the

back end squeeze out some extra instruction-level parallelism (ILP) by

giving it more freedom to schedule code.

196

Chapter 10

One important difference to note is that the architected PowerPC

instructions are very close in form and function to the 970’s IOPs, and in

fact, the latter are probably just a restricted subset of the former. (That this is definitely true is not clear from the publicly available documentation.) The

Pentium 4, in contrast, uses an internal instruction format that is significantly different in many important respects from the
x
86 ISA. So the process of

“ISA translation” on the 970 is significantly less complex and less resource-

intensive than the analogous process on the Pentium 4.

Almost all the PowerPC ISA instructions, with a few exceptions, translate

into exactly one IOP on the 970. Of the instructions that translate into more

than one IOP, IBM distinguishes two types:

z

A
cracked
instruction is an instruction that splits into exactly two IOPs.

z

A
millicoded
instruction is an instruction that splits into more than

two IOPs.

This difference in the way instructions are classified is not arbitrary.

Rather, it ties into a very important design decision that the POWER4’s

designers made regarding how the chip tracks instructions at various stages

of execution.

Dispatching and Issuing Instructions on the PowerPC 970

If you take a look at the middle of the large PPC 970 diagram in Figure 10-1,

notice that right below the
Decode, Crack, and Group
phase I’ve placed a group of five boxes. These five boxes represent what IBM calls a
dispatch group
(or
group
, for short), and each group consists of five IOPs arranged in program order according to certain rules and restrictions. It is these organized and

packaged groups of five IOPs that the 970 dispatches in parallel to the six

issue queues in its back end.

I probably shouldn’t go any further in discussing how these groups work

without first explaining the reason for their existence. By assembling IOPs

together into specially ordered groups of five for dispatch and completion,

the 970 can track these groups, and not individual IOPs, through the various

stages of execution. So instead of tracking 100 individual IOPs in-flight as they work their way through the 100 or so execution slots available in the back

end, the 970 need track only 20 groups. IOP grouping, then, significantly

reduces the overhead associated with tracking and reordering the huge

volume of instructions that can fit into the 970’s deep and wide design.

As noted earlier, the 970’s peculiar group dispatch scheme doesn’t go

into action until after the decode stage. Decoded PowerPC instructions flow

from the bottom of the 970’s instruction queue in order and fill up the five

available dispatch slots in a group (four non-branch instructions and one

branch instruction). Once the five slots have been filled from the instruction

queue, the entire group is dispatched to the back end, where the individual

instructions that constitute it (loads, stores, adds, fadds, etc.) enter the tops of their respective issue queues (i.e., an add goes into an integer issue queue, a

fadd goes into a floating-point issue queue, etc.).

Other books

Dead Sexy by Linda Jaivin
Brooklyn by Colm Tóibín
An Almost Perfect Thing by Nicole Moeller
Dragons at the Party by Jon Cleary