Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (70 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
8.44Mb size Format: txt, pdf, ePub

unit by spreading them out into other functional blocks.

Latency and Throughput Revisited

On superscalar processors like the 601 and its more modern counterparts,

different instructions take different numbers of cycles to pass through the

processor. Different execution units often have different pipeline depths,

and even within one execution unit, different instructions sometimes take

different numbers of cycles. Regarding this latter case, one instruction can

take longer to pass through an ALU than another instruction, either because

the instruction has a mandatory stall in a certain stage or because the partic-

ular
subunit
that is handling the instruction has a longer pipeline than the other subunits that together make up the ALU. This being the case, it no

longer makes sense for us to simplistically treat instruction latency as a property of the processor as a whole. Rather, instruction latency is actually a matter of the individual instruction, so our discussion will reflect that from now on.

Earlier, we defined an instruction’s latency as the minimum number of

cycles that an instruction must spend in the execution phase. Here are the

latencies of some commonly used PowerPC instructions on the PowerPC G4,

a processor that we’ll discuss in “The PowerPC 7400 (aka the G4)” on page 133
:
Mnemonic

Cycles to Execute

add

1

and

1

cmp

1

divw

19

mul

6

Notice that most of the instructions take only one cycle to execute, while

a few, like division and multiplication, take more. An integer division for a full word, for example, takes 19 cycles to execute, while a 32-bit multiply takes 6

cycles. This means that the division instruction sits in IU1’s integer pipeline

for 19 cycles, and during this time, no other instruction can execute in IU1.

PowerPC Processors: 600 Series, 700 Series, and 7400

117

Now let’s look at the floating-point instruction latencies for the G4:

Mnemonic

Cycles to Execute

fabs

1-1-1

fadd

1-1-1

fdiv

32

fmadd

1-1-1

fmul

1-1-1

fsub

1-1-1

These latencies are listed a bit differently than the integer instruction laten-

cies. The numbers separated by dashes tell how long the instruction spends

in each of the FPU’s three pipeline stages. Most of the instructions listed spend one cycle in each stage, for a total of three cycles in the G4’s FPU pipeline, so if a program is using only these instructions, the FPU can start and finish one

instruction on each cycle.

A few instructions, like floating-point division, have only a single number

in the latency column. This is because an fdiv ties up the entire floating-point pipeline when executing. While an fdiv is grinding out its 32 cycles in the FPU, no other instructions can execute along with it in the floating-point pipeline.

This means that any floating-point instructions that come immediately after

the fdiv in the code stream must wait in the instruction queue because they

cannot be dispatched while the fdiv is executing.

Summary: The 601 in Historical Context

The 601 could spend a ton of transistors (at least, a ton for its day) on a 32KB

cache, because its front end was so much simpler than that of its
x
86 counterpart, the Intel Pentium. This was a significant advantage to using RISC at

that time. The chip made its debut in the PowerMac 6100 to good reviews, and

it put Apple in the performance lead over its
x
86 competition. The 601 was definitive in firmly establishing the cult of Apple as a high-end computer maker.

Nonetheless, the 601 did leave some room for improvement. The

sequencer unit that it inherited from its mainframe ancestor took up valuable

die space that could have been put to better use. With a little more time to

tweak it, the 601 could have been closer to perfect. But near perfection would

have to wait for the one-two punch of the 603e and 604.

The PowerPC 603 and 603e

While one team was putting the finishing touches on the 601, another team

at IBM’s Sommerset Design Center in Austin had already begun working

on the 601’s successor—the 603. The 603 was a significantly different design

than the 601, so it was less of an evolutionary shift than it was a completely

different processor. Table 6-2 summarizes the features of the PowerPC 603

and 603e.

118

Chapter 6

Table 6-2:
Features of the PowerPC 603 and 603e

PowerPC 603 Vitals

PowerPC 603e Vitals

Introduction Date

May 1, 1995

October 16, 1995

Process

0.50 micron

0.50 micron

Transistor Count

1.6 million

2.6 million

Die Size

81 mm2

98 mm2

Clock Speed at Introduction

75 MHz

100 MHz

L1 Cache Size

16KB split L1

32KB split L1

First Appeared In

Macintosh Performa 5200CD

Macintosh Performa 6300CD

The 603 was designed to run on very little power, because Apple needed

a chip for its PowerBook line of laptop computers. As a result, the processor

had a very good performance-per-watt ratio on native PowerPC code, and in

fact was able to match the 601 in clock-for-clock performance even though it

had about half the number of transistors as the older processor. But the 603’s

smaller 16KB split L1 cache meant that it was pretty bad at emulating the

legacy 68K code that formed a large part of Apple’s OS and application base.

As a result, the 603 was relegated to the very lowest end of Apple’s product

line (the Performas, beginning with the 6200, and the all-in-ones designed

for the education market, beginning with the 5200), until a tweaked version

(the 603e) with an enlarged, 32KB split cache was released. The 603e

performed better on emulated 68K code, so it saw widespread use in the

PowerBook line.

This section will take a quick look at the microarchitecture of the 603e,

illustrated in Figure 6-2, because it was the version of the 603 that saw the

most widespread use.

NOTE

The 604 was also released at the same time as the original 603. The 604, which was
intended for Apple’s high-end products just like the 603e was intended for its low-end
products, was yet another brand new design. We’ll cover the 604 in
“The PowerPC 604”

on page 123.

The 603e’s Back End

Like the 601, the 603e sports the classic RISC four-stage pipeline. But unlike

the 601, which can decode and dispatch up to three instructions per cycle

to any of its execution units—including its branch unit—the 603e has one

important restriction that constrains how it uses its dispatch bandwidth of

three instructions per cycle.

On the 603e, and on all processors derived from it (the 750 and the

7400/7410), branches that aren’t folded or don’t fall through are dispatched

from the instruction queue to the branch unit over a
dispatch bus
that isn’t connected to any of the other execution units. This way, branch instructions

don’t take up any of the available dispatch bandwidth that feeds the main

part of the back end. The 603e and its derivatives can dispatch one branch

instruction per cycle to the branch unit over this particular bus.

PowerPC Processors: 600 Series, 700 Series, and 7400

119

Front End

Instruction Fetch

BU

Instruction Queue

Branch

Unit

Decode/Dispatch

(ROB + Rename)

Reservation

Reservation

Reservation

Reservation

Station

Station

Station

Station

VPU-1

FPU-1

IU-1

LSU-1

CR

FPU-2

LSU-2

FPU-3

Load-

Floating-

Integer

System

Store

Point Unit

ALU

Unit

Unit

Memory Access

Scalar Arithmetic Logic Units

Units

Back End

Completion Queue

(ROB)

Write

Commit Unit

Figure 6-2: Microarchitecture of the PowerPC 603e

Non-branch instructions can dispatch at a rate of up to two instructions

per cycle to the back end, which means that the 603 has a maximum dispatch

rate of three instructions per cycle (two non-branch + one branch). However,

because two non-branch instructions per cycle can dispatch, branch instruc-

tions are often ignored when discussing the dispatch rate of the 603 and its

successors. Therefore, these processors are often said to have a dispatch rate

of up to two instructions per cycle, even though the dispatch rate is
technically
three instructions per cycle.

120

Chapter 6

The 603e’s dispatch logic takes a maximum of two non-branch instruc-

tions per cycle from the bottom of the instruction queue and passes them to

the back end, where they are executed by one of five execution units:

z

Integer unit

z

Floating-point unit

z

Branch unit

z

Load-store unit

z

System unit

Notice that this list contains two more units than the analogous list for

the 601: the load-store unit (LSU) and the
system unit
. The 603e’s load-store unit takes over all of the address-calculating labors that the older 601 foisted onto its lone integer ALU. Because the 603e has a dedicated LSU for performing address calculations and executing store-data operations, its

integer unit is freed up from having to handle memory traffic and can

therefore focus solely on integer arithmetic. This helps improve the 603e’s

performance on integer code.

The 603e’s dedicated system unit also takes over some of the functions

of the 601’s integer unit, in that it handles updates to the PowerPC condition

register. We’ll talk more about the condition register in Chapter 10, so don’t

worry if you don’t know what it is. The 603e’s system unit also contains a

limited integer adder, which can take some of the burden off the integer

ALU by doing certain types of addition. (The original 603’s system unit

lacked this feature.)

The 603e’s basic floating-point pipeline differs from that of the 601 in that

it has one more execute stage and one less decode stage. Most floating-

point instructions have a three-cycle latency (and a one-cycle throughput)

on the 603e, compared to a two-cycle latency on the 601. This three-cycle

latency/one-cycle throughput design wouldn’t be bad at all if it weren’t for

one serious problem: At its very fastest, the 603e’s FPU can only execute three

instructions every four cycles. In other words, after every third single-cycle

floating-point instruction, there is a mandatory pipeline bubble. I won’t get

into the reason for this, but the 603e’s FPU took a nontrivial hit to perfor-

mance for this three-instruction/four-cycle design.

The other, perhaps more serious, flaw in the 603e’s FPU is that it is not

fully pipelined for multiply operations. Double-precision multiplies—and this

includes double-precision fmadds—spend two cycles in the execute stage, which

means that the 603e’s FPU can complete only one double-precision multiply

every two cycles.

603e’s floating-point unit isn’t all bad news, though. It still has the stan-

dard PPC ability to do single-precision fmadd operations, with a four-cycle

latency and a one-cycle throughput. This fast fmadd ability helped the architec-

ture retain much of its usefulness for DSP, scientific, and media applications,

in spite of the aforementioned drawbacks.

PowerPC Processors: 600 Series, 700 Series, and 7400

121

The 603e’s Front End, Instruction Window, and Branch Prediction

Up to two instructions per cycle can be fetched into the 603e’s six-entry

instruction queue. From there, a maximum of two instructions per cycle

(one fewer than the 601) can be dispatched from the two bottom entries in

the IQ to the reservation stations in the 603e’s back end.

Not only does the 603e dispatch one fewer instruction per cycle to its back

end than the 601 does, but its overall approach to superscalar and out-of-order

execution differs from that of the 601 in another way, as well. The 603e uses

a dedicated
commit unit
, which contains a five-entry
completion queue
(analogous to the P6’s ROB) for keeping track of the program order of in-flight instructions. When instructions execute out of order, the commit unit refers to the

information stored in the completion queue and puts the instructions back

in program order before committing them.

To use a term that figured prominently in our discussion of the Pentium,

the 603 is the first PowerPC processor to feature dynamic scheduling via a

full-blown instruction window, complete with a ROB and reservation stations.

Other books

Fasting and Eating for Health by Joel Fuhrman; Neal D. Barnard
Gio (5th Street) by Elizabeth Reyes
Demonized by Naomi Clark
The Gentle Rebel by Gilbert Morris
Last Stand Ranch by Jenna Night
Behind the Locked Door by Procter, Lisa
The Liberties of London by House, Gregory