Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
unit by spreading them out into other functional blocks.
Latency and Throughput Revisited
On superscalar processors like the 601 and its more modern counterparts,
different instructions take different numbers of cycles to pass through the
processor. Different execution units often have different pipeline depths,
and even within one execution unit, different instructions sometimes take
different numbers of cycles. Regarding this latter case, one instruction can
take longer to pass through an ALU than another instruction, either because
the instruction has a mandatory stall in a certain stage or because the partic-
ular
subunit
that is handling the instruction has a longer pipeline than the other subunits that together make up the ALU. This being the case, it no
longer makes sense for us to simplistically treat instruction latency as a property of the processor as a whole. Rather, instruction latency is actually a matter of the individual instruction, so our discussion will reflect that from now on.
Earlier, we defined an instruction’s latency as the minimum number of
cycles that an instruction must spend in the execution phase. Here are the
latencies of some commonly used PowerPC instructions on the PowerPC G4,
a processor that we’ll discuss in “The PowerPC 7400 (aka the G4)” on page 133
:
Mnemonic
Cycles to Execute
add
1
and
1
cmp
1
divw
19
mul
6
Notice that most of the instructions take only one cycle to execute, while
a few, like division and multiplication, take more. An integer division for a full word, for example, takes 19 cycles to execute, while a 32-bit multiply takes 6
cycles. This means that the division instruction sits in IU1’s integer pipeline
for 19 cycles, and during this time, no other instruction can execute in IU1.
PowerPC Processors: 600 Series, 700 Series, and 7400
117
Now let’s look at the floating-point instruction latencies for the G4:
Mnemonic
Cycles to Execute
fabs
1-1-1
fadd
1-1-1
fdiv
32
fmadd
1-1-1
fmul
1-1-1
fsub
1-1-1
These latencies are listed a bit differently than the integer instruction laten-
cies. The numbers separated by dashes tell how long the instruction spends
in each of the FPU’s three pipeline stages. Most of the instructions listed spend one cycle in each stage, for a total of three cycles in the G4’s FPU pipeline, so if a program is using only these instructions, the FPU can start and finish one
instruction on each cycle.
A few instructions, like floating-point division, have only a single number
in the latency column. This is because an fdiv ties up the entire floating-point pipeline when executing. While an fdiv is grinding out its 32 cycles in the FPU, no other instructions can execute along with it in the floating-point pipeline.
This means that any floating-point instructions that come immediately after
the fdiv in the code stream must wait in the instruction queue because they
cannot be dispatched while the fdiv is executing.
Summary: The 601 in Historical Context
The 601 could spend a ton of transistors (at least, a ton for its day) on a 32KB
cache, because its front end was so much simpler than that of its
x
86 counterpart, the Intel Pentium. This was a significant advantage to using RISC at
that time. The chip made its debut in the PowerMac 6100 to good reviews, and
it put Apple in the performance lead over its
x
86 competition. The 601 was definitive in firmly establishing the cult of Apple as a high-end computer maker.
Nonetheless, the 601 did leave some room for improvement. The
sequencer unit that it inherited from its mainframe ancestor took up valuable
die space that could have been put to better use. With a little more time to
tweak it, the 601 could have been closer to perfect. But near perfection would
have to wait for the one-two punch of the 603e and 604.
The PowerPC 603 and 603e
While one team was putting the finishing touches on the 601, another team
at IBM’s Sommerset Design Center in Austin had already begun working
on the 601’s successor—the 603. The 603 was a significantly different design
than the 601, so it was less of an evolutionary shift than it was a completely
different processor. Table 6-2 summarizes the features of the PowerPC 603
and 603e.
118
Chapter 6
Table 6-2:
Features of the PowerPC 603 and 603e
PowerPC 603 Vitals
PowerPC 603e Vitals
Introduction Date
May 1, 1995
October 16, 1995
Process
0.50 micron
0.50 micron
Transistor Count
1.6 million
2.6 million
Die Size
81 mm2
98 mm2
Clock Speed at Introduction
75 MHz
100 MHz
L1 Cache Size
16KB split L1
32KB split L1
First Appeared In
Macintosh Performa 5200CD
Macintosh Performa 6300CD
The 603 was designed to run on very little power, because Apple needed
a chip for its PowerBook line of laptop computers. As a result, the processor
had a very good performance-per-watt ratio on native PowerPC code, and in
fact was able to match the 601 in clock-for-clock performance even though it
had about half the number of transistors as the older processor. But the 603’s
smaller 16KB split L1 cache meant that it was pretty bad at emulating the
legacy 68K code that formed a large part of Apple’s OS and application base.
As a result, the 603 was relegated to the very lowest end of Apple’s product
line (the Performas, beginning with the 6200, and the all-in-ones designed
for the education market, beginning with the 5200), until a tweaked version
(the 603e) with an enlarged, 32KB split cache was released. The 603e
performed better on emulated 68K code, so it saw widespread use in the
PowerBook line.
This section will take a quick look at the microarchitecture of the 603e,
illustrated in Figure 6-2, because it was the version of the 603 that saw the
most widespread use.
NOTE
The 604 was also released at the same time as the original 603. The 604, which was
intended for Apple’s high-end products just like the 603e was intended for its low-end
products, was yet another brand new design. We’ll cover the 604 in
“The PowerPC 604”
The 603e’s Back End
Like the 601, the 603e sports the classic RISC four-stage pipeline. But unlike
the 601, which can decode and dispatch up to three instructions per cycle
to any of its execution units—including its branch unit—the 603e has one
important restriction that constrains how it uses its dispatch bandwidth of
three instructions per cycle.
On the 603e, and on all processors derived from it (the 750 and the
7400/7410), branches that aren’t folded or don’t fall through are dispatched
from the instruction queue to the branch unit over a
dispatch bus
that isn’t connected to any of the other execution units. This way, branch instructions
don’t take up any of the available dispatch bandwidth that feeds the main
part of the back end. The 603e and its derivatives can dispatch one branch
instruction per cycle to the branch unit over this particular bus.
PowerPC Processors: 600 Series, 700 Series, and 7400
119
Front End
Instruction Fetch
BU
Instruction Queue
Branch
Unit
Decode/Dispatch
(ROB + Rename)
Reservation
Reservation
Reservation
Reservation
Station
Station
Station
Station
VPU-1
FPU-1
IU-1
LSU-1
CR
FPU-2
LSU-2
FPU-3
Load-
Floating-
Integer
System
Store
Point Unit
ALU
Unit
Unit
Memory Access
Scalar Arithmetic Logic Units
Units
Back End
Completion Queue
(ROB)
Write
Commit Unit
Figure 6-2: Microarchitecture of the PowerPC 603e
Non-branch instructions can dispatch at a rate of up to two instructions
per cycle to the back end, which means that the 603 has a maximum dispatch
rate of three instructions per cycle (two non-branch + one branch). However,
because two non-branch instructions per cycle can dispatch, branch instruc-
tions are often ignored when discussing the dispatch rate of the 603 and its
successors. Therefore, these processors are often said to have a dispatch rate
of up to two instructions per cycle, even though the dispatch rate is
technically
three instructions per cycle.
120
Chapter 6
The 603e’s dispatch logic takes a maximum of two non-branch instruc-
tions per cycle from the bottom of the instruction queue and passes them to
the back end, where they are executed by one of five execution units:
z
Integer unit
z
Floating-point unit
z
Branch unit
z
Load-store unit
z
System unit
Notice that this list contains two more units than the analogous list for
the 601: the load-store unit (LSU) and the
system unit
. The 603e’s load-store unit takes over all of the address-calculating labors that the older 601 foisted onto its lone integer ALU. Because the 603e has a dedicated LSU for performing address calculations and executing store-data operations, its
integer unit is freed up from having to handle memory traffic and can
therefore focus solely on integer arithmetic. This helps improve the 603e’s
performance on integer code.
The 603e’s dedicated system unit also takes over some of the functions
of the 601’s integer unit, in that it handles updates to the PowerPC condition
register. We’ll talk more about the condition register in Chapter 10, so don’t
worry if you don’t know what it is. The 603e’s system unit also contains a
limited integer adder, which can take some of the burden off the integer
ALU by doing certain types of addition. (The original 603’s system unit
lacked this feature.)
The 603e’s basic floating-point pipeline differs from that of the 601 in that
it has one more execute stage and one less decode stage. Most floating-
point instructions have a three-cycle latency (and a one-cycle throughput)
on the 603e, compared to a two-cycle latency on the 601. This three-cycle
latency/one-cycle throughput design wouldn’t be bad at all if it weren’t for
one serious problem: At its very fastest, the 603e’s FPU can only execute three
instructions every four cycles. In other words, after every third single-cycle
floating-point instruction, there is a mandatory pipeline bubble. I won’t get
into the reason for this, but the 603e’s FPU took a nontrivial hit to perfor-
mance for this three-instruction/four-cycle design.
The other, perhaps more serious, flaw in the 603e’s FPU is that it is not
fully pipelined for multiply operations. Double-precision multiplies—and this
includes double-precision fmadds—spend two cycles in the execute stage, which
means that the 603e’s FPU can complete only one double-precision multiply
every two cycles.
603e’s floating-point unit isn’t all bad news, though. It still has the stan-
dard PPC ability to do single-precision fmadd operations, with a four-cycle
latency and a one-cycle throughput. This fast fmadd ability helped the architec-
ture retain much of its usefulness for DSP, scientific, and media applications,
in spite of the aforementioned drawbacks.
PowerPC Processors: 600 Series, 700 Series, and 7400
121
The 603e’s Front End, Instruction Window, and Branch Prediction
Up to two instructions per cycle can be fetched into the 603e’s six-entry
instruction queue. From there, a maximum of two instructions per cycle
(one fewer than the 601) can be dispatched from the two bottom entries in
the IQ to the reservation stations in the 603e’s back end.
Not only does the 603e dispatch one fewer instruction per cycle to its back
end than the 601 does, but its overall approach to superscalar and out-of-order
execution differs from that of the 601 in another way, as well. The 603e uses
a dedicated
commit unit
, which contains a five-entry
completion queue
(analogous to the P6’s ROB) for keeping track of the program order of in-flight instructions. When instructions execute out of order, the commit unit refers to the
information stored in the completion queue and puts the instructions back
in program order before committing them.
To use a term that figured prominently in our discussion of the Pentium,
the 603 is the first PowerPC processor to feature dynamic scheduling via a
full-blown instruction window, complete with a ROB and reservation stations.