Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
was also born a subset of the POWER architecture dubbed PowerPC. PowerPC
processors were to be jointly designed and produced by IBM and Motorola
with input from Apple, and were to be used in Apple computers and in the
embedded market. The AIM alliance has since passed into history, but
PowerPC lives on, not only in Apple computers but in a whole host of
different products that use PowerPC-based chips from Motorola and IBM.
The PowerPC 601
In 1993, AIM kicked off the PowerPC party by releasing the 32-bit PowerPC 601
at an initial speed of 66 MHz. The 601, which was based on IBM’s older RISC
Single Chip (RSC) processor and was originally designed to serve as a “bridge”
between POWER and PowerPC, combines parts of IBM’s POWER architecture
with the 60
x
bus developed by Motorola for use with their 88000. As a bridge, the 601 supports a union of the POWER and PowerPC instruction sets, and it
enabled the first PowerPC application writers to easily make the transition
from the older ISA to the newer.
NOTE
The term
32-bit
may be unfamiliar to you at this point. If you’re curious about what
it means, you might want to skip ahead and skim the chapter on 64-bit computing,
Chapter 9.
Table 6-1 summarizes the features of the PowerPC 601.
Table 6-1:
Features of the PowerPC 601
Introduction Date
March 14, 1994
Process
0.60 micron
Transistor Count
2.8 million
Die Size
121 mm2
Clock Speed at Introduction
60–80 MHz
Cache Sizes
32KB unified L1
First Appeared In
Power Macintosh 6100/60
112
Chapter 6
Even though the joint IBM-Motorola team in Austin, Texas had only
12 months to get this chip off the ground, it was a very nice and full-featured
RISC design for its time.
The 601’s Pipeline and Front End
In the previous chapter, you learned how complex the different Pentiums’
front ends and pipelines tend to be. There is none of that with the 601, which
has a classic four-stage RISC integer pipeline:
1.
Fetch
2.
Decode/dispatch
3.
Execute
4.
Write-back
The fact that PowerPC’s RISC instructions are all the same size means
that the 601’s instruction fetch logic doesn’t have the instruction alignment
headaches that plague
x
86 designs, and thus the fetch hardware is simpler and faster. Back when transistor budgets were tight, this kind of thing could
make a big difference in performance, power consumption, and cost.
The PowerPC Instruction Queue
As you can see in Figure 6-1, up to eight instructions per cycle can be fetched
directly into an eight-entry
instruction queue (IQ)
, where they are decoded before being dispatched to the back end. Get used to seeing the instruction
queue, because it shows up in some form in every single PPC model that we’ll
discuss in this book, all the way down to the PPC 970.
The instruction queue is used mainly for detecting and dealing
with branches. The 601’s branch unit scans the bottom four entries of the
queue, identifying branch instructions and determining what type they
are (conditional, unconditional, etc.). In cases where the branch unit has
enough information to resolve the branch immediately (e.g., in the case of
an unconditional branch, or a conditional branch whose condition depends
on information that’s already in the condition register), the branch instruc-
tion is simply deleted from the instruction queue and replaced with the
instruction located at the branch target.
NOTE
The PowerPC condition register is the analog of the processor status word on the Pentium.
We’ll discuss the condition register in more detail in Chapter 10.
This branch-elimination technique, called
branch folding
, speeds per-
formance in two ways. First, it eliminates an instruction (the branch) from
the code stream, which frees up dispatch bandwidth for other instructions.
Second, it eliminates the single-cycle pipeline bubble that usually occurs
immediately after a branch. All of the PowerPC processors covered in this
chapter perform branch folding.
PowerPC Processors: 600 Series, 700 Series, and 7400
113
Front End
Instruction Fetch
BU
Instruction Queue
Branch
Unit
Decode/Dispatch
FPU-1
IU-1
SU
FPU-2
Floating-
Integer
System
Point
Unit
ALU
Unit
Scalar
Arithmetic Logic Units
Back End
Write
Figure 6-1: PowerPC 601 microarchitecture
If the branch unit determines that the branch is not taken, it allows the
branch to propagate to the bottom of the queue, where the dispatch logic
simply deletes it from the code stream. The act of allowing not-taken branches
to fall out of the instruction queue is called
fall-through
, and it happens on all the PowerPC processors covered in this book.
Non-branch instructions and branch instructions that are not folded sit in
the instruction queue while the dispatch logic examines the four bottommost
entries to see which three of them it can send off to the back end on the next
cycle. The dispatch logic can dispatch up to three instructions per cycle out
of order from the bottom four queue entries, with a few restrictions, of which
one is the most important for our immediate purposes: Integer instructions
can be dispatched only from the bottommost queue entry.
Instruction Scheduling on the 601
Notice that the 601 has no equivalent to the Pentium Pro’s reorder buffer
(ROB) for keeping track of the original program order. Instead, instructions
are tagged with what amounts to metadata so that the write-back logic can
commit the results to the register file in program order. This technique of
tagging instructions with program-order metadata works fine for a simple,
statically scheduled design like the 601 with a very small number of in-flight
114
Chapter 6
instructions. But later, dynamically scheduled PPC designs would require
dedicated structures for tracking larger numbers of in-flight instructions and
making sure that they commit their results in order.
The 601’s Back End
From the dispatch stage, instructions go into the 601’s back end, where
they’re executed by each of three different execution units: the integer unit,
the floating-point unit, or the branch unit. Let’s take a look at each of these
units in turn.
The Integer Unit
The 601’s 32-bit integer unit is a straightforward fixed-point ALU that is
responsible for all of the integer math—including address calculations—
on the chip. While
x
86 designs, like the original Pentium, need extra address adders to keep all of the address calculations associated with
x
86’s multiplicity of addressing modes from tying up the back end’s integer hardware, the 601’s
RISC, load-store memory model means that it can feasibly handle memory
traffic and regular ALU traffic with a single integer execution unit.
So the 601’s integer unit handles the following memory-related functions,
most of which are moved off into a dedicated load-store unit in subsequent
PPC designs:
z
integer and floating-point load-address calculations
z
integer and floating-point store-address calculations
z
integer and floating-point load-data operations
z
integer store-data operations
Cramming all of these load-store functions into the 601’s single integer
ALU doesn’t exactly help the chip’s integer performance, but it is good
enough to keep up with the Pentium in this area, even though the Pentium
has two integer ALUs. Most of this integer performance parity probably comes
from the 601’s huge 32KB unified L1 cache (compare that to the Pentium’s
8KB split L1), a luxury afforded the 601 by the relative simplicity of its frontend decoding hardware.
A final point worth noting about the 601’s integer unit is that multi-cycle
integer instructions (e.g., integer multiplies and divides) are not fully pipelined.
When an instruction that takes, say, five cycles to execute entered the IU, it
ties up the entire IU for the whole five cycles. Thankfully, the most common
integer instructions are single-cycle instructions.
The Floating-Point Unit
With its single floating-point unit, which handles all floating-point calculations and store-address operations, the 601 was a very strong performer when it
was first launched.
The 601’s floating-point pipeline is six stages long, and includes the
four basic stages outlined earlier in this chapter, but with an extra decode
stage and an extra execute stage. What really sets the chip’s floating-point
PowerPC Processors: 600 Series, 700 Series, and 7400
115
hardware apart when compared to its contemporaries is the fact that not only
are almost all single-precision operations fully pipelined, but most double-
precision (64-bit) floating-point operations are as well. This means that for
single-precision operations (with the exception of divides) and most double-
precision operations, the 601’s floating-point hardware can turn out one
instruction per cycle with a two-cycle latency.
Another great feature of the 601’s FPU is its ability to do single-precision
fused multiply-add (fmadd) instructions with single-cycle throughput. The fmadd
is a core digital signal processing (DSP) and scientific computing function,
so the 601’s fast fmadd capabilities make it well suited to these types of applications. This single-cycle fmadd capability is actually a significant feature of the entire PowerPC computing line, from the 601 on down to the present day, and
it is one reason why these processors have been so popular for media and
scientific applications.
Another factor in the 601’s floating-point dominance is that its integer
unit handles all of the memory traffic (with the FPU providing the data for
floating-point stores). This means that during long stretches of floating-point–
only code, the integer unit acts like a dedicated load-store unit (LSU), whose
sole purpose is to keep the FPU fed with data.
Such an FPU + LSU combination performs well for two reasons: First,
integer and floating-point code are rarely mixed, so it doesn’t matter for per-
formance if the integer unit is tied up with floating-point–related memory
traffic. Second, floating-point code is often data-intensive, with lots of loads and stores, and thus high levels of memory traffic to keep a dedicated
LSU busy.
When you combine both of these factors with the 601’s hefty 32KB L1
cache and its ability to do single-cycle fused multiply-adds at a rate of one per clock, you have a floating-point force to be reckoned with in 1994 terms.
The Branch Execution Unit
The 601’s branch unit (BU) works in combination with the instruction fetcher
and the instruction queue to steer the front end of the processor through
the code stream by executing branch instructions and predicting branches.
Regarding the latter function, the 601’s BU uses a simple static branch pre-
dictor to predict conditional branches. I’ll talk a bit more about branch
prediction and speculative execution in covering the 60
3e in “The PowerPC
The Sequencer Unit
The 601 contains a peculiar holdover from the IBM RSC called the sequencer
unit. The
sequencer unit
, which I’ll admit is a bit of a mystery to me, appears to be a small, CISC-like processor with its own 18-bit instruction set, 32-word
RAM, microcode ROM, register file, and execution unit, all of which are
embedded on the 601. Its purpose is to execute some legacy instructions
particular to the older RSC; to take care of housekeeping chores like self-test, reset, and initialization functions; and to handle exceptions, interrupts,
and errors.
116
Chapter 6
The inclusion of the sequencer unit on the 601 is quite obviously the result
of the time crunch that the 601 team faced in bringing the first PowerPC
chip to market; IBM admitted this much in its 601 white paper. The team
started with IBM’s RSC as its basis and began redesigning it to implement the
PowerPC ISA. Instead of throwing out the sequencer unit, a component that
played a major role in the functioning of the original RSC, IBM simply scaled
back its size and functionality for use in the 601.
I don’t have any exact figures, but I think it’s safe to say that this embedded
subprocessor unit took up a decent amount of die space on the 601 and that
the design team would have thrown it out if it had had more time. Subsequent
PowerPC processors, which didn’t have to worry about RSC legacy support,
implemented all of the (non–RSC-related) functions of the 601’s sequencer