Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (69 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

12.16Mb size Format: txt, pdf, ePub

Read Book Download Book

was also born a subset of the POWER architecture dubbed PowerPC. PowerPC

processors were to be jointly designed and produced by IBM and Motorola

with input from Apple, and were to be used in Apple computers and in the

embedded market. The AIM alliance has since passed into history, but

PowerPC lives on, not only in Apple computers but in a whole host of

different products that use PowerPC-based chips from Motorola and IBM.

The PowerPC 601

In 1993, AIM kicked off the PowerPC party by releasing the 32-bit PowerPC 601

at an initial speed of 66 MHz. The 601, which was based on IBM’s older RISC

Single Chip (RSC) processor and was originally designed to serve as a “bridge”

between POWER and PowerPC, combines parts of IBM’s POWER architecture

with the 60
x
bus developed by Motorola for use with their 88000. As a bridge, the 601 supports a union of the POWER and PowerPC instruction sets, and it

enabled the first PowerPC application writers to easily make the transition

from the older ISA to the newer.

NOTE

The term
32-bit
may be unfamiliar to you at this point. If you’re curious about what
it means, you might want to skip ahead and skim the chapter on 64-bit computing,
Chapter 9.

Table 6-1 summarizes the features of the PowerPC 601.

Table 6-1:
Features of the PowerPC 601

Introduction Date

March 14, 1994

Process

0.60 micron

Transistor Count

2.8 million

Die Size

121 mm2

Clock Speed at Introduction

60–80 MHz

Cache Sizes

32KB unified L1

First Appeared In

Power Macintosh 6100/60

112

Chapter 6

Even though the joint IBM-Motorola team in Austin, Texas had only

12 months to get this chip off the ground, it was a very nice and full-featured

RISC design for its time.

The 601’s Pipeline and Front End

In the previous chapter, you learned how complex the different Pentiums’

front ends and pipelines tend to be. There is none of that with the 601, which

has a classic four-stage RISC integer pipeline:

Fetch

Decode/dispatch

Execute

Write-back

The fact that PowerPC’s RISC instructions are all the same size means

that the 601’s instruction fetch logic doesn’t have the instruction alignment

headaches that plague
x
86 designs, and thus the fetch hardware is simpler and faster. Back when transistor budgets were tight, this kind of thing could

make a big difference in performance, power consumption, and cost.

The PowerPC Instruction Queue

As you can see in Figure 6-1, up to eight instructions per cycle can be fetched

directly into an eight-entry
instruction queue (IQ)
, where they are decoded before being dispatched to the back end. Get used to seeing the instruction

queue, because it shows up in some form in every single PPC model that we’ll

discuss in this book, all the way down to the PPC 970.

The instruction queue is used mainly for detecting and dealing

with branches. The 601’s branch unit scans the bottom four entries of the

queue, identifying branch instructions and determining what type they

are (conditional, unconditional, etc.). In cases where the branch unit has

enough information to resolve the branch immediately (e.g., in the case of

an unconditional branch, or a conditional branch whose condition depends

on information that’s already in the condition register), the branch instruc-

tion is simply deleted from the instruction queue and replaced with the

instruction located at the branch target.

NOTE

The PowerPC condition register is the analog of the processor status word on the Pentium.

We’ll discuss the condition register in more detail in Chapter 10.

This branch-elimination technique, called
branch folding
, speeds per-

formance in two ways. First, it eliminates an instruction (the branch) from

the code stream, which frees up dispatch bandwidth for other instructions.

Second, it eliminates the single-cycle pipeline bubble that usually occurs

immediately after a branch. All of the PowerPC processors covered in this

chapter perform branch folding.

PowerPC Processors: 600 Series, 700 Series, and 7400

113

Front End

Instruction Fetch

Instruction Queue

Branch

Unit

Decode/Dispatch

FPU-1

IU-1

FPU-2

Floating-

Integer

System

Point

Unit

ALU

Unit

Scalar

Arithmetic Logic Units

Back End

Write

Figure 6-1: PowerPC 601 microarchitecture

If the branch unit determines that the branch is not taken, it allows the

branch to propagate to the bottom of the queue, where the dispatch logic

simply deletes it from the code stream. The act of allowing not-taken branches

to fall out of the instruction queue is called
fall-through
, and it happens on all the PowerPC processors covered in this book.

Non-branch instructions and branch instructions that are not folded sit in

the instruction queue while the dispatch logic examines the four bottommost

entries to see which three of them it can send off to the back end on the next

cycle. The dispatch logic can dispatch up to three instructions per cycle out

of order from the bottom four queue entries, with a few restrictions, of which

one is the most important for our immediate purposes: Integer instructions

can be dispatched only from the bottommost queue entry.

Instruction Scheduling on the 601

Notice that the 601 has no equivalent to the Pentium Pro’s reorder buffer

(ROB) for keeping track of the original program order. Instead, instructions

are tagged with what amounts to metadata so that the write-back logic can

commit the results to the register file in program order. This technique of

tagging instructions with program-order metadata works fine for a simple,

statically scheduled design like the 601 with a very small number of in-flight

114

Chapter 6

instructions. But later, dynamically scheduled PPC designs would require

dedicated structures for tracking larger numbers of in-flight instructions and

making sure that they commit their results in order.

The 601’s Back End

From the dispatch stage, instructions go into the 601’s back end, where

they’re executed by each of three different execution units: the integer unit,

the floating-point unit, or the branch unit. Let’s take a look at each of these

units in turn.

The Integer Unit

The 601’s 32-bit integer unit is a straightforward fixed-point ALU that is

responsible for all of the integer math—including address calculations—

on the chip. While
x
86 designs, like the original Pentium, need extra address adders to keep all of the address calculations associated with
x
86’s multiplicity of addressing modes from tying up the back end’s integer hardware, the 601’s

RISC, load-store memory model means that it can feasibly handle memory

traffic and regular ALU traffic with a single integer execution unit.

So the 601’s integer unit handles the following memory-related functions,

most of which are moved off into a dedicated load-store unit in subsequent

PPC designs:

integer and floating-point load-address calculations

integer and floating-point store-address calculations

integer and floating-point load-data operations

integer store-data operations

Cramming all of these load-store functions into the 601’s single integer

ALU doesn’t exactly help the chip’s integer performance, but it is good

enough to keep up with the Pentium in this area, even though the Pentium

has two integer ALUs. Most of this integer performance parity probably comes

from the 601’s huge 32KB unified L1 cache (compare that to the Pentium’s

8KB split L1), a luxury afforded the 601 by the relative simplicity of its frontend decoding hardware.

A final point worth noting about the 601’s integer unit is that multi-cycle

integer instructions (e.g., integer multiplies and divides) are not fully pipelined.

When an instruction that takes, say, five cycles to execute entered the IU, it

ties up the entire IU for the whole five cycles. Thankfully, the most common

integer instructions are single-cycle instructions.

The Floating-Point Unit

With its single floating-point unit, which handles all floating-point calculations and store-address operations, the 601 was a very strong performer when it

was first launched.

The 601’s floating-point pipeline is six stages long, and includes the

four basic stages outlined earlier in this chapter, but with an extra decode

stage and an extra execute stage. What really sets the chip’s floating-point

PowerPC Processors: 600 Series, 700 Series, and 7400

115

hardware apart when compared to its contemporaries is the fact that not only

are almost all single-precision operations fully pipelined, but most double-

precision (64-bit) floating-point operations are as well. This means that for

single-precision operations (with the exception of divides) and most double-

precision operations, the 601’s floating-point hardware can turn out one

instruction per cycle with a two-cycle latency.

Another great feature of the 601’s FPU is its ability to do single-precision

fused multiply-add (fmadd) instructions with single-cycle throughput. The fmadd

is a core digital signal processing (DSP) and scientific computing function,

so the 601’s fast fmadd capabilities make it well suited to these types of applications. This single-cycle fmadd capability is actually a significant feature of the entire PowerPC computing line, from the 601 on down to the present day, and

it is one reason why these processors have been so popular for media and

scientific applications.

Another factor in the 601’s floating-point dominance is that its integer

unit handles all of the memory traffic (with the FPU providing the data for

floating-point stores). This means that during long stretches of floating-point–

only code, the integer unit acts like a dedicated load-store unit (LSU), whose

sole purpose is to keep the FPU fed with data.

Such an FPU + LSU combination performs well for two reasons: First,

integer and floating-point code are rarely mixed, so it doesn’t matter for per-

formance if the integer unit is tied up with floating-point–related memory

traffic. Second, floating-point code is often data-intensive, with lots of loads and stores, and thus high levels of memory traffic to keep a dedicated

LSU busy.

When you combine both of these factors with the 601’s hefty 32KB L1

cache and its ability to do single-cycle fused multiply-adds at a rate of one per clock, you have a floating-point force to be reckoned with in 1994 terms.

The Branch Execution Unit

The 601’s branch unit (BU) works in combination with the instruction fetcher

and the instruction queue to steer the front end of the processor through

the code stream by executing branch instructions and predicting branches.

Regarding the latter function, the 601’s BU uses a simple static branch pre-

dictor to predict conditional branches. I’ll talk a bit more about branch

prediction and speculative execution in covering the 60
3e in “The PowerPC

603 and 603e” on page 118.

The Sequencer Unit

The 601 contains a peculiar holdover from the IBM RSC called the sequencer

unit. The
sequencer unit
, which I’ll admit is a bit of a mystery to me, appears to be a small, CISC-like processor with its own 18-bit instruction set, 32-word

RAM, microcode ROM, register file, and execution unit, all of which are

embedded on the 601. Its purpose is to execute some legacy instructions

particular to the older RSC; to take care of housekeeping chores like self-test, reset, and initialization functions; and to handle exceptions, interrupts,

and errors.

116

Chapter 6

The inclusion of the sequencer unit on the 601 is quite obviously the result

of the time crunch that the 601 team faced in bringing the first PowerPC

chip to market; IBM admitted this much in its 601 white paper. The team

started with IBM’s RSC as its basis and began redesigning it to implement the

PowerPC ISA. Instead of throwing out the sequencer unit, a component that

played a major role in the functioning of the original RSC, IBM simply scaled

back its size and functionality for use in the 601.

I don’t have any exact figures, but I think it’s safe to say that this embedded

subprocessor unit took up a decent amount of die space on the 601 and that