Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (79 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

3.63Mb size Format: txt, pdf, ePub

branch instruction that is the exit condition evaluates to
taken
many thousands of times; it only evaluates to
not taken
once—when the program exits Intel’s Pentium 4 vs. Motorola’s G4e: The Back End

165

the loop. For such types of code, simple static branch prediction, where the

branch prediction unit guesses that all branches will be taken every time,

works quite well.

This description also points out two other ways in which floating-point

code is the opposite of integer code. First, floating-point code has excellent

locality of reference with respect to the instruction cache. A floating-point–

intensive program, as I’ve noted, spends a large part of its execution time in

relatively small loops, and these loops often fit in one of the processor caches.

Second, because floating-point–intensive applications operate on large data

files that are streamed into the processor from main memory, memory band-

width is extremely important for floating-point performance. So while integer

programs need good branch prediction and caching to keep the IUs fed with

instructions, floating-point programs need good memory bandwidth to keep

the FPUs fed with data.

Now that you understand how floating-point code tends to operate, let’s

look at the FPUs of the G4e and Pentium 4 to see how they tackle the problem.

The G4e’s FPU

Since the G4e’s designers have bet on the processor’s vector execution

units (described shortly) to do most of the serious floating-point heavy

lifting, they made the G4e’s FPU fairly simple and straightforward. It has

a single pipeline for all floating-point instructions, and both single- and

double-precision operations take the same number of cycles. (This ability

to do double-precision FP is important mostly for scientific applications

like simulations.) In addition, one single- or double-precision operation

can be issued per cycle, with one important restriction (described later in

this section). Finally, the G4e inherits the PPC line’s ability to do single-

cycle fmadds, and this time, both double- and single-precision fmadds have

a single-cycle throughput.

Almost all of the G4e’s floating-point instructions take five cycles to

complete. There are a few instructions, however, that can take longer (fdiv,

fre, and fdiv, for example). These longer instructions can take from 14 to 35

cycles to complete, and while they’re executing, they hang up the floating-

point pipeline, meaning that no other instructions can be done during

that time.

One rarely discussed weakness of the G4e’s FPU is the fact that it isn’t

fully pipelined
, which means that it can’t have five different instructions in five different stages of execution simultaneously. Motorola’s software optimization manual for the 7450 states that if the 7450’s first four stages of execution are occupied, the FPU will stall on the next cycle. This means that the FPU’s

peak theoretical instruction throughput is four instructions every five cycles.

It could plausibly be said that the PowerPC ISA gives the G4e a slight

advantage over the Pentium 4 in terms of floating-point performance.

However, a better way of phrasing that would be to say that the
x
86 ISA (or more specifically, the
x
87 floating-point extensions) puts the Pentium 4

at a slight disadvantage with respect to the rest of the world. In other words,

166

Chapter 8

the PPC’s floating-point implementation is fairly normal and unremarkable,

whereas the
x
87 has a few quirks that can make life difficult for FPU designers and compiler writers.

As mentioned at the outset of this chapter, the PPC ISA has instructions

with one-, two-, three-, and four-operand formats. This puts a lot of power in

the hands of the compiler or programmer as far as scheduling floating-point

instructions to minimize dependencies and increase throughput and perfor-

mance. Furthermore, this instruction format flexibility is augmented by a

flat, 32-entry floating-point register file, which yields even more scheduling

flexibility and even more performance.

In contrast, the instructions that make up the
x
87 floating-point extensions support two operands at most. You saw in Chapter 5 that the
x
87’s very small eight-entry register file has a stack-based structure that limits it in certain ways. All Pentium processors up until the Pentium 4 get around this stack-related limitation with the “free” fxch instruction described in Chapter 5, but

on the Pentium 4, fxch is no longer free.

So the Pentium 4’s small, stack-based floating-point register file and two-

operand floating-point instruction format put the processor at a disadvantage

compared to the G4e’s cleaner PowerPC floating-point specification. The

Pentium 4’s 128 floating-point rename registers help alleviate some of the

false dependencies that arise from the low number of architectural registers,

but they don’t help much with the other problems.

The Pentium 4’s FPU

There are two fully independent FPU pipelines on the Pentium 4, one of

which is strictly for floating-point memory operations (loading and storing

floating-point data). Since floating-point applications are extremely data-

and memory-intensive, separating the floating-point memory operations and

giving them their own execution unit helps a bit with performance.

The other FPU pipeline is for all floating-point arithmetic operations,

and except for the fact that it doesn’t execute memory instructions, it’s very

similar to the G4e’s single FPU. Most simple floating-point operations take

between five and seven cycles, with a few more complicated operations (like

floating-point division) tying up the pipeline for a significantly longer time.

Single- and double-precision operations take the same number of cycles, with

both single- and double-precision floating-point numbers being converted

into the
x
87’s internal 80-bit temporary format. (This conversion is done for overflow reasons and doesn’t concern us here.)

So the Pentium 4’s FPU hardware executes instructions with slightly

higher instruction latencies than the FPU of its predecessor, the P6 (for

example, three to five cycles on the P6 for common instructions), but because

the Pentium 4’s clock speed is so much higher, it can still complete more

floating-point instructions in a shorter period of time. The same is true of

the Pentium 4’s FPU in comparison with the G4e’s FPU—the Pentium 4

takes more clock cycles than the G4e to execute floating-point instructions,

but those clock cycles are much faster. So Pentium 4’s clock-speed advantage

Intel’s Pentium 4 vs. Motorola’s G4e: The Back End

167

and high-bandwidth front-side bus give it a distinct advantage over the

Pentium III in floating-point–intensive benchmarks and enable it to be

more than competitive with the G4e in spite of
x
87’s drawbacks.

Concluding Remarks on the G4e’s and Pentium 4’s FPUs

The take-home message in the preceding discussion can be summed up

as follows: While the G4e has fairly standard, unremarkable floating-point

hardware, the PPC ISA does things the way they’re supposed to be done—

with three-operand or more instructions and a large, flat register file. The

Pentium 4, on the other hand, has slightly better hardware but is hobbled

by the legacy
x
87 ISA. The exact degree to which the
x
87’s weaknesses affect performance has been debated for as long as the
x
87 has been around, but there seems to be a consensus that the situation is less than ideal.

The other thing that’s extremely important to note is that when it comes

to floating-point performance, a good memory subsystem is absolutely key.

It doesn’t matter how good a processor’s floating-point hardware is—if you

can’t keep it fed, it won’t be doing much work. Therefore, floating-point

performance on both Pentium 4– and G4e-based systems depends on each

system’s available memory bandwidth.

The Vector Execution Units

One key technology on which both the Pentium 4 and the G4e rely for per-

formance in their most important type of application—media applications

(image processing, streaming media, 3D rendering, etc.)—is Single Instruc-

tion, Multiple Data (SIMD) computing, also known as
vector computing
. This section looks at SIMD on both the G4e and the Pentium 4.

A Brief Overview of Vector Computing

Chapter 1 discussed the movement of floating-point and vector capabilities

from co-processors onto the CPU die. However, the addition of vector

instructions and hardware to a modern, superscalar CPU is a bit more

drastic than the addition of floating-point capability. A microprocessor is

a
Single Instruction stream, Single Data stream (SISD)
device, and it has been since its inception, whereas vector computation represents a fundamentally

different type of computing: SIMD. Figure 8-1 compares the SIMD and SISD

in terms of a simple diagram that was introduced in Chapter 1.

As you can see in Figure 8-1, an SIMD machine exploits a property of

the data stream called
data parallelism
. Data parallelism is said to be present in a dataset when its elements can be processed in parallel, a situation that

most often occurs in large masses of data of a uniform type, like media files.

Chapter 1 described media applications as applications that use small, repeti-

tious chunks of code to operate on large, uniform datasets. Since these small

chunks of code apply the same sequence of operations to every element of a

large dataset, and these datasets can often be processed out of order, it makes

sense to use SIMD to apply the same instructions to multiple elements at once.

168

Chapter 8

Instructions

Data

Instructions

Data

SISD

SIMD

Results

Figure 8-1: SISD versus SIMD

A classic example of a media application that exploits data parallelism is

the inversion of a digital image to produce its negative. The image processing

program must iterate through an array of uniform integer values (pixels)

and perform the same operation (inversion) on each one. Consequently,

there are multiple data points on which a single operation is performed, and

the order in which that operation is performed on the data points doesn’t

affect the outcome. The program could start the inversion at the top of the

image, the bottom of the image, or in the middle of the image—it doesn’t

matter as long as the entire image is inverted.

This technique of applying a single instruction to multiple data elements

at once is quite effective in yielding significant speedups for many types of

applications, especially streaming media, image processing, and 3D rendering.

In fact, many of the floating-point–intensive applications described previously

can benefit greatly from SIMD, which is why both the G4e and the Pentium 4

skimped on the traditional FPU in favor of strengthening their SIMD units.

There were some early, ill-fated attempts at making a purely SIMD

machine, but the SIMD model is simply not flexible enough to accommo-

date general-purpose code. The only form in which SIMD is really feasible

is as a part of a SISD host machine that can execute branch instructions

and other types of code that SIMD doesn’t handle well. This is, in fact,

the situation with SIMD in today’s market. Programs are written for a SISD

machine and include SIMD instructions in their code.

Vectors Revisited: The AltiVec Instruction Set

The basic data unit of SIMD computation is the vector, which is why SIMD

computing is also known as vector computing or
vector processing
. Vectors, which you met in Chapter 1, are nothing more than rows of individual numbers, or

scalars. Figure 8-2 illustrates the differences between vectors and scalars.

A simple CPU operates on scalars one at a time. A superscalar CPU

operates on multiple scalars at once, but it performs a different operation

on each instruction. A vector processor lines up a whole row of scalars, all

of the same type, and operates on them in parallel as a unit.

Intel’s Pentium 4 vs. Motorola’s G4e: The Back End

169

1 2 3 4 5 6 7 8

7 5 3 1 8 6 2 4

Scalars

Vectors

Figure 8-2: Scalars versus vectors

Vectors are represented in what is called a
packed data format
, in which data are grouped into bytes or words and packed into a vector of a certain

length. To take Motorola’s AltiVec, for example, each of the 32 AltiVec

registers is 128 bits wide, which means that AltiVec can operate on vectors

that are 128 bits wide. AltiVec’s 128-bit wide vectors can be subdivided into

16 elements, where each element is either an 8-bit signed or unsigned

integer or an 8-bit character;

8 elements, where each element is a 16-bit signed or unsigned integer;