Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
branch instruction that is the exit condition evaluates to
taken
many thousands of times; it only evaluates to
not taken
once—when the program exits Intel’s Pentium 4 vs. Motorola’s G4e: The Back End
165
the loop. For such types of code, simple static branch prediction, where the
branch prediction unit guesses that all branches will be taken every time,
works quite well.
This description also points out two other ways in which floating-point
code is the opposite of integer code. First, floating-point code has excellent
locality of reference with respect to the instruction cache. A floating-point–
intensive program, as I’ve noted, spends a large part of its execution time in
relatively small loops, and these loops often fit in one of the processor caches.
Second, because floating-point–intensive applications operate on large data
files that are streamed into the processor from main memory, memory band-
width is extremely important for floating-point performance. So while integer
programs need good branch prediction and caching to keep the IUs fed with
instructions, floating-point programs need good memory bandwidth to keep
the FPUs fed with data.
Now that you understand how floating-point code tends to operate, let’s
look at the FPUs of the G4e and Pentium 4 to see how they tackle the problem.
The G4e’s FPU
Since the G4e’s designers have bet on the processor’s vector execution
units (described shortly) to do most of the serious floating-point heavy
lifting, they made the G4e’s FPU fairly simple and straightforward. It has
a single pipeline for all floating-point instructions, and both single- and
double-precision operations take the same number of cycles. (This ability
to do double-precision FP is important mostly for scientific applications
like simulations.) In addition, one single- or double-precision operation
can be issued per cycle, with one important restriction (described later in
this section). Finally, the G4e inherits the PPC line’s ability to do single-
cycle fmadds, and this time, both double- and single-precision fmadds have
a single-cycle throughput.
Almost all of the G4e’s floating-point instructions take five cycles to
complete. There are a few instructions, however, that can take longer (fdiv,
fre, and fdiv, for example). These longer instructions can take from 14 to 35
cycles to complete, and while they’re executing, they hang up the floating-
point pipeline, meaning that no other instructions can be done during
that time.
One rarely discussed weakness of the G4e’s FPU is the fact that it isn’t
fully pipelined
, which means that it can’t have five different instructions in five different stages of execution simultaneously. Motorola’s software optimization manual for the 7450 states that if the 7450’s first four stages of execution are occupied, the FPU will stall on the next cycle. This means that the FPU’s
peak theoretical instruction throughput is four instructions every five cycles.
It could plausibly be said that the PowerPC ISA gives the G4e a slight
advantage over the Pentium 4 in terms of floating-point performance.
However, a better way of phrasing that would be to say that the
x
86 ISA (or more specifically, the
x
87 floating-point extensions) puts the Pentium 4
at a slight disadvantage with respect to the rest of the world. In other words,
166
Chapter 8
the PPC’s floating-point implementation is fairly normal and unremarkable,
whereas the
x
87 has a few quirks that can make life difficult for FPU designers and compiler writers.
As mentioned at the outset of this chapter, the PPC ISA has instructions
with one-, two-, three-, and four-operand formats. This puts a lot of power in
the hands of the compiler or programmer as far as scheduling floating-point
instructions to minimize dependencies and increase throughput and perfor-
mance. Furthermore, this instruction format flexibility is augmented by a
flat, 32-entry floating-point register file, which yields even more scheduling
flexibility and even more performance.
In contrast, the instructions that make up the
x
87 floating-point extensions support two operands at most. You saw in Chapter 5 that the
x
87’s very small eight-entry register file has a stack-based structure that limits it in certain ways. All Pentium processors up until the Pentium 4 get around this stack-related limitation with the “free” fxch instruction described in Chapter 5, but
on the Pentium 4, fxch is no longer free.
So the Pentium 4’s small, stack-based floating-point register file and two-
operand floating-point instruction format put the processor at a disadvantage
compared to the G4e’s cleaner PowerPC floating-point specification. The
Pentium 4’s 128 floating-point rename registers help alleviate some of the
false dependencies that arise from the low number of architectural registers,
but they don’t help much with the other problems.
The Pentium 4’s FPU
There are two fully independent FPU pipelines on the Pentium 4, one of
which is strictly for floating-point memory operations (loading and storing
floating-point data). Since floating-point applications are extremely data-
and memory-intensive, separating the floating-point memory operations and
giving them their own execution unit helps a bit with performance.
The other FPU pipeline is for all floating-point arithmetic operations,
and except for the fact that it doesn’t execute memory instructions, it’s very
similar to the G4e’s single FPU. Most simple floating-point operations take
between five and seven cycles, with a few more complicated operations (like
floating-point division) tying up the pipeline for a significantly longer time.
Single- and double-precision operations take the same number of cycles, with
both single- and double-precision floating-point numbers being converted
into the
x
87’s internal 80-bit temporary format. (This conversion is done for overflow reasons and doesn’t concern us here.)
So the Pentium 4’s FPU hardware executes instructions with slightly
higher instruction latencies than the FPU of its predecessor, the P6 (for
example, three to five cycles on the P6 for common instructions), but because
the Pentium 4’s clock speed is so much higher, it can still complete more
floating-point instructions in a shorter period of time. The same is true of
the Pentium 4’s FPU in comparison with the G4e’s FPU—the Pentium 4
takes more clock cycles than the G4e to execute floating-point instructions,
but those clock cycles are much faster. So Pentium 4’s clock-speed advantage
Intel’s Pentium 4 vs. Motorola’s G4e: The Back End
167
and high-bandwidth front-side bus give it a distinct advantage over the
Pentium III in floating-point–intensive benchmarks and enable it to be
more than competitive with the G4e in spite of
x
87’s drawbacks.
Concluding Remarks on the G4e’s and Pentium 4’s FPUs
The take-home message in the preceding discussion can be summed up
as follows: While the G4e has fairly standard, unremarkable floating-point
hardware, the PPC ISA does things the way they’re supposed to be done—
with three-operand or more instructions and a large, flat register file. The
Pentium 4, on the other hand, has slightly better hardware but is hobbled
by the legacy
x
87 ISA. The exact degree to which the
x
87’s weaknesses affect performance has been debated for as long as the
x
87 has been around, but there seems to be a consensus that the situation is less than ideal.
The other thing that’s extremely important to note is that when it comes
to floating-point performance, a good memory subsystem is absolutely key.
It doesn’t matter how good a processor’s floating-point hardware is—if you
can’t keep it fed, it won’t be doing much work. Therefore, floating-point
performance on both Pentium 4– and G4e-based systems depends on each
system’s available memory bandwidth.
The Vector Execution Units
One key technology on which both the Pentium 4 and the G4e rely for per-
formance in their most important type of application—media applications
(image processing, streaming media, 3D rendering, etc.)—is Single Instruc-
tion, Multiple Data (SIMD) computing, also known as
vector computing
. This section looks at SIMD on both the G4e and the Pentium 4.
A Brief Overview of Vector Computing
Chapter 1 discussed the movement of floating-point and vector capabilities
from co-processors onto the CPU die. However, the addition of vector
instructions and hardware to a modern, superscalar CPU is a bit more
drastic than the addition of floating-point capability. A microprocessor is
a
Single Instruction stream, Single Data stream (SISD)
device, and it has been since its inception, whereas vector computation represents a fundamentally
different type of computing: SIMD. Figure 8-1 compares the SIMD and SISD
in terms of a simple diagram that was introduced in Chapter 1.
As you can see in Figure 8-1, an SIMD machine exploits a property of
the data stream called
data parallelism
. Data parallelism is said to be present in a dataset when its elements can be processed in parallel, a situation that
most often occurs in large masses of data of a uniform type, like media files.
Chapter 1 described media applications as applications that use small, repeti-
tious chunks of code to operate on large, uniform datasets. Since these small
chunks of code apply the same sequence of operations to every element of a
large dataset, and these datasets can often be processed out of order, it makes
sense to use SIMD to apply the same instructions to multiple elements at once.
168
Chapter 8
Instructions
Data
Instructions
Data
SISD
SIMD
Results
Results
Figure 8-1: SISD versus SIMD
A classic example of a media application that exploits data parallelism is
the inversion of a digital image to produce its negative. The image processing
program must iterate through an array of uniform integer values (pixels)
and perform the same operation (inversion) on each one. Consequently,
there are multiple data points on which a single operation is performed, and
the order in which that operation is performed on the data points doesn’t
affect the outcome. The program could start the inversion at the top of the
image, the bottom of the image, or in the middle of the image—it doesn’t
matter as long as the entire image is inverted.
This technique of applying a single instruction to multiple data elements
at once is quite effective in yielding significant speedups for many types of
applications, especially streaming media, image processing, and 3D rendering.
In fact, many of the floating-point–intensive applications described previously
can benefit greatly from SIMD, which is why both the G4e and the Pentium 4
skimped on the traditional FPU in favor of strengthening their SIMD units.
There were some early, ill-fated attempts at making a purely SIMD
machine, but the SIMD model is simply not flexible enough to accommo-
date general-purpose code. The only form in which SIMD is really feasible
is as a part of a SISD host machine that can execute branch instructions
and other types of code that SIMD doesn’t handle well. This is, in fact,
the situation with SIMD in today’s market. Programs are written for a SISD
machine and include SIMD instructions in their code.
Vectors Revisited: The AltiVec Instruction Set
The basic data unit of SIMD computation is the vector, which is why SIMD
computing is also known as vector computing or
vector processing
. Vectors, which you met in Chapter 1, are nothing more than rows of individual numbers, or
scalars. Figure 8-2 illustrates the differences between vectors and scalars.
A simple CPU operates on scalars one at a time. A superscalar CPU
operates on multiple scalars at once, but it performs a different operation
on each instruction. A vector processor lines up a whole row of scalars, all
of the same type, and operates on them in parallel as a unit.
Intel’s Pentium 4 vs. Motorola’s G4e: The Back End
169
1
2
3
4
1 2 3 4 5 6 7 8
8
6
7
5
7 5 3 1 8 6 2 4
Scalars
Vectors
Figure 8-2: Scalars versus vectors
Vectors are represented in what is called a
packed data format
, in which data are grouped into bytes or words and packed into a vector of a certain
length. To take Motorola’s AltiVec, for example, each of the 32 AltiVec
registers is 128 bits wide, which means that AltiVec can operate on vectors
that are 128 bits wide. AltiVec’s 128-bit wide vectors can be subdivided into
z
16 elements, where each element is either an 8-bit signed or unsigned
integer or an 8-bit character;
z
8 elements, where each element is a 16-bit signed or unsigned integer;