Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
All of this deep buffering, scheduling, queuing, and optimizing is essential
for keeping the Pentium 4’s high-speed back end full. To return yet again to
the Tetris analogy, imagine what would happen if someone were to double the
speed at which the blocks fall; you’d hope that they would also double the size
of the look-ahead window to compensate. The Pentium 4 greatly increases
the size of the P6’s instruction window as a way of compensating for the fact
that the arrangement of instructions in its core is made so much more critical
by the increased clock speed and pipeline depth.
The downside to all of this is that the schedulers and queues and the very
large ROB all add complexity and cost to the Pentium 4’s design. This com-
plexity and cost are part of the price that the Pentium 4 pays for its deep
pipeline and high clock speed.
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
159
I N T E L ’ S P E N T I U M 4 V S .
M O T O R O L A ’ S G 4 E :
T H E B A C K E N D
In this chapter, I’ll explain in greater detail the back
end of both the Pentium 4 and the G4e. I’ll talk about
the execution resources that each processor uses for
crunching code and data, and how those resources
contribute to overall performance on specific types of
applications.
Some Remarks About Operand Formats
Unlike the DLW and PowerPC ISAs described so far, the
x
86 ISA uses a two-operand format for both integer and floating-point instructions. If you want
to add two numbers in registers A and B, the instruction would look as follows:
add A, B
itm08_03.fm Page 162 Thursday, January 11, 2007 10:28 AM
This command adds the contents of A to the contents of B and places the
result in A, overwriting whatever was previously in A in the process. Expressed
mathematically, this would look as follows:
A = A + B
The problem with using a two-operand format is that it can be incon-
venient for some sequences of operations. For instance, if you want to add A
to B and store the result in C, you need two operations to do so:
Line # Code
Comments
1
mov C, A
Copy the contents of register A to register C.
2
add C, B
Add the numbers in registers C and B and store the result in C, overwriting
the previous contents of C.
Program 8-1
The first instruction in Program 8-1 copies the contents of A into C so that
A’s value is not erased by the addition, and the second instruction adds the
two numbers.
With a three-operand or more format, like many of the instructions in
the PPC ISA, the programmer gets a little more flexibility and control. For
instance, you saw earlier that the PPC ISA has a three-operand add instruction
of the format
add
destination
,
source1
,
source2
So if you want to add the contents of register 1 to the contents of register
2 and store the result in register 3 (i.e., r3 = r1 + r2), you just use the following instruction:
add 3, 2, 1
Some PPC instructions support even more than three operands, which
can be a real boon to programmers and compiler writers. In
“AltiVec Vector
Operations” on page 170, we’l
l look in detail at the G4e AltiVec instruction set’s use of a four-operand instruction format.
The PPC ISA’s variety of multiple-operand formats are obviously more
flexible than the one- and two-operand formats of
x
86. But nonetheless, modern
x
86 compilers are quite advanced and can overcome many of the
aforementioned problems through the use of hidden microarchitectural
rename registers and various scheduling algorithms. The problems with
x
86’s two-operand format are much more of a liability for floating-point and vector code than for integer code. We’ll talk more about this later, though.
162
Chapter 8
The Integer Execution Units
Though the Pentium 4’s
double-pumped
integer execution units got quite a bit of press when Netburst was first announced, you might be surprised to learn
that both the G4e and the Pentium 4 embody approaches to enhancing inte-
ger performance that are very similar. As you’ll see, this similarity arises from both processors’ application of the computing design dictum:
Make the common
case fast
.
For integer applications, the common case is easy to spot. As I outlined
in Chapter 1, integer instructions generally fall into one of two categories:
Simple/fast integer instructions
Instructions like add and sub require very few steps to complete and are
therefore easy to implement with little overhead. These simple instruc-
tions make up the majority of the integer instructions in an average
program.
Complex/slow integer instructions
While addition and subtraction are fairly simple to implement, multipli-
cation and division are complicated to implement and can take quite a
few steps to complete. Such instructions involve a series of additions and
bit shifts, all of which can take a while. These instructions represent only
a fraction of the instruction mix for an average program.
Since simple integer instructions are by far the most common type of
integer instruction, both the Pentium 4 and G4e devote most of their integer
resources to executing these types of instructions very rapidly.
The G4e’s IUs: Making the Common Case Fast
As was explained in the previous chapter, the G4e has a total of four IUs.
The IUs are divided into two groups:
Three simple/fast integer execution units—SIUa, SIUb, SIUc
These three simple IUs handle only fast integer instructions. Most of the
instructions executed by these IUs are single-cycle, but there are some
multi-cycle exceptions to this rule. Each of the three fast IUs is fed by a
single-entry reservation station.
One complex/slow integer execution unit—CIU
This single complex IU handles only complex integer instructions like
multiply, divide, and some special-purpose register instructions, includ-
ing condition register (CR) logical operations. Instructions sent to this
IU generally take four cycles to complete, although some take longer.
Note that divides, as well as some multiplication instructions, are not
fully pipelined and thus can tie up the entire IU for multiple cycles. Also,
instructions that update the PPC CR have an extra pipeline stage—called
finish
—to pass through before they leave the IU (more on this shortly).
Finally, the CIU is fed by a two-entry reservation station.
Intel’s Pentium 4 vs. Motorola’s G4e: The Back End
163
By dedicating three of its four integer ALUs to the fastest, simplest,
and most common instructions, the G4e is able to make the common case
quite fast.
Before moving on, I should note that the finish stages attached to the
ends of some of the execution unit pipelines are new in the G4e. These
finish stages are dedicated to updating the condition register to reflect the
results of any arithmetic operation that needs to do such updating (this
happens infrequently). It’s important to understand that at the end of the
execute stage/start of the finish stage, an arithmetic instruction’s results are available for use by dependent instructions, even though the CR has not yet
been updated. Therefore, the finish stage doesn’t affect the effective latency
of arithmetic instructions. But for instructions that depend on the CR—like
branch instructions—the finish stage adds an extra cycle of latency.
One nice thing that the PPC ISA has going for it is its large number of
general-purpose registers (GPRs) for storing integers and addresses. This
large number of architectural GPRs (32 to be exact) gives the compiler plenty
of flexibility in scheduling integer operations and address calculations.
In addition to the PPC ISA’s 32 GPRs, the G4e provides 16 microarchi-
tectural general-purpose rename registers for use by the on-chip scheduling
logic. These additional registers, not visible to the compiler or programmer,
are used by the G4e to augment the 32 GPRs, thereby providing more flexi-
bility for scheduling the processor’s execution resources and keeping them
supplied with data.
The Pentium 4’s IUs: Make the Common Case Twice as Fast
The Pentium 4’s integer functional block takes a very similar strategy to
the G4e for speeding up integer performance. It contains one slow integer
execution unit and two fast integer execution units. By just looking at the
number of integer execution units, you might think that the Pentium 4
has less integer horsepower than the G4e. This isn’t quite the case, though,
because the Pentium 4’s two fast IUs operate at
twice the core clock speed
, a trick that allows them to look to the outside world like four fast IUs.
The Pentium 4 can issue two integer instructions per cycle in rapid
succession to each of the two fast IUs—one on the rising edge of the clock
pulse and one on the falling edge. Each fast ALU can process an integer
instruction in 0.5 cycles, which means it can process a total of two integer
instructions per cycle. This gives the Pentium 4 a total peak throughput of
four simple integer instructions per cycle for the two fast IUs combined.
Does this mean that the Pentium 4’s two double-speed integer units are
twice as powerful as two single-speed integer units? No, not quite. Integer
performance is about much more than just a powerful integer functional
block. You can’t squeeze peak performance out of an integer unit if you
can’t keep it fed with code, and the Pentium 4 seems to have a weakness
in this area when it comes to integer code.
We talked earlier in this book about how, due to the Pentium 4’s “narrow
and deep” design philosophy, branch mispredictions and cache misses can
degrade performance by introducing pipeline bubbles into the instruction
164
Chapter 8
stream. This is especially a problem for integer performance, because integer-
intensive applications often contain branch-intensive code that exhibits poor
locality of reference
. As a result, branch mispredictions in conjunction with cache latencies can kill integer performance. (For more information about these
issues, see Chapter 11.)
As most benchmarks of the Pentium 4 bear out, in spite of its double-
pumped ALUs, the Pentium 4’s “narrow and deep” design is much more
suited to floating-point applications than it is to integer applications. This
floating-point bias seems to have been a deliberate choice on the part of the
Pentium 4’s designers—the Pentium 4 is designed to give not maximum but
acceptable performance on integer applications. This strategy works because
most modern processors (at least since the PIII, if not the PII or Pentium
Pro) are able to offer perfectly workable performance levels on consumer-
level integer-intensive applications like spreadsheets, word processors, and
the like. Though server-oriented applications like databases require higher
levels of integer performance, the demand for ever-increasing integer per-
formance just isn’t there in the consumer market. As a way of increasing the
Pentium 4’s integer performance for the server market, Intel sells a version
of the Pentium 4 called the
Xeon
, which has a much larger cache.
Before moving on to the next topic, I should note that one character-
istic of the Pentium 4 that bears mentioning is its large number of micro-
architectural rename registers. The
x
86 ISA has only eight GPRs, but the Pentium 4 augments these with the addition of a large number of rename
registers: 128 to be exact. Since the Pentium 4 keeps so many instructions
on-chip for scheduling purposes, it needs these added rename resources to
prevent the kinds of register-based resource conflicts that result in pipeline
bubbles.
The Floating-Point Units (FPUs)
While the mass market’s demand for integer performance may not be
picking up, its demand for floating-point seems insatiable. Games, 3D
rendering, audio processing, and almost all other forms of multimedia-
and entertainment-oriented computing applications are extremely floating-
point intensive. With floating-point applications perennially driving the home
PC market, it’s no wonder that the Pentium 4’s designers made their design
trade-offs in favor of FP performance over integer performance.
In terms of the way they use the processor and cache, floating-point appli-
cations are in many respects the exact opposite of the integer applications
described in the preceding section. For instance, the branches in floating-
point code are few and extremely predictable. Most of these branches occur
as exit conditions in small chunks of loop code that iterate through a large
dataset (e.g., a sound or image file) in order to modify it. Since these loops
iterate many thousands of times as they work their way through a file, the