Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
intensive office apps.
The final thing worth noting about the Pentium’s two integer ALUs is
that they are responsible for many of the processor’s address calculations.
The Intel Pentium and Pentium Pro
87
More recently designed processors have specialized hardware for handling
the address calculations associated with loads and stores, but on the Pentium
these calculations are done in the integer ALUs.
The Floating-Point ALU
Floating-point operations are usually more complex to implement than
integer operations, so floating-point pipelines often feature more stages
than integer pipelines. The Pentium’s six-stage floating-point pipeline is no
exception to this rule. The Pentium’s floating-point performance is limited
by two main factors. First, the processor can only dispatch both a floating-
point and an integer operation simultaneously under extremely restrictive
circumstances. This isn’t too bad, though, because floating-point and integer
code are rarely mixed. The second factor, the unfortunate design of the
x
87
floating-point architecture, is more important.
In contrast to the average RISC ISA’s
flat
floating-point register file, the
x
87 register file contains eight 80-bit registers arranged in the form of a stack.
A
stack
is a simple data storage structure commonly used by programmers
and some scientific calculators to perform arithmetic.
NOTE
Flat
is an adjective that programmers use to describe an array of elements that is logically laid out so that any element is accessible via a simple address. For instance, all
of the register files that we’ve seen so far are flat, because a programmer needs to know
only the name of the register in order to access that register. Contrast the flat file with the
stack structure described next, in which elements that are inside the data structure are
not immediately and directly accessible to the programmer.
As Figure 5-4 illustrates, a programmer writes data to the stack by
pushing
it onto the top of the stack via the push instruction. The stack therefore grows with each new piece of data that is pushed onto its top. To read data from
the stack, the programmer issues a pop instruction, which returns the top-
most piece of data and removes that data from the stack, causing the stack
to shrink.
As the stack grows and shrinks, the variable ST, which stands for the
stack
top
, always points to the top element of the stack. In the most basic type of stack, ST is the only element of the stack that can be directly accessed by the
programmer—it is read using the pop command, and it is written to using the
push command. This being the case, if you want to read the blue element
from the stack in Figure 5-4, you have to pop all of the elements above it, and
then you have to pop the blue element itself. Similarly, if you want to alter
the blue element, you first have to pop all of the elements above it. Then
you pop the blue element itself, alter it, and then push the modified element
back onto the stack.
Because the first item that you place in a stack is not accessible until
you’ve removed all the items above it, a stack is often called a
FILO (first in,
last out)
data structure. Contrast this with a traditional queue structure, like a supermarket checkout line, which is a
FIFO (first in, first out)
structure.
88
Chapter 5
3
ST
3
ST
ST
Empty Stack
Push ( )
Push ( )
Push ( )
3
ST
3
ST
ST
ST
Push ( )
Push ( )
Pop
Pop
Figure 5-4: Pushing and popping data on a simple stack
All of this pushing and popping sounds like a lot of work, and you might
wonder why anyone would use such a data structure. As it turns out, a stack
is ideal for certain specialized types of applications, like parsing natural language, keeping track of nested procedure calls, and evaluating postfix arith-
metic expressions. It was the stack’s utility for evaluating postfix arithmetic
expressions that recommended it to the designers of the
x
87 floating-point unit (FPU), so they arranged the FPU’s eight floating-point registers as a
stack.
NOTE
Normal arithmetic expressions, like 5 + 2 – 1 = 6, are called
infix
expressions, because
the arithmetic operators (+ and –) are situated in between the numbers on which they
operate.
Postfix
expressions, in contrast, have the operators affixed to the end of the
expression, e.g. 521–+ = 6. You could evaluate this expression from left to right using a
stack by pushing the numbers 5, 2, and 1 onto the stack (in that order), and then popping them back off (first 1, then 2, and finally 5) as the operators at the end of the
expression are encountered. The operators would be applied to the popped numbers as
they appear, and the running result would be stored in the top of the stack.
The
x
87 register file is a little different than the simple stack described two paragraphs ago, because ST is not the only variable through which the
stack elements can be accessed. Instead, the programmer can read and
write the lower elements of the stack by using ST with an index value that
designates the desired element’s position relative to the top of the stack.
For example, in Figure 5-5, the stack is at its tallest when the green value
has just been pushed onto it. This green value is accessed via the variable
ST(0), because it occupies the top of the stack. The blue value, because it is
three elements down from the top of the stack, is accessed via ST(3).
The Intel Pentium and Pentium Pro
89
3
ST(0)
3
ST(0)
4
ST(1)
ST(0)
4
ST(1)
5
ST(2)
Empty Stack
Push ( )
Push ( )
Push ( )
3
ST(0)
3
ST(0)
4
ST(1)
ST(0)
4
ST(1)
5
ST(2)
ST(1)
ST(0)
5
ST(2)
6
ST(3)
ST(2)
ST(1)
6
ST(3)
7
ST(4)
ST(3)
ST(2)
Push ( )
Push ( )
Pop
Pop
Figure 5-5: Pushing and popping data on the
x
87 floating-point register stack
In general, to read from or write to a specific register in the stack, you
can just use the form ST(
i
), where
i
is the number of registers from the top of the stack.
Programming purists might suggest that since you can access its stack
elements arbitrarily, it’s kind of pointless to still call the
x
87 register file a stack.
This would be true except for one catch: For every floating-point arithmetic
instruction, at least one of the operands must be the stack top. For instance,
if you want to add two floating-point numbers, one of the numbers must
be in the stack top and the other can be in any of the other registers. For
example, the instruction
fadd ST, ST(5)
performs the operation
ST = ST + ST(5)
Though the stack-based nature of
x
87’s floating-point register file was originally a boon to assembly language programmers, it soon began to
become an obstacle to floating-point performance as compilers saw more
widespread use. A flat register file is easier for a compiler to manage, and
the newer RISC ISAs featured not only large, flat register files but also
three-operand floating-point instructions.
While compiler tricks are arguably enough to make up for
x
87’s two-
operand limit under most circumstances, they’re not quite able to overcome
both the two-operand limit and the stack-based limit. So compiler tricks alone
won’t eliminate the performance penalties associated with both of these
90
Chapter 5
quirks combined. The stack-based register file is bad enough that a micro-
architectural hack is needed in order simulate a flat register file and thereby
keep the
x
87’s design from hobbling floating-point performance.
This microarchitectural hack involves turbocharging a single instruction:
fxch. The fxch instruction is an ordinary
x
87 instruction that allows you to swap any element of the stack with the stack top. For example, if you wanted
to calculate ST(2) = ST(2) + ST(6), you might execute the code shown in
Program 5-1:
Line # Code
Comments
1
fxch ST(2)
Place the contents of ST(2) into ST and the contents of ST into ST(2).
2
fadd ST, ST(6)
Add the contents of ST to ST(6).
3
fxch ST(2)
Place the contents of ST(2) into ST and the contents of ST into ST(2).
Program 5-1: Using the fxch instruction
Now, here’s where the microarchitectural hack comes in. On all modern
x
86 designs, from the original Pentium up to but not including the Pentium 4, the fxch instruction can be executed in zero cycles. This means that for all
intents and purposes, fxch is “free of charge” and can therefore be used when
needed without a performance hit. (Note, however, that the fxch instruction
still takes up decode bandwidth, so even when it’s “free,” it’s not entirely
“free.”) If you stop and think about the fact that, before executing any floating-point instruction (which has to involve the stack top), you can instantaneously
swap ST with any other register, you’ll realize that a zero-cycle fxch instruction gives programmers the functional equivalent of a flat register file.
To revisit the previous example, the fact that the first instruction in
Program 5-1 executes “instantaneously,” as it were, means that the series of
operations effectively looks as follows:
fadd ST(2), ST(6)
There are in fact some limitations on the use of the “free” fxch instruc-
tion, but the overall result is that by using this trick, both the Pentium and its successors get the effective benefits of a flat register file, but with the aforementioned hit to decode bandwidth.
x
86 Overhead on the Pentium
There are a number of places, like the Pentium’s decode-2 stage, where
legacy
x
86 support adds significant overhead to the Pentium’s design. Intel has estimated that a whopping 30 percent of the Pentium’s transistors are
dedicated solely to providing
x
86 legacy support. When you consider the fact that the Pentium’s RISC competitors with comparable transistor counts could
spend those transistors on performance-enhancing hardware like execution
units and cache, it’s no wonder that the Pentium lagged behind some of its
contemporaries when it was first introduced.
The Intel Pentium and Pentium Pro
91
A large chunk of the Pentium’s legacy-supporting transistors are eaten
up by its microcode ROM. Chapter 4 explained that one of the big benefits
of RISC processors is that they don’t need the microcode ROMs that CISC
designs require for decoding large, complex instructions. (For more on
x
86
as a CISC ISA, see the
section “CISC, RISC, and Instruction Set Translation”
The front end of the Pentium also suffers from
x
86-related bloat, in that its prefetch logic has to take account of the fact that
x
86 instructions are not a uniform size and hence can straddle cache lines. The Pentium’s decode
logic also has to support
x
86’s segmented memory model, which means
checking for and enforcing code segment limits; such checking requires its
own dedicated address calculation hardware, in addition to the Pentium’s
other address hardware.
Summary: The Pentium in Historical Context
The primary factor constraining the Pentium’s performance versus its RISC
competitors was the fact that its entire front end was bloated with hardware
that was there solely to support
x
86 features which, even at the time of the processor’s introduction, were rapidly falling out of use. With transistor
budgets as tight as they were in 1993, each of those extra address adders
and prefetch buffers—not to mention the microcode ROM—represented
a painful expenditure of scarce resources that did nothing to enhance the
Pentium’s performance.
Fortunately for Intel, Pentium’s legacy support headaches weren’t the
end of the story. There were a few facts and trends working in the favor of
Intel and the
x
86 ISA. If we momentarily forget about ISA extensions like MMX, SSE, and so on, and the odd handful of special-purpose instructions,
like Intel’s CPU identifier instruction, that get added to the
x
86 ISA every so often, the core legacy
x
86 ISA is fixed in size and has not grown over the years.
Similarly, with one exception (the P6, covered next), the amount of hardware