Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
until the branch condition is evaluated. When the branch condition has been
evaluated, if the BPU guessed correctly, the speculative instructions’ ROB
entries are marked as non-speculative, and the instructions are committed
in order. If the BPU guessed incorrectly, the speculative instructions and their results are deleted from the ROB without being committed.
The P6 Back End
The P6’s back end (illustrated in Figure 5-9) is significantly wider than that
of the Pentium. Like the Pentium, it contains two asymmetrical integer
ALUs and a separate floating-point unit, but its load-store capabilities have
been beefed up to include three execution units devoted solely to memory
accesses: a
load address unit
, a
store address unit
, and a
store data unit
. The load address and store address units each contain a pair of four-input adders for
calculating addresses and checking segment limits; these are the adders
that show up in the decode-1 stage of the original Pentium.
The asymmetrical integer ALUs on the P6 have single-cycle throughput
and latency for most operations, with multiplication having single-cycle
throughput but four-cycle latency. Thus, multiply instructions execute faster
on the P6 than on the Pentium.
102
Chapter 5
Reservation Station (RS)
Port 0
Port 0
Port 1
Port 4
Port 3
Port 2
Port 1
CIU
SIU
Store
Store
Load
FPU
Data
Addr.
Addr.
BU
Floating-
Branch
Point
Integer Unit
Load-Store Unit
Unit
Unit
Scalar ALUs
Memory Access Units
Back End
Figure 5-9: The P6 back end
The P6’s floating-point unit executes most single- and double-precision
operations in three cycles, with five cycles needed for multiply instructions.
The FPU is fully pipelined for most instructions, so that most instructions
execute with a single-cycle throughput. Some instructions, like floating-point
division and square root, are not pipelined and take 18 to 38 and 29 to 69
cycles, respectively.
From the present overview’s perspective, the most noteworthy feature of
the P6’s back end is that its execution units are attached to the reservation
station via five issue ports, as shown in Figure 5-9.
This means that up to five instructions per cycle can pass from the reser-
vation station through the issue ports and into the execution units. This five–
issue port structure is one of the most recognizable features of the P6, and
when later designs (like the PII) added execution units to the microarchi-
tecture (like MMX units), they had to be added on the existing issue ports.
If you looked closely at the Pentium Pro diagram, you probably noticed
that there were already two units that shared a single port in the original
Pentium Pro: the simple integer unit and the floating-point unit. This means
that there are some restrictions on issuing a second integer computation and a
floating-point computation, but these restrictions rarely affect performance.
CISC, RISC, and Instruction Set Translation
Like the original Pentium, the P6 spends extra time in the decode phase, but
this time, the extra cycle and a half goes not to address calculations but to
instruction set translation
. ISA translation is an important technique used in many modern processors, but before you can understand how it works you
must first become acquainted with two terms that often show up in computer
architecture discussions: RISC and CISC.
One of the most important ways in which the
x
86 ISA differs from that of both the PowerPC ISA (described in the next chapter) and the hypothetical
DLW ISA presented in Chapter 2 is that it supports register-to-memory and
memory-to-memory format arithmetic instructions. On the DLW architecture,
The Intel Pentium and Pentium Pro
103
source and destination operands for every arithmetic instruction had to be
either registers or immediate values, and it was the programmer’s responsi-
bility to include in the program the load and store instructions necessary to
ensure that the arithmetic instructions’ source registers were populated with
the correct values from memory and their results written back to memory.
On the
x
86 architecture, the programmer can voluntarily surrender control of most load-store traffic to the processor by using source and/or destination
operands that are memory locations. If the DLW architecture supported
such operations, they might look like Program 5-2:
Line # Code
Comments
1
add #12, #13, A
Add the contents of memory locations #12 and #13 and place the
result in register A.
2
sub A, #15, #16
Subtract the contents of register A from the contents of memory
location #15 and store the result in memory location #16.
3
sub A, #B, #100
Subtract the contents of register A from the contents of the memory
location pointed to by #B and store the result in memory location
#100.
Program 5-2: Arithmetic instructions using memory-to-memory and memory-to-register
formats
Adding the contents of two memory locations, as in line 1 of Program 5-2,
still requires the processor to load the necessary values into registers and store the results. However, in memory-to-register and memory-to-memory format
instructions, these load and store operations are
implicit
in the instruction.
The processor must look at the instruction and figure out that it needs to
perform the necessary memory accesses; then it must perform them before
and/or after it executes the arithmetic part of the instruction. So for the
add in line 1 of Program 5-2, the processor would have to perform two loads
before executing the addition. Similarly, for the subtractions in lines 2 and 3, the processor would have to perform one load before executing the subtraction and one store afterwards.
The use of such register-to-memory and memory-to-memory format
instructions shifts the burden of scheduling memory traffic from the pro-
grammer to the processor, freeing the programmer to focus on other aspects
of coding. It also has the effect of reducing the number of instructions that a
programmer must write (or
code density
) in order to perform most tasks. In the days when programmers programmed primarily in assembly language,
compilers for
high-level languages (HLLs)
like C and FORTRAN were primitive, and main memories were small and expensive, ISA qualities like programmer
ease-of-use and high code density were very attractive.
A further technique that ISAs like
x
86 use to lessen the burden on pro-
grammers and increase code density is the inclusion of ISA-level support for
complex data types like
strings
. A string is simply a series, or “string,” of contiguous memory locations of a certain length. Strings are often used to store ASCII text, so a short string might store a word, or a longer string might store a whole sentence. If an ISA includes instructions for working with strings—and the
104
Chapter 5
x
86 ISA does—assembly language programmers can write programs like text editors and terminal applications in a much shorter length of time and with
significantly fewer instructions than they could if the ISA lacked such support.
Complex instructions, like the
x
86 string manipulation instructions,
carry out complex, multistep tasks and therefore stand in for what would
otherwise be many lines of RISC assembler code. These types of instructions
have serious drawbacks, though, when it comes to performing the kind of
dynamic scheduling and out-of-order execution described earlier in this
chapter. String instructions, for instance, have latencies that can vary with
the length of the string being manipulated—the longer the string, the
more cycles the instruction takes to execute. Because their latencies are not
predictable, it’s difficult for the processor to schedule them optimally using
the dynamic scheduling mechanisms described previously.
Finally, complex instructions often vary in the number of bytes they
need in order to be rendered in machine language. Such variable-length
instructions are more difficult to fetch and decode, and once they’re
decoded, they’re more difficult to schedule.
Because of its use of multiple instruction formats (register-to-memory
and memory-to-memory) and complex, variable-length instructions,
x
86 is an example of an approach to processor and ISA design called complex instruction set computing (CISC). Both DLW and PowerPC, in contrast, represent
an approach called reduced instruction set computing (RISC), in which
all machine language instructions are the same length, fewer instruction
formats are supported, and complex instructions are eliminated entirely.
RISC ISAs are harder to program for in assembly language, so they assume
the existence and widespread use of high-level languages and sophisticated
compilers. For RISC programmers who use a high-level language like C, the
burden of scheduling memory traffic and handling complex data types shifts
from the processor to the compiler. By shifting the burden of scheduling
memory accesses and other types of code to the compiler, processors that
implement RISC ISAs can be made less complex, and because they’re less
complex, they can be faster and more efficient.
It would be nice if
x
86, which is far and away the world’s most popular ISA, were RISC, but it isn’t. The
x
86 ISA is a textbook example of a CISC ISA, and that means processors that implement
x
86 require more complicated
microarchitectures. At some point,
x
86 processor designers realized that in order to use the latest RISC-oriented dynamic scheduling techniques to
speed
x
86-based architectures without the processor’s complexity spinning out of control, they’d have to limit the added complexity to the front end by
translating
x
86 CISC operations into smaller, faster, more uniform RISC-like operations for use in the back end. AMD’s K6 and Intel’s P6 were two early
x
86 designs that used this type of instruction set translation to great advantage.
The technique was so successful that all subsequent
x
86 processors from both Intel and AMD have used instruction set translation, as have some RISC
processors like IBM’s PowerPC 970.
The Intel Pentium and Pentium Pro
105
The P6 Microarchitecture’s Instruction Decoding Unit
The P6 microarchitecture breaks down complex, variable-length
x
86 instructions into one or more smaller, fixed-length
micro-operations
(aka
micro-ops
,
µops
, or
uops
) using a decoding unit that consists of three separate decoders, depicted in Figure 5-10: two simple/fast decoders, which handle simple
x
86
instructions and can produce one decoded micro-op per cycle; and one
complex/slow decoder, which handles the more complex
x
86 instructions
and can produce up to four decoded micro-ops per cycle.
Sixteen-byte groups of architected
x
86 instructions are fetched from the I-cache into the front end’s 32-byte instruction queue, where predecoding
logic first identifies each instruction’s boundaries and type before aligning
the instructions for entry into the decoding hardware. Up to three
x
86
instructions per cycle can then move from the instruction queue into the
decoders, where they’re converted to micro-ops and passed into a micro-op
queue before going to the ROB. Together, the P6’s three decoders are capable
of producing up to six decoded micro-ops per cycle (four from the complex/
slow decoder plus one from each of the two simple/fast decoders) for con-
sumption by the micro-op queue. The micro-op queue, in turn, is capable of
passing up to three micro-ops per cycle into the P6’s instruction window.
L1 Instruction Cache
2 x 16-byte
Fetch Buffer
Simple Decoder
Simple Decoder
6-entry
40-entry
Micro-op
Reorder Buffer
Queue
(ROB)
Complex
Decoder
Microcode
Translate/
Engine
x
86 Decode
x
86 instruction path
micro-op instruction path
Figure 5-10: The P6 microarchitecture’s decoding hardware
106
Chapter 5
Simple
x
86 instructions, which can be decoded very rapidly and which
break down into only one or two micro-ops, are by far the most common type
of instruction found in an average
x
86 program. So the P6 dedicates most of its decoding hardware to these types of instructions. More complex
x
86 instructions, like string manipulation instructions, are less common and take longer