Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (67 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
3.74Mb size Format: txt, pdf, ePub

until the branch condition is evaluated. When the branch condition has been

evaluated, if the BPU guessed correctly, the speculative instructions’ ROB

entries are marked as non-speculative, and the instructions are committed

in order. If the BPU guessed incorrectly, the speculative instructions and their results are deleted from the ROB without being committed.

The P6 Back End

The P6’s back end (illustrated in Figure 5-9) is significantly wider than that

of the Pentium. Like the Pentium, it contains two asymmetrical integer

ALUs and a separate floating-point unit, but its load-store capabilities have

been beefed up to include three execution units devoted solely to memory

accesses: a
load address unit
, a
store address unit
, and a
store data unit
. The load address and store address units each contain a pair of four-input adders for

calculating addresses and checking segment limits; these are the adders

that show up in the decode-1 stage of the original Pentium.

The asymmetrical integer ALUs on the P6 have single-cycle throughput

and latency for most operations, with multiplication having single-cycle

throughput but four-cycle latency. Thus, multiply instructions execute faster

on the P6 than on the Pentium.

102

Chapter 5

Reservation Station (RS)

Port 0

Port 0

Port 1

Port 4

Port 3

Port 2

Port 1

CIU

SIU

Store

Store

Load

FPU

Data

Addr.

Addr.

BU

Floating-

Branch

Point

Integer Unit

Load-Store Unit

Unit

Unit

Scalar ALUs

Memory Access Units

Back End

Figure 5-9: The P6 back end

The P6’s floating-point unit executes most single- and double-precision

operations in three cycles, with five cycles needed for multiply instructions.

The FPU is fully pipelined for most instructions, so that most instructions

execute with a single-cycle throughput. Some instructions, like floating-point

division and square root, are not pipelined and take 18 to 38 and 29 to 69

cycles, respectively.

From the present overview’s perspective, the most noteworthy feature of

the P6’s back end is that its execution units are attached to the reservation

station via five issue ports, as shown in Figure 5-9.

This means that up to five instructions per cycle can pass from the reser-

vation station through the issue ports and into the execution units. This five–

issue port structure is one of the most recognizable features of the P6, and

when later designs (like the PII) added execution units to the microarchi-

tecture (like MMX units), they had to be added on the existing issue ports.

If you looked closely at the Pentium Pro diagram, you probably noticed

that there were already two units that shared a single port in the original

Pentium Pro: the simple integer unit and the floating-point unit. This means

that there are some restrictions on issuing a second integer computation and a

floating-point computation, but these restrictions rarely affect performance.

CISC, RISC, and Instruction Set Translation

Like the original Pentium, the P6 spends extra time in the decode phase, but

this time, the extra cycle and a half goes not to address calculations but to

instruction set translation
. ISA translation is an important technique used in many modern processors, but before you can understand how it works you

must first become acquainted with two terms that often show up in computer

architecture discussions: RISC and CISC.

One of the most important ways in which the
x
86 ISA differs from that of both the PowerPC ISA (described in the next chapter) and the hypothetical

DLW ISA presented in Chapter 2 is that it supports register-to-memory and

memory-to-memory format arithmetic instructions. On the DLW architecture,

The Intel Pentium and Pentium Pro

103

source and destination operands for every arithmetic instruction had to be

either registers or immediate values, and it was the programmer’s responsi-

bility to include in the program the load and store instructions necessary to

ensure that the arithmetic instructions’ source registers were populated with

the correct values from memory and their results written back to memory.

On the
x
86 architecture, the programmer can voluntarily surrender control of most load-store traffic to the processor by using source and/or destination

operands that are memory locations. If the DLW architecture supported

such operations, they might look like Program 5-2:

Line # Code

Comments

1

add #12, #13, A

Add the contents of memory locations #12 and #13 and place the

result in register A.

2

sub A, #15, #16

Subtract the contents of register A from the contents of memory

location #15 and store the result in memory location #16.

3

sub A, #B, #100

Subtract the contents of register A from the contents of the memory

location pointed to by #B and store the result in memory location

#100.

Program 5-2: Arithmetic instructions using memory-to-memory and memory-to-register
formats

Adding the contents of two memory locations, as in line 1 of Program 5-2,

still requires the processor to load the necessary values into registers and store the results. However, in memory-to-register and memory-to-memory format

instructions, these load and store operations are
implicit
in the instruction.

The processor must look at the instruction and figure out that it needs to

perform the necessary memory accesses; then it must perform them before

and/or after it executes the arithmetic part of the instruction. So for the

add in line 1 of Program 5-2, the processor would have to perform two loads

before executing the addition. Similarly, for the subtractions in lines 2 and 3, the processor would have to perform one load before executing the subtraction and one store afterwards.

The use of such register-to-memory and memory-to-memory format

instructions shifts the burden of scheduling memory traffic from the pro-

grammer to the processor, freeing the programmer to focus on other aspects

of coding. It also has the effect of reducing the number of instructions that a

programmer must write (or
code density
) in order to perform most tasks. In the days when programmers programmed primarily in assembly language,

compilers for
high-level languages (HLLs)
like C and FORTRAN were primitive, and main memories were small and expensive, ISA qualities like programmer

ease-of-use and high code density were very attractive.

A further technique that ISAs like
x
86 use to lessen the burden on pro-

grammers and increase code density is the inclusion of ISA-level support for

complex data types like
strings
. A string is simply a series, or “string,” of contiguous memory locations of a certain length. Strings are often used to store ASCII text, so a short string might store a word, or a longer string might store a whole sentence. If an ISA includes instructions for working with strings—and the

104

Chapter 5

x
86 ISA does—assembly language programmers can write programs like text editors and terminal applications in a much shorter length of time and with

significantly fewer instructions than they could if the ISA lacked such support.

Complex instructions, like the
x
86 string manipulation instructions,

carry out complex, multistep tasks and therefore stand in for what would

otherwise be many lines of RISC assembler code. These types of instructions

have serious drawbacks, though, when it comes to performing the kind of

dynamic scheduling and out-of-order execution described earlier in this

chapter. String instructions, for instance, have latencies that can vary with

the length of the string being manipulated—the longer the string, the

more cycles the instruction takes to execute. Because their latencies are not

predictable, it’s difficult for the processor to schedule them optimally using

the dynamic scheduling mechanisms described previously.

Finally, complex instructions often vary in the number of bytes they

need in order to be rendered in machine language. Such variable-length

instructions are more difficult to fetch and decode, and once they’re

decoded, they’re more difficult to schedule.

Because of its use of multiple instruction formats (register-to-memory

and memory-to-memory) and complex, variable-length instructions,
x
86 is an example of an approach to processor and ISA design called complex instruction set computing (CISC). Both DLW and PowerPC, in contrast, represent

an approach called reduced instruction set computing (RISC), in which

all machine language instructions are the same length, fewer instruction

formats are supported, and complex instructions are eliminated entirely.

RISC ISAs are harder to program for in assembly language, so they assume

the existence and widespread use of high-level languages and sophisticated

compilers. For RISC programmers who use a high-level language like C, the

burden of scheduling memory traffic and handling complex data types shifts

from the processor to the compiler. By shifting the burden of scheduling

memory accesses and other types of code to the compiler, processors that

implement RISC ISAs can be made less complex, and because they’re less

complex, they can be faster and more efficient.

It would be nice if
x
86, which is far and away the world’s most popular ISA, were RISC, but it isn’t. The
x
86 ISA is a textbook example of a CISC ISA, and that means processors that implement
x
86 require more complicated

microarchitectures. At some point,
x
86 processor designers realized that in order to use the latest RISC-oriented dynamic scheduling techniques to

speed
x
86-based architectures without the processor’s complexity spinning out of control, they’d have to limit the added complexity to the front end by

translating
x
86 CISC operations into smaller, faster, more uniform RISC-like operations for use in the back end. AMD’s K6 and Intel’s P6 were two early

x
86 designs that used this type of instruction set translation to great advantage.

The technique was so successful that all subsequent
x
86 processors from both Intel and AMD have used instruction set translation, as have some RISC

processors like IBM’s PowerPC 970.

The Intel Pentium and Pentium Pro

105

The P6 Microarchitecture’s Instruction Decoding Unit

The P6 microarchitecture breaks down complex, variable-length
x
86 instructions into one or more smaller, fixed-length
micro-operations
(aka
micro-ops
,
µops
, or
uops
) using a decoding unit that consists of three separate decoders, depicted in Figure 5-10: two simple/fast decoders, which handle simple
x
86

instructions and can produce one decoded micro-op per cycle; and one

complex/slow decoder, which handles the more complex
x
86 instructions

and can produce up to four decoded micro-ops per cycle.

Sixteen-byte groups of architected
x
86 instructions are fetched from the I-cache into the front end’s 32-byte instruction queue, where predecoding

logic first identifies each instruction’s boundaries and type before aligning

the instructions for entry into the decoding hardware. Up to three
x
86

instructions per cycle can then move from the instruction queue into the

decoders, where they’re converted to micro-ops and passed into a micro-op

queue before going to the ROB. Together, the P6’s three decoders are capable

of producing up to six decoded micro-ops per cycle (four from the complex/

slow decoder plus one from each of the two simple/fast decoders) for con-

sumption by the micro-op queue. The micro-op queue, in turn, is capable of

passing up to three micro-ops per cycle into the P6’s instruction window.

L1 Instruction Cache

2 x 16-byte

Fetch Buffer

Simple Decoder

Simple Decoder

6-entry

40-entry

Micro-op

Reorder Buffer

Queue

(ROB)

Complex

Decoder

Microcode

Translate/

Engine

x
86 Decode

x
86 instruction path

micro-op instruction path

Figure 5-10: The P6 microarchitecture’s decoding hardware

106

Chapter 5

Simple
x
86 instructions, which can be decoded very rapidly and which

break down into only one or two micro-ops, are by far the most common type

of instruction found in an average
x
86 program. So the P6 dedicates most of its decoding hardware to these types of instructions. More complex
x
86 instructions, like string manipulation instructions, are less common and take longer

Other books

Promise Me Always by Kari March
Deadlock by Robert Liparulo
A Good Fall by Ha Jin
When Last Seen Alive by Gar Anthony Haywood
The God Warriors by Sean Liebling
Ground Truth by Rob Sangster
Séraphine (Eternelles: A Prequel, Book 0.5) by Owens, Natalie G., Zee Monodee
Redemption by La Kuehlke
Blaze of Memory by Singh, Nalini