Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (76 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
4.45Mb size Format: txt, pdf, ePub

station (RS). The RS is really the heart of the reservoir approach, a place where instructions can collect in a pool and then issue when their data become available. This
instruction pool
is what decouples the P6’s fetch/decode bandwidth from its execution bandwidth by enabling the P6 to continue executing

instructions during short periods when the front end gets hung up in either

fetching or decoding the next instruction.

With the advent of the Pentium 4’s much longer pipeline, the reserva-

tion station’s decoupling just isn’t enough. The Pentium 4’s performance

plummets when the front end cannot keep feeding instructions to the back

end in extremely rapid succession. There’s no extra time to wait for a complex

instruction to decode or for a branch delay—the high-speed back end needs

the instructions to flow quickly.

One route that Intel could have taken would have been to increase the

size of the code reservoir, and in doing so, increase the size of the instruction window. Intel actually did do this—the Pentium 4 can track up to 126 instructions in various stages of execution—but that’s not all they did. More drastic

measures were required to keep the high-speed back end from depleting the

reservoir before the front end could fill it.

The answer that Intel settled on was to take the costly and time-consuming

x
86 decode stage out of the basic pipeline. They did this by the clever trick of converting the L1 cache—a structure that was already on the die and therefore already taking up transistors—into a cache for decoded micro-ops.

The Trace Cache

As the previous chapter mentioned, modern
x
86 chips convert complex
x
86

instructions into a simple internal instruction format called a micro-operation

(aka micro-op, µop, or uop). These micro-ops are uniform, and thus it’s easier

for the processor to manage them dynamically. To return to the previous

chapter’s Tetris analogy, converting all of the
x
86 instructions into micro-ops is kind of like converting all of the falling Tetris pieces into one or two types of simple piece, like the T and the block pieces. This makes everything easier

to place, because there’s less complexity to manage on the fly.

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

149

The older P6 fetches
x
86 instructions from the L1 instruction cache

and translates them into micro-ops before passing them on to the reser-

vation station to be scheduled for execution. The Pentium 4, in contrast,

fetches groups of
x
86 instructions from the L2 cache, decodes them into strings of micro-ops called
traces
, and then fits these traces into its modified L1 instruction cache (the trace cache). This way, the instructions are already

decoded, so when it comes time to execute them, they need only to be fetched

from the trace cache and passed directly into the back end’s buffers.

So the trace cache is a reservoir for a reservoir; it builds up a large pool

of already decoded micro-ops that can be piped directly into the back end’s

smaller instruction pool. This helps keep the high-speed back end from

draining that pool dry.

Shortening Instruction Execution Time

As noted earlier, on a conventional
x
86 processor like the PIII or the Athlon,
x
86 instructions make their way from the instruction cache into the decoder, where they’re broken down into multiple smaller, more uniform, more easily

managed instructions called micro-ops. (See the section on instruction set

translation in Chapter 3.) These micro-ops are actually what the out-of-order

back end rearranges, executes, and commits.

This instruction translation happens each time an instruction is executed,

so it adds a few pipeline stages to the beginning of the processor’s basic pipe-

line. Notice in Figures 7-6 and 7-7 that multiple pipeline stages have been

collapsed into each other—instruction fetch takes multiple stages, translate

takes multiple stages, decode takes multiple stages, and so on.

For a block of code that’s executed only a few times over the course of a

single program run, this loss of a few cycles to retranslation each time isn’t

that big of a deal. But for a block of code that’s executed thousands and

thousands of times (e.g., a loop in a media application that applies a series

of operations to a large file), the number of cycles spent repeatedly trans-

lating and decoding the same group of instructions can add up quickly. The

Pentium 4 reclaims those lost cycles by removing the need to translate those

x
86 instructions into micro-ops each time they’re executed.

The Pentium 4’s instruction cache takes translated, decoded micro-ops

that are primed and ready to be sent straight out to the back end and arranges

them into little mini-programs called traces. These traces, and not the
x
86

code that was produced by the compiler, are what the Pentium 4 executes

whenever there’s a trace cache hit, which is over 90 percent of the time. As

long as the needed code is in the trace cache, the Pentium 4’s execution

path looks as in Figure 7-7.

As the front end executes the stored traces, the trace cache sends up to

three micro-ops per cycle directly to the back end, without the need for them

to pass through any translation or decoding stages. Only when there’s a trace

cache miss does that top part of the front end kick in order to fetch and

decode instructions from the L2 cache. The decoding and translating steps

brought on by a trace cache miss add another eight pipeline stages onto the

150

Chapter 7

L1 Instruction Cache

BU

1

Instruction Fetch

Branch

Unit

2
Trans.
x
86 / Decode

3

Allocate/Schedule

4

Execute

Execute

Execute

5

Write

Figure 7-6: Normal
x
86 processor’s critical execution path

beginning of the Pentium 4’s pipeline. You can see that the trace cache saves

quite a few cycles over the course of a program’s execution, thereby shorten-

ing the average instruction execution time and average instruction latency.

The Trace Cache’s Operation

The trace cache operates in two modes.
Execute mode
is the mode pictured above, where the trace cache feeds stored traces to the execution logic to

be executed. This is the mode that the trace cache normally runs in. When

there’s an L1 cache miss, the trace cache goes into
trace segment build mode
.

In this mode, the front end fetches
x
86 code from the L2 cache, translates it into micro-ops, builds a
trace segment
with it, and loads that segment into the trace cache to be executed.

Notice in Figure 7-7 that the trace cache execution path knocks the BPU

out of the picture along with the instruction fetch and translate/decode

stages. This is because a trace segment is much more than just a translated,

decoded, predigested slice of the same
x
86 code that compiler originally produced. The trace cache actually uses branch prediction when it builds a

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

151

Instruction Fetch

BU

Translate
x
86/

Decode

Branch

Unit

L1 Instruction Cache

(Trace Cache)

BU

1

Trace Cache Fetch

(TC)

2

Allocate/Schedule

3

Execute

Execute

Execute

4

Write

Figure 7-7: The Pentium 4’s critical execution path

trace. As shown in Figure 7-8, the trace cache’s branch prediction hardware

splices code from the branch that it speculates the program will take right

into the trace behind the code that it knows the program will take. So if you

have a chunk of
x
86 code with a branch in it, the trace cache builds a trace from the instructions up to and including the branch instruction. Then, it

picks which branch it thinks the program will take, and it continues building

the trace along that speculative branch.

Having the speculative execution path spliced in right after the branch

instruction confers on the trace cache two big advantages over a normal instruc-

tion cache. First, in a normal machine, it takes the branch predictor and BPU

some time to do their thing when they come across a conditional branch

instruction—they have to figure out which branch to speculatively execute,

load up the proper branch target, and so on. This whole process usually

152

Chapter 7

x
86 Instruction Stream

Branch Instruction

Trace Segment

Speculative Instruction Path

Figure 7-8: Speculative execution using the trace cache

adds at least one cycle of delay after every conditional branch instruction, a

delay that often can’t be filled with other code and therefore results in a

pipeline bubble. With the trace cache, however, the code from the branch

target is already sitting there right after the branch instruction, so there’s no delay associated with looking it up and hence no pipeline bubble. In other

words, the Pentium 4’s trace cache implements a sort of branch folding, like

what we previously saw implemented in the instruction queues of PowerPC

processors.

The other advantage that the trace cache offers is also related to its

ability to store speculative branches. When a normal L1 instruction cache

fetches a cache line from memory, it stops fetching when it hits a branch

instruction and leaves the rest of the line blank. If the branch instruction is

the first instruction in an L1 cache line, then it’s the only instruction in that line and the rest of the line goes to waste. Trace cache lines, on the other

hand, can contain both branch instructions and the speculative code after

the branch instruction. This way, no space in the trace cache’s six–micro-op

line goes to waste.

Most compilers take steps to deal with the two problems I’ve outlined

(the delay after the branch and the wasted cache line space). As you saw,

though, the trace cache solves these problems in its own way, so programs

that are optimized to exploit these abilities see some advantages from them.

One interesting effect that the trace cache has on the Pentium 4’s front

end is that
x
86 translation/decode bandwidth is for the most part decoupled from dispatch bandwidth. You saw previously how the P6, for instance, spends

a lot of transistor resources on a three different
x
86 decoders so that it can translate enough clunky
x
86 instructions each cycle into micro-ops to keep the Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

153

back end fed. With the Pentium 4, the fact that most of the time program code

is fetched from the trace cache in the form of predigested micro-ops means

that a high bandwidth translator/decoder isn’t necessary. The Pentium 4’s

decoding logic only has to kick on whenever there’s an L1 cache miss, so it was

designed to decode only one
x
86 instruction per clock cycle. This is one-third the maximum theoretical decode bandwidth of the P6, but the Pentium 4’s

trace cache allows it to meet or exceed the P6’s real-world average dis-

patch rate.

The trace cache’s handling of very long, multi-cycle
x
86 instructions is worth taking a look at, because it’s quite clever. While most
x
86 instructions decode into around two or three micro-ops, there are some exceedingly

long (and thankfully rare)
x
86 instructions (e.g., the string manipulation instructions) that decode into hundreds of micro-ops. Like the P6, the

Pentium 4 has a special microcode ROM that decodes these longer

instructions so that the regular hardware decoder can concentrate on

decoding the smaller, faster instructions. For each long instruction, the

microcode ROM stores a canned sequence of micro-ops, which it spits out

when fed that instruction.

To keep these long, prepackaged sequences of micro-ops from polluting

the trace cache, the Pentium 4’s designers devised the following solution:

Whenever the trace cache is building a trace segment and it encounters one

of the long
x
86 instructions, instead of breaking it down and storing it as a micro-op sequence, the trace cache inserts into the trace segment a tag that

points to the section of the microcode ROM containing the micro-op sequence

for that particular instruction. Later, in execute mode, when the trace cache

is streaming instructions out to the back end and it encounters one of these

tags, it stops and temporarily hands control of the instruction stream over to

the microcode ROM. The microcode ROM spits out the proper sequence of

micro-ops (as designated by the tag) into the instruction stream, and then

hands control back over to the trace cache, which resumes putting out instruc-

tions. The back end, which is on the other end of this instruction stream,

Other books

Libros de Sangre Vol. 3 by Clive Barker
Falling in Love Again by Sophie King
Poker for Dummies (Mini Edition) by Richard D. Harroch, Lou Krieger
Hollow Dolls, The by Dahl, MT
Murder in the Marsh by Ramsey Coutta
Place in the City by Howard Fast
Spurn by Jaymin Eve
His to Claim by Sierra Jaid