Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (78 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
4.62Mb size Format: txt, pdf, ePub

All of this deep buffering, scheduling, queuing, and optimizing is essential

for keeping the Pentium 4’s high-speed back end full. To return yet again to

the Tetris analogy, imagine what would happen if someone were to double the

speed at which the blocks fall; you’d hope that they would also double the size

of the look-ahead window to compensate. The Pentium 4 greatly increases

the size of the P6’s instruction window as a way of compensating for the fact

that the arrangement of instructions in its core is made so much more critical

by the increased clock speed and pipeline depth.

The downside to all of this is that the schedulers and queues and the very

large ROB all add complexity and cost to the Pentium 4’s design. This com-

plexity and cost are part of the price that the Pentium 4 pays for its deep

pipeline and high clock speed.

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

159

I N T E L ’ S P E N T I U M 4 V S .

M O T O R O L A ’ S G 4 E :

T H E B A C K E N D

In this chapter, I’ll explain in greater detail the back

end of both the Pentium 4 and the G4e. I’ll talk about

the execution resources that each processor uses for

crunching code and data, and how those resources

contribute to overall performance on specific types of

applications.

Some Remarks About Operand Formats

Unlike the DLW and PowerPC ISAs described so far, the
x
86 ISA uses a two-operand format for both integer and floating-point instructions. If you want

to add two numbers in registers A and B, the instruction would look as follows:

add A, B

itm08_03.fm Page 162 Thursday, January 11, 2007 10:28 AM

This command adds the contents of A to the contents of B and places the

result in A, overwriting whatever was previously in A in the process. Expressed

mathematically, this would look as follows:

A = A + B

The problem with using a two-operand format is that it can be incon-

venient for some sequences of operations. For instance, if you want to add A

to B and store the result in C, you need two operations to do so:

Line # Code

Comments

1

mov C, A

Copy the contents of register A to register C.

2

add C, B

Add the numbers in registers C and B and store the result in C, overwriting

the previous contents of C.

Program 8-1

The first instruction in Program 8-1 copies the contents of A into C so that

A’s value is not erased by the addition, and the second instruction adds the

two numbers.

With a three-operand or more format, like many of the instructions in

the PPC ISA, the programmer gets a little more flexibility and control. For

instance, you saw earlier that the PPC ISA has a three-operand add instruction

of the format

add
destination
,
source1
,
source2

So if you want to add the contents of register 1 to the contents of register

2 and store the result in register 3 (i.e., r3 = r1 + r2), you just use the following instruction:

add 3, 2, 1

Some PPC instructions support even more than three operands, which

can be a real boon to programmers and compiler writers. In
“AltiVec Vector

Operations” on page 170, we’l
l look in detail at the G4e AltiVec instruction set’s use of a four-operand instruction format.

The PPC ISA’s variety of multiple-operand formats are obviously more

flexible than the one- and two-operand formats of
x
86. But nonetheless, modern
x
86 compilers are quite advanced and can overcome many of the

aforementioned problems through the use of hidden microarchitectural

rename registers and various scheduling algorithms. The problems with

x
86’s two-operand format are much more of a liability for floating-point and vector code than for integer code. We’ll talk more about this later, though.

162

Chapter 8

The Integer Execution Units

Though the Pentium 4’s
double-pumped
integer execution units got quite a bit of press when Netburst was first announced, you might be surprised to learn

that both the G4e and the Pentium 4 embody approaches to enhancing inte-

ger performance that are very similar. As you’ll see, this similarity arises from both processors’ application of the computing design dictum:
Make the common
case fast
.

For integer applications, the common case is easy to spot. As I outlined

in Chapter 1, integer instructions generally fall into one of two categories:

Simple/fast integer instructions

Instructions like add and sub require very few steps to complete and are

therefore easy to implement with little overhead. These simple instruc-

tions make up the majority of the integer instructions in an average

program.

Complex/slow integer instructions

While addition and subtraction are fairly simple to implement, multipli-

cation and division are complicated to implement and can take quite a

few steps to complete. Such instructions involve a series of additions and

bit shifts, all of which can take a while. These instructions represent only

a fraction of the instruction mix for an average program.

Since simple integer instructions are by far the most common type of

integer instruction, both the Pentium 4 and G4e devote most of their integer

resources to executing these types of instructions very rapidly.

The G4e’s IUs: Making the Common Case Fast

As was explained in the previous chapter, the G4e has a total of four IUs.

The IUs are divided into two groups:

Three simple/fast integer execution units—SIUa, SIUb, SIUc

These three simple IUs handle only fast integer instructions. Most of the

instructions executed by these IUs are single-cycle, but there are some

multi-cycle exceptions to this rule. Each of the three fast IUs is fed by a

single-entry reservation station.

One complex/slow integer execution unit—CIU

This single complex IU handles only complex integer instructions like

multiply, divide, and some special-purpose register instructions, includ-

ing condition register (CR) logical operations. Instructions sent to this

IU generally take four cycles to complete, although some take longer.

Note that divides, as well as some multiplication instructions, are not

fully pipelined and thus can tie up the entire IU for multiple cycles. Also,

instructions that update the PPC CR have an extra pipeline stage—called

finish
—to pass through before they leave the IU (more on this shortly).

Finally, the CIU is fed by a two-entry reservation station.

Intel’s Pentium 4 vs. Motorola’s G4e: The Back End

163

By dedicating three of its four integer ALUs to the fastest, simplest,

and most common instructions, the G4e is able to make the common case

quite fast.

Before moving on, I should note that the finish stages attached to the

ends of some of the execution unit pipelines are new in the G4e. These

finish stages are dedicated to updating the condition register to reflect the

results of any arithmetic operation that needs to do such updating (this

happens infrequently). It’s important to understand that at the end of the

execute stage/start of the finish stage, an arithmetic instruction’s results are available for use by dependent instructions, even though the CR has not yet

been updated. Therefore, the finish stage doesn’t affect the effective latency

of arithmetic instructions. But for instructions that depend on the CR—like

branch instructions—the finish stage adds an extra cycle of latency.

One nice thing that the PPC ISA has going for it is its large number of

general-purpose registers (GPRs) for storing integers and addresses. This

large number of architectural GPRs (32 to be exact) gives the compiler plenty

of flexibility in scheduling integer operations and address calculations.

In addition to the PPC ISA’s 32 GPRs, the G4e provides 16 microarchi-

tectural general-purpose rename registers for use by the on-chip scheduling

logic. These additional registers, not visible to the compiler or programmer,

are used by the G4e to augment the 32 GPRs, thereby providing more flexi-

bility for scheduling the processor’s execution resources and keeping them

supplied with data.

The Pentium 4’s IUs: Make the Common Case Twice as Fast

The Pentium 4’s integer functional block takes a very similar strategy to

the G4e for speeding up integer performance. It contains one slow integer

execution unit and two fast integer execution units. By just looking at the

number of integer execution units, you might think that the Pentium 4

has less integer horsepower than the G4e. This isn’t quite the case, though,

because the Pentium 4’s two fast IUs operate at
twice the core clock speed
, a trick that allows them to look to the outside world like four fast IUs.

The Pentium 4 can issue two integer instructions per cycle in rapid

succession to each of the two fast IUs—one on the rising edge of the clock

pulse and one on the falling edge. Each fast ALU can process an integer

instruction in 0.5 cycles, which means it can process a total of two integer

instructions per cycle. This gives the Pentium 4 a total peak throughput of

four simple integer instructions per cycle for the two fast IUs combined.

Does this mean that the Pentium 4’s two double-speed integer units are

twice as powerful as two single-speed integer units? No, not quite. Integer

performance is about much more than just a powerful integer functional

block. You can’t squeeze peak performance out of an integer unit if you

can’t keep it fed with code, and the Pentium 4 seems to have a weakness

in this area when it comes to integer code.

We talked earlier in this book about how, due to the Pentium 4’s “narrow

and deep” design philosophy, branch mispredictions and cache misses can

degrade performance by introducing pipeline bubbles into the instruction

164

Chapter 8

stream. This is especially a problem for integer performance, because integer-

intensive applications often contain branch-intensive code that exhibits poor

locality of reference
. As a result, branch mispredictions in conjunction with cache latencies can kill integer performance. (For more information about these

issues, see Chapter 11.)

As most benchmarks of the Pentium 4 bear out, in spite of its double-

pumped ALUs, the Pentium 4’s “narrow and deep” design is much more

suited to floating-point applications than it is to integer applications. This

floating-point bias seems to have been a deliberate choice on the part of the

Pentium 4’s designers—the Pentium 4 is designed to give not maximum but

acceptable performance on integer applications. This strategy works because

most modern processors (at least since the PIII, if not the PII or Pentium

Pro) are able to offer perfectly workable performance levels on consumer-

level integer-intensive applications like spreadsheets, word processors, and

the like. Though server-oriented applications like databases require higher

levels of integer performance, the demand for ever-increasing integer per-

formance just isn’t there in the consumer market. As a way of increasing the

Pentium 4’s integer performance for the server market, Intel sells a version

of the Pentium 4 called the
Xeon
, which has a much larger cache.

Before moving on to the next topic, I should note that one character-

istic of the Pentium 4 that bears mentioning is its large number of micro-

architectural rename registers. The
x
86 ISA has only eight GPRs, but the Pentium 4 augments these with the addition of a large number of rename

registers: 128 to be exact. Since the Pentium 4 keeps so many instructions

on-chip for scheduling purposes, it needs these added rename resources to

prevent the kinds of register-based resource conflicts that result in pipeline

bubbles.

The Floating-Point Units (FPUs)

While the mass market’s demand for integer performance may not be

picking up, its demand for floating-point seems insatiable. Games, 3D

rendering, audio processing, and almost all other forms of multimedia-

and entertainment-oriented computing applications are extremely floating-

point intensive. With floating-point applications perennially driving the home

PC market, it’s no wonder that the Pentium 4’s designers made their design

trade-offs in favor of FP performance over integer performance.

In terms of the way they use the processor and cache, floating-point appli-

cations are in many respects the exact opposite of the integer applications

described in the preceding section. For instance, the branches in floating-

point code are few and extremely predictable. Most of these branches occur

as exit conditions in small chunks of loop code that iterate through a large

dataset (e.g., a sound or image file) in order to modify it. Since these loops

iterate many thousands of times as they work their way through a file, the

Other books

Death on the Eleventh Hole by Gregson, J. M.
Escape Velocity by Mark Dery
The Bloody Border by J. T. Edson
Outbreak by Christine Fonseca
The Price of Faith by Rob J. Hayes
Moonsong by Lisa Olsen
Party of One by Michael Harris