Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (81 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

8.48Mb size Format: txt, pdf, ePub

Intel had to hold their nose and add an extra processor state to accommo-

date them. This means a state switch if you want to go from using
x
87 to MMX or SSE, and it also means that operating system code had to be

rewritten to accommodate the new state.

SSE still had some shortcomings, though. SSE’s vector floating-point

operations were limited to single-precision, and vector integer operations

were still limited to 64 bits, because they had to use the old MMX/
x
87

registers.

With the introduction of SSE2 on the Pentium 4, Intel finally got its

SIMD act together. On the integer side, SSE2 finally allows the storage of

128-bit integer vectors in the XMM registers, and it modifies the ISA by

extending old instructions and adding new ones to support 128-bit SIMD

integer operations. For floating-point, SSE2 now supports double-precision

SIMD floating-point operations. All told, SSE2 adds 144 new instructions,

some of which are cache control instructions.

Intel’s Pentium 4 vs. Motorola’s G4e: The Back End

175

The Pentium 4’s Vector Unit: Alphabet Soup Done Quickly

Now that you know something about SSE2, let’s look at how it’s implemented

on the Pentium 4. In keeping with the Pentium 4’s “narrow, deep, and fast”

approach, the Pentium 4 does not sport any dedicated SSE2 pipelines. Rather,

both FPU pipes double as VU pipes, meaning that the FPU memory unit also

handles vector memory operations and the FPU arithmetic unit also handles

vector arithmetic operations. So in contrast to the G4e’s four vector arithmetic units, the Pentium 4 has one vector arithmetic unit that just does everything—

integer, floating-point, permutes, and so on.

So, with only one execution pipeline to handle all vector and floating-

point operations, you’re probably wondering how the Pentium 4’s designers

expected it to perform competitively on the media applications for which it

was obviously designed. The Pentium 4 is able to offer competitive SIMD

performance based on a combination of three factors:

relatively low instruction latencies

extremely high clock speeds

a high-bandwidth caching and memory subsystem

Let’s take a look at these factors and how they work together.

The Pentium 4 optimization manual lists in Section C the average

latencies of the most commonly used SIMD instructions. A look through

the latency tables reveals that the majority of single- and double-precision

arithmetic operations have latencies in the four- to six-cycle range. In other

words, most vector floating-point instructions go through four to six pipeline

stages before leaving the Pentium 4’s VU. This number is relatively low for

a 20-stage pipeline design like the Pentium 4, especially considering that

vector floating-point instructions on the G4e go through four pipeline

stages on average before leaving the G4e’s vector FPU.

Now, considering the fact that at the time of this writing, the Pentium 4’s

clock speed is significantly higher than (roughly double) that of the G4e, the

Pentium 4’s ability to execute single- and double-precision vector floating-

point operations in almost the same number of clock cycles as the G4e means

that the Pentium 4 executes these operations almost three times as fast in

real-world, “wall clock” time. So the Pentium 4 can get away with having only

one VU, because that VU is able to grind through vector operations with a

much higher instruction completion rate (instructions/ns) than the G4e.

Furthermore, as the Pentium 4’s clock speed increases, its vector crunching

power grows.

Another piece that’s crucial to the whole performance picture is the

Pentium 4’s high-bandwidth FSB, memory subsystem, and low-latency cach-

ing subsystem. I won’t cover these features here, but I’ll just note that the

Pentium 4 has large amounts of bandwidth at its disposal. All this bandwidth

is essential to keeping the very fast VU fed with data, and as the Pentium 4’s

clock speed has increased, bandwidth has played an even larger role in

vector processing performance.

176

Chapter 8

Increasing Floating-Point Performance with SSE2

I mentioned earlier that MMX uses a flat register file, and the same is true of

both SSE and SSE2. The eight, 128-bit XMM registers are arranged as a flat

file, which means that if you’re able to replace an
x
87 FP operation with an SSE or SSE2 operation, you can use clever resource scheduling to avoid the

performance hit brought on by the Pentium 4’s combination of a stack-based

FP register file and a non-free fxch instruction. Intel’s highly advanced com-

piler has proven that converting large amounts of
x
87 code to SSE2 code can yield a significant performance boost.

Conclusions

The preceding discussion should make it clear that the overall design

approaches outlined in the first half of this chapter can be seen in the back

ends of each processor. The G4e continues its “wide and shallow” approach

to performance, counting on instruction-level parallelism (ILP) to allow it to

squeeze the most performance out of code. The Pentium 4’s “narrow and

deep” approach, on the other hand, uses fewer execution units, eschewing

ILP and betting instead on increases in clock speed to increase performance.

Each of these approaches has its benefits and drawbacks, but as I’ve

stressed repeatedly throughout this chapter, microarchitecture is by no

means the only factor in the application performance equation. Certain

properties of the ISA that a processor implements can influence performance.

Intel’s Pentium 4 vs. Motorola’s G4e: The Back End

177

6 4 - B I T C O M P U T I N G

A N D x 8 6 - 6 4

On a number of occasions in previous chapters, I’ve

discussed some of the more undesirable aspects of the

PC market’s most popular instruction set architecture—

the
x
86. The
x
86 ISA’s complex addressing modes,

its inclusion of unwieldy and obscure instructions, its

variable instruction lengths, its dearth of architectural

registers, and its other quirks have vexed programmers,

compiler writers, and microprocessor architects for

years.

In spite of these drawbacks, the
x
86 ISA continues to enjoy widespread

commercial success, and the number of markets in which it competes con-

tinues to expand. The reasons for this ongoing success are varied, but one

factor stands out as by far the most important: inertia. The installed base of

x
86 application software is huge, and the costs of an industry-wide transition to a cleaner, more classically RISC ISA would enormously outweigh any benefits.

Nonetheless, this hasn’t stopped many folks, including the top brass at Intel,

from dreaming of a post-
x
86 world.

Intel’s IA-64 and AMD’s
x
86-64

As of this writing, there have been two important attempts to move main-

stream, commodity desktop and server computing beyond the 32-bit
x
86 ISA, the first by Intel and the second by Intel’s chief rival, Advanced Micro Devices (AMD). In 1994, Intel and Hewlett-Packard began work on a completely

new ISA, called
IA-64
. IA-64 is a 64-bit ISA that embodies a radically different approach to performance from anything the mainstream computing market

has yet seen. This approach, which Intel has called
Explicitly Parallel Instruction
Computing (EPIC)
, is a mix of a
very long instruction word (VLIW)
ISA,
predication
,
speculative loading
, and other compiler-oriented, performance-enhancing

techniques, many of which had never been successfully implemented in a

commercial product line prior to IA-64.

Because IA-64 represents a total departure from
x
86, IA-64–based pro-

cessors cannot run legacy
x
86 code natively. The fact that IA-64 processors must run the very large mass of legacy
x
86 code in emulation has posed a serious problem for Intel as they try to persuade various segments of the

market to adopt the new architecture.

Unfortunately for Intel, the lack of backward compatibility isn’t the only

obstacle that IA-64 has had to contend with. Since its inception, Intel’s IA-64

program has met with an array of setbacks, including massive delays in meet-

ing development and production milestones, lackluster integer performance,

and difficulty in achieving high clock speeds. These and a number of other

problems have prompted some wags to refer to Intel’s first IA-64 implemen-

tation, called
Itanium
and released in 2001, as
Itanic
. The Itanium Processor Family (IPF) has since found a niche in the lucrative and growing high-end

server and workstation segments. Nonetheless, in focusing on Itanium Intel

left a large, 64-bit sized hole in the commodity workstation and server markets.

In 1999, with Itanium development beset by problems and clearly some

distance away from commercial release, AMD saw an opening to score a

major blow against Intel by using a 64-bit derivative of Intel’s own ISA to

jump-start and dominate the nascent commodity 64-bit workstation market.

Following on the success of its Athlon line of desktop processors, AMD took

a gamble and bet the company’s future on a set of 64-bit extensions to the

x
86 ISA. Called x
86-64
, these extensions enabled AMD to produce a line of 64-bit microprocessors that are cost-competitive with existing high-end and

midrange
x
86 processors and, most importantly, backward-compatible with existing
x
86 code. The new processor architecture, popularly referred to by its code name
Hammer
, has been a commercial and technical success.

Introduced in April 2003 after a long series of delays and production

problems, the Hammer’s strong benchmark performance and excellent

adoption rate spelled trouble for any hopes that Intel may have had for the

mainstream commercial adoption of its Itanium line. Intel conceded as much

when in 2004, they took the unprecedented step of announcing support for

AMD’s extensions in their own
x
86 workstation and server processors. Intel calls these 64-bit extensions
IA-32e
, but in this book we’ll refer to them by AMD’s name—
x
86-64.

180

Chapter 9

Because
x
86-64 represents the future of
x
86 for both Intel and AMD, this chapter will look in some detail at the new ISA. As you’ll see,
x
86-64 is more than just a 64-bit extension to the 32-bit
x
86 ISA; it adds some new features as well, while getting rid of some obsolete ones.

Why 64 Bits?

The question of why we need 64-bit computing is often asked but rarely

answered in a satisfactory manner. There are good reasons for the confusion

surrounding the question, the first of which is the rarely acknowledged fact

that “the 64-bit question” is actually two questions:

How does the existing 64-bit server and workstation market use 64-bit

computing?

What use does the consumer market have for 64-bit computing?

People who ask the 64-bit question are usually asking for the answer to

question 1 in order to deduce the answer to question 2. This being the case,

let’s first look at question 1 before tackling question 2.

What Is 64-Bit Computing?

Simply put, the labels
16-bit
,
32-bit
, or
64-bit
, when applied to a microprocessor, characterize the processor’s data stream. You may have heard the term
64-bit
code
; this designates code that operates on 64-bit data.

In more specific terms, the labels
64-bit
,
32-bit
, and so on designate the number of bits that each of the processor’s general-purpose registers (GPRs)

can hold. So when someone uses the term
64-bit processor
, what they mean is a processor with GPRs that store 64-bit numbers. And in the same vein, a
64-bit
instruction
is an instruction that operates on 64-bit numbers that are stored in 64-bit GPRs.

Figure 9-1 shows two computers, one a 32-bit computer and the other a

64-bit computer.

In Figure 9-1, I’ve tried my best to modify Figu
re 1-3 on page 6 in order

to make my point. Don’t take the instruction and code sizes too literally, since they’re intended to convey a general feel for what it means to “widen” a

processor from 32 bits to 64 bits.

Notice that not all of the data in memory, the cache, or the registers is

64-bit data. Rather, the data sizes are mixed, with 64 bits being the widest.

We’ll discuss why this is and what it means shortly.

Note that in the 64-bit CPU pictured in Figure 9-1, the width of the code

stream has not changed; the same-sized machine language instruction could

theoretically represent an instruction that operates on 32-bit numbers or an

instruction that operates on 64-bit numbers, depending on the instruction’s

default data size. On the other hand, the widths of some elements of the data