Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (74 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

6.11Mb size Format: txt, pdf, ePub

Read Book Download Book

set it apart from the original G4. It also continued the excellent performance/

power consumption ratio of its predecessors. These features combined to

make it an excellent chip for portables, and Apple has exploited derivatives

of this basic architecture under the G4 name in a series of innovative desktop

enclosure designs and portables. The G4e also brought enhanced vector

computing performance to the table, which made it a great platform for

DSP and media applications.

This chapter will examine the trade-offs and design decisions that the

Pentium 4’s architects made in their effort to build a MHz monster, paying

special attention to the innovative features that the Pentium 4 sported and

the ways that those features fit with the processor’s overall design philosophy

and target application domain. We’ll cover the Pentium 4’s ultradeep pipe-

line, its trace cache, its double-pumped ALUs, and a host of other aspects

of its design, all with an eye to their impact on performance. As a point of

comparison, we’ll also look at the microarchitecture of Motorola’s G4e. By

examining two microprocessor designs side by side, you’ll gain a deeper

understanding of how the concepts outlined in the previous chapters play

out in a pair of popular, real-world designs.

The Pentium 4’s Speed Addiction

Table 7-1 lists the features of the Pentium 4.

Table 7-1:
Features of the Pentium 4

Introduction Date

April 23, 2001

Process

0.18 micron

Transistor Count

42 million

Clock Speed at Introduction

1.7 GHz

Cache Sizes

L1: Approximately 16KB instruction, 16KB data

Features

Simultaneous Multithreading (SMT, aka “hyperthreading”)

added in 2003. 64-bit support (EM64T) and SSE3 added in

2004. Virtualization Technology (VT) added in 2005.

138

Chapter 7

While some processors still have the classic, four-stage pipeline, described

in Chapter 1, most modern CPUs are more complicated. You’ve already

seen how the original Pentium had a second decode stage, and the P6 core

tripled the standard four-stage pipeline to 12 stages. The Pentium 4, with a

whopping 20 stages in its basic pipeline, takes this tactic to the extreme. Take a look at Figure 7-1. The chart shows the relative clock frequencies of Intel’s

last six
x
86 designs. (This picture assumes the same manufacturing process for all six cores.) The vertical axis shows the relative clock frequency, and the horizontal axis shows the various processors relative to each other.

2.5

1.5

0.5

Relative Frequency

286

386

486

P4P

Figure 7-1: The relative frequencies of Intel’s processors

Intel’s explanation of this diagram and the history it illustrates is

enlightening, as it shows where their design priorities were:

Figure [3.2] shows that the 286, Intel386™, Intel486™, and Pentium®

(P5) processors had similar pipeline depths—they would run at

similar clock rates if they were all implemented on the same silicon

process technology. They all have a similar number of gates of

logic per clock cycle. The P6 microarchitecture lengthened the

processor pipelines, allowing fewer gates of logic per pipeline

stage, which delivered significantly higher frequency and perfor-

mance. The P6 microarchitecture approximately doubled the

number of pipeline stages compared to the earlier processors and

was able to achieve about a 1.5 times higher frequency on the same

process technology. The NetBurst microarchitecture was designed

to have an even deeper pipeline (about two times the P6 micro-

architecture) with even fewer gates of logic per clock cycle to allow

an industry-leading clock rate.

—The Microarchitecture of the Pentium 4 Processor, p. 3.

As you learned in Chapter 2, there are limits to how deeply you can

pipeline an architecture before you begin to reach a point of diminishing

returns. Deeper pipelining results in an increase in instruction execution

time; this increase can be quite damaging to instruction completion rates if

the pipeline has to be flushed and refilled often. Furthermore, in order to

realize the throughput gains that deep pipelining promises, the processor’s

clock speed must increase in proportion to its pipeline depth. But in the real

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

139

world, speeding up the clock of a deeply pipelined processor to match its

pipeline depth is not all that easy.

Because of these drawbacks to deep pipelining, many critics of the

Pentium 4’s microarchitecture, dubbed NetBurst by Intel, have suggested that

its staggeringly long pipeline was a gimmick—a poor design choice made for

reasons of marketing and not performance and scalability. Intel knew that

the public naïvely equated higher MHz numbers with higher performance,

or so the argument went, so they designed the Pentium 4 to run at strato-

spheric clock speeds and in the process, made design trade-offs that would

prove detrimental to real-world performance and long-term scalability.

As it turns out, the Pentium 4’s critics were both wrong and right. In

spite of the predictions of its most ardent detractors, the Pentium 4’s perform-

ance has scaled fairly well with its clock rate, a phenomenon that readers

of this book would expect given the sect
ion “Pipelining Explained” on page 40
.

But though they were wrong about its performance, the Pentium 4’s critics

were right about the origins of the processor’s deeply pipelined approach.

Revelations from former members of the Pentium 4’s design team, as well

as my own off-the-record conversations with Intel folks, all indicate that the

Pentium 4’s design was the result of a marketing-driven focus on clock speeds

at the expense of actual performance and long-term scalability.

It’s my understanding that this fact was widely known within Intel, even

though it was not, and probably never will be, publicly acknowledged. We

now know that during the course of the Pentium 4’s design, the design team

was under pressure from the marketing folks to turn out a chip that would

give Intel a massive MHz lead over its rivals. The reasoning apparently went

that MHz was a single number that the general public understood, and they

knew that, just like with everything in the world—except for golf scores—

higher numbers are somehow better.

When it comes to processor clock speeds, higher numbers are indeed

better, but industry-wide problems with the transition to a 90-nanometer

process caused problems for NetBurst, which on the whole relied on ever-

increasing clock rates and ever-rising power consumption to maintain a per-

formance edge over its rivals. As Intel ran into difficulties keeping up the

regularly scheduled increases in the Pentium 4’s clock rate, the processor’s

performance increases began to level off, even as its power consumption

continued to rise.

Regardless of the drawbacks of the NetBurst architecture and its long-term

prospects, the Pentium 4 line of processors has been successful from both

commercial and performance standpoints. This is because the Pentium 4’s

way of doing things has advantages for certain types of applications—

especially 3D and streaming media applications—even though it carries

with it serious risks.

140

Chapter 7

The General Approaches and Design Philosophies of the

Pentium 4 and G4e

The drastic difference in pipeline depth between the G4e and the Pentium 4

reflects some very important differences in the design philosophies and goals

of the two processors. Both processors try to execute as many instructions as

quickly as possible, but they attack this problem in two different ways.

The G4e’s approach to performance can be summarized as “wide and

shallow.” Its designers added more functional units to its back end for exe-

cuting instructions, and its front end tries to fill up these units by issuing

instructions to each functional unit in parallel. In order to extract the max-

imum amount of
instruction-level parallelism (ILP)
from the linear code stream, the G4e’s front end first moves a small batch of instructions onto the chip.

Then, its out-of-order (OOO) execution logic examines them for hazard-

causing dependencies, spreads them out to execute in parallel, and then

pushes them through the back end’s nine execution units. Each of the G4e’s

execution units has a fairly short pipeline, so the instructions take very few

cycles to move through and finish executing. Finally, in the G4e’s final pipe-

line stages, the instructions are put back in their original program order

before the results are written back to memory.

At any given moment, the G4e can have up to 16 instructions simulta-

neously spread throughout the chip in various stages of execution. As you’ll

see when we look at the Pentium 4, this instruction window is quite small.

The end result is that the G4e focuses on getting a small number of instruc-

tions onto the chip at once, spreading them out widely to execute in parallel,

and then getting them off the chip in as few cycles as possible. This “wide and

shallow” approach is illustrated in Figure 7-2.

The Pentium 4 takes a “narrow and deep” approach to moving through

the instruction stream, as illustrated in Figure 7-3. The fact that the

Pentium 4’s pipeline is so deep means that it can hold and work on quite

a few instructions at once, but instead of spreading these instructions out

more widely to execute in parallel, it pushes them through its narrower

back end at a higher rate.

It’s important to note that in order to keep the Pentium 4’s fast back end

fed with instructions, the processor needs deep buffers that can hold and

schedule an enormous number of instructions. The Pentium 4 can have

up to 126 instructions in various stages of execution simultaneously. This way,

the processor can have many more instructions on chip for the OOO exe-

cution logic to examine for dependencies and then rearrange to be rapidly

fired to the execution units. Or, another way of putting this is to say that

the Pentium 4’s instruction window is very large.

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

141

Code Stream

Front End

Back End

Figure 7-2: The G4e’s approach to performance

It might help you to think about these two approaches in terms of a fast

food drive-through analogy. At most fast food restaurants, you can either

walk in or drive through. If you walk in, there are five or six short lines that you can get in and wait to have your order processed by a single server in one

long step. If you choose to drive through, you’ll wind up on a single long line, but that line is geared to move faster because more servers process your order

in more, quicker steps. In other words:

You pull up to the speaker and tell them what you want.

You pull up to a window and pay a cashier.

You drive around and pick up your order.

142

Chapter 7

Code Stream

Front End

Back End

Figure 7-3: The Pentium 4’s approach to performance

Because the drive-through approach splits the ordering process up into

multiple, shorter stages, more customers can be waited on in a single line