Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (92 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

2.57Mb size Format: txt, pdf, ePub

Read Book Download Book

one processor: the microarchitecture’s code name, the microarchitecture’s

brand name, and the processor’s brand name.

The processors described in this chapter are commonly known by the

code names that Intel assigned to their microarchitectures prior to their

launch. The first version of the Pentium M was called
Banias
, and a later revision was called
Dothan
; the code name for Core Duo was
Yonah
. The code name situation for Core 2 Duo is a bit complicated, and will be unraveled in

the appropriate section. These code names are still used periodically by Intel

and others, so you’ll often need to be familiar with them in order to know

which design is being discussed in a particular article or white paper.

Figure 12-1 should help you understand and keep track of the different

names and codenames used throughout this chapter. Note that the related

information for the Pentium 4 is included for reference.

Microarchitecture

Willamette, Prescott, etc.

Banias, Dothan

Yonah

Merom/Conroe/Woodcrest

Code Name

Microarchitecture

Netburst

Intel Original Mobile

Intel Core

Brand Name

Microarchitecture

Processor

Pentium 4

Pentium M

Core Duo/Solo

Core 2 Duo/Solo

Brand Name

Figure 12-1: Code names and official brand names for some recent Intel architectures and
implementations

236

Chapter 12

To forestall any potential confusion, I have avoided the use of both the

brand names and the code names for the microarchitectures under discussion.

Instead, I typically employ the official brand name that Intel has given the

desktop
processor (as opposed to the mobile or server processor) that implements a particular microarchitecture.

The Rise of Power-Efficient Computing

Although the so-called “mobile revolution” had clearly arrived by the time

Intel introduced the Pentium M in 2003, the previous years’ rapid growth in

portable computer sales wasn’t the only reason Intel and other processor

manufacturers had begun to pay serious attention to the power dissipation of

their chips. As transistor sizes steadily shrank and designers became able to

cram more power-hungry circuits into each square millimeter of a chip’s

surface area, a new barrier to processor performance loomed on the near

horizon: the power wall.

Power wall
is a term used by Intel to describe the point at which its chips’

power density
(the number of watts dissipated per unit area) began to seriously limit further integration and clockspeed scaling. The general idea behind

the power wall is straightforward. Though the explanation here leaves out a

number of factors like the effects of per-device capacitance and supply voltage, it should nonetheless give you enough of a handle on the phenomenon that

you can understand some of the major design decisions behind the processors

covered in this chapter.

Power Density

The amount of power that a chip dissipates per unit area is called its power

density, and there are two types of power density that concern processor

architects: dynamic power density and static power density.

Dynamic Power Density

Each transistor on a chip dissipates a small amount of power when it is

switched, and transistors that are switched rapidly dissipate more power than

transistors that are switched slowly. The total amount of power dissipated per

unit area due to switching of a chip’s transistors is called
dynamic power density
.

There are two factors that work together to cause an increase in dynamic

power density: clockspeed and transistor density.

Increasing a processor’s clockspeed involves switching its transistors

more rapidly, and as I just mentioned, transistors that are switched more

rapidly dissipate more power. Therefore, as a processor’s clockspeed rises, so

does its dynamic power density, because each of those rapidly switching tran-

sistors contributes more to the device’s total power dissipation. You can also

increase a chip’s dynamic power density by cramming more transistors into

the same amount of surface area.

Intel’s Pentium M, Core Duo, and Core 2 Duo

237

Figure 12-2 illustrates how transistor density and clockspeed work

together to increase dynamic power density. As the clockspeed of the device

and the number of transistors per unit area rise, so does the overall dynamic

power density.

Clockspeed

Power

Density

Transistor Density

Figure 12-2: Dynamic power density

Static Power Density

In addition to clockspeed-related increases in dynamic power density, chip

designers must also contend with the fact that even transistors that aren’t

switching will still leak current during idle periods, much like how a faucet

that is shut off can still leak water if the water pressure behind it is high

enough. This
leakage current
causes an idle transistor to constantly dissipate a trace amount of power. The amount of power dissipated per unit area

due to leakage current is called
static power density
.

Transistors leak more current as they get smaller, and consequently

static power densities begin to rise across the chip when more transistors

are crammed into the same amount of space. Thus even relatively low clock-

speed devices with very small transistor sizes are still subject to increases in power density if leakage current is not controlled. If a silicon device’s

overall power density gets high enough, it will begin to overheat and will

eventually fail entirely. Thus it’s critical that designers of highly integrated devices like modern
x
86 processors take power efficiency into account when designing a new microarchitecture.

Power density is a major, growing concern for every type of micro-

processor, regardless of the type of computer in which the processor is

238

Chapter 12

intended to be used. The same types of power-aware design decisions

that are important for a mobile processor are now just as critical for a

server processor.

The Pentium M

In order to meet the new challenges posed by the power-efficient computing

paradigm, Intel’s Israel-based design team drew on the older, time-tested P6

microarchitecture as the basis for its new low-power design, the Pentium M.

The Pentium M takes the overall pipeline organization and layout of the P6

in its Pentium III incarnation and builds on it substantially with a number of

innovations, allowing it to greatly exceed its predecessor in both power

efficiency and raw performance (see Table 12-1).

Table 12-1:
Features of the Pentium M

Introduction Date

March 12, 2003

Process

0.13 micron

Transistor Count

77 million

Clock Speed at Introduction
1.3 to 1.6 GHz

L1 Cache Size

32KB instruction, 32KB data

L2 Cache Size (on-die)

1MB

Most of the Pentium M’s new features are in its front end, specifically in

its fetch, decode, and branch-prediction hardware.

The Fetch Phase

As I explained in Chapter 5, the original P6 processor fetches one 16-byte

instruction packet per cycle from the I-cache into a buffer that’s two instruc-

tion packets (or 32 bytes) deep. (This fetch buffer is roughly analogous to the

PowerPC instruction queue [IQ] described in previous chapters.) From the

fetch buffer,
x
86 instructions can move at a rate of up to three instructions per cycle into the P6 core’s three decoders. This fetch and decode process

is illustrated in Figure 12-3.

On the Pentium M and its immediate successor, Core Duo, the fetch

buffer has been widened to 64 bytes. Thus the front end’s predecode hard-

ware can hold and examine up to four 16-byte instruction packets at a time.

This deeper buffer, depicted in Figure 12-4, is necessary to keep the newer

design’s much improved decode hardware (described later) from starving.

A second version of the Pentium M, commonly known by its code name,

Dothan
, modifies this 64-byte fetch buffer to do double duty as a hardware loop buffer. This fetch/loop buffer combination is also used in the Pentium

M’s successor, the Core Duo.

Intel’s Pentium M, Core Duo, and Core 2 Duo

239

L1 Instruction Cache

2 x 16-byte

Fetch Buffer

Simple Decoder

6-entry

40-entry

Micro-op

Reorder Buffer

Queue

(ROB)

Complex

Decoder

Microcode

Translate/

Engine

x86 Decode

x86 instruction path

micro-op instruction path

Figure 12-3: The P6 architecture’s fetch and decode hardware

The Hardware Loop Buffer

A
hardware loop buffer
caches the block of instructions that is located inside a program loop. Because the instructions inside a loop are repeated many times,

storing them in a front-end buffer keeps the processor from having to re-fetch

them on each loop iteration. Thus the loop buffer is a feature that saves power

because it cuts down on the number of accesses to the I-cache and to the

branch prediction unit’s branch target buffer.

The Decode Phase: Micro-ops Fusion

One of the most important ways that Intel’s Pentium M architects were able

to get more performance out of the P6 architecture was by improving on the

predecessor design’s decode rate.

You’ll recall from Chapter 5 that each of the P6’s two simple/fast decoders

can output a maximum of one micro-op per cycle to the micro-op queue,

for a total of two micro-ops per cycle. All instructions that translate into more than one micro-op must use the single complex/slow decoder, which can

output up to four micro-ops per cycle. Thus, the P6 core’s decode hardware

can output a maximum of six micro-ops per cycle to its micro-op queue.

240

Chapter 12

L1 Instruction Cache

4 x 16-byte

Fetch Buffer

Simple Decoder

Micro-op

Reorder Buffer

Queue

(ROB)

Complex

Decoder

Microcode

Translate/

Engine

x86 Decode

x86 instruction path

micro-op instruction path

fused micro-op instruction path

Figure 12-4: The Pentium M’s fetch and decode hardware

For certain types of operations, especially memory operations, the P6’s

decoding scheme can cause a serious bottleneck. As I’ll discuss in more

detail later,
x
86 store instructions and a specific category of load instructions decode into two micro-ops, which means that most
x
86 memory accesses

must use the complex/slow decoder. During bursts of memory instructions,

the complex/slow decoder becomes backed up with work while the other

two decoders sit idle. At such times, the P6’s decoding hardware decodes

only two micro-ops per cycle, a number far short of its peak decode rate,

six micro-ops per cycle.

The Pentium M’s redesigned decoding unit contains a new feature

called
micro-ops fusion
that eliminates this bottleneck for memory accesses and enables the processor to increase the number of
x
86 instructions per cycle that it can convert to micro-ops. The Pentium M’s two simple/fast decoders are able to take certain
x
86 instructions that normally translate into two micro-ops and translate them into a single fused micro-op. These two decoders

can send either one micro-op per cycle or one fused micro-op per cycle to

the micro-op queue, as depicted in Figure 12-5. Because both simple and fast

decoders can now process these formerly two–micro-op memory instructions,

the Pentium M’s front end can actually achieve the maximum decode rate

of six micro-ops per cycle during long stretches of memory traffic.

Intel’s Pentium M, Core Duo, and Core 2 Duo

241

Simple Decoder

Micro-op

Queue

Complex

Decoder