Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (96 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
2.93Mb size Format: txt, pdf, ePub

phase as a whole can translate one macro-fused
x
86 instruction into a
macro-fused micro-op
on each cycle. (No more than one such macro-fused micro-op can be generated per cycle.)

All told, macro-fusion allows the predecode phase to send to the decode

phase a maximum of either

z

four normal
x
86 instructions per cycle, or

z

three normal
x
86 instructions plus one macro-fused instruction, for a

total of
five
x
86 instructions per cycle.

Moving five instructions per cycle into the decode phase is a huge

improvement over the throughput of three instructions per cycle in previous

designs. By enabling the front end to combine two
x
86 instructions per cycle into a single micro-op, macro fusion effectively enlarges Core’s decode,

dispatch, and retire bandwidth, all without the need for extra ROB and RS

entries. Ultimately, less book-keeping hardware means better power efficiency

per
x
86 instruction for the processor as a whole, which is why it’s important for Core to approach the goal of one micro-op per
x
86 instruction as closely as possible.

The Decode Phase

Core’s widened back end can grind through micro-ops at an unprecedented

rate, so Intel needed to dramatically increase the new microarchitecture’s

decode rate compared with previous designs so that more micro-ops per

cycle could reach the back end. Core’s designers did a few things to achieve

this goal.

I’ve already talked about one innovation that Core uses to increase its

decode rate: macro-fusion. This new capability has the effect of giving Core

an extra decoder for “free,” but remember that this free decoder that macro-

fusion affords is only good for certain instruction types. Also, the decode

phase as a whole can translate only one macro-fused
x
86 instruction into a
macro-fused micro-op
on each cycle. (No more than one such macro-fused

micro-op can be generated per cycle.)

Intel’s Pentium M, Core Duo, and Core 2 Duo

257

Intel also expanded the decode phase’s total throughput by adding a

brand new simple/fast decoding unit, bringing Core’s number of simple/

fast decoders up to three. The three simple/fast decoders combine with the

complex/slow decoder to enable Core’s decoding hardware to send up to

seven micro-ops per cycle into the micro-op queue, from which up to four

micro-ops per cycle can pass into the ROB. The newly expanded decoding

unit was depicted in Figure 12-11.

Finally, Intel has increased Core’s decode rate by making a change to the

back end (described later) that now permits 128-bit SSE instructions to be

decoded into a single micro-op instead of a fused micro-op pair, as in previous

designs. Thus Core’s new front end design brings the processor much closer

to the goal of one micro-op per
x
86 instruction.

Core’s Pipeline

Core’s 14-stage pipeline is two stages longer than the original 12-stage P6

pipeline. Both of Core’s new stages were added in the processor’s front end.

The first new stage was added in the fetch/predecode phase to accommodate

the instruction queue and macro-fusion, and the second stage was added to

help out with 64-bit address translation.

Intel has not yet made available a detailed breakdown of Core’s pipeline

stages, so the precise locations of the two new stages are still unknown.

Core’s Back End

One of the most distinctive features of the older P6 design is its back end’s

issue port structure, described in Chapter 5. Core uses a similar structure in

its back end, although there are some major differences between the issue

port and reservation station (RS) combination of Core and that of the P6.

To get a sense of the historical development of the issue port scheme,

let’s take a look at the back end of the original Pentium Pro.

As you can see from Figure 12-12, ports 0 and 1 host the arithmetic hard-

ware, while ports 2, 3, and 4 host the memory access hardware. The P6 core’s

reservation station is capable of issuing up to five instructions per cycle to the execution units—one instruction per issue port per cycle.

As the P6 core developed through the Pentium II and Pentium III, Intel

began adding execution units to handle integer and floating-point vector

arithmetic. This new vector execution hardware was added on ports 0 and 1,

with the result that by the time the PIII was introduced, the P6 back end

looked like Figure 12-13.

The PIII’s core is fairly wide, but the distribution of arithmetic execution

resources between only two of the five issue ports means that its performance

can sometimes be bottlenecked by a lack of issue bandwidth (among other

things). All of the code stream’s vector and scalar arithmetic instructions are

contending with each other for two ports, a fact that, when combined with

the two-cycle SSE limitation that I’ll outline in a moment, means the PIII’s

vector performance could never really reach the heights of a cleaner design

like Core.

258

Chapter 12

Reservation Station (RS)

Port 0

Port 0

Port 1

Port 4

Port 3

Port 2

Port 1

CIU

SIU

Store

Store

Load

FPU

Data

Addr.

Addr.

BU

Floating-

Point

Branch

Integer Unit

Load-Store Unit

Unit

Unit

Scalar ALUs

Memory Access Units

Back End

Figure 12-12: The Pentium Pro’s back end

Almost nothing is known about the back ends of the Pentium M and

Core Duo processors because Intel has declined to release that infor-

mation. Both are rumored to be quite similar in organization to the back

end of the Pentium III, but that rumor cannot be confirmed based on

publicly available information.

Reservation Station (RS)

Port 0

Port 1

Port 1

Port 0

Port 0

Port 1

Port 4

Port 3

Port 2

Port 1

FPU &

CIU

SIU

Store

Store

Load

VFADD

MMX 0

MMX 1

Data

Addr.

Addr.

BU

VFMUL

VSHUFF

VRECIP

FP/SSE

Branch

MMX/SSE Unit

Unit

Integer Unit

Load-Store Unit

Unit

Vector ALUs

Scalar ALUs

Memory Access Units

Back End

Figure 12-13: The Pentium III’s back end

For Core, Intel’s architects added a new issue port for handling

arithmetic operations. They also changed the distribution of labor on

issue ports 1 and 2 to provide more balance and accommodate more

execution hardware. The final result is the much wider back end that is

shown in Figure 12-14.

Each of Core’s three arithmetic issue ports (0, 1, and 5) now con-

tains a scalar integer ALU, a vector integer ALU, and hardware to perform

floating-point, vector move, and logic operations (the F/VMOV label in

Figure 12-14). Let’s take a brief look at Core’s integer and floating-point

pipelines before moving on to look at the vector hardware in more detail.

Intel’s Pentium M, Core Duo, and Core 2 Duo

259

Reservation Station (RS)

Port 0

Port 1

Port 5

Port 5

Port 0

Port 1

Port 0

Port 1

Port 5

Port 5

Port 4

Port 3

Port 2

MMX5

FADD

FMUL

CIU1

CIU2

SIU

Store

Store

Load

MMX0

MMX1

Data

Addr.

Addr.

VSHUF

BU

F/VMOV

F/VMOV

F/VMOV

VFADD

VFMUL

Floating-Point

MMX/SSE Unit

Branch

Unit

Integer Unit

Load-Store Unit

Unit

Vector ALUs

Scalar ALUs

Memory Access Units

Back End

Figure 12-14: The back end of the Intel Core microarchitecture

Integer Units

Core’s back end features three scalar 64-bit integer units: one complex

integer unit that’s capable of handling 64-bit multiplication (port 0); one

complex integer unit that’s capable of handling shift instructions, rotate

instructions, and 32-bit multiplication (port 1); and one simple integer

unit (port 5).

The new processor’s back end also has three vector integer units that

handle MMX instructions, one each on ports 0, 1, and 5. This abundance of

scalar and vector integer hardware means that Core can issue three vector

or scalar integer operations per cycle.

Floating-Point Units

In the P6-derived processors leading up to Core, there was a mix of

floating-point hardware of different types on the issue ports. Specifically, the Pentium III added vector floating-point multiplication to its back end by

modifying the existing FPU on port 0 to support this function. Vector floating-

point addition was added as a separate VFADD (or PFADD, for
packed floating-

point addition
) unit on port 1. Thus, the floating-point arithmetic capabilities were unevenly divided among the Pentium III’s two issue ports as follows:

Port 0

z

Scalar addition (
x
87 and SSE family)

z

Scalar multiplication (
x
87 and SSE family)

z

Vector multiplication

Port 1

z

Vector addition

Core cleans up this arrangement, which the Pentium M and Core Duo

probably also inherited from the Pentium III, by consolidating all floating-

point multiplication functions (both scalar and vector) into a single VFMUL

unit on port 0; similarly, all vector and scalar floating-point addition functions are brought together in a single VFADD unit on port 1.

260

Chapter 12

itm12_03.fm Page 261 Thursday, January 11, 2007 10:40 AM

Core’s distribution of floating-point labor therefore looks as follows:

Port 0

z

Scalar multiplication (single- and double-precision,
x
87 and SSE

family)

z

Vector multiplication (four single-precision or two double-precision)

Port 1

z

Scalar addition (single- and double-precision,
x
87 and SSE family)

z

Vector addition (four single-precision or two double-precision)

The Core 2 Duo is the first
x
86 processor from Intel to support double-

precision floating-point operations with a single-cycle throughput. Thus,

Core’s floating-point unit can complete up to four double-precision or

eight single-precision floating-point operations on every cycle. To see just

how much of an improvement Core’s floating-point hardware offers over

its predecessors, take a look at Table 12-4, which compares the throughputs

(number of instructions completed per cycle) of scalar and vector floating-

point instructions on four generations of Intel hardware.

Table 12-4:
Throughput numbers (cycles/instruction) for vector and

scalar floating-point instructions on five different Intel processors

PentiumM/

Instruction Pentium III Pentium 4 Core Duo

Core 2 Duo

fadd1

1

1

1

1

fmul1

2

2

2

2

addss

1

2

1

1

addsd

2

1

1

addps

2

2

2

1

addpd

2

2

1

mulss

1

2

1

1

mulsd

2

2

1

mulps

2

2

2

1

mulpd

2

4

1

1
x
87 instruction

In Table 12-4, the rows with green shaded backgrounds denote vector

operations, while those with blue shaded backgrounds denote scalar opera-

tions. The throughput numbers for all double-precision operations are bold.

With the exception of the fadd and fmul instructions, all of the instructions

listed belong to the SSE family. Here are a few SSE instructions interpreted for you, so that you can figure out which operations the instructions perform:

z

addss: scalar, single-precision addition

z

addsd: scalar, double-precision addition

z

mulps: packed (vector), single-precision multiplication

Other books

The Other Side by Lacy M. Johnson
Stranded by Lorena McCourtney
HiddenDepths by Angela Claire
Castle: A Novel by J. Robert Lennon