Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (97 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
2.77Mb size Format: txt, pdf, ePub

z

mulpd: packed (vector), double-precision multiplication

Intel’s Pentium M, Core Duo, and Core 2 Duo

261

Core’s designers were able to achieve this dramatic speedup in scalar

and vector floating-point performance by widening the floating-point data-

paths from 80 bits to 128 bits, as described below.

Vector Processing Improvements

One of Core’s most significant improvements over its
x
86 predecessors is in the area of vector processing, or SIMD. Not only does Core feature the

expanded vector execution resources described earlier, but the rate at which

it executes the SSE family of 128-bit vector instructions has been doubled

from one instruction every two cycles to one instruction per cycle. To under-

stand how Core’s designers achieved this throughput improvement, it’s

necessary to take a look at the limitations of Intel’s previous implementations

of the SSE family.

128-bit Vector Execution on the P6 Through Core Duo

When Intel finally got around to adding 128-bit vector support to the Pentium

line with the introduction of streaming SIMD extensions (SSE), the results

weren’t quite what programmers and users might have hoped for. SSE arrived

on the Pentium III with two disadvantages, both of which have continued to

plague every Intel processor prior to Core 2 Duo that implements SSE and its

successors (SSE2 and SSE3).

z

On the ISA side, SSE’s main drawback is the lack of support for three-

operand instructions, a problem that was covered in Chapter 8.

z

On the hardware implementation side, 128-bit SSE operations suffer

from a limitation that’s the result of Intel shoehorning 128-bit opera-

tions onto the 80-bit internal floating-point datapaths of the P6 and

Pentium 4.

The former problem is a permanent part of the SSE family, but the latter

is a fixable problem that can be traced to some specific decisions Intel made

when adding SSE support to the Pentium III and the Pentium 4.

When Intel originally modified the P6 core to include support for

128-bit vector operations, it had to hang the new SSE execution units off of

the existing 80-bit data bus that previous designs had been using to ferry

floating-point and MMX operands and results between the execution units

and floating-point/MMX register file.

In order to execute a 128-bit instruction using its 80-bit data bus and

vector units, the P6 and its successors must first break down that instruction

into a pair of 64-bit micro-ops that can be executed on successive cycles. To

see how this works, take a look at Figure 12-15, which shows (in a very abstract way) what happens when the P6 decodes and executes a 128-bit SSE instruction. The decoder first splits the instruction into two 64-bit micro-ops—one

for the upper 64 bits of the vector and another for the lower 64 bits. Then

this pair of micro-ops is passed to the appropriate SSE unit for execution.

262

Chapter 12

Instructions

Instruction Fetch

Data

Results

128-bit SSE instruction

Translate x86/

Decode

64-bit micro-ops

Execute

Vector

ALU

Figure 12-15: How the P6 executes a 128-bit vector operation

The result of this hack is that all 128-bit vector operations take a

minimum of two cycles to execute on the P6, Pentium 4, Pentium M, and

Core Duo—one cycle for the top half and another for the bottom half.

Compare this to the single-cycle throughput and latency of simple 128-bit

AltiVec operations on the PowerPC G4e described in Chapters 7 and 8.

128-bit Vector Execution on Core

The Core microarchitecture that powers the Core 2 Duo is the first to give

x
86 programmers a single-cycle latency for 128-bit vector operations. Intel achieved this reduced latency by making the floating-point and vector internal

data buses 128 bits wide. Core’s 128-bit floating-point/vector datapaths mean

only a single micro-op needs to be generated, dispatched, scheduled, and

issued for each 128-bit vector operation. Not only does the new design elim-

inate the latency disadvantage that has plagued SSE operations so far, but it

also improves decode, dispatch, and scheduling bandwidth because half as

many micro-ops are generated for 128-bit vector instructions.

Figure 12-16 shows how Core’s 128-bit vector execution hardware decodes

and executes an SSE instruction using a single micro-op.

Intel’s Pentium M, Core Duo, and Core 2 Duo

263

Instructions

Instruction Fetch

Data

Results

128-bit SSE instruction

Translate x86/

Decode

128-bit micro-op

Execute

Vector

ALU

Figure 12-16: How Core executes a 128-bit vector operation

As you can see, the vector ALU’s data ports, both input and output, have

been enlarged in order to accommodate 128 bits of data at a time.

When you combine these critical improvements with Core’s increased

amount of vector execution hardware and its expanded decode, dispatch,

issue, and commit bandwidth, you get a very capable vector processing

machine. (Of course, SSE’s two-operand limitation still applies, but there’s

no helping that.) Core can, for example, execute a 128-bit packed multiply,

128-bit packed add, 128-bit packed load, 128-bit packed store, and a macro-

fused cmpjcc (a compare + a jump on condition code) all in the same cycle.

That’s essentially six instructions in one cycle—quite a boost from any

previous Intel processor.

Memory Disambiguation: The Results Stream Version of Speculative Execution

As I explained previously, micro-ops fusion and the added simple/fast

decoder give Core’s front end the ability to decode many more memory

instructions per cycle than its predecessors. Much of the benefit of this

expanded capability would be lost, however, had Intel not also found a way

to greatly increase the number of memory instructions per cycle that Core

can execute.

264

Chapter 12

itm12_03.fm Page 265 Wednesday, October 25, 2006 2:17 PM

In order to increase the number of loads and stores that can be executed

on each cycle, Core uses a technique on its memory instruction stream that’s

somewhat like speculative execution. But before you can understand how

this technique works, you must first understand exactly how the other
x
86

processors covered in this book execute load and store instructions.

The Lifecycle of a Memory Access Instruction

In Chapter 3, you learned about the four primary phases that every

instruction goes through in order to be executed:

1.

Fetch

2.

Decode

3.

Execute

4.

Write

You also learned that these four phases are broken into smaller steps,

and you’ve even seen a few examples of how these smaller steps are arranged

into discrete pipeline stages. However, all of the pipeline examples that

you’ve seen so far have been for arithmetic instructions, so now it’s time to

take a look at the steps involved in executing memory access instructions.

Table 12-5 compares the specific steps needed to execute an add instruc-

tion with the steps needed to execute a load and a store. As you study the

table, pay close attention to the green shaded rows. These rows show stages

where instructions are buffered and possibly reordered.

As you can see from Table 12-5, the execution phase of a memory access

instruction’s lifecycle is more complicated than that of a simple arithmetic

instruction. Not only do load and store instructions need to access the register file and perform arithmetic operations (address calculations), but they must

also access the data cache. The fact that the L1 data cache is farther away from the execution units than the register file and the fact that a load instruction

could miss in the L1 data cache and incur a lengthy L2 cache access delay

mean that special arrangements must be made in order to keep memory

instructions from stalling the entire pipeline.

The Memory Reorder Buffer

The P6 and its successors feed memory access instructions into a special

queue where they’re buffered as they wait for the results of their address

calculations and for the data cache to become available. This queue, called

the
memory reorder buffer (MOB)
, is arranged as a FIFO queue, but under certain conditions, instructions earlier in the queue can bypass an instruction that’s

stalled later in the queue. Thus memory instructions can access the data

cache out of program order with respect to one another, a situation that

improves performance but brings the need for an additional mechanism to

prevent problems associated with
memory aliasing
.

Intel’s Pentium M, Core Duo, and Core 2 Duo

265

itm12_03.fm Page 266 Thursday, January 11, 2007 10:40 AM

.

Table 12-5:
A comparison of the stages of execution of an arithmetic instruction and two memory access instructions.

add

load

store

Fetch

Fetch the add from the instruc-

Fetch the load from the instruc-

Fetch the store from the instruction

tion cache.

tion cache.

cache.

Decode

Decode the add.

Decode the load into a load-

Decode the store into store-

address micro-op.

address and store-data micro-ops.

Issue

Wait in the RS to execute the

Wait in the RS to execute the

Wait in the RS to execute the store-

add out-of-order.

load-address out-of-order.

address and store-data out-of-

order.

store-address:

store-data:

Execute

Read the operands from the

Read an address and

Read an

Read the data to

register file.

possibly an index value

address and

be stored from

from the register file.1

possibly an

the register file.

index value

from the

register file.1

Add the two operands using

Calculate the source address

Calculate the

the ALU.

using the address generation

destination

units (AGUs).

address using

the address

generation units

(AGUs).

Wait in the memory re-order

Wait in the memory re-order

buffer (MOB) for an opportu-

buffer (MOB) for an opportunity

nity to access the data cache.

to access the data cache.

Read the data from the data

Write the data to the data cache,

cache, using the address

using the address calculated in the

calculated by the load-address

store-data micro-op.

micro-op

Complete
Wait in the ROB to commit the Wait in the ROB to commit the Wait in the ROB to commit the add in order.

load-address in order.

store-address and store-data in

order.

Commit

Write the result into the register Write the loaded data into the

Set any flags, and remove the

file, set any flags, and remove

register file, set any flags, and

instruction from the ROB.

the instruction from the ROB.

remove the instruction from the

ROB.

1 Depending on the type of address (i.e., register relative or immediate), this register could be an architectural register or a rename register that’s allocated to hold the immediate value.

Memory Aliasing

I explained in Chapter 5 why out-of-order processors must first put instruc-

tions back in program order before officially writing their results to the

programmer-visible register file: You can’t modify an architectural register

until you’re sure that all of the previous instructions that read that location

have completed execution; to do otherwise would destroy the integrity of the

sequential programming model.

266

Chapter 12

L O A D S A N D S T O R E S :

T O S P L I T , O R N O T T O S P L I T ?

You may be wondering why a load is decoded into a single load-address micro-op,

while a store is decoded into store-address and store-data micro-ops. Doesn’t the load also have to access the cache, just like the store does with its store-data micro-op? The load instruction does indeed access the cache, but because the load-address and load-data operations are inherently serial, there’s no point in separating them into two distinct micro-ops and assigning them to two separate execution pipelines.

The two parts of the store operation, in contrast, are inherently parallel. Because the computer can begin calculating a store’s destination address at the same time it is retrieving the store’s data from the register file, both of these operations can be performed simultaneously by two different micro-ops and two separate execution

units. Thus the P6 core design and its successors feature a single execution unit (the load unit) and a single micro-op for load instructions, and two execution units (the store-address unit and the store-data unit) and their two corresponding micro-ops for store instructions.

Other books

What Haunts Me by Margaret Millmore
Heart Tamer by Sophia Knightly
Cities of Refuge by Michael Helm
Waking the Dead by Scott Spencer
Shipwreck by Maureen Jennings
Azar Nafisi by Reading Lolita in Tehran
Kelsey the Spy by Linda J Singleton