Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
z
mulpd: packed (vector), double-precision multiplication
Intel’s Pentium M, Core Duo, and Core 2 Duo
261
Core’s designers were able to achieve this dramatic speedup in scalar
and vector floating-point performance by widening the floating-point data-
paths from 80 bits to 128 bits, as described below.
Vector Processing Improvements
One of Core’s most significant improvements over its
x
86 predecessors is in the area of vector processing, or SIMD. Not only does Core feature the
expanded vector execution resources described earlier, but the rate at which
it executes the SSE family of 128-bit vector instructions has been doubled
from one instruction every two cycles to one instruction per cycle. To under-
stand how Core’s designers achieved this throughput improvement, it’s
necessary to take a look at the limitations of Intel’s previous implementations
of the SSE family.
128-bit Vector Execution on the P6 Through Core Duo
When Intel finally got around to adding 128-bit vector support to the Pentium
line with the introduction of streaming SIMD extensions (SSE), the results
weren’t quite what programmers and users might have hoped for. SSE arrived
on the Pentium III with two disadvantages, both of which have continued to
plague every Intel processor prior to Core 2 Duo that implements SSE and its
successors (SSE2 and SSE3).
z
On the ISA side, SSE’s main drawback is the lack of support for three-
operand instructions, a problem that was covered in Chapter 8.
z
On the hardware implementation side, 128-bit SSE operations suffer
from a limitation that’s the result of Intel shoehorning 128-bit opera-
tions onto the 80-bit internal floating-point datapaths of the P6 and
Pentium 4.
The former problem is a permanent part of the SSE family, but the latter
is a fixable problem that can be traced to some specific decisions Intel made
when adding SSE support to the Pentium III and the Pentium 4.
When Intel originally modified the P6 core to include support for
128-bit vector operations, it had to hang the new SSE execution units off of
the existing 80-bit data bus that previous designs had been using to ferry
floating-point and MMX operands and results between the execution units
and floating-point/MMX register file.
In order to execute a 128-bit instruction using its 80-bit data bus and
vector units, the P6 and its successors must first break down that instruction
into a pair of 64-bit micro-ops that can be executed on successive cycles. To
see how this works, take a look at Figure 12-15, which shows (in a very abstract way) what happens when the P6 decodes and executes a 128-bit SSE instruction. The decoder first splits the instruction into two 64-bit micro-ops—one
for the upper 64 bits of the vector and another for the lower 64 bits. Then
this pair of micro-ops is passed to the appropriate SSE unit for execution.
262
Chapter 12
Instructions
Instruction Fetch
Data
Results
128-bit SSE instruction
Translate x86/
Decode
64-bit micro-ops
Execute
Vector
ALU
Figure 12-15: How the P6 executes a 128-bit vector operation
The result of this hack is that all 128-bit vector operations take a
minimum of two cycles to execute on the P6, Pentium 4, Pentium M, and
Core Duo—one cycle for the top half and another for the bottom half.
Compare this to the single-cycle throughput and latency of simple 128-bit
AltiVec operations on the PowerPC G4e described in Chapters 7 and 8.
128-bit Vector Execution on Core
The Core microarchitecture that powers the Core 2 Duo is the first to give
x
86 programmers a single-cycle latency for 128-bit vector operations. Intel achieved this reduced latency by making the floating-point and vector internal
data buses 128 bits wide. Core’s 128-bit floating-point/vector datapaths mean
only a single micro-op needs to be generated, dispatched, scheduled, and
issued for each 128-bit vector operation. Not only does the new design elim-
inate the latency disadvantage that has plagued SSE operations so far, but it
also improves decode, dispatch, and scheduling bandwidth because half as
many micro-ops are generated for 128-bit vector instructions.
Figure 12-16 shows how Core’s 128-bit vector execution hardware decodes
and executes an SSE instruction using a single micro-op.
Intel’s Pentium M, Core Duo, and Core 2 Duo
263
Instructions
Instruction Fetch
Data
Results
128-bit SSE instruction
Translate x86/
Decode
128-bit micro-op
Execute
Vector
ALU
Figure 12-16: How Core executes a 128-bit vector operation
As you can see, the vector ALU’s data ports, both input and output, have
been enlarged in order to accommodate 128 bits of data at a time.
When you combine these critical improvements with Core’s increased
amount of vector execution hardware and its expanded decode, dispatch,
issue, and commit bandwidth, you get a very capable vector processing
machine. (Of course, SSE’s two-operand limitation still applies, but there’s
no helping that.) Core can, for example, execute a 128-bit packed multiply,
128-bit packed add, 128-bit packed load, 128-bit packed store, and a macro-
fused cmpjcc (a compare + a jump on condition code) all in the same cycle.
That’s essentially six instructions in one cycle—quite a boost from any
previous Intel processor.
Memory Disambiguation: The Results Stream Version of Speculative Execution
As I explained previously, micro-ops fusion and the added simple/fast
decoder give Core’s front end the ability to decode many more memory
instructions per cycle than its predecessors. Much of the benefit of this
expanded capability would be lost, however, had Intel not also found a way
to greatly increase the number of memory instructions per cycle that Core
can execute.
264
Chapter 12
itm12_03.fm Page 265 Wednesday, October 25, 2006 2:17 PM
In order to increase the number of loads and stores that can be executed
on each cycle, Core uses a technique on its memory instruction stream that’s
somewhat like speculative execution. But before you can understand how
this technique works, you must first understand exactly how the other
x
86
processors covered in this book execute load and store instructions.
The Lifecycle of a Memory Access Instruction
In Chapter 3, you learned about the four primary phases that every
instruction goes through in order to be executed:
1.
Fetch
2.
Decode
3.
Execute
4.
Write
You also learned that these four phases are broken into smaller steps,
and you’ve even seen a few examples of how these smaller steps are arranged
into discrete pipeline stages. However, all of the pipeline examples that
you’ve seen so far have been for arithmetic instructions, so now it’s time to
take a look at the steps involved in executing memory access instructions.
Table 12-5 compares the specific steps needed to execute an add instruc-
tion with the steps needed to execute a load and a store. As you study the
table, pay close attention to the green shaded rows. These rows show stages
where instructions are buffered and possibly reordered.
As you can see from Table 12-5, the execution phase of a memory access
instruction’s lifecycle is more complicated than that of a simple arithmetic
instruction. Not only do load and store instructions need to access the register file and perform arithmetic operations (address calculations), but they must
also access the data cache. The fact that the L1 data cache is farther away from the execution units than the register file and the fact that a load instruction
could miss in the L1 data cache and incur a lengthy L2 cache access delay
mean that special arrangements must be made in order to keep memory
instructions from stalling the entire pipeline.
The Memory Reorder Buffer
The P6 and its successors feed memory access instructions into a special
queue where they’re buffered as they wait for the results of their address
calculations and for the data cache to become available. This queue, called
the
memory reorder buffer (MOB)
, is arranged as a FIFO queue, but under certain conditions, instructions earlier in the queue can bypass an instruction that’s
stalled later in the queue. Thus memory instructions can access the data
cache out of program order with respect to one another, a situation that
improves performance but brings the need for an additional mechanism to
prevent problems associated with
memory aliasing
.
Intel’s Pentium M, Core Duo, and Core 2 Duo
265
itm12_03.fm Page 266 Thursday, January 11, 2007 10:40 AM
.
Table 12-5:
A comparison of the stages of execution of an arithmetic instruction and two memory access instructions.
add
load
store
Fetch
Fetch the add from the instruc-
Fetch the load from the instruc-
Fetch the store from the instruction
tion cache.
tion cache.
cache.
Decode
Decode the add.
Decode the load into a load-
Decode the store into store-
address micro-op.
address and store-data micro-ops.
Issue
Wait in the RS to execute the
Wait in the RS to execute the
Wait in the RS to execute the store-
add out-of-order.
load-address out-of-order.
address and store-data out-of-
order.
store-address:
store-data:
Execute
Read the operands from the
Read an address and
Read an
Read the data to
register file.
possibly an index value
address and
be stored from
from the register file.1
possibly an
the register file.
index value
from the
register file.1
Add the two operands using
Calculate the source address
Calculate the
the ALU.
using the address generation
destination
units (AGUs).
address using
the address
generation units
(AGUs).
Wait in the memory re-order
Wait in the memory re-order
buffer (MOB) for an opportu-
buffer (MOB) for an opportunity
nity to access the data cache.
to access the data cache.
Read the data from the data
Write the data to the data cache,
cache, using the address
using the address calculated in the
calculated by the load-address
store-data micro-op.
micro-op
Complete
Wait in the ROB to commit the Wait in the ROB to commit the Wait in the ROB to commit the add in order.
load-address in order.
store-address and store-data in
order.
Commit
Write the result into the register Write the loaded data into the
Set any flags, and remove the
file, set any flags, and remove
register file, set any flags, and
instruction from the ROB.
the instruction from the ROB.
remove the instruction from the
ROB.
1 Depending on the type of address (i.e., register relative or immediate), this register could be an architectural register or a rename register that’s allocated to hold the immediate value.
Memory Aliasing
I explained in Chapter 5 why out-of-order processors must first put instruc-
tions back in program order before officially writing their results to the
programmer-visible register file: You can’t modify an architectural register
until you’re sure that all of the previous instructions that read that location
have completed execution; to do otherwise would destroy the integrity of the
sequential programming model.
266
Chapter 12
L O A D S A N D S T O R E S :
T O S P L I T , O R N O T T O S P L I T ?
You may be wondering why a load is decoded into a single load-address micro-op,
while a store is decoded into store-address and store-data micro-ops. Doesn’t the load also have to access the cache, just like the store does with its store-data micro-op? The load instruction does indeed access the cache, but because the load-address and load-data operations are inherently serial, there’s no point in separating them into two distinct micro-ops and assigning them to two separate execution pipelines.
The two parts of the store operation, in contrast, are inherently parallel. Because the computer can begin calculating a store’s destination address at the same time it is retrieving the store’s data from the register file, both of these operations can be performed simultaneously by two different micro-ops and two separate execution
units. Thus the P6 core design and its successors feature a single execution unit (the load unit) and a single micro-op for load instructions, and two execution units (the store-address unit and the store-data unit) and their two corresponding micro-ops for store instructions.