Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
The need for accesses to programmer-visible storage to be committed in
program order applies to main memory just as it does to the register file. To
see an example of this, consider Program 12-1. The first line stores the number
13 in an unknown memory cell, and the next line loads the contents of the
red memory cell into register A. The final line is an arithmetic instruction
that adds the contents of registers A and B and places the result in register C.
Program 12-1
Main Memory
store 13,
load
,A
add A,B,C
Program 12-1: A program with a load and a store,
where the store’s target is unknown.
Figures 12-17 and 12-18 show two options for the destination address
of the store: either the red cell (Figure 12-17) or an unrelated blue cell (Fig-
ure 12-18). If the store ends up writing to the red cell, then the store
must
execute before the load so that the load can then read the updated value from
the red cell and supply it to the following add instruction (via register A). If the store writes its value to the blue cell, then it doesn’t really matter if that store executes before or after the load, because it is modifying an unrelated memory
location.
When a store and a load both access the same memory address, the two
instructions are said to
alias
. Figure 12-17 is an example of memory aliasing, while Figure 12-18 is not.
Intel’s Pentium M, Core Duo, and Core 2 Duo
267
store 13,
store 13,
load
,A
load
,A
add A,B,C
add A,B,C
Figure 12-17: An aliased
Figure 12-18: A non-aliased
load-store pair
load-store pair
Memory Reordering Rules
In order to avoid aliasing-related problems, all of the Intel processors prior
to Core obey the following rules when reordering memory operations in
the MOB:
1.
Stores must always commit in program order relative to other stores.
2.
No load can commit ahead of an aliased store-data micro-op.
3.
No load can commit ahead of a store-data micro-op with an unknown
target address.
Rule #3 dictates that no load is allowed to be moved (or “hoisted”) above
a store with an undefined address, because when that store’s address becomes
available, the processor might find that a load and store are accessing the same address (i.e., the load and store are
aliased
). Because of memory aliasing, this kind of
load hoisting
is not allowed on
x
86 processors prior to Core.
False Aliasing
As it turns out, the kind of load-store aliasing that rule #3 above is intended
to prevent is exceedingly rare. The vast majority of the memory accesses in
the MOB window do not alias, so they could theoretically proceed indepen-
dently of one another. Thus the practice of preventing any load from com-
mitting until every last store in the MOB has a known target address unneces-
sarily restricts the number of memory instructions that the processor can
commit each cycle.
Because most load-store pairs don’t alias, processors like the P6 that play
it safe lose quite a bit of performance to
false aliasing
, where the processor assumes that two or more memory accesses alias, when in reality they do not.
Let’s take a look at exactly where this performance loss comes from.
Figure 12-19 shows a cycle-by-cycle breakdown of how aliased and non-
aliased versions of Program 12-1 execute on processors like the P6 and
Pentium 4, which use conservative memory access reordering assumptions.
In both instances, the store-address micro-op must execute first so that
it can yield a known destination address for the store-data micro-op. This
destination address has to be available before any of the memory accesses
that are waiting in the MOB can be carried out. Because the destination
address of the store is not available until the end of the second cycle when
the store-address micro-op has finished calculating it, the processor cannot
execute either the store or the load until the third cycle or later.
268
Chapter 12
Aliased accesses
Non-aliased accesses
with no reordering
with conservative reordering
store-addr 13,
store-addr 13,
store-data 13,
load
,A
store-data 13,
load
,A
add A,B,C
add A,B,C
Figure 12-19: Execution without memory disambiguation
When the destination address of the store becomes available at the
outset of cycle three, if it turns out that the memory accesses are aliased, the processor must wait another cycle for the store-data micro-op to update the
red memory cell before it can execute the load. Then, the load executes, and
it too takes an extra cycle to move the data from the red memory cell into the
register. Finally, on the seventh cycle, the add is executed.
If the processor discovers that the accesses are not aliased, the load can
execute immediately after the store-data micro-op and before the store-address
micro-op. In other words, for non-aliased accesses, the processor will move
the load up in the queue so that it executes before the less critical, “fire-and-forget” store instruction.
Memory Disambiguation
Core’s
memory disambiguation
hardware attempts to identify instances of false aliasing so that in instances where the memory accesses are not aliased, a
load can actually execute before a store’s destination address becomes
available. Figure 12-20 illustrates non-aliased memory accesses with and
without the reordering opportunity that memory disambiguation affords.
Non-aliased accesses
Non-aliased accesses
with conservative reordering with disambiguation
store-addr 13,
store-addr 13,
load
,A
load
,A
store-data 13,
store-data 13,
add A,B,C
add A,B,C
Figure 12-20: Execution with and without memory disambiguation
When the non-aliased accesses execute with memory disambiguation,
the load can go ahead and execute while the store’s address is still unknown.
The store, for its part, can just execute whenever its destination address
becomes available.
Intel’s Pentium M, Core Duo, and Core 2 Duo
269
Re-ordering the memory accesses in this manner enables our example
processor to execute the addition a full cycle earlier than it would have without memory disambiguation. If you consider a large instruction window that
contains many memory accesses, the ability to speculatively hoist loads
above stores could save a significant number of total execution cycles.
Intel has developed an algorithm that examines memory accesses in
order to guess which ones are probably aliased and which ones aren’t. If the
algorithm determines that a load-store pair is aliased, it forces them to commit in program order. If the algorithm decides that the pair is not aliased, the
load may commit before the store.
In cases where Core’s memory disambiguation algorithm guesses incor-
rectly, the pipeline stalls, and any operations that were dependent on the
erroneous load are flushed and restarted once the correct data has been
(re)loaded from memory.
By drastically cutting down on false aliasing, Core eliminates many cycles
that are unnecessarily wasted on waiting for store address data to become
available. Intel claims that memory disambiguation’s impact on performance
is significant, especially in the case of memory-intensive floating-point code.
Summary: Core 2 Duo in Historical Context
Intel’s turn from the hyperpipelined Netburst microarchitecture to the power-
efficient, multi-core–friendly Core microarchitecture marks an important
shift not just for one company, but for the computing industry as a whole.
The processor advances of the past two decades—advances described in detail
throughout this book—have been aimed at increasing the performance of
single instruction streams (or
threads of execution
). The Core microarchitecture emphasizes single-threaded performance as well, but it is part of a larger,
long-term project that involves shifting the focus of the entire computing
industry from single-threaded performance to multithreaded performance.
270
Chapter 12
B I B L I O G R A P H Y A N D
S U G G E S T E D R E A D I N G
General
Hennessy, John and David Patterson.
Computer Architecture: A Quantitative
Approach.
3rd ed. San Francisco: Morgan Kaufmann, 2002.
Hennessy, John and David Patterson.
Computer Organization and Design:
the Hardware/Software Interface.
3rd ed. San Francisco: Morgan
Kaufmann, 2004.
Shriver, Bruce and Bennett Smith.
The Anatomy of a High-Performance
Microprocessor: A Systems Perspective
. Los Alamitos, CA: Wiley-IEEE
Computer Society Press, 1998.
PowerPC ISA and Extensions
AltiVec Technology Programming Environment’s Manual
. rev. 0.1. Motorola, 1998.
Diefendorff, Keith. “A History of the PowerPC Architecture.”
Communications
of the ACM
37, no. 6 ( June 1994): 28–33.
Fuller, Sam. “Motorola’s AltiVec Technology” (white paper). Motorola, 1998.
PowerPC 600 Series Processors
Denman, Marvin, Paul Anderson, and Mike Snyder. “Design of the PowerPC
604e Microprocessor.” Presented at Compcon ‘96.
Technologies for the
Information Superhighway: Digest of Papers
, 126–131. Washington, DC:
IEEE Computer Society, 1996.
Gary, Sonya, Carl Dietz, Jim Eno, Gianfranco Gerosa, Sung Park, and Hector
Sanchez. “The PowerPC 603 Microprocessor: A Low-Power Design for
Portable Applications.”
Proceedings of the 39th IEEE Computer Society
International Conference
. IEEE Computer Science Press, 1994, 307–15.
PowerPC 601 RISC Microprocessor User’s Manual
. IBM and Motorola, 1993.
PowerPC 601 RISC Microprocessor Technical Summary
. IBM and Motorola, 1995.
PowerPC 603e RISC Microprocessor User’s Manual
. IBM and Motorola, 1995.
PowerPC 603e RISC Microprocessor Technical Summary
. IBM and Motorola, 1995.
PowerPC 604 RISC Microprocessor User’s Manual
. IBM and Motorola, 1994.
PowerPC 604 RISC Microprocessor Technical Summary
. IBM and Motorola, 1994.
PowerPC 620 RISC Microprocessor Technical Summary
. IBM and Motorola, 1994.
PowerPC G3 and G4 Series Processors
MPC750 User’s Manual
. rev. 1. Motorola, 2001.
MPC7400 RISC Microprocessor Technical Summary
, rev. 0. Motorola, 1999.
MPC7410/MPC7400 RISC Microprocessor User’s Manual
, rev. 1. Motorola, 2002.
MPC7410 RISC Microprocessor Technical Summary
, rev. 0. Motorola, 2000.
MPC7450 RISC Microprocessor User’s Manual
, rev. 0. Motorola, 2001.
Seale, Susan. “PowerPC G4 Architecture White Paper: Delivering Performance
Enhancement in 60x Bus Mode” (white paper). Motorola, 2001.
IBM PowerPC 970 and POWER
Behling, Steve, Ron Bell, Peter Farrell, Holger Holthoff, Frank O’Connell,
and Will Weir.
The POWER4 Processor Introduction and Tuning Guide
. 1st ed.
(white paper). IBM, November 2001.
DeMone, Paul. “A Big Blue Shadow over Alpha, SPARC, and IA-64.”
Real
World Technologies
(October 2000). http://www.realworldtech.com/
page.cfm?AID=RWT101600000000.
DeMone, Paul. “The Battle in 64 bit Land, 2003 and Beyond.”
Real World
Technologies
( January 2003). http://www.realworldtech.com/page.cfm?
AID=RWT012603224711.
DeMone, Paul. “Sizing Up the Super Heavyweights.”
Real World Technologies
(October 2004). http://www.realworldtech.com/page.cfm?ArticleID=
RWT100404214638.
Diefendorff, Keith. “Power4 Focuses on Memory Bandwidth: IBM
Confronts IA-64, Says ISA Not Important.”
Microprocessor Report
13,
no. 13 (October 1999).
Sandon, Peter. “PowerPC 970: First in a new family of 64-bit high performance
PowerPC processors.” Presented at the Microprocessor Forum, San Jose,
CA, October 14–17, 2002.
272
Bibliography and Suggested Reading
Stokes, Jon. “IBM’s POWER5: A Talk with Pratap Pattnaik.”
Ars Technica
,
October 2004. http://arstechnica.com/articles/paedia/cpu/POWER5.ars.
Stokes, Jon. “PowerPC 970: Dialogue and Addendum.”
Ars Technica
, October 2002. http://arstechnica.com/cpu/03q2/ppc970-interview/ppc970-interview-1.html.
Tendler, J. M., J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy,
“POWER4 System Microarchitecture.”
IBM Journal of Research and
Development
46, no. 1 ( January 2002): 5–25.
x
86 ISA and Extensions
Granlund, Torbjorn. “Instruction latencies and throughput for AMD and