Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (98 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
5.14Mb size Format: txt, pdf, ePub

The need for accesses to programmer-visible storage to be committed in

program order applies to main memory just as it does to the register file. To

see an example of this, consider Program 12-1. The first line stores the number

13 in an unknown memory cell, and the next line loads the contents of the

red memory cell into register A. The final line is an arithmetic instruction

that adds the contents of registers A and B and places the result in register C.

Program 12-1

Main Memory

store 13,

load

,A

add A,B,C

Program 12-1: A program with a load and a store,

where the store’s target is unknown.

Figures 12-17 and 12-18 show two options for the destination address

of the store: either the red cell (Figure 12-17) or an unrelated blue cell (Fig-

ure 12-18). If the store ends up writing to the red cell, then the store
must
execute before the load so that the load can then read the updated value from

the red cell and supply it to the following add instruction (via register A). If the store writes its value to the blue cell, then it doesn’t really matter if that store executes before or after the load, because it is modifying an unrelated memory

location.

When a store and a load both access the same memory address, the two

instructions are said to
alias
. Figure 12-17 is an example of memory aliasing, while Figure 12-18 is not.

Intel’s Pentium M, Core Duo, and Core 2 Duo

267

store 13,

store 13,

load

,A

load

,A

add A,B,C

add A,B,C

Figure 12-17: An aliased

Figure 12-18: A non-aliased

load-store pair

load-store pair

Memory Reordering Rules

In order to avoid aliasing-related problems, all of the Intel processors prior

to Core obey the following rules when reordering memory operations in

the MOB:

1.

Stores must always commit in program order relative to other stores.

2.

No load can commit ahead of an aliased store-data micro-op.

3.

No load can commit ahead of a store-data micro-op with an unknown

target address.

Rule #3 dictates that no load is allowed to be moved (or “hoisted”) above

a store with an undefined address, because when that store’s address becomes

available, the processor might find that a load and store are accessing the same address (i.e., the load and store are
aliased
). Because of memory aliasing, this kind of
load hoisting
is not allowed on
x
86 processors prior to Core.

False Aliasing

As it turns out, the kind of load-store aliasing that rule #3 above is intended

to prevent is exceedingly rare. The vast majority of the memory accesses in

the MOB window do not alias, so they could theoretically proceed indepen-

dently of one another. Thus the practice of preventing any load from com-

mitting until every last store in the MOB has a known target address unneces-

sarily restricts the number of memory instructions that the processor can

commit each cycle.

Because most load-store pairs don’t alias, processors like the P6 that play

it safe lose quite a bit of performance to
false aliasing
, where the processor assumes that two or more memory accesses alias, when in reality they do not.

Let’s take a look at exactly where this performance loss comes from.

Figure 12-19 shows a cycle-by-cycle breakdown of how aliased and non-

aliased versions of Program 12-1 execute on processors like the P6 and

Pentium 4, which use conservative memory access reordering assumptions.

In both instances, the store-address micro-op must execute first so that

it can yield a known destination address for the store-data micro-op. This

destination address has to be available before any of the memory accesses

that are waiting in the MOB can be carried out. Because the destination

address of the store is not available until the end of the second cycle when

the store-address micro-op has finished calculating it, the processor cannot

execute either the store or the load until the third cycle or later.

268

Chapter 12

Aliased accesses

Non-aliased accesses

with no reordering

with conservative reordering

store-addr 13,

store-addr 13,

store-data 13,

load

,A

store-data 13,

load

,A

add A,B,C

add A,B,C

Figure 12-19: Execution without memory disambiguation

When the destination address of the store becomes available at the

outset of cycle three, if it turns out that the memory accesses are aliased, the processor must wait another cycle for the store-data micro-op to update the

red memory cell before it can execute the load. Then, the load executes, and

it too takes an extra cycle to move the data from the red memory cell into the

register. Finally, on the seventh cycle, the add is executed.

If the processor discovers that the accesses are not aliased, the load can

execute immediately after the store-data micro-op and before the store-address

micro-op. In other words, for non-aliased accesses, the processor will move

the load up in the queue so that it executes before the less critical, “fire-and-forget” store instruction.

Memory Disambiguation

Core’s
memory disambiguation
hardware attempts to identify instances of false aliasing so that in instances where the memory accesses are not aliased, a

load can actually execute before a store’s destination address becomes

available. Figure 12-20 illustrates non-aliased memory accesses with and

without the reordering opportunity that memory disambiguation affords.

Non-aliased accesses

Non-aliased accesses

with conservative reordering with disambiguation

store-addr 13,

store-addr 13,

load

,A

load

,A

store-data 13,

store-data 13,

add A,B,C

add A,B,C

Figure 12-20: Execution with and without memory disambiguation

When the non-aliased accesses execute with memory disambiguation,

the load can go ahead and execute while the store’s address is still unknown.

The store, for its part, can just execute whenever its destination address

becomes available.

Intel’s Pentium M, Core Duo, and Core 2 Duo

269

Re-ordering the memory accesses in this manner enables our example

processor to execute the addition a full cycle earlier than it would have without memory disambiguation. If you consider a large instruction window that

contains many memory accesses, the ability to speculatively hoist loads

above stores could save a significant number of total execution cycles.

Intel has developed an algorithm that examines memory accesses in

order to guess which ones are probably aliased and which ones aren’t. If the

algorithm determines that a load-store pair is aliased, it forces them to commit in program order. If the algorithm decides that the pair is not aliased, the

load may commit before the store.

In cases where Core’s memory disambiguation algorithm guesses incor-

rectly, the pipeline stalls, and any operations that were dependent on the

erroneous load are flushed and restarted once the correct data has been

(re)loaded from memory.

By drastically cutting down on false aliasing, Core eliminates many cycles

that are unnecessarily wasted on waiting for store address data to become

available. Intel claims that memory disambiguation’s impact on performance

is significant, especially in the case of memory-intensive floating-point code.

Summary: Core 2 Duo in Historical Context

Intel’s turn from the hyperpipelined Netburst microarchitecture to the power-

efficient, multi-core–friendly Core microarchitecture marks an important

shift not just for one company, but for the computing industry as a whole.

The processor advances of the past two decades—advances described in detail

throughout this book—have been aimed at increasing the performance of

single instruction streams (or
threads of execution
). The Core microarchitecture emphasizes single-threaded performance as well, but it is part of a larger,

long-term project that involves shifting the focus of the entire computing

industry from single-threaded performance to multithreaded performance.

270

Chapter 12

B I B L I O G R A P H Y A N D

S U G G E S T E D R E A D I N G

General

Hennessy, John and David Patterson.
Computer Architecture: A Quantitative

Approach.
3rd ed. San Francisco: Morgan Kaufmann, 2002.

Hennessy, John and David Patterson.
Computer Organization and Design:

the Hardware/Software Interface.
3rd ed. San Francisco: Morgan

Kaufmann, 2004.

Shriver, Bruce and Bennett Smith.
The Anatomy of a High-Performance

Microprocessor: A Systems Perspective
. Los Alamitos, CA: Wiley-IEEE

Computer Society Press, 1998.

PowerPC ISA and Extensions

AltiVec Technology Programming Environment’s Manual
. rev. 0.1. Motorola, 1998.

Diefendorff, Keith. “A History of the PowerPC Architecture.”
Communications

of the ACM
37, no. 6 ( June 1994): 28–33.

Fuller, Sam. “Motorola’s AltiVec Technology” (white paper). Motorola, 1998.

PowerPC 600 Series Processors

Denman, Marvin, Paul Anderson, and Mike Snyder. “Design of the PowerPC

604e Microprocessor.” Presented at Compcon ‘96.
Technologies for the

Information Superhighway: Digest of Papers
, 126–131. Washington, DC:

IEEE Computer Society, 1996.

Gary, Sonya, Carl Dietz, Jim Eno, Gianfranco Gerosa, Sung Park, and Hector

Sanchez. “The PowerPC 603 Microprocessor: A Low-Power Design for

Portable Applications.”
Proceedings of the 39th IEEE Computer Society

International Conference
. IEEE Computer Science Press, 1994, 307–15.

PowerPC 601 RISC Microprocessor User’s Manual
. IBM and Motorola, 1993.

PowerPC 601 RISC Microprocessor Technical Summary
. IBM and Motorola, 1995.

PowerPC 603e RISC Microprocessor User’s Manual
. IBM and Motorola, 1995.

PowerPC 603e RISC Microprocessor Technical Summary
. IBM and Motorola, 1995.

PowerPC 604 RISC Microprocessor User’s Manual
. IBM and Motorola, 1994.

PowerPC 604 RISC Microprocessor Technical Summary
. IBM and Motorola, 1994.

PowerPC 620 RISC Microprocessor Technical Summary
. IBM and Motorola, 1994.

PowerPC G3 and G4 Series Processors

MPC750 User’s Manual
. rev. 1. Motorola, 2001.

MPC7400 RISC Microprocessor Technical Summary
, rev. 0. Motorola, 1999.

MPC7410/MPC7400 RISC Microprocessor User’s Manual
, rev. 1. Motorola, 2002.

MPC7410 RISC Microprocessor Technical Summary
, rev. 0. Motorola, 2000.

MPC7450 RISC Microprocessor User’s Manual
, rev. 0. Motorola, 2001.

Seale, Susan. “PowerPC G4 Architecture White Paper: Delivering Performance

Enhancement in 60x Bus Mode” (white paper). Motorola, 2001.

IBM PowerPC 970 and POWER

Behling, Steve, Ron Bell, Peter Farrell, Holger Holthoff, Frank O’Connell,

and Will Weir.
The POWER4 Processor Introduction and Tuning Guide
. 1st ed.

(white paper). IBM, November 2001.

DeMone, Paul. “A Big Blue Shadow over Alpha, SPARC, and IA-64.”
Real

World Technologies
(October 2000). http://www.realworldtech.com/

page.cfm?AID=RWT101600000000.

DeMone, Paul. “The Battle in 64 bit Land, 2003 and Beyond.”
Real World

Technologies
( January 2003). http://www.realworldtech.com/page.cfm?

AID=RWT012603224711.

DeMone, Paul. “Sizing Up the Super Heavyweights.”
Real World Technologies

(October 2004). http://www.realworldtech.com/page.cfm?ArticleID=

RWT100404214638.

Diefendorff, Keith. “Power4 Focuses on Memory Bandwidth: IBM

Confronts IA-64, Says ISA Not Important.”
Microprocessor Report
13,

no. 13 (October 1999).

Sandon, Peter. “PowerPC 970: First in a new family of 64-bit high performance

PowerPC processors.” Presented at the Microprocessor Forum, San Jose,

CA, October 14–17, 2002.

272

Bibliography and Suggested Reading

Stokes, Jon. “IBM’s POWER5: A Talk with Pratap Pattnaik.”
Ars Technica
,

October 2004. http://arstechnica.com/articles/paedia/cpu/POWER5.ars.

Stokes, Jon. “PowerPC 970: Dialogue and Addendum.”
Ars Technica
, October 2002. http://arstechnica.com/cpu/03q2/ppc970-interview/ppc970-interview-1.html.

Tendler, J. M., J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy,

“POWER4 System Microarchitecture.”
IBM Journal of Research and

Development
46, no. 1 ( January 2002): 5–25.

x
86 ISA and Extensions

Granlund, Torbjorn. “Instruction latencies and throughput for AMD and

Other books

Purpose by Andrew Q Gordon
Some Like it Scottish by Patience Griffin
A Nurse's Duty by Maggie Hope
American Dream Machine by Specktor, Matthew
Daughter of Necessity by Marie Brennan
Lonely Hearts by John Harvey
Ecotopia by Ernest Callenbach