Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
Intel
x
86 processors” (working paper). Swox AB, September 2005.
http://www.swox.com/doc/x86-timing.pdf.
Gwennap, Linley. “Intel’s MMX Speeds Multimedia: Instruction-Set Exten-
sions to Aid Audio, Video, and Speech.”
Microprocessor Report
10, no. 3
(March 1996).
Mittal, Millind, Alex Peleg, and Uri Weiser. “MMX Technology Architecture
Overview.”
Intel Technology Journal
1, no. 1 (August 1997).
Thakkar, Shreekant and Tom Huff. “The Internet Streaming SIMD Exten-
sions.”
Intel Technology Journal
3, no. 2 (May 1999).
Pentium and P6 Family
Case, Brian. “Intel Reveals Pentium Implementation Details: Architectural
Enhancements Remain Shrouded by NDA.”
Microprocessor Report
7,
no. 4 (March 1993).
Fog, Agner. “How to optimize for the Pentium family of microprocessors,” 2004.
Fog, Agner.
Software optimization resources
.“The microarchitecture of Intel and AMD CPU’s: An optimization guide for assembly programmers and
compiler makers.” 2006. http://www.agner.org/optimize.
Gwennap, Linley. “Intel’s P6 Uses Decoupled Superscalar Design: Next Gen-
eration of
x
86 Integrates L2 Cache in Package with CPU.”
Microprocessor
Report
9, no. 2 (February 16, 1995).
Keshava, Jagannath and Vladimir Pentkovski. “Pentium III Processor Imple-
mentation Tradeoffs.”
Intel Technology Journal
3, no. 2, (Q2 1999).
Intel Architecture Optimization Manual
. Intel, 2001.
Intel Architecture Software Developer’s Manual
, vols. 1–3. Intel, 2006.
P6 Family of Processors Hardware Developer’s Manual
. Intel, 1998.
Pentium II Processor Developer’s Manual
. Intel, 1997.
Pentium Pro Family Developer’s Manual
, vols. 1–3. Intel, 1995.
Bibliography and Suggested Reading
273
Pentium 4
“A Detailed Look Inside the Intel NetBurst Micro-Architecture of the Intel
Pentium 4 Processor” (white paper). Intel, November 2000.
Boggs, Darrell, Aravindh Baktha, Jason Hawkins, Deborah T. Marr, J. Alan
Miller, Patrice Roussel, Ronak Singhal, Bret Toll, and K.S. Venkatraman,
“The Microarchitecture of the Intel Pentium 4 Processor on 90nm
Technology.”
Intel Technology Journal
8, no. 1 (February 2004).
DeMone, Paul. “What's Up With Willamette? (Part 1).”
Real World Technologies
(March 2000). http://www.realworldtech.com/page.cfm?ArticleID=
RWT030300000001.
Hinton, Glenn, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean,
Alan Kyker, Desktop Platforms Group, and Patrice Roussel. “The Micro-
architecture of the Pentium 4 Processor.”
Intel Technology Journal
5 no. 1
(February 2001).
Intel Pentium 4 Processor Optimization Manual
. Intel, 2001.
Pentium M, Core, and Core 2
Gochman, Simcha, Avi Mendelson, Alon Naveh, and Efraim Rotem.
“Introduction to Intel Core Duo Processor Architecture.”
Intel
Technology Journal
10, no. 2 (May 2006).
Gochman, Simcha, Ronny Ronen, Ittai Anati, Ariel Berkovits, Tsvika
Kurts, Alon Naveh, Ali Saeed, Zeev Sperber, and Robert C. Valentine.
“The Intel Pentium M Processor: Microarchitecture and Performance.”
Intel Technology Journal
7, no. 2 (May 2003).
Kanter, David. “Intel’s Next Generation Microarchitecture Unveiled.”
Real
World Technologies
(March 2006). http://realworldtech.com/page.cfm?
ArticleID=RWT030906143144.
Mendelson, Avi, Julius Mandelblat, Simcha Gochman, Anat Shemer,
Rajshree Chabukswar, Erik Niemeyer, and Arun Kumar. “CMP
Implementation in Systems Based on the Intel Core Duo Processor.”
Intel Technology Journal
10, no. 2 (May 2006).
Wechsler, Ofri. “Inside Intel Core Microarchitecture: Setting New
Standards for Energy-Efficient Performance.”
Technology@Intel
Magazine
(March 2006).
Online Resources
Ace’s Hardware
http://aceshardware.com.
AnandTech
http://anandtech.com.
ArsTechnica
http://arstechnica.com.
Real World Technologies
http://realworldtech.com.
sandpile.org
http://sandpile.org.
X-bit labs
http://xbitlabs.com.
274
Bibliography and Suggested Reading
I N D E X
Note: Page numbers in italics refer to
addpd instruction, throughput on
figures. A page number followed by an
Intel processors, 261
italic n refers to a term in the footnote of
addps instruction, throughput on
that page.
Intel processors, 261
addresses
Symbols and Numbers
on 32-bit vs. 64-bit processors, 183
64-bit, benefits of, 186–187
# (hash mark), for memory
calculation of, 17
address, 15
calculations in load-store
2-way set associative mapping,
228
,
units, 203
228–229
as integer data, 183
vs. 4-way, 229
register-relative, 16–17
vs. direct mapping, 229
virtual vs. physical space, 185–186
4-way set associative mapping,
address generation, by load-store
226–227,
228
unit, 69
vs. 2-way, 229
address space, 186
32-bit computing, vs. 64-bit,
182
addsd instruction, throughput on
32-bit integers,
x
86 ISA support
Intel processors, 261
for, 187
addss instruction, throughput on
64-bit address space, 186–187
Intel processors, 261
64-bit computing, 181–183
Advanced Micro Devices (AMD)
vs. 32-bit,
182
64-bit workstation market
current applications, 183–187
opening for, 180
64-bit mode, in
x
86-64, 190, 191
Athlon processor, 72, 110
x
86-64, 187–192
A
added registers, 188
absolute addressing, vs. register-
extended registers, 187
relative addressing, 17
programming model,
189
add instruction, 6–7, 11
switching modes, 189–192
executing, 8
AIM (Apple, IBM, Motorola)
on PowerPC, 117
alliance, 112
steps to execute, 265, 266
allocate and rename stages, on
three-operand format for
Pentium 4, 155
PowerPC, 162
AltiVec, 135.
See also
Motorola
addition of numbers, steps for
AltiVec
performing, 10
ALU.
See
arithmetic logic unit (ALU)
AMD.
See
Advanced Micro Devices
on Pentium II,
108
(AMD)
on Pentium III,
110
and instruction, cycles to execute on
on Pentium M, 246
PowerPC, 117
on Pentium Pro, 94–100,
Apple.
See also
PowerPC (PPC)
102–103,
103
G3.
See
PowerPC 750 (G3)
on PowerPC 603 and 603e,
G4.
See
PowerPC 7400 (G4)
119–121
G4e.
See
Motorola G4e
on PowerPC 604, 123–126
G5, 193.
See also
PowerPC 970
on PowerPC 970, 200–203
(G5)
backward branch, 30
Performas, 119
bandwidth, and cache block size, 232
PowerBook, 119
Banias, 236
approximations of fractional
base-10 numbering system, 183–184
values, 66
base address, 16, 17
arithmetic
BEU (branch execution unit), 69, 85
coprocessor, 67
on PowerPC 601, 116
instructions, 11, 12, 36
BHT.
See
branch history table (BHT)
actions to execute, 36–37
binary code, 21
binary code for, 21–22
for arithmetic instructions, 21–22
format, 12
binary notation, 20
immediate values in, 14–16
BIOS, 34
micro-op queue, 156
blocks and block frames, for caches,
operations, 67
223
, 223–224
arithmetic logic unit (ALU), 2, 5,
6
,
sizes of, 231–232
67–69
Boolean operations, 67
multiple on chip, 62
bootloader program, 34
on Pentium, 88–91
bootstrap, 34
on Pentium 4, schedulers for, 156
boot up, 34
storage close to, 7
BPU (branch prediction unit), 85
Arm, 73
branch check stage, in Pentium 4
assembler, 26
pipeline, 158
assembler code, 21
branch execution unit (BEU), 69, 85
assembly language, beginnings, 26
on PowerPC 601, 116
associative mapping
branch folding, 113
fully, 224,
225
by Pentium 4 trace cache, 153
n
-way set, 226–230,
227
branch hazards, 78
Athlon processor (AMD), 72, 110
branch history table (BHT), 86
average completion rate, 52–53
on PowerPC 604, 125
average instruction throughput, 54
on PowerPC 750, 132
pipeline stalls and, 56–57
on PowerPC 970, 196
branch instructions, 11, 30–34
B
and fetch-execute loop, 32
and labels, 33–34
back end, 38,
38
on PowerPC 750, 130–132
on Core 2 Duo, 258–270,
260
on PowerPC 970, 198
on Pentium, 87–91
register-relative address with, 33
floating-point ALUs, 88–91
as special type of load, 32–33
integer ALUs, 87–88
in superscalar systems, 64
276
INDEX
branch prediction, 78
Busicom, 62
on Pentium, 84, 85–87
business applications, spatial local-
on Pentium M, 244–245
ity of code for, 221–222
on Pentium Pro, 102
on PowerPC 603, 122
C
on PowerPC 970, 195–196
cache, 81–82
for trace cache, 151–152
basics, 215–219
branch prediction unit (BPU), 85
blocks and block frames for, 219,
branch target, 85, 87
223
, 223–224
branch target address cache (BTAC),
hierarchy, 218,
219
on PowerPC 604, 125
byte’s journey through,
branch target buffer (BTB), 86
218–219
P6 pipeline stage to access, 101
hit, 81, 219
on PowerPC 750, 132
level 1, 81, 217–218
branch target instruction cache
level 2, 81, 218
(BTIC), on PowerPC 750, 132
level 3, 81
branch unit
line, 219
issue queue for, 210–211
locality of reference and,
on Pentium, 85–87
220–223
on PowerPC 604, 125
memory, 81
brand names for Intel processors,
miss, 81, 217
236–237
capacity, 231
BTAC (branch target address cache),
compulsory, 219
on PowerPC 604, 125
conflict, 226
BTB.
See
branch target buffer (BTB)
placement formula, 229
BTIC (branch target instruction
placement policy, 224
cache), on PowerPC 750, 132
on PowerPC 970, 194–195
bubbles in pipeline, 54–55,
55
, 78
replacement policy, 230–232
avoiding on PowerPC 970, 196
tag RAM for, 224
insertion by PowerPC 970 front
write policies for, 232–233
end, 198
caching, instruction, 78
for Pentium 4, 164
calculator model of computing,
2
,
and pipeline depth, 143
2–3
squeezing out, 97–98
capacity miss, 231
buffers.
See also
branch target buffer
C code, 21
(BTB); reorder buffer (ROB)
Cedar Mill, 236
for completion phase, 98
central processing unit (CPU), 1.
dynamic scheduling with,
97
See also
microprocessor
fetch, on Intel Core Duo, 239
channels, 1
front-end branch target, 87, 147
chip multiprocessing,
249
,
hardware loop, 240
249–250,
250
issue, 96
chipset,
204
, 204–205
memory reorder, 265, 268
CISC (complex instruction set
between Pentium Pro front end
computing), 73, 105
and execution units, 96
CIU (complex integer unit), 68, 87
bus, 5, 204
on PowerPC 750, 130
INDEX
277
clock, 29–30
completion queue, on PowerPC 603,
cycle
122
CPU vs. memory and bus, 216
completion rate
instruction completion per,
and clock period, 58–60
53–54
and program execution time,
in pipelined processor, 47
51–52
generator module, 29–30
complex instruction set computing
period, and completion rate,
(CISC), 73, 105
58–60
complex integer instructions, 201
speed
complex integer unit (CIU), 68, 87
and dynamic power
on PowerPC 750, 130
density, 237
complexity, moving from hardware