Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
relationship with program exe-
extensions, 70
cution time, 52–53
hardware implementation of, 70
decoding unit, on Pentium Pro,
history of, 71–73,
72
106
, 106–107
microarchitecture and, 69–74
282
INDEX
instruction set translation, on
microarchitecture,
255
Pentium Pro, 103, 105
pipeline, 258
integer ALUs, on Pentium, 87–88
vector processing, 262–264
integer execution units (IUs), 68,
Intel Core Duo/Solo, 235, 247–254
163–165
back end, 259
on Core 2 Duo, 260
features, 247
on G4e, 163–164
fetch buffer, 239
issue queue for, 210,
211
floating-point control word
on Pentium, 69
(FPCW), 252
on PowerPC 601, 115
floor plan of,
250
on PowerPC 604, 125
in historical context, 254
integer instructions
integer division, 253
and default integer size, 191
loop detector, 252
PowerPC 970 performance of, 203
micro-ops fusion in, 251–252
integer pipeline, 84
multi-core processor on, 247–250
integers, 66
Streaming SIMD Extensions
on 32-bit vs. 64-bit processors, 183
(SSE), 252
division on Core Duo, 253
Virtualization Technology, 253
Intel
Intel P6 microarchitecture, 93.
See
code names and brand names,
also
Intel Pentium Pro
236–237
fetch and decode hardware,
240
Itanium Processor family, 180
Intel Pentium, 62,
68
, 80–93
MMX (Multimedia Extensions),
back end, 87–91
70, 108, 174
floating-point ALUs, 88–91
project names, 138
integer ALUs, 87–88
relative frequencies of
basic microarchitecture,
80
processors, 139
branch unit and branch
Intel 4004, 62
prediction, 85–87
Intel 8080, 62
caches, 81–82
Intel 8086, 74, 187
features, 80
Intel Celeron processor, 222
floating-point unit in, 69
Intel Core 2 Duo, 254–258
in historical context, 92–93
back end, 258–270,
260
integer unit in, 69
floating-point execution units
level 1 cache, 80
(FPUs), 260–262
pipeline, 82–85,
83
integer execution units
stages, 84–85
(IUs), 260
static scheduling,
95
decode phase of instruction,
256
,
x
86 overhead on, 91–92
257–258
Intel Pentium II
double-precision floating-point
back end,
108
operations on, 261
features, 93
features, 254
Intel Pentium III, 109
fetch phase of instruction,
256
,
back end,
110
,
259
256–257
bottleneck potential, 258
floating-point instruction
features, 93
throughput, 261
floating-point instruction
in historical context, 270
throughput, 261
memory disambiguation,
floating-point vector
264–270
processing in, 108
INDEX
283
Intel Pentium 4, 74, 110, 138–140
reorder buffer, 99–100
approach to performance,
143
reservation station (RS),
architecture,
148
, 148–154
98–99, 100
branch prediction, 147–148
Intel Technology Journal
, 175
critical execution path,
152
Intel
x
86 hardware, 70
features, 138
inter-element arithmetic and non-
floating-point execution unit
arithmetic operations,
(FPU) for, 167–168
172–173
floating-point instruction
internal operations (IOPs), 196
throughput, 261
intra-element arithmetic and non-
vs. G4e, 137
arithmetic operations,
general approaches and design
171–172
philosophy, 141–144
I/O (input-output) unit, 26
instruction window, 159
IOPs (internal operations), 196
integer execution units (IUs),
ISA.
See
instruction set architecture
163, 164–165
(ISA)
internal instruction format, 197
issue buffer, 96
pipeline, 155–159
issue buffer/execution unit avail-
vector unit, 176
ability rule, 127
Intel Pentium D, 249
issue phase
Intel Pentium M, 235, 239–246
on Pentium Pro, 96–98
back end, 259
on PowerPC 604, 126–127
branch prediction, 244–245
issue ports, on Pentium 4, 157–158
decode phase, 240–244,
241
,
242
issue queues
features, 239
for branch unit and condition
fetch phase, 239–240
register, 210–211
floating-point instruction
for G4e, 146
throughput, 261
for integer and load-store execu-
floor plan of,
248
tion units, 210,
211
pipeline and back end, 246
on PowerPC 970, 199
stack execution unit on, 246
vector, 211,
212
versions, 236
vector logical, 207
Intel Pentium Pro, 93–109,
94
issue stage
back end, 102–103,
103
,
259
for G4e, 146
branch prediction in, 102
for Pentium 4, 157–158
cost of legacy
x
86 support on, 107
issuing, 96
decoupling front end from
Itanium Processor family, 180
back end, 94–100
IU.
See
integer execution units (IUs)
features, 93
J
floating-point unit in, 103
in historical context, 107–109
jumpn instruction, 32
instruction set translation, 103
jumpo instruction, 32
instruction window, 100
jumpz instruction, 31
issue phase, 96–98
level 1 cache, 107
K
microarchitecture’s instruction
Katmai, 109
decoding unit,
106
, 106–107
processor, 175
pipeline, 100–102
kernels, 221
284
INDEX
L
steps to execute, 265, 266
L1 cache.
See
level 1 cache
translating into fused micro-ops,
L2 cache.
242, 243
See
level 2 cache
load port, on Pentium 4, 157
L3 cache, 81
load-store units (LSUs), 17
labels, and branch instructions,
issue queue for, 210,
211
33–34
on Pentium, 69
laminated micro-op, 251
on PowerPC 603 and 603e, 121
laptop (portable) computers, 237
on PowerPC 604, 125
latency of instruction
on PowerPC 970, 203–205
for Pentium 4 SIMD
locality of reference, 220–223
instructions, 176
for floating-point code, 166
pipeline stalls and, 57–58
for integer-intensive
for PowerPC 970 integer unit, 202
applications, 165
for string instructions, 105
logical issue queues, 207, 210
on superscalar processors,
logical operations, 12, 67
117–118
as intra-element non-arithmetic
tag RAM and, 224
operations, 171
leakage current, from idle
long mode, in
x
86-64, 190, 191
transistor, 238
loop detector
least recently used (LRU) block,
on Core Duo, 252
and cache replacement
on Pentium M, 244–245
policy, 230
LRU (least recently used) block,
legacy mode, in
x
86-64, 189,
190
, 191
and cache replacement
level 1 cache, 81, 217–218
policy, 230
vs. other data storage, 217
LSU.
See
load-store units (LSUs)
on Pentium, 80
on Pentium II, 108
M
on Pentium 4, 149
on PowerPC 601, 115
machine instructions, 72
splitting, 223
machine language, 19–25
level 2 cache, 81, 218
on DLW-1, 20–21
vs. other data storage, 217
translating program into, 25
for Pentium III, 109
use in early computing, 26
level 3 cache, 81
machine language format, for
lines, 1
register-relative load, 24
load address unit, on P6 back
machine language instruction, 20
end, 102
macro-fusion, 255, 257
load balancing, on PowerPC 970,
main memory, 9
212–213
mapping
load hoisting, 268
direct, 225–226,
226
loading, operating system, 34
fully associative, 224,
225
n
-way set associative, 226–230,
227
load instruction, 11, 15, 23–24
branch instruction as special
Mark I (Harvard), 81
type, 32–33
Mauchly, John, 6
n
micro-ops for, 267
maximum theoretical completion
programmer and control of, 104
rate, 52–53
register-relative address, 24
maximum theoretical instruction
throughput, 54
INDEX
285
media applications
micro-ops.
See
micro-operations
cache pollution by, 222
microprocessor, 1
vector computing for, 168
clock cycle, 44
memory
instruction completion per,
access to contents, 14
53–54
address, storage by memory
vs. memory and bus clock
cell, 16
cycles, 216
aliasing, 265, 266–267
in pipelined processor, 47
bus, 9
core microarchitecture of, 248
disambiguation on Core 2 Duo,
errors from exceeding dynamic
264–270
range, 184–185
for floating-point
hard-wired to fetch first
performance, 168
instruction, 34
hierarchy on computer,
82
increasing number of instruc-
instruction format, 13
tions per time period, 43
lifecycle of access instruction,
interface to, 26
265, 266
microarchitecture vs.
micro-op queue, 156
implementations, 248
vs. other data storage, 217
non-pipelined, 43–45,
44
ports, on Pentium 4, 157
pipelined, 45–48
RAM, 8–10
three-step sequence, 8
reorder buffer, 265
millicoded instruction, 197, 198
rules for, 268
MIPS, 73
scheduler, on Pentium 4, 156
MMX (Multimedia Extensions), 70,
speed of, 81
108, 174
memory-access instructions, 11, 12
mnemonics, 20
binary encoding, 23–25
mode bit, 21
memory-access units, 69
Moore’s Curves, 92, 93
branch unit as, 85
μops.
See
micro-operations
memory-to-memory format arith-
Motorola.
See
PowerPC (PPC)
metic instructions, 103–104
Motorola 68000 processor, 112
Merom, 255
Motorola AltiVec, 70, 135, 169–170
micro-operations (micro-ops; μops;
development, 207
uops), 106, 149
G4e units, 206–207
fusion, 240–244
vector operations, 170–173
in Core Duo, 251–252
inter-element arithmetic and
in Pentium 4, 150
non-arithmetic operations,
queue, 106
172–173
memory, 156
intra-element arithmetic and
on Pentium 4, 155
non-arithmetic operations,
schedulers for, 156–157
171–172
microarchitecture, and ISA, 69–74
Motorola G4, vector instruction
microcode engine, 72, 73
latencies on, 208
microcode programs, 72
Motorola G4e
microcode ROM, 72
architecture and pipeline,
in Pentium, 85, 92
144–147
in Pentium 4, 154
branch prediction, 147–148
caches, 194
286
INDEX
dedicated integer hardware for
O
address calculations, 204
offset, 17
floating-point execution unit
on-die cache, 82
(FPU) for, 166–167
opcodes, 19–25
general approaches and design
operands, 12
philosophy, 141–144
formats, 161–162
integer execution units (IUs),
operating system, loading, 34
163–164
operations, 11
integer hardware, 203
out-of-order execution, 96
microarchitecture,
144
output, 2
vs. Pentium 4, 137
overflow of register, 32, 184
performance approach,
142
overhead cost, for pipeline, 60
pipeline stages, 145–147
overheating, 238
vector instruction
latencies on, 208
P
Motorola G5.
See
PowerPC (PPC)
970 (G5)
packed floating-point addition
Motorola MPC7400.
See
PowerPC
(PFADD), 260
(PPC) 7400 (G4)
page file, on hard drive, 218
Motorola MPC7450, 138