Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
parallel execution of instructions, 63
mul instruction, cycles to execute on
Pentium.
See
Intel Pentium
PowerPC, 117
performance, 51
mulpd instruction, throughput on
branch prediction and, 86, 125
Intel processors, 261
gains from pipelining, 60
mulps instruction, throughput on
of Pentium 4, 140
Intel processors, 261
Performance Optimization
mulsd instruction, throughput on
With Enhanced RISC
Intel processors, 261
(POWER), 112
mulss instruction, throughput on
PFADD (packed floating-point
Intel processors, 261
addition), 260
multi-core processors, 247
physical address space, vs. virtual,
Multimedia Extensions (MMX), 70,
185–186
108, 174
pipeline, 37, 40–43,
42
multiprocessing,
249
, 249–250
challenges, 74–78
cost of, 60
depth, 46
N
of DLW-2 hypothetical
NetBurst architecture, 139, 140, 235
computer,
64
code names for variations, 236
flushing, 86
non-pipelined microprocessor,
limits to, 58–60, 139–140
43–45,
44
on Pentium, 82–85,
83
noops (no operation), 198
on Pentium M, 246
Northwood, 236
on PowerPC 601, 113–115
notebook (portable) computers, 237
instruction queue, 113–114
numbers, basic formats, 66–67,
67
instruction scheduling,
n
-way set associative mapping,
114–115
226–230,
227
on PowerPC 604, 123–126
INDEX
287
pipeline,
continued
PowerPC (PPC) 604, 119,
speedup from, 48–51,
50
123–129, 136
stages, 45
features, 123
and superscalar execution,
65
front end and instruction
trace cache effect on, 154
window, 126–128
pipeline stalls, 54–57
microarchitecture,
124
avoiding, 60
pipeline and back end, 123–126
instruction latency and, 57–58
reorder buffer, 128
pipelined execution, 35
reservation station (RS), 126–127
pipelined microprocessor, 45–48
PowerPC (PPC) 604e, 129
pointers, 187
PowerPC (PPC) 750 (G3), 129–133
polluting the cache, 222
features, 130
pop instruction, 88,
89
front end, instruction window,
portable computers, 237
and branch instruction,
ports, for PowerPC instructions, 206
130–132
postfix expressions, 89
vs. G4, 133
POWER (Performance
Optimization With
in historical context, 132–133
Enhanced RISC), 112
microarchitecture,
131
power density of chip, 237–239
PowerPC (PPC) 970 (G5), 193
power-efficient computing, 237–239
back end, 200–203
PowerPC (PPC), 73, 111
branch prediction, 195–196
AltiVec extension, 173
caches and front end, 194–195
brief history, 112
decode, cracking and group
instruction set architecture
formation, 196–200
(ISA), 70, 162
dispatching and issuing
PowerPC (PPC) 601, 112–118, 135
instructions, 197–198
back end, 115–117
design philosophy, 194
branch execution unit
dispatch rules, 198–199
(BEU), 116
floating-point execution units
floating-point unit, 115–116
(FPUs), 205–206
integer unit, 115
floating-point issue queue (FIQ),
sequencer unit, 116–117
209
, 209–211
features, 112
group dispatch scheme
in historical context, 118
conclusions, 199–200
latency and throughput, 117–118
performance implications,
microarchitecture,
114
211–213
pipeline and front end, 113–115
integer execution units (IUs),
instruction queue, 113–114
201–202
instruction scheduling,
performance conclusions, 203
114–115
load-store units (LSUs) and
PowerPC (PPC) 603 and 603e,
front-end bus, 203–205
118–122, 135
microarchitecture,
195
back end, 119–121
features, 119
predecoding and group
front end, instruction window,
dispatch, 199
and branch prediction, 122
vector computing, 206–209,
207
in historical context, 122
vector instruction
microarchitecture,
120
latencies on, 208
288
INDEX
PowerPC (PPC) 7400 (G4),
R
133–135, 138
RAM (random access memory),
AltiVec support, 173
8–10
features, 133
RAT (register allocation table), on
in historical context, 135
Pentium Pro, 100
microarchitecture,
134
read-modify instruction, 243
scalability of clock rate, 135
read-modify-write sequence, 4
power wall, 237
read-only memory (ROM), 34
PPC.
See
PowerPC
reboot, 34
predecoding, on PowerPC 970, 199
reduced instruction set computing
Prefetch/Fetch stage in Pentium
(RISC), 73–74, 105
pipeline, 84–85
instructions in PowerPC, 113
Prescott, 236
load-store model, 4
processors.
See
microprocessor
refills of pipeline, performance
processor serial number (PSN), 109
impact of, 60
processor status word (PSW)
register allocation table (RAT), on
register, 31, 67
Pentium Pro, 100
condition register for functions
register files, 7–8
of, 202–203
stages on Pentium 4, 158
productivity, pipelining and, 42
register-relative address, 16–17
program, 11–14
with branch instruction, 33
program counter, 26
register renaming
program execution time
to overcome data hazards, 75,
76
and completion rate, 51–52
P6 pipeline stage for, 101
decreasing, 43, 47–48
on PowerPC 604, 125
relationship with completion
registers, 7
rate, 52–53
mapping to binary codes, 20
programmers, early processes, 26
vs. other data storage, 217
programming model, 26,
27
, 69–70
register-to-memory format arith-
32-bit vs. 64-bit,
182
metic instructions, 103–104
early variations, 71
register-type arithmetic instruction,
pseudo-LRU algorithm, 230
21–22
PSN (processor serial number), 109
rename register availability rule, 128
PSW register.
See
processor status
rename registers, 98
word (PSW) register
on Pentium 4, 165
push instruction, 88,
89
on PowerPC 604, 126
pushing data to stack, 88,
89
on PowerPC 750, 131
on PowerPC 970, 199, 200
Q
reorder buffer (ROB), 265
queue.
See also
issue queues
on Pentium 4, 159
instruction
on Pentium Pro, 99–100
on Core, 256
on PowerPC 604, 126, 128
on PowerPC 601, 113–114
rules for, 268
micro-op, 106, 155
reservation station (RS)
stage, on Pentium 4, 156
on P6 core, 258
vector issue (VIQ), for G4e, 146
P6 pipeline stages for writing to
and reading from, 101
INDEX
289
reservation station (RS),
continued
SMP (symmetric
on Pentium 4, 149
multiprocessing), 136
on Pentium Pro, 98–99, 100
software
on PowerPC 604, 126–127
early, custom-fitted to hardware,
on PowerPC 750, 131
71,
71
results, 4
moving hardware complexity to,
results stream, 2
73–74
REX prefix, 191
software branch hits, 147–148
RISC.
See
reduced instruction set
source field, 12
computing (RISC)
source registers, 8, 21
ROB.
See
reorder buffer (ROB)
spatial locality of code, 221–222
ROM (read-only memory), 34
spatial locality of data, 220
RS.
See
reservation station (RS)
speculative execution, 85–86
RS6000 (IBM), 62
path, 152–153,
153
results stream version of, 264–270
S
SRAM (static RAM), for L1
cache, 217
scalar operations, 62
SSE.
See
Streaming SIMD
scalars, vs. vectors, 66,
170
Extensions (SSE)
schedule stage, on Pentium 4,
ST (stack top), 88
156–157
stack, 88
segmented memory model, 192
vs. flat register file, 90
sequencer unit, on PowerPC 601,
swapping element with stack
116–117
top, 91
sequentially ordered data, and
stack execution unit, on Pentium M,
spacial locality, 220
246
set associative mapping, 226
stack pointer register, 246
SIMD (Single Instruction, Multiple
stack top (ST), 88
Data) computing, 168,
169
static branch prediction, 86
extensions to PowerPC instruc-
static power density, 238–239
tion set, 135
static prediction, 147
simple/fast integer execution units,
static RAM (SRAM), for L1
on G4e, 163
cache, 217
simple/fast integer instructions, 163
static scheduling, in Pentium Pro,
simple FP scheduler, on Pentium 4,
94–95,
95
157
storage, 4–5
simple integer instructions, 201
store address unit, on P6 back
simple integer unit (SIU), 68, 87
end, 102
on PowerPC 750, 130
store data unit, on P6 back end, 102
single-cycle processors, 44, 49, 50
stored-program computer, 4–6
SISD (Single Instruction stream,
store instruction, 11
Single Data stream) device,
micro-ops for, 267
168,
169
programmer and control of, 104
SIU.
See
simple integer unit (SIU)
register-type binary format for,
slow integer ALU unit, on
24–25
Pentium 4, 157
translating into fused micro-ops,
slow IU/general FPU scheduler, on
242–243
Pentium 4, 157
write-through for, 233
290
INDEX
store port, on Pentium 4, 157
in Pentium 4, 149–154
Streaming SIMD Extensions (SSE),
and instruction execution
70, 262–263
time, 150–151
on Core Duo, 252
operation, 151–154
floating-point performance
traces, 150
with, 177
trace segment build mode, 151
implementation of, 176
transistors, 1
Intel’s goal for, 175
density, and dynamic power
on Pentium III, 108, 109
density, 237
strings, ISA-level support for,
number on chip, 62
104–105
translating program into machine
structural hazards, 76–77
language, 25
sum vector (VT), 171
Turing machine, 4
superscalar computers, 62
two-way set associative mapping,
challenges, 74–78
228
, 228–229
expanding with execution units,
vs. direct mapping, 229
65–69
vs. four-way, 229
and instructions per clock, 64–65
latency and throughput, 117–118
U
SUV-building process, pipelining
in, 40–43,
42
U integer pipe, 87
swapping stack element with
unconditional branch, 30
stack top, 91
underflow, 184
symmetric multiprocessing
uops.
See
micro-operations
(SMP), 136
system unit, on PowerPC 603
V
and 603e, 121
variable-length instructions, 105
vector ALU (VALU), 135
T
vector complex integer unit,
tag RAM, 224
on G4e, 173
tags for cache
vector computing
direct mapping, 225–226,
226
on 32-bit vs. 64-bit processors, 183
fully associative mapping with,
and AltiVec instruction set,
224,
225
169–170
n
-way set associative mapping,
G3 and, 132–133
226–230,
227
inter-element operations,
temporal locality of code and
172–173
data, 222
intra-element operations,
throughput
171–172
for floating-point instructions on
MMX (Multimedia
Intel processors, 261
Extensions), 174
for PowerPC 970 integer unit, 202
overview of, 168–169
of superscalar processors,
on PowerPC 970, 206–209,
207
117–118
vector execution units, 69, 168–177
trace cache
vector floating-point multiplication,