Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
of front-side bus, 205
to software, 73–74
importance of, 137–138
complex/slow integer execution
market focus on, 140
units, on G4e, 163
for Pentium 4, 167
complex/slow integer
for Pentium 4, vs. G4e, 176
instructions, 163
for Pentium 4 integer units, 164
compulsory cache miss, 219
cmp instruction, cycles to execute on
computer
PowerPC, 117
costs of systems, 62
code, 2
definition, 3, 4
spatial locality of, 221–222
general-purpose,
2
temporal locality of, 222
memory hierarchy,
82
code names for Intel processors,
power efficiency, 237–239
236–237
with register file,
9
code segment descriptor, 191
stored-program, 4–6
code stream, 2, 11–14
computing
flow of, 5
calculator model of,
2
, 2–3
coding, 21
file-clerk model of, 3–7
collision, 226
conceptual layer, 72
commands, reusing prerecorded
condition codes, 203
sequences, 6
condition register unit (CRU), 129
commit cycle, in Pentium Pro
issue queue for, 210–211
pipeline, 101
on PowerPC, 121
commit phase, 128
on PowerPC 604, 125–126
commit unit, on PowerPC 603, 122
on PowerPC 970, 198, 202
common interleaved queues, 210
conditional branch, 30–34, 78
compatibility mode, in
x
86-64,
conflict misses, 226
190, 191
Conroe, 255
compilers, for high-level languages
control hazards, 78
(HLLs), 104
control operand, in AltiVec vector
complete stage, for G4e, 147
operation, 171
completion buffer availability
control unit, in Pentium, 85
rule, 128
control vector, in AltiVec vector
completion phase
operation, 171
in instruction lifecycle, 98
Coppermine, 109
on PowerPC 604, 128
core logic chipset,
204
, 204–205
278
INDEX
core microarchitecture of
direct mapping, 225–226,
226
processor, 248
vs. two-way set associative
CPU (central processing unit), 1.
mapping, 229
See also
microprocessor
dirty blocks in cache, 230
CPU clock cycle, 44
dispatch group, 197
instruction completion per,
dispatch queue, on PowerPC 970,
53–54
199
vs. memory and bus clock
dispatch rules, on PowerPC 970,
cycles, 216
198–199
in pipelined processor, 47
divw instruction, cycles to execute
cracked instruction, 197, 198, 206
on PowerPC, 117
CRU.
See
condition register
DLW-1 hypothetical computer
unit (CRU)
arithmetic instruction format, 12
cryptography, 185
example program, 13–14
machine language on, 20–21
D
memory instruction format, 13
DLW-2 hypothetical computer
data
decode/dispatch logic, 70
comparison of storage
pipeline,
64
options, 217
two-way superscalar version,
spatial locality of, 220
62–64,
63
temporal locality of, 222
Dothan, 236, 251
databases, back-end servers for, 186
fetch buffer, 239
data bus, 5
double data rate (DDR) front-side
data cache (D-cache), 81, 223
bus, 205
data hazards, 74–76
on PowerPC 970, 195
data parallelism, 168
double-speed execution ports, on
data segment, 16–17
Pentium 4, 157
data stream, 2
drive stages, on Pentium 4, 155
flow of, 5
dual-core processor, 249
daughtercard, 109
dynamic branch prediction, 86–87,
D-cache (data cache), 81, 223
147, 244
DDR.
See
double data rate (DDR)
dynamic execution, 96
front-side bus
dynamic power density,
decode/dispatch stage, 63
237–238,
238
for G4e, 145–146
dynamic range, 183–184
decode phase of instruction, 37
benefits of increased, 184–185
for Core 2 Duo,
256
, 257–258
dynamic scheduling
for Pentium M, 240–244,
241
,
242
with buffers,
97
Decode stages in Pentium pipeline,
instruction’s lifecycle phases, 97
84–85
decoding
x
86 instructions, in P6
E
pipeline, 101
destination field, 12
Eckert, J. Presper, 6
n
destination register, 8
EDVAC (Electronic Discrete
binary encoding, 21
Variable Automatic
digital image, data parallelism for
Computer), 6
n
inverting, 169
embedded processors, 133
INDEX
279
emulation, 72
fetch groups, on PowerPC 970, 196
encryption schemes, 185
fetch phase of instruction, 37
EPIC (Explicitly Parallel Instruc-
for Core 2 Duo,
256
, 256–257
tion Computing), 180
for Pentium M, 239–240
evicted data from cache, 219
Feynman, Richard, 3
eviction policy for cached data,
fields in instruction, 12
230–232
FIFO (first in, first out) data
execute mode for trace cache, 151
structure, 88
execute stage of instruction, 37
file-clerk model of computing, 3–7
for G4e, 146
expanded, 9–10
in Pentium 4 pipeline, 158
refining, 6–7
in Pentium pipeline, 84–85
FILO (first in, last out) data
in Pentium Pro pipeline, 101
structure, 88
execution.
See also
program execu-
filter/mod operand, 171
tion time
finish pipeline stage, on PPC CR,
phases,
39
163, 164
time requirements, and comple-
FIQ.
See
floating-point issue
tion rate, 51–52
queue (FIQ)
execution ports, on Pentium 4, 157
first in, first out (FIFO) data
execution units, 17
structure, 88
empty slots, 198
first in, last out (FILO) data
expanding superscalar process-
structure, 88
ing with, 65–69
fixed-point ALU, on PowerPC 601,
micro-op passed to, 156
115
on Pentium, 83
fixed-point numbers, 66
Explicitly Parallel Instruction
flags stage, in Pentium 4
Computing (EPIC), 180
pipeline, 158
flat floating-point register file, 88
F
flat register file, vs. stack, 90
floating-point ALUs, on Pentium,
fabs instruction, cycles to
88–91
execute, 118
floating-point applications,
fadd instruction
Pentium 4 design for, 165
cycles to execute, 118
floating-point control word (FPCW),
on PowerPC 970, 212
in Intel Core Duo, 252
throughput on Intel
floating-point data type, on 32-bit
processors, 261
vs. 64-bit processors, 183
fall-through, 114
floating-point execution unit
false aliasing, 268
(FPU), 68, 165–168
fast integer ALU1 and ALU2 units,
on Core 2 Duo, 260–262
on Pentium 4, 157
on G4, 134
fast IU scheduler, on Pentium 4, 156
on G4e, 166–167
fdiv instruction, cycles to
on Pentium, 69
execute, 118
on Pentium 4, 167–168
fetch buffer, on Intel Core Duo, 239
on PowerPC 601, 115–116
fetch-execute loop, 28–29
on PowerPC 750, 130
and branch instructions, 32
on PowerPC 970, 205–206
280
INDEX
floating-point instructions
front-end bus, on PowerPC 970,
latencies for G4, 118
203–205
throughput on Intel
fsub instruction, cycles to
processors, 261
execute, 118
floating-point issue queue (FIQ)
fully associative mapping, 224,
225
for G4e, 146
fused multiply-add (fmadd) instruc-
on PowerPC 970,
209
, 209–211
tion.
See
fmadd instruction
floating-point numbers, 66
fxch instruction, 91, 167
floating-point/SEE/MMX ALU, on
Pentium 4, 158
G
floating-point/SSE move unit, on
Pentium 4, 157
G3 (Apple).
See
PowerPC (PPC)
floating-point vector processing, in
750 (G3)
Pentium III, 108
G4.
See
PowerPC (PPC) 7400 (G4)
flushing pipeline, 86
G4e.
See
Motorola G4e
performance impact of, 60
games, 187
fmadd instruction, 116
gaps in pipeline, 54–55,
55
.
See also
cycles to execute, 118
bubbles in pipeline
on G4e, 166
gates, 1
on PowerPC 603, 121
GCT.
See
group completion
fmul instruction
table (GCT)
cycles to execute, 118
general issue queue (GIQ), for
throughput on Intel
G4e, 146
processors, 261
general-purpose registers (GPRs), 17
forward branch, 30
and bit count, 181
forwarding by pipelined
on PPC ISA, 164
processors, 75
gigahertz race, 110
four-way set associative mapping,
GIQ (general issue queue),
226–227,
228
for G4e, 146
vs. two-way, 229
global predictor table, on
FPCW (floating-point control
PowerPC 970, 196
word), in Intel Core
GPRs.
See
general-purpose registers
Duo, 252
(GPRs)
FPU.
See
floating-point execution
group completion table (GCT), 210
unit (FPU)
internal fragmentation, 199
fractional values,
on PowerPC 970, 198–199
approximations of, 66
group dispatch on PowerPC 970,
front end, 38,
38
197, 199
for Pentium Pro, 94–100
conclusions, 199–200
for PowerPC 601, 113–115
performance implications,
instruction queue, 113–114
211–213
instruction scheduling,
114–115
for PowerPC 603, 122
H
for PowerPC 604, 126–128
Hammer processor architecture, 180
for PowerPC 750, 130–132
hard drives
for PowerPC 970, 194–195
vs. other data storage, 217
front-end branch target buffer,
page file on, 218
87, 147
INDEX
281
hardware
execution time, trace cache and,
ISA implementation by, 70
150–151
moving complexity to software,
fetch, 28
73–74
fetch logic, on PowerPC 970, 196
hardware loop buffer, 240
fetch stages, for G4e, 145
Harvard architecture level 1 cache,
field, 12
6, 81
latency, pipeline stalls and, 57–58
hazards, 74
pool, for Pentium 4, 149, 159
control, 78
queue
data, 74–76
on Core, 256
structural, 76–77
on PowerPC 601, 113–114
high-level languages (HLLs), com-
register, 26
pilers for, 104
loading, 28
scheduling, on PowerPC 601,
I
114–115
IA-64, 180
window
IBM.
See also
PowerPC (PPC)
for Core 2 Duo, 254
AltiVec development, 207
for G4, 134
POWER4 microarchitecture, 194
for Pentium 4, 141, 149, 159
RS6000, 62
for Pentium Pro, 93
System/360, 71
for PowerPC 603, 122
VMX, 70, 135, 253
for PowerPC 604, 126–128
I-cache (instruction cache), 78,
for PowerPC 750, 130–132
81, 223
instruction-level parallelism (ILP),
idiv instruction, 253
141, 196
ILP (instruction-level parallelism),
instructions, 11
141, 196
basic flow, 38–40,
39
immediate-type instruction
first, microprocessor hard-wired
format, 22
to fetch, 34
immediate values, in arithmetic
general types, 11–12
instructions, 14–16
lifecycle of, 36–37
indirect branch predictor, on
phases, 45–46
Pentium M, 245
load latency, 78
infix expressions, 89
parallel execution of, 63
in-order instruction dispatch
per clock, and superscalar
rule, 127
computers, 64–65
input, 2
preventing execution out of
input operands, 8
order, 96
input-output (I/O) unit, 26
rules of dispatch on PowerPC 604,
instruction
127–128
bus, 5
throughput, 53–54
cache (I-cache), 78, 81, 223
writing results back to register, 98
completion rate
instruction set, 22, 69–70
of microprocessor, 45, 51
instruction set architecture (ISA)