Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
decodes the larger
x
86 instructions, as do Intel’s Pentium III and Pentium 4.
The key to understanding Figure 4-7 is that the blue layer represents a
layer of abstraction that hides the complexity of the underlying hardware
from the programmer. The blue layer is not a hardware layer (that’s the
gray one) and it’s not a software layer (that’s the peach one), but it’s a
conceptual layer
. Think of it like a user interface that hides the complexity
72
Chapter 4
of an operating system from the user. All the user needs to know to use the
machine is how to close windows, launch programs, find files, and so on. The
UI (and by this I mean the WIMP conceptual paradigm—windows, icons,
menus, pointer—not the software that implements the UI) exposes the
machine’s power and functionality to the user in a way that he or she can
understand and use. And whether that UI appears on a PDA or on a desktop
machine, the user still knows how to use it to control the machine.
The main drawback to using microcode to implement an ISA is that
the microcode engine was, in the beginning, slower than direct decoding.
(Modern microcode engines are about 99 percent as fast as direct execution.)
However, the ability to separate ISA design from microarchitectural imple-
mentation was so significant for the development of modern computing that
the small speed hit incurred was well worth it.
The advent of the
reduced instruction set computing (RISC)
movement in the 1970s saw a couple of changes to the scheme described previously. First and
foremost, RISC was all about throwing stuff overboard in the name of speed.
So the first thing to go was the microcode engine. Microcode had allowed ISA
designers to get elaborate with instruction sets, adding in all sorts of complex and specialized instructions that were intended to make programmers’ lives
easier but that were in reality rarely used. More instructions meant that you
needed more microcode ROM, which in turn meant larger CPU die sizes,
higher power consumption, and so on. Since RISC was more about less, the
microcode engine got the ax. RISC reduced the number of instructions in
the instruction set and reduced the size and complexity of each individual
instruction so that this smaller, faster, and more lightweight instruction set
could be more easily implemented directly in hardware, without a bulky
microcode engine.
While RISC designs went back to the old method of direct execution of
instructions, they kept the concept of the ISA intact. Computer architects
had by this time learned the immense value of not breaking backward com-
patibility with old software, and they weren’t about to go back to the bad old
days of marrying software to a single product. So the ISA stayed, but in a
stripped-down, much simplified form that enabled designers to implement
directly in hardware the same lightweight ISA over a variety of different
hardware types.
NOTE
Because the older, non-RISC ISAs featured richer, more complex instruction sets, they
were labeled
complex instruction set computing (CISC)
ISAs in order to distinguish them from the new RISC ISAs. The
x
86 ISA is the most popular example of a
CISC ISA, while PowerPC, MIPS, and Arm are all examples of popular RISC ISAs.
Moving Complexity from Hardware to Software
RISC machines were able to get rid of the microcode engine and still retain
the benefits of the ISA by moving complexity from hardware to software.
Where the microcode engine made CISC programming easier by providing
programmers with a rich variety of complex instructions, RISC programmers
depended on high-level languages, like C, and on compilers to ease the
burden of writing code for RISC ISAs’ restricted instruction sets.
Superscalar Execution
73
Because a RISC ISA’s instruction set is more limited, it’s harder to write
long programs in assembly language for a RISC processor. (Imagine trying to
write a novel while restricting yourself to a fifth grade vocabulary, and you’ll get the idea.) A RISC assembly language programmer may have to use many
instructions to achieve the same result that a CISC assembly language pro-
grammer can get with one or two instructions. The advent of high-level
languages (HLLs), like C, and the increasing sophistication of compiler
technology combined to effectively eliminate this programmer-unfriendly
aspect of RISC computing.
The ISA was and is still the optimal solution to the problem of easily and
consistently exposing hardware functionality to programmers so that soft-
ware can be used across a wide range of machines. The greatest testament
to the power and flexibility of the ISA is the longevity and ubiquity of the
world’s most popular and successful ISA: the
x
86 ISA. Programs written for the Intel 8086, a chip released in 1978, can run with relatively little modification on the latest Pentium 4. However, on a microarchitectural level, the
8086 and the Pentium 4 are as different as the Ford Model T and the Ford
Mustang Cobra.
Challenges to Pipelining and Superscalar Design
I noted previously that there are conditions under which two arithmetic
instructions cannot be “safely” dispatched in parallel for simultaneous exe-
cution by the DLW-2’s two ALUs. Such conditions are called
hazards
, and
they can all be placed in one of three categories:
z
Data hazards
z
Structural hazards
z
Control hazards
Because pipelining is a form of parallel execution, these three types of
hazards can also hinder pipelined execution, causing bubbles to occur in
the pipeline. In the following three sections, I’ll discuss each of these types
of hazards. I won’t go into a huge amount of detail about the tricks that
computer architects use to eliminate them or alleviate their affects, because
we’ll discuss those when we look at specific microprocessors in the next few
chapters.
Data Hazards
The best way to explain what a
data hazard
is to illustrate one. Consider Program 4-1:
Line #
Code
Comments
1
add A, B, C
Add the numbers in registers A and B and store the result in C.
2
add C, D, D
Add the numbers in registers C and D and store the result in D.
Program 4-1: A data hazard
74
Chapter 4
Because the second instruction in Program 4-1 depends on the out-
come of the first instruction, the two instructions cannot be executed
simultaneously. Rather, the add in line 1
must
finish first, so that the result is available in C for the add in line 2.
Data hazards are a problem for both superscalar and pipelined execution.
If Program 4-1 is run on a superscalar processor with two integer ALUs, the
two add instructions cannot be executed simultaneously by the two ALUs.
Rather, the ALU executing the add in line 1 has to finish first, and then the
other ALU can execute the add in line 2. Similarly, if Program 4-1 is run on a
pipelined processor, the second add has to wait until the first add completes
the write stage before it can enter the execute phase. Thus the dispatch
circuitry has to recognize the add in line 2’s dependence on the add in line 1,
and keep the add in line 2 from entering the execute stage until the add in line 1’s result is available in register C.
Most pipelined processors can do a trick called
forwarding
that’s aimed at alleviating the effects of this problem. With forwarding, the processor takes
the result of the first add from the ALU’s output port and feeds it directly
back into the ALU’s input port, bypassing the register-file write stage. Thus
the second add has to wait for the first add to finish only the execute stage, and not the execute and write stages, before it’s able to move into the execute
stage itself.
Register renaming
is a trick that helps overcome data hazards on superscalar machines. Since any given machine’s programming model often specifies
fewer registers than can be implemented in hardware, a given microprocessor
implementation often has more registers than the number specified in the
programming model. To get an idea of how this group of additional registers
is used, take a look at Figure 4-8.
In Figure 4-8, the DLW-2’s programmer thinks that he or she is using a
single ALU with four architectural general-purpose registers—A, B, C, and D—
attached to it, because four registers and one ALU are all that the DLW
architecture’s programming model specifies. However, the actual superscalar
DLW-2 hardware has two ALUs and 16 microarchitectural GPRs implemented
in hardware. Thus the DLW-2’s register rename logic can map the four archi-
tectural registers to the available microarchitectural registers in such a way as to prevent false register name conflicts.
In Figure 4-8, an instruction that’s being executed by IU1 might think
that it’s the only instruction executing and that it’s using registers A, B, and C, but it’s actually using rename registers 2, 5, and 10. Likewise, a second instruction executing simultaneously with the first instruction but in IU2 might also
think that it’s the only instruction executing and that it has a monopoly on
the register file, but in reality, it’s using registers 3, 7, 12, and 16. Once both IUs have finished executing their respective instructions, the DLW-2’s write-back logic takes care of transferring the contents of the rename registers back
to the four architectural registers in the proper order so that the program’s
state can be changed.
Superscalar Execution
75
itm04_03.fm Page 76 Thursday, January 11, 2007 10:23 AM
Programming Model
(Architecture)
ALU Registers
A
B
C
D
Hardware Implementation
(Microarchitecture)
Rename Buffer
1
IU1 Registers
2
A
3
B
4
C
5
D
6
7
8
9
10
IU2 Registers
11
A
12
B
13
C
14
D
15
16
Figure 4-8: Register renaming
Let’s take a quick look at a false register name conflict in Program 4-2.
Line #
Code
Comments
1
add A, B, C
Add the numbers in registers A and B and store the result in C.
2
add D, B, A
Add the numbers in registers B and D and store the result in A.
Program 4-2: A false register name conflict
In Program 4-2, there is no data dependency, and both add instructions
can take place simultaneously except for one problem: the first add reads the
contents of A for its input, while the second add writes a new value into A as its output. Therefore, the first add’s read absolutely must take place
before
the second add’s write. Register renaming solves this register name conflict by
allowing the second add to write its output to a temporary register; after both
adds have executed in parallel, the result of the second add is written from
that temporary register into the architectural register A after the first add has finished executing and written back its own results.
Structural Hazards
Program 4-3 contains a short code example that shows superscalar execution
in action. Assuming the programming model presented for the DLW-2,
consider the following snippet of code.
76
Chapter 4
Line #
Code
Comments
15
add A, B, B
Add the numbers in registers A and B and store the result in B.
16
add C, D, D
Add the numbers in registers C and D and store the result in D.
Program 4-3: A structural hazard
At first glance, there appears to be nothing wrong with Program 4-3.
There’s no data hazard, because the two instructions don’t depend on each
other. So it should be possible to execute them in parallel. However, this
example presumes that both ALUs share the same group of four registers.
But in order for the DLW-2’s register file to accommodate multiple ALUs
accessing it at once, it needs to be different from the DLW-1’s register file in one important way: it must be able to accommodate two simultaneous writes.
Otherwise, executing Program 4-3’s two instructions in parallel would trigger
what’s called a
structural hazard
, where the processor doesn’t have enough resources to execute both instructions at once.
The Register File
In a superscalar design with multiple ALUs, it would take an enormous
number of wires to connect each register directly to each ALU. This problem
gets worse as the number of registers and ALUs increases. Hence, in super-
scalar designs with a large number of registers, a CPU’s registers are grouped
together into a special unit called a
register file
. This unit is a memory array, much like the array of cells that makes up a computer’s main memory, and