Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
purpose CPU.
During the decades following the 8080, the number of transistors that
could be packed onto a single chip increased at a stunning pace. As CPU
designers had more and more transistors to work with when designing new
chips, they began to think up novel ways for using those transistors to increase computing performance on application code. One of the first things that
occurred to designers was that they could put more than one ALU on a chip
and have both ALUs working in parallel to process code faster. Since these
designs could do more than one scalar (or
integer
, for our purposes) operation at once, they were called
superscalar
computers. The RS6000 from IBM was released in 1990 and was the world’s first commercially available superscalar
CPU. Intel followed in 1993 with the Pentium, which, with its two ALUs,
brought the
x
86 world into the superscalar era.
For illustrative purposes, I’ll now introduce a
two-way superscalar
version of the DLW-1, called the DLW-2 and illustrated in Figure 4-1. The DLW-2
has two ALUs, so it’s able to execute two arithmetic instructions in parallel
(hence the term
two-way
superscalar). These two ALUs share a single register file, a situation that in terms of our file clerk analogy would correspond to
the file clerk sharing his personal filing cabinet with a second file clerk.
As you can probably guess from looking at Figure 4-1, superscalar
processing adds a bit of complexity to the DLW-2’s design, because it needs
new circuitry that enables it to reorder the linear instruction stream so that
some of the stream’s instructions can execute in parallel. This circuitry has to ensure that it’s “safe” to dispatch two instructions in parallel to the two execution units. But before I go on to discuss some reasons why it might not be
safe to execute two instructions in parallel, I should define the term I just
used—
dispatch
.
62
Chapter 4
Main Memory
ALU1
ALU2
CPU
Figure 4-1: The superscalar DLW-2
Notice that in Figure 4-2 I’ve renamed the second pipeline stage
decode/
dispatch
. This is because attached to the latter part of the decode stage is a bit of dispatch circuitry whose job it is to determine whether or not two
instructions can be executed in parallel, in other words, on the same clock
cycle. If they can be executed in parallel, the dispatch unit sends one instruc-
tion to the first integer ALU and one to the second integer ALU. If they can’t
be dispatched in parallel, the dispatch unit sends them in program order to
the first of the two ALUs. There are a few reasons why the dispatcher might
decide that two instructions can’t be executed in parallel, and we’ll cover
those in the following sections.
It’s important to note that even though the processor has multiple ALUs,
the programming model does not change. The programmer still writes to the
same interface, even though that interface now represents a fundamentally
different type of machine than the processor actually is; the interface repre-
sents a sequential execution machine, but the processor is actually a parallel
execution machine. So even though the superscalar CPU executes instruc-
tions in parallel, the illusion of sequential execution absolutely must be
maintained for the sake of the programmer. We’ll see some reasons why
this is so later on, but for now the important thing to remember is that main
memory still sees one sequential code stream, one data stream, and one
results stream, even though the code and data streams are carved up inside
the computer and pushed through the two ALUs in parallel.
Superscalar Execution
63
Front End
Fetch
Decode/
Dispatch
ALU1
ALU2
Execute
Write
Back End
Figure 4-2: The pipeline of the superscalar DLW-2
If the processor is to execute multiple instructions at once, it must be
able to fetch and decode multiple instructions at once. A two-way superscalar
processor like the DLW-2 can fetch two instructions at once from memory on
each clock cycle, and it can also decode and dispatch two instructions each
clock cycle. So the DLW-2 fetches instructions from memory in groups of
two, starting at the memory address that marks the beginning of the current
program’s code segment and incrementing the program counter to point
four bytes ahead each time a new instruction is fetched. (Remember, the
DLW-2’s instructions are two bytes wide.)
As you might guess, fetching and decoding two instructions at a time
complicates the way the DLW-2 deals with branch instructions. What if the
first instruction in a fetched pair happens to be a branch instruction that has
the processor jump directly to another part of memory? In this case, the
second instruction in the pair has to be discarded. This wastes fetch band-
width and introduces a bubble into the pipeline. There are other issues
relating to superscalar execution and branch instructions, and I’ll say more
about them in the section on control hazards.
Superscalar Computing and IPC
Superscalar computing allows a microprocessor to increase the number
of instructions per clock that it completes beyond one instruction per clock.
Recall that one instruction per clock was the maximum theoretical instruction
throughput for a pipelined processor, as described in “Instruction Through-
put” on
page 53. Because a
superscalar machine can have multiple instructions
64
Chapter 4
in multiple write stages on each clock cycle, the superscalar machine can
complete multiple instructions per cycle. If we adapt Chapter 3’s pipeline
diagrams to take account of superscalar execution, they look like Figure 4-3.
1ns
2ns
3ns
4ns
5ns
6ns
7ns
Stored
Instructions
CPU
Fetch
Decode
Execute
Write
Completed
Instructions
Figure 4-3: Superscalar execution and pipelining combined
In Figure 4-3, two instructions are added to the
Completed Instructions
box on each cycle once the pipeline is full. The more ALU pipelines that a
processor has operating in parallel, the more instructions it can add to that
box on each cycle. Thus superscalar computing allows you to increase a pro-
cessor’s IPC by adding more hardware. There are some practical limits to how
many instructions can be executed in parallel, and we’ll discuss those later.
Expanding Superscalar Processing with Execution Units
Most modern processors do more with superscalar execution than just add-
ing a second ALU. Rather, they distribute the work of handling different
types of instructions among different types of execution units. An
execution
unit
is a block of circuitry in the processor’s back end that executes a certain category of instruction. For instance, you’ve already met the arithmetic logic
unit (ALU), an execution unit that performs arithmetic and logical opera-
tions on integers. In this section we’ll take a closer look at the ALU, and
you’ll learn about some other types of execution units for non-integer arith-
metic operations, memory accesses, and branch instructions.
Superscalar Execution
65
Basic Number Formats and Computer Arithmetic
The kinds of numbers on which modern microprocessors operate can be
divided into two main types: integers (aka fixed-point numbers) and floating-
point numbers.
Integers
are simply whole numbers of the type with which
you first learn to count in grade school. An integer can be positive, negative,
or zero, but it cannot, of course, be a fraction. Integers are also called
fixed-point numbers
because an integer’s decimal point does not move. Examples of integers are 1, 0, 500, 27, and 42. Arithmetic and logical operations involving integers are among the simplest and fastest operations that a micropro-
cessor performs. Applications like compilers, databases, and word processors
make heavy use of integer operations, because the numbers they deal with
are usually whole numbers.
A
floating-point number
is a decimal number that represents a fraction.
Examples of floating-point numbers are 56.5, 901.688, and 41.9999. As you
can see from these three numbers, the decimal point “floats” around and
isn’t fixed in once place, hence the name. The number of places behind the
decimal point determines a floating-point number’s accuracy, so floating-
point numbers are often
approximations
of fractional values. Arithmetic and logical operations performed on floating-point numbers are more complex
and, hence, slower than their integer counterparts. Because floating-point
numbers are approximations of fractional values, and the real world is kind
of approximate and fractional, floating-point arithmetic is commonly found
in real world–oriented applications like simulations, games, and signal-
processing applications.
Both integer and floating-point numbers can themselves be divided into
one of two types: scalars and vectors.
Scalars
are values that have only one numerical component, and they’re best understood in contrast with
vectors
.
Briefly, a vector is a multicomponent value, most often seen as an ordered
sequence or array of numbers. (Vectors
are covered in detail in “The Vector
Execution Units” on page 168
.) Here are some examples of different types of vectors and scalars:
Integer
Floating-Point
Scalar
14
1.01
−500
15.234
37
−0.0023
Vector
{5, −7, −9, 8}
{0.99, −1.1, 3.31}
{1,003, 42, 97, 86, 97}
{50.01, 0.002, −1.4, 1.4}
{234, 7, 6, 1, 3, 10, 11}
{5.6, 22.3, 44.444, 76.01, 9.9}
Returning to the code/data distinction, we can say that the data
stream consists of four types of numbers: scalar integers, scalar floating-
point numbers, vector integers, and vector floating-point numbers. (Note
that even memory addresses fall into one of these four categories—scalar
integers.) The code stream, then, consists of instructions that operate on
all four types of numbers.
66
Chapter 4
The kinds of operations that can be performed on the four types of
numbers fall into two main categories: arithmetic operations and logical
operations. When I first introduced arithmetic operations in Chapter 1,
I lumped them together with logical operations for the sake of convenience.
At this point, though, it’s useful to distinguish the two types of operations
from one another:
z
Arithmetic operations are operations like addition, subtraction,
multiplication, and division, all of which can be performed on any
type of number.
z
Logical operations are Boolean operations like AND, OR, NOT, and
XOR, along with bit shifts and rotates. Such operations are performed
on scalar and vector integers, as well as on the contents of special-
purpose registers like the processor status word (PSW).
The types of operations performed on these types of numbers can be
broken down as illustrated in Figure 4-4.
Floating-
Point
Arithmetic Operations
Vector
Scalar
Operations
Operations
Integer
Logic Operations
Figure 4-4: Number formats and operation types
As you make your way through the rest of the book, you may want to
refer back to this section occasionally. Different microprocessors divide these
operations among different execution units in a variety of ways, and things
can easily get confusing.
Arithmetic Logic Units
On early microprocessors, as on the DLW-1 and DLW-2, all integer arithmetic
and logical operations were handled by the ALU. Floating-point operations
were executed by a companion chip, commonly called an
arithmetic coprocessor
, that was attached to the motherboard and designed to work in conjunction
with the microprocessor. Eventually, floating-point capabilities were inte-
grated onto the CPU as a separate execution unit alongside the ALU.
Superscalar Execution
67
Consider the Intel Pentium processor depicted in Figure 4-5, which
contains two integer ALUs and a floating-point ALU, along with some
other units that we’ll describe shortly.
Front End
Instruction Fetch
BU
Branch
Decode
Unit
Control Unit
SIU (V) CIU (U)
FPU
Floating-
Point
Integer Unit
Unit
Back End
Write
Figure 4-5: The Intel Pentium
This diagram is a variation on Figure 4-2, with the execute stage replaced