Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (12 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
11.77Mb size Format: txt, pdf, ePub

1hr

2hr

3hr

4hr

5hr

6hr

7hr

Factory

Floor

Completed

SUVs

Figure 3-5: The lifecycle of an SUV in a pipelined factory

So as the assembly line begins to fill up with SUVs in various stages of

production, more of the crews are put to work simultaneously until all of the

crews are working on a different vehicle in a different stage of production.

(Of course, this is how most of us nowadays in the post-Ford era expect a

good, efficient assembly line to work.) If we can keep the assembly line full

and keep all five crews working at once, we can produce one SUV every hour:

a fivefold improvement in SUV completion rate over the previous comple-

tion rate of one SUV every five hours. That, in a nutshell, is pipelining.

42

Chapter 3

While the total amount of time that each individual SUV spends in pro-

duction has not changed from the original five hours, the rate at which the

factory as a whole completes SUVs has increased drastically. Furthermore,

the rate at which the factory can fulfill the Army’s orders for batches of

SUVs has increased drastically, as well. Pipelining works its magic by making

optimal use of already existing resources. We don’t need to speed up each

individual stage of the production process, nor do we need to drastically

increase the amount of resources that we throw at the problem; all that’s

necessary is that we get more work out of resources that are already there.

W H Y T H E S U V F A C T O R Y ?

The preceding discussion uses a factory analogy to explain pipelining. Other books use simpler analogies, like doing laundry, for instance, to explain this technique, but there are a few reasons why I chose a more elaborate and lengthy analogy to

illustrate what is a relatively simple concept. First, I use factory analogies throughout this book, because assembly line-based factories are easy for readers to visualize and there’s plenty of room for filling out the mental image in interesting ways in order to make a variety of related points. Second, and perhaps even more importantly, the many scheduling-, queuing- and resource management–related problems

that factory designers face have direct analogies in computer architecture. In many cases, the problems and solutions are exactly the same, simply translated into a different domain. (Similar queuing-related problem/solution pairs also crop up in the service industry, which is why analogies involving supermarkets and fast food restaurants are also favorites of mine.)

Applying the Analogy

Bringing our discussion back to microprocessors, it should be easy to see

how this concept applies to the four phases of an instruction’s lifecycle. Just

as the owners of the factory in our analogy wanted to increase the number of

SUVs that the factory could finish in a given period of time, microprocessor

designers are always looking for ways to increase the number of instructions

that a CPU can complete in a given period of time. When you recall that a

program is an ordered sequence of instructions, it becomes clear that

increasing the number of instructions executed per unit time is one way

to decrease the total amount of time that it takes to execute a program.

(The other way to decrease a program’s execution time is to decrease the

number of instructions in the program, but this chapter won’t address that

approach until later.) In terms of our analogy, a program is like an order

of SUVs from the military; just like increasing our factory’s output of SUVs

per hour enabled us to fill orders faster, increasing a processor’s
instruction
completion rate
(the number of instructions completed per unit time) enables it to run programs faster.

A Non-Pipelined Processor

The previous chapter briefly described how the simple processors described

so far (e.g., the DLW-1) use the clock to time its internal operations. These

Pipelined Execution

43

non-pipelined processors work on one instruction at a time, moving each

instruction through all four phases of its lifecycle during the course of

one clock cycle. Thus non-pipelined processors are also called
single-cycle

processors, because all instructions take exactly one clock cycle to execute

fully (i.e., to pass through all four phases of their lifecycles).

Because the processor completes instructions at a rate of one per clock

cycle, you want the CPU’s clock to run as fast as possible so that the processor’s instruction completion rate can be as high as possible.

Thus you need to calculate the maximum amount of time that it takes

to complete an instruction and make the clock cycle time equivalent to that

length of time. It just so happens that on the hypothetical example CPU, the

four phases of the instruction’s lifecycle take a total of 4 ns to complete. Therefore, you should set the duration of the CPU clock cycle to 4 ns, so that the

CPU can complete the instruction’s lifecycle—from
fetch
to
write-back
—in a single clock. (A CPU clock cycle is often just called a
clock
for short.) In Figure 3-6, the blue instruction leaves the code storage area, enters

the processor, and then advances through the phases of its lifecycle over the

course of the 4 ns clock period, until at the end of the fourth nanosecond, it

completes the last phase and its lifecycle is over. The end of the fourth nano-

second is also the end of the first clock cycle, so now that the first clock cycle is finished and the blue instruction has completed its execution, the red

instruction can enter the processor at the start of a new clock cycle and go

through the same process. This 4 ns sequence of steps is repeated until, after

a total of 16 ns (or four clock cycles), the processor has completed all four

instructions at a completion rate of 0.25 instructions/ns (= 4 instructions/

16 ns).

1ns 2ns 3ns 4ns 5ns 6ns 7ns 8ns 9ns

Stored

Instructions

CPU

Fetch

Decode

Execute

Write

Completed

Instructions

Figure 3-6: A single-cycle processor

44

Chapter 3

Single-cycle processors like the one in Figure 3-6 are simple to design, but

they waste a lot of hardware resources. All of that white space in the diagram

represents processor hardware that’s sitting idle while it waits for the instruction that’s currently in the processor to finish executing. By pipelining the

processor in this figure, you can put more of that hardware to work every

nanosecond, thereby increasing the processor’s efficiency and its perfor-

mance on executing programs.

Before moving on, I should clarify a few concepts illustrated in Figure 3-6.

At the bottom is a region labeled “Completed Instructions.” Completed

instructions don’t actually go anywhere when they’re finished executing;

once they’ve done their job of telling the processor how to modify the

data stream, they’re simply deleted from the processor. So the “Completed

Instructions” box does not represent a real part of the computer, which

is why I’ve placed a dotted line around it. This area is just a place for you

to keep track of how many instructions the processor has completed in

a certain amount of time, or the processor’s instruction completion rate

(or
completion rate
, for short), so that when you compare different types of processors, you’ll have a place where you can quickly see which processor

performs better. The more instructions a processor completes in a set

amount of time, the better it performs on programs, which are an ordered

sequence of instructions. Think of the “Completed Instructions” box as a

sort of scoreboard for tracking each processor’s completion rate, and check

the box in each of the subsequent figures to see how long it takes for the

processor to populate this box.

Following on the preceding point, you may be curious as to why the

blue instruction that has completed in the fourth nanosecond does not

appear in the “Completed Instructions” box until the fifth nanosecond.

The reason is straightforward and stems from the nature of the diagram.

Because an instruction spends
one complete nanosecond
, from start to finish, in each stage of execution, the blue instruction enters the write phase at the

beginning
of the fourth nanosecond and exits the write phase at the
end
of the fourth nanosecond. This means that the fifth nanosecond is the first

full nanosecond in which the blue instruction stands completed. Thus at

the beginning of the fifth nanosecond (which coincides with the end of the

fourth nanosecond), the processor has completed one instruction.

A Pipelined Processor

Pipelining a processor means breaking down its instruction execution

process—what I’ve been calling the instruction’s lifecycle—into a series of

discrete
pipeline stages
that can be completed in sequence by specialized hardware. Recall the way that we broke down the SUV assembly process into five

discrete steps—with one dedicated crew assigned to complete each step—

and you’ll get the idea.

Because an instruction’s lifecycle consists of four fairly distinct phases, you

can start by breaking down the single-cycle processor’s instruction execution

Pipelined Execution

45

process into a sequence of four discrete pipeline stages, where each pipeline

stage corresponds to a phase in the standard instruction lifecycle:

Stage 1:
Fetch the instruction from code storage.

Stage 2:
Decode the instruction.

Stage 3:
Execute the instruction.

Stage 4:
Write the results of the instruction back to the register file.

Note that the number of pipeline stages is called the
pipeline depth
. So the four-stage pipeline has a pipeline depth of four.

For convenience’s sake, let’s say that each of these four pipeline stages

takes exactly 1 ns to finish its work on an instruction, just like each crew in

our assembly line analogy takes one hour to finish its portion of the work on

an SUV. So the original single-cycle processor’s 4 ns execution process is now

broken down into four discrete, sequential pipeline stages of 1 ns each in

length.

Now let’s step through another diagram together to see how a pipelined

CPU would execute the four instructions depicted in Figure 3-7.

1ns 2ns 3ns 4ns 5ns 6ns 7ns 8ns 9ns

Stored

Instructions

CPU

Fetch

Decode

Execute

Write

Completed

Instructions

Figure 3-7: A four-stage pipeline

At the beginning of the first nanosecond, the blue instruction enters

the fetch stage. After that nanosecond is complete, the second nanosecond

begins and the blue instruction moves on to the decode stage, while the

next instruction, the red one, starts to make its way from code storage to the

processor (i.e., it enters the fetch stage). At the start of the third nanosecond, the blue instruction advances to the execute stage, the red instruction

advances to the decode stage, and the green instruction enters the fetch

stage. At the fourth nanosecond, the blue instruction advances to the write

46

Chapter 3

stage, the red instruction advances to the execute stage, the green instruc-

tion advances to the decode stage, and the purple instruction advances to

the fetch stage. After the fourth nanosecond has fully elapsed and the fifth

nanosecond starts, the blue instruction has passed from the pipeline and is

now finished executing. Thus we can say that at the end of 4 ns (= four clock

cycles), the pipelined processor depicted in Figure 3-7 has completed one

instruction.

At start of the fifth nanosecond, the pipeline is now full and the processor

can begin completing instructions at a rate of one instruction per nanosecond.

This one instruction/ns completion rate is a fourfold improvement over

the single-cycle processor’s completion rate of 0.25 instructions/ns (or four

instructions every 16 ns).

Shrinking the Clock

You can see from Figure 3-7 that the role of the CPU clock changes slightly

in a pipelined processor, compared to the single-cycle processor shown in

Figure 3-6. Because all of the pipeline stages must now work together

simultaneously and be ready at the start of each new nanosecond to hand

over the results of their work to the next pipeline stage, the clock is needed

to coordinate the activity of the whole pipeline. The way this is done is

simple: Shrink the clock cycle time to match the time it takes each stage to

complete its work so that at the start of each clock cycle, each pipeline stage

hands off the instruction it was working on to the next stage in the pipeline.

Because each pipeline stage in the example processor takes 1 ns to complete

its work, you can set the clock cycle to be 1 ns in duration.

This new method of clocking the processor means that a new instruction

Other books

The Shelter of Neighbours by Eílís Ní Dhuibhne
Infinity by Sedona Venez
Homer & Langley by E. L. Doctorow
At My Mother's Knee by Paul O'Grady
Trying to Score by Aleo, Toni
Christmas Tales of Terror by Chris Priestley
Napoleon in Egypt by Paul Strathern
Flower by Irene N.Watts
Stormrider by David Gemmell