Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
each clock in which there’s a bubble in the write stage, the pipeline’s instruc-
tion throughput is 0 instructions/clock, so its average instruction throughput
for the whole period continues to decline. After the last bubble has worked
its way out of the write stage, then the pipeline begins completing new instruc-
tions again at a rate of one instruction/cycle and its average instruction
throughput begins to climb. And when the processor’s instruction through-
put begins to climb, so does its completion rate and its performance on
programs.
56
Chapter 3
1
0.8
Average
Instruction
0.6
Throughput
(instructions/clock)
0.4
0.2
20
40
60
80
100
Clock Cycles
Figure 3-12: Average instruction throughput of a four-stage pipeline with a 10-cycle stall
Now, 10 or 20 cycles worth of stalls here and there may not seem like
much, but they do add up. Even more important, though, is the fact that the
numbers in the preceding examples would be increased by a factor of 30 or
more in real-world execution scenarios. As of this writing, a processor can
spend from 50 to 120 nanoseconds waiting on data from main memory. For a
3 GHz processor that has a clock cycle time of a fraction of a nanosecond, a
100 ns main memory access translates into a few thousand clock cycles worth
of bubbles—and that’s just for one main memory access out of the many
millions that a program might make over the course of its execution.
In later chapters, we’ll look at the causes of pipeline stalls and the many
tricks that computer architects use to overcome them.
Instruction Latency and Pipeline Stalls
Before closing out our discussion of pipeline stalls, I should introduce
another term that you’ll be seeing periodically throughout the rest of the
book:
instruction latency
. An instruction’s latency is the number of clock cycles it takes for the instruction to pass through the pipeline. For a single-cycle
processor, all instructions have a latency of one clock cycle. In contrast, for
the simple four-stage pipeline described so far, all instructions have a latency of four cycles. To get a visual image of this, take one more look at the blue
instruction in Figure 3-6 earlier in this chapter; this instruction takes four
clock cycles to advance, at a rate of one clock cycle per stage, through each of the four stages of the pipeline. Likewise, instructions have a latency of eight
cycles on an eight-stage pipeline, 12 cycles on a 12-stage pipeline, and so on.
In real-world processors, instruction latency is not necessarily a fixed
number that’s equal to the number of pipeline stages. Because instructions
can get hung up in one or more pipeline stages for multiple cycles, each extra
cycle that they spend waiting in a pipeline stage adds one more cycle to their
latency. So the instruction latencies given in the previous paragraph (i.e., four cycles for a four-stage pipeline, eight cycles for an eight-stage pipeline, and
Pipelined Execution
57
so on) represent
minimum
instruction latencies. Actual instruction latencies in pipelines of any length can be longer than the depth of the pipeline,
depending on whether or not the instruction stalls in one or more stages.
Limits to Pipelining
As you can probably guess, there are some practical limits to how deeply you
can pipeline an assembly line or a processor before the actual speedup in
completion rate that you gain from pipelining starts to become significantly
less than the ideal speedup that you might expect. In the real world, the
different phases of an instruction’s lifecycle don’t easily break down into an
arbitrarily high number of shorter stages of perfectly equal duration. Some
stages are inherently more complex and take longer than others.
But because each pipeline stage must take exactly one clock cycle to
complete, the clock pulse that coordinates all the stages can be no faster
than the pipeline’s slowest stage. In other words, the amount of time it takes
for the slowest stage in the pipeline to complete will determine the length of
the CPU’s clock cycle and thus the length of every pipeline stage. This means
that the pipeline’s slowest stage will spend the entire clock cycle working,
while the faster stages will spend part of the clock cycle idle. Not only does
this waste resources, but it increases each instruction’s overall execution time by dragging out some phases of the lifecycle to take up more time than they
would if the processor was not pipelined—all of the other stages must wait a
little extra time each cycle while the slowest stage plays catch-up.
So, as you slice the pipeline more finely in order to add stages and increase
throughput, the individual stages get less and less uniform in length and
complexity, with the result that the processor’s overall instruction execution
time gets longer. Because of this feature of pipelining, one of the most diffi-
cult and important challenges that the CPU designer faces is that of balancing
the pipeline so that no one stage has to do more work to do than any other.
The designer must distribute the work of processing an instruction evenly to
each stage, so that no one stage takes up too much time and thus slows down
the entire pipeline.
Clock Period and Completion Rate
If the pipelined processor’s clock cycle time, or
clock period
, is longer than its ideal length (i.e., non-pipelined instruction execution time/pipeline depth),
and it always is, then the processor’s completion rate will suffer. If the instruction throughput stays fixed at, say, one instruction/clock, then as the clock
period increases, the completion rate decreases. Because new instructions
can be completed only at the end of each clock cycle, a longer clock cycle translates into fewer instructions completed per nanosecond, which in turn
translates into longer program execution times.
To get a better feel for the relationship between completion rate, instruc-
tion throughput, and clock cycle time, let’s take the eight-stage pipeline from
Figure 3-8 and increase its clock cycle time to 1 ns instead of 0.5 ns. Its first 9 ns of execution would then look as in Figure 3-13.
58
Chapter 3
1
2
3
4
5
6
7
8
9
Stored
Instructions
CPU
Fetch1
Fetch2
Decode
Execute
Data1
Data2
Tag Ch.
Write
Completed
Instructions
Figure 3-13: An eight-stage pipeline with a 1 ns clock period
As you can see, the instruction execution time has now increased from
an original time of 4 ns to a new time of 8 ns, which means that the eight-
stage pipeline does not complete its first instruction until the end of the
eighth nanosecond. Once the pipeline is full, the processor pictured in
Figure 3-13 begins completing instructions at a rate of one instruction per
nanosecond. This completion rate is half the completion rate of the ideal
eight-stage pipeline with the 0.5 ns clock cycle time. It’s also the exact same
completion rate as the one instruction/ns completion rate of the ideal four-
stage pipeline. In short, the longer clock cycle time of the new eight-stage
pipeline has robbed the deeper pipeline of its completion rate advantage.
Furthermore, the eight-stage pipeline now takes twice as long to fill.
Take a look at the graph in Figure 3-14 to see what this doubled execu-
tion time does to the eight-stage pipeline’s average completion rate curve
versus the same curve for a four-stage pipeline.
It takes longer for the slower eight-stage pipeline to fill up, which means
that its average completion rate—and hence its performance—ramps up
more slowly when the pipeline is first filled with instructions. There are many
situations in which a processor that’s executing a program must flush its
pipeline entirely and then begin refilling it from a different point in the
code stream. In such instances, that slower-ramping completion rate curve
causes a performance hit.
Pipelined Execution
59
1
0.8
Average
Completion
0.6
Rate
(instructions/ns)
4-stage pipeline
0.4
8-stage pipeline
0.2
20
40
60
80
100
Time (ns)
Figure 3-14: Average instruction completion rate for four- and eight-stage
pipelines with a 1 ns clock period
In the end, the performance gains brought about by pipelining depend
on two things:
1.
Pipeline stalls must be avoided. As you’ve seen earlier, pipeline stalls
cause the processor’s completion rate and performance to drop. Main
memory accesses are a major cause of pipeline stalls, but this problem
can be alleviated significantly by the use of caching. We’ll cover caching
in detail in Chapter 11. Other causes of stalls, like the various types of
hazards, will be covered at the end of Chapter 4.
2.
Pipeline refills must be avoided. Flushing the pipeline and refilling it
again takes a serious toll on both completion rate and performance.
Once the pipeline is full, it must remain full for as long as possible if the
average completion rate is to be kept up.
When we look more closely at real-world pipelined processors in later
chapters, you’ll see these two issues come up again and again. In fact, much
of the rest of the book will be about how the different architectures under
discussion work to keep their pipelines full by preventing stalls and ensuring
a continuous and uninterrupted flow of instructions into the processor from
the code storage area.
The Cost of Pipelining
In addition to the inherent limits to performance improvement that we’ve
just looked at, pipelining requires a nontrivial amount of extra bookkeeping
and buffering logic to implement, so it incurs an overhead cost in transis-
tors and die space. Furthermore, this overhead cost increases with pipeline
depth, so that a processor with a very deep pipeline (for example, Intel’s
Pentium 4) spends a significant amount of its transistor budget on pipeline-
related logic. These overhead costs place some practical constraints on how
deeply you can pipeline a processor. I’ll say a lot more about such constraints
in the chapters covering the Pentium line and the Pentium 4.
60
Chapter 3
S U P E R S C A L A R E X E C U T I O N
Chapters 1 and 2 described the processor as it is visible
to the programmer. The register files, the processor
status word (PSW), the arithmetic logic unit (ALU),
and other parts of the programming model are all
there to provide a means for the programmer to
manipulate the processor and make it do useful work.
In other words, the programming model is essentially
a user interface for the CPU.
Much like the graphical user interfaces on modern computer systems,
there’s a lot more going on under the hood of a microprocessor than the
simplicity of the programming model would imply. In Chapter 12, I’ll talk
about the various ways in which the operating system and processor collab-
orate to fool the user into thinking that he or she is executing multiple pro-
grams at once. There’s a similar sort of trickery that goes on beneath the
programming model in a modern microprocessor, but it’s intended to fool
the programmer into thinking that there’s only one thing going on at a time,
when really there are multiple things happening simultaneously. Let me
explain.
Back in the days when computer designers could fit relatively few
transistors on a single piece of silicon, many parts of the programming
model actually resided on separate chips attached to a single circuit board.
For instance, one chip contained the ALU, another chip contained the
control unit, still another chip contained the registers, and so on. Such
computers were relatively slow, and the fact that they were made of multiple
chips made them expensive. Each chip had its own manufacturing and
packaging costs, so the more chips you put on a board, the more expensive
the overall system was. (Note that this is still true today. The cost of pro-
ducing systems and components can be drastically reduced by packing the
functionality of multiple chips into a single chip.)
With the advent of the Intel 4004 in 1971, all of that changed. The 4004
was the world’s first microprocessor on a chip. Designed to be the brains of
a calculator manufactured by a now defunct company named Busicom, the
4004 had 16 four-bit registers, an ALU, and decoding and control logic all
packed onto a single, 2,300-transistor chip. The 4004 was quite a feat for its
day, and it paved the way for the PC revolution. However, it wasn’t until Intel
released the 8080 four years later that the world saw the first true general-