Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
set it apart from the original G4. It also continued the excellent performance/
power consumption ratio of its predecessors. These features combined to
make it an excellent chip for portables, and Apple has exploited derivatives
of this basic architecture under the G4 name in a series of innovative desktop
enclosure designs and portables. The G4e also brought enhanced vector
computing performance to the table, which made it a great platform for
DSP and media applications.
This chapter will examine the trade-offs and design decisions that the
Pentium 4’s architects made in their effort to build a MHz monster, paying
special attention to the innovative features that the Pentium 4 sported and
the ways that those features fit with the processor’s overall design philosophy
and target application domain. We’ll cover the Pentium 4’s ultradeep pipe-
line, its trace cache, its double-pumped ALUs, and a host of other aspects
of its design, all with an eye to their impact on performance. As a point of
comparison, we’ll also look at the microarchitecture of Motorola’s G4e. By
examining two microprocessor designs side by side, you’ll gain a deeper
understanding of how the concepts outlined in the previous chapters play
out in a pair of popular, real-world designs.
The Pentium 4’s Speed Addiction
Table 7-1 lists the features of the Pentium 4.
Table 7-1:
Features of the Pentium 4
Introduction Date
April 23, 2001
Process
0.18 micron
Transistor Count
42 million
Clock Speed at Introduction
1.7 GHz
Cache Sizes
L1: Approximately 16KB instruction, 16KB data
Features
Simultaneous Multithreading (SMT, aka “hyperthreading”)
added in 2003. 64-bit support (EM64T) and SSE3 added in
2004. Virtualization Technology (VT) added in 2005.
138
Chapter 7
While some processors still have the classic, four-stage pipeline, described
in Chapter 1, most modern CPUs are more complicated. You’ve already
seen how the original Pentium had a second decode stage, and the P6 core
tripled the standard four-stage pipeline to 12 stages. The Pentium 4, with a
whopping 20 stages in its basic pipeline, takes this tactic to the extreme. Take a look at Figure 7-1. The chart shows the relative clock frequencies of Intel’s
last six
x
86 designs. (This picture assumes the same manufacturing process for all six cores.) The vertical axis shows the relative clock frequency, and the horizontal axis shows the various processors relative to each other.
3
2.5
2.5
2
1.5
1.5
1
1
1
1
1
0.5
Relative Frequency
0
286
386
486
P5
P6
P4P
Figure 7-1: The relative frequencies of Intel’s processors
Intel’s explanation of this diagram and the history it illustrates is
enlightening, as it shows where their design priorities were:
Figure [3.2] shows that the 286, Intel386™, Intel486™, and Pentium®
(P5) processors had similar pipeline depths—they would run at
similar clock rates if they were all implemented on the same silicon
process technology. They all have a similar number of gates of
logic per clock cycle. The P6 microarchitecture lengthened the
processor pipelines, allowing fewer gates of logic per pipeline
stage, which delivered significantly higher frequency and perfor-
mance. The P6 microarchitecture approximately doubled the
number of pipeline stages compared to the earlier processors and
was able to achieve about a 1.5 times higher frequency on the same
process technology. The NetBurst microarchitecture was designed
to have an even deeper pipeline (about two times the P6 micro-
architecture) with even fewer gates of logic per clock cycle to allow
an industry-leading clock rate.
—The Microarchitecture of the Pentium 4 Processor, p. 3.
As you learned in Chapter 2, there are limits to how deeply you can
pipeline an architecture before you begin to reach a point of diminishing
returns. Deeper pipelining results in an increase in instruction execution
time; this increase can be quite damaging to instruction completion rates if
the pipeline has to be flushed and refilled often. Furthermore, in order to
realize the throughput gains that deep pipelining promises, the processor’s
clock speed must increase in proportion to its pipeline depth. But in the real
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
139
world, speeding up the clock of a deeply pipelined processor to match its
pipeline depth is not all that easy.
Because of these drawbacks to deep pipelining, many critics of the
Pentium 4’s microarchitecture, dubbed NetBurst by Intel, have suggested that
its staggeringly long pipeline was a gimmick—a poor design choice made for
reasons of marketing and not performance and scalability. Intel knew that
the public naïvely equated higher MHz numbers with higher performance,
or so the argument went, so they designed the Pentium 4 to run at strato-
spheric clock speeds and in the process, made design trade-offs that would
prove detrimental to real-world performance and long-term scalability.
As it turns out, the Pentium 4’s critics were both wrong and right. In
spite of the predictions of its most ardent detractors, the Pentium 4’s perform-
ance has scaled fairly well with its clock rate, a phenomenon that readers
of this book would expect given the sect
ion “Pipelining Explained” on page 40
.
But though they were wrong about its performance, the Pentium 4’s critics
were right about the origins of the processor’s deeply pipelined approach.
Revelations from former members of the Pentium 4’s design team, as well
as my own off-the-record conversations with Intel folks, all indicate that the
Pentium 4’s design was the result of a marketing-driven focus on clock speeds
at the expense of actual performance and long-term scalability.
It’s my understanding that this fact was widely known within Intel, even
though it was not, and probably never will be, publicly acknowledged. We
now know that during the course of the Pentium 4’s design, the design team
was under pressure from the marketing folks to turn out a chip that would
give Intel a massive MHz lead over its rivals. The reasoning apparently went
that MHz was a single number that the general public understood, and they
knew that, just like with everything in the world—except for golf scores—
higher numbers are somehow better.
When it comes to processor clock speeds, higher numbers are indeed
better, but industry-wide problems with the transition to a 90-nanometer
process caused problems for NetBurst, which on the whole relied on ever-
increasing clock rates and ever-rising power consumption to maintain a per-
formance edge over its rivals. As Intel ran into difficulties keeping up the
regularly scheduled increases in the Pentium 4’s clock rate, the processor’s
performance increases began to level off, even as its power consumption
continued to rise.
Regardless of the drawbacks of the NetBurst architecture and its long-term
prospects, the Pentium 4 line of processors has been successful from both
commercial and performance standpoints. This is because the Pentium 4’s
way of doing things has advantages for certain types of applications—
especially 3D and streaming media applications—even though it carries
with it serious risks.
140
Chapter 7
The General Approaches and Design Philosophies of the
Pentium 4 and G4e
The drastic difference in pipeline depth between the G4e and the Pentium 4
reflects some very important differences in the design philosophies and goals
of the two processors. Both processors try to execute as many instructions as
quickly as possible, but they attack this problem in two different ways.
The G4e’s approach to performance can be summarized as “wide and
shallow.” Its designers added more functional units to its back end for exe-
cuting instructions, and its front end tries to fill up these units by issuing
instructions to each functional unit in parallel. In order to extract the max-
imum amount of
instruction-level parallelism (ILP)
from the linear code stream, the G4e’s front end first moves a small batch of instructions onto the chip.
Then, its out-of-order (OOO) execution logic examines them for hazard-
causing dependencies, spreads them out to execute in parallel, and then
pushes them through the back end’s nine execution units. Each of the G4e’s
execution units has a fairly short pipeline, so the instructions take very few
cycles to move through and finish executing. Finally, in the G4e’s final pipe-
line stages, the instructions are put back in their original program order
before the results are written back to memory.
At any given moment, the G4e can have up to 16 instructions simulta-
neously spread throughout the chip in various stages of execution. As you’ll
see when we look at the Pentium 4, this instruction window is quite small.
The end result is that the G4e focuses on getting a small number of instruc-
tions onto the chip at once, spreading them out widely to execute in parallel,
and then getting them off the chip in as few cycles as possible. This “wide and
shallow” approach is illustrated in Figure 7-2.
The Pentium 4 takes a “narrow and deep” approach to moving through
the instruction stream, as illustrated in Figure 7-3. The fact that the
Pentium 4’s pipeline is so deep means that it can hold and work on quite
a few instructions at once, but instead of spreading these instructions out
more widely to execute in parallel, it pushes them through its narrower
back end at a higher rate.
It’s important to note that in order to keep the Pentium 4’s fast back end
fed with instructions, the processor needs deep buffers that can hold and
schedule an enormous number of instructions. The Pentium 4 can have
up to 126 instructions in various stages of execution simultaneously. This way,
the processor can have many more instructions on chip for the OOO exe-
cution logic to examine for dependencies and then rearrange to be rapidly
fired to the execution units. Or, another way of putting this is to say that
the Pentium 4’s instruction window is very large.
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
141
Code Stream
Front End
Back End
Figure 7-2: The G4e’s approach to performance
It might help you to think about these two approaches in terms of a fast
food drive-through analogy. At most fast food restaurants, you can either
walk in or drive through. If you walk in, there are five or six short lines that you can get in and wait to have your order processed by a single server in one
long step. If you choose to drive through, you’ll wind up on a single long line, but that line is geared to move faster because more servers process your order
in more, quicker steps. In other words:
1.
You pull up to the speaker and tell them what you want.
2.
You pull up to a window and pay a cashier.
3.
You drive around and pick up your order.
142
Chapter 7
Code Stream
Front End
Back End
Figure 7-3: The Pentium 4’s approach to performance
Because the drive-through approach splits the ordering process up into
multiple, shorter stages, more customers can be waited on in a single line
because there are more stages of the ordering process for different customers
to find themselves in. The G4e takes the multiline, walk-in approach, while
the Pentium 4 takes the single-line, drive-through approach.
As we’ve already discussed, the more deeply pipelined a machine is, the
more severe a problem pipeline bubbles and pipeline fills become. When the
Pentium 4’s designers set high clock speeds as their primary goal in crafting
the new microarchitecture, they had to do a lot of work to keep the pipeline
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
143
from stalling and to keep branches from being mispredicted. The Pentium 4’s
enormous branch prediction resources and deep buffers represent a place
where the Pentium 4 spends a large number of transistors to alleviate the
negative effects of its long pipeline, transistors that the G4e spends instead
on added execution units.
An Overview of the G4e’s Architecture and Pipeline
The diagram in Figure 7-4 shows the basics of the G4e’s microarchitecture,
with an emphasis on representing the pipeline stages of the front end and
back end. You might want to mark this page so you can refer to it throughout
this section.
Front End
Instruction Fetch-1
Instruction Fetch-2
BU
Instruction Queue
Branch