Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
unit examines the 512-entry BHT and decides to speculatively take a branch,
it doesn’t have to go to code storage to fetch the first instruction from that
branch’s target address. Instead, the BPU loads the branch’s target instruction
directly from the BTIC into the instruction queue, which means that the
processor doesn’t have to wait around for the fetch logic to go out and fetch
the target instruction from code storage. This scheme saves valuable cycles,
and it helps keep performance-killing bubbles out of the 750’s pipeline.
Summary: The PowerPC 750 in Historical Context
In spite of its short pipeline and small instruction window, the 750 packed quite a punch. It managed to outperform the 604, partially because of a dedicated
back-side L2 cache interface that allowed it to offload L2 traffic from the front-side bus. It was so successful that a 604 derivative was scrapped in favor of just building on the 750. The 750 and its immediate successors, all of which went
under the name of
G3
, eventually found widespread use both as embedded
devices and across Apple’s entire product line, from its portables to its
workstations.
The G3 lacked one important feature that separated it from the
x
86
competition, though: vector computing capabilities. While comparable
PC processors supported SIMD in the form of Intel’s and AMD’s vector
132
Chapter 6
extensions to the
x
86 instruction set, the G3 was stuck in the world of scalar computing. So when Motorola decided to develop the G3 into an even
more capable embedded and media workstation chip, this lack was the first
thing it addressed.
The PowerPC 7400 (aka the G4)
The Motorola MPC7400 (aka the G4) was designed as a media processing
powerhouse for desktops and portables. Apple Computer used the 7400 as
the CPU in the first version of their G4 workstation line, and this processor
was later replaced by a lower-power version—the 7410—before the 7450
(aka the G4+ or G4e) was introduced. Today, the successors to the 7400/7410
have seen widespread use as
embedded processors
, which means that they’re used in routers and other non-PC devices that need a microprocessor with
low power consumption and strong DSP capabilities. Table 6-5 lists the
features of the PowerPC 7400.
Table 6-5:
Features of the PowerPC 7400
Introduction Date
September 1999
Process
0.20 micron
Transistor Count
10.5 million
Die Size
83 mm2
Clock Speed at Introduction
400–600 MHz
Cache Sizes
64KB split L1, 2MB L2 supported via on-chip tags
First Appeared In
Power Macintosh G4
Figure 6-5 illustrates the PowerPC 7400 microarchitecture.
Except for the addition of SIMD capabilities, which we’ll discuss in the
next chapter, the G4 is essentially the same as the 750. Motorola’s technical
summary of the G4 has this to say about the G4 compared to the 750:
The design philosophy on the MPC7410 (and the MPC7400)
is to change from the MPC750 base only where required to
gain compelling multimedia and multiprocessor performance.
The MPC7410’s core is essentially the same as the MPC750’s,
except that whereas the MPC750 has a 6-entry completion queue
and has slower performance on some floating-point double-
precision operations, the MPC7410 has an 8-entry completion
queue and a full double-precision FPU. The MPC7410 also adds
the AltiVec instruction set, has a new memory subsystem, and can
interface to the improved MPX bus.
—
MPC7410 RISC Microprocessor Technical Summary, section 3.11.
PowerPC Processors: 600 Series, 700 Series, and 7400
133
Front End
Instruction Fetch
BU
Branch
Instruction Queue
Unit
Decode/Dispatch
Reserv.
Reserv.
Reserv.
Reserv.
Reserv.
Reserv.
Station
Station
Station
Station
Station
Station
VPU-1
VSIU-1
VCIU-1 VFPU-1
FPU-1
IU1-1
IU2-1
LSU-1
VCIU-2 VFPU-2
FPU-2
LSU-2
VCIU-3 VFPU-3
FPU-3
Vector
Load-
VFPU-4
Permute
Vector
Floating-
Integer
Store
Unit
ALU
Point Unit
Unit
Unit
Memory Access
Vector Arithmetic Logic Units
Scalar Arithmetic Logic Units
Units
Back End
Completion
Queue
Write
Commit Unit
Figure 6-5: Microarchitecture of the PowerPC 7400
Aside from the vector execution unit, the most important difference in
the back ends of the two units lies in the G4’s improved FPU. The G4’s FPU
is a full-blown double-precision FPU, and it does single- and double-precision
floating-point operations, including multiply and multiply-add, in three fully-
pipelined cycles.
With respect to the instruction window, the G4 has the same number
and configuration of reservation stations as the 750. (Note that the G4’s two
vector execution units, which were not present on the 750, each have a one-
entry reservation station.) The only difference is that the G4’s instruction
queue has been lengthened to eight entries from the 750’s original six as a
way of reducing dispatch bottlenecks.
134
Chapter 6
The G4’s Vector Unit
In the late 1990s, Apple, Motorola, and IBM jointly developed a set of SIMD
extensions to the PowerPC instruction set for use in the PowerPC processor
series. These SIMD extensions went by different names: IBM called them
VMX, and Motorola called them AltiVec. This book will refer to these exten-
sions using Motorola’s AltiVec label.
The new AltiVec instructions, which I’ll cover in detail in Chapter 8, were
first introduced in the G4. The G4 executes these instructions in its vector
unit, which consists of two vector execution units: the
vector ALU (VALU)
and the
vector permute unit (VPU)
. The VALU performs vector arithmetic and logical operations, while the VPU performs permute and shift operations on vectors.
To support the AltiVec instructions, which can operate on up to 128 bits of
data at a time, 32 new 128-bit vector registers were added to the PowerPC ISA.
On the G4, these 32 architectural registers are accompanied by 6 vector
rename registers.
Summary: The PowerPC G4 in Historical Context
The G4’s AltiVec instruction set was a hit, and it began to see widespread use
by Apple and by Motorola’s embedded customers. But there was still much
room for improvement to the G4’s AltiVec implementation. In particular, the
vector unit’s single VALU was tasked with handling all integer and floating-
point vector operations. Just like scalar code benefits from the presence of
multiple specialized scalar ALUs, vector performance could be improved by
splitting the burden of vector computation among multiple specialized VALUs
operating in parallel. Such an improvement would have to wait for the succes-
sor to the G4—the G4e.
The major problem with the G4 was that its short, four-stage pipeline
severely limited the upward scalability of its clock rate. While Intel and AMD
were locked in the gigahertz race, Motorola’s G4 was stuck around the 500 MHz
mark for quite a long time. As a result, Apple’s
x
86 competitors soon surpassed it in both clock speed and performance, leaving what was once the most powerful commodity RISC workstation line in serious trouble with the market.
Conclusion
The 600 series saw the PPC line go from the new kid on the block to a mature
RISC alternative that brought Apple’s PowerMac workstation to the forefront
of personal computing performance. While the initial 601 had a few teeth-
ing problems, the line was in great shape after the 603e and 604e made it
to market. The 603e was a superb mobile chip that worked well in Apple’s
laptops, and even though it had a more limited instruction dispatch/commit
bandwidth and a smaller cache than the 601, it still managed to beat its
predecessor because of its more efficient use of transistors.
PowerPC Processors: 600 Series, 700 Series, and 7400
135
The 604 doubled the 603’s instruction dispatch and commit bandwidth,
and it sported a wider back end and a larger instruction window that enabled
its back end to grind through more instructions per clock. Furthermore, its
pipeline was deepened in order to increase the number of instructions per
clock and to allow for better clock speed scaling. The end result was that the
604 was a strong enough desktop chip to keep the PowerMac comfortably in
the performance game.
It’s important to remember, though, that the 600 series reigned at a time
when transistor budgets were still relatively small by today’s standards, so the PowerPC architecture’s RISC nature gave it a definite cost, performance, and
power consumption edge over the
x
86 competition. This is not to say that the 600 series was always in the performance lead; it wasn’t. The performance
crown changed hands a number of time during this period.
During the heyday of the 600 series and into the dawn of the G3 era, the
fact that PowerPC was a RISC ISA was a strong mark in the platform’s favor.
But as Moore’s Curves drove transistor counts and MHz numbers ever higher,
the relative cost of legacy
x
86 support began to go down and the PowerPC
ISA’s RISC advantage started to wane. By the time the 7400 hit the market,
x
86 processors from Intel and AMD were already catching up to it in performance, and by the time the gigahertz race was over, Apple’s flagship workstation
line was in trouble. The 7400’s clock speed and performance had stagnated
for too long during a period when Intel and AMD were locked in a heated
price/performance competition.
Apple’s stop-gap solution to this problem was to turn to
symmetric
multiprocessing (SMP)
in order to increase the performance of its desktop line. (See Chapter 12 for a more detailed discussion of SMP.) By offering
computers in which two G4s worked together to execute code and process
data, Apple hoped to pack more processing power into its computers in
a way that didn’t rely on Motorola to ramp up clock speeds. The dual G4
met with mixed success in the market, and it wasn’t until the debut of the
significantly redesigned PowerPC 7450 (aka G4+ or G4e) that Apple saw the
per-processor performance of its workstations improve. The introduction of
the G4e into its workstation line enabled Apple to recover some ground in its
race with its primary competitor in the PC space—systems based on Intel’s
Pentium 4.
136
Chapter 6
I N T E L ’ S P E N T I U M 4 V S .
M O T O R O L A ’ S G 4 E : A P P R O A C H E S
A N D D E S I G N P H I L O S O P H I E S
Now that we’ve covered not only the microprocessor
basics but also the development of two popular
x
86
and PowerPC processor lines, you’re equipped to com-
pare and to understand two of the processors that have
been among the most popular examples of these two
lines: Intel’s Pentium 4 and Motorola’s G4e.
When the Pentium 4 hit the market in November 2000, it was the first
major new
x
86 microarchitecture from Intel since the 1995 introduction of the Pentium Pro. In the years prior to the Pentium 4’s launch, the Pentium
Pro’s P6 core dominated the market in its incarnations as the Pentium II and
Pentium III, and anyone who was paying attention during that time learned
at least one major lesson: Clock speed sells. Intel was definitely paying atten-
tion, and as the Willamette team members labored away in Hillsboro, Oregon,
they kept MHz foremost in their minds. This singular focus is evident in every-
thing from Intel’s Pentium 4 promotional and technical literature down to
the very last detail of the processor’s design. As this chapter will show, the
successor to the most successful
x
86 microarchitecture of all time was a machine built from the ground up for stratospheric clock speed.
NOTE
Willamette
was Intel’s code name for the Pentium 4 while the project was in development. Intel’s projects are usually code-named after rivers in Oregon. Many companies
use code names that follow a certain convention, like Apple’s use of the names of large
cats for versions of OS X.
Motorola introduced MPC7450 in January 2001, and Apple quickly
adopted it under the
G4
moniker. Because the 7450 represented a significant departure from the 7400, the 7450 was often referred to as the G4e or the
G4+, so throughout this chapter we’ll call it the G4e. The new processor had
a slightly deeper pipeline, which allowed it to scale to higher clock speeds, and both its front end and back ends boasted a whole host of improvements that