Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (77 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

6.23Mb size Format: txt, pdf, ePub

Read Book Download Book

doesn’t know or care if the instructions are coming from the trace cache or

the microcode ROM. All it sees is a constant, uninterrupted stream of

instructions.

Intel hasn’t said exactly how big the trace cache is—only that it holds

12,000 micro-ops. Intel claims this is roughly equivalent to a 16KB to 18KB

I-cache.

By way of finishing up our discussion of the trace cache and introducing

our detailed walk-through of the Pentium 4’s pipeline, I should note two

final aspects of the trace cache’s effect on the pipeline. First, the trace cache still needs a short instruction fetch stage so that micro-ops can be fetched

from it and sent to the allocation and scheduling logic. When we look at the

Pentium 4’s basic execution pipeline, you’ll see this stage. Second, the trace

cache actually has its own little BPU for predicting the directions and return

addresses of branches within the trace cache itself. So the trace cache doesn’t

eliminate branch processing and prediction entirely from the picture; it just

alleviates their effects on performance.

154

Chapter 7

An Overview of the Pentium 4’s Pipeline

Now let’s step back and take a look at the Pentium 4’s basic execution pipeline.

Here’s a breakdown of the various pipeline stages.

Stages 1 and 2: Trace Cache Next Instruction Pointer

In these stages, the Pentium 4’s trace cache fetch logic gets a pointer to the

next instruction in the trace cache.

Stages 3 and 4: Trace Cache Fetch

These two stages fetch an instruction from the trace cache to be sent to the

back end.

Stage 5: Drive

This is the first of two special
drive stages
in the Pentium 4’s pipeline, each of which is dedicated to driving signals from one part of the processor to the

next. The Pentium 4 runs so fast that sometimes a signal can’t make it all the

way to where it needs to be in a single clock pulse, so the processor dedicates

some pipeline stages to letting these signals propagate across the chip. These

drive stages are there because the Pentium 4’s designers intend for the chip

to reach such stratospheric clock speeds that stages like this are absolutely

necessary.

At the end of these first five stages, the Pentium 4’s trace cache sends up

to three micro-ops per cycle into a large, FIFO
micro-op queue
. This in-order queue, which sits in between the Pentium 4’s front end and back end, smoothes

out the flow of instructions to the back end by squeezing out any fetch- or

decode-related bubbles. Micro-ops enter the top of the queue and fall down

to rest at the lowest available entry, directly above the most recent micro-op to enter the queue. Thus, any bubbles that may have been ahead of the micro-ops disappear from the pipeline at this point. The micro-ops leave the bottom

of the micro-op queue in program order and proceed to the next pipeline

stage.

Stages 6 Through 8: Allocate and Rename (ROB)

In this group of stages, up to three instructions per cycle move from the

bottom of the micro-op queue and are allocated entries in the Pentium 4’s

ROB and rename registers. With regard to the latter, the
x
86 ISA specifies only eight GPRs, eight FPRs, and eight VPRs, but the Pentium 4 has 128 of

each type of register in its rename register files.

The allocator/renamer stages also allocates each micro-op an entry in

one of the two micro-op queues detailed in the next section and can send up

to three micro-ops per cycle into these queues.

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

155

Stage 9: Queue

To implement out-of-order execution, the Pentium 4 flows micro-ops from

its trace cache through the ROB and into two deep micro-op queues that sit

between its instructions’ dispatch and execution phases. These two queues

are the
memory micro-op queue
and the
arithmetic micro-op queue
. The memory micro-op queue holds memory operations (loads and stores) that are destined

for the Pentium 4’s LSU, while the arithmetic micro-op queue holds all other

types of operations.

The two main micro-op queues are roughly analogous to the G4e’s issue

queues, but with one crucial difference: the Pentium 4’s micro-op queues are

FIFO queues, while the G4e’s issue queues are not. For the Pentium 4, this

means an instruction passes into and out of a micro-op queue
in program order
with respect to the other instructions in its own queue. However, instructions

can still exit the bottom of each queue
out of program order
with respect to instructions in the other queue.

These two micro-op queues feed micro-ops into the scheduling logic in

the next stage.

Stages 10 Through 12: Schedule

The micro-op queues described in the preceding section are only part of the

Pentium 4’s dynamic scheduling logic. The other half consists of a set of four

micro-op schedulers whose job it is to schedule micro-ops for execution and

to determine to which execution unit the micro-ops should be passed. Each

of these schedulers consists of a smaller, 8- to 12-entry micro-op queue attached to a bit of scheduling logic. The scheduler’s scheduling logic arbitrates with

the other schedulers for access to the Pentium 4’s four issue ports, and it

removes micro-ops from its in-order scheduling queue and sends them

through the right port at the right time.

An instruction cannot exit from a scheduling queue until its input

operands are available and the appropriate execution unit is available.

When the micro-op is ready to execute, it is removed from the bottom of

its scheduling queue and passed to the proper execution unit through one

of the Pentium 4’s four issue ports, which are analogous to the P6 core’s

five issue ports in that they act as gateways to the back end’s execution

units.

Here’s a breakdown of the four schedulers:

Memory scheduler

Schedules memory operations for the LSU.

Fast IU scheduler

Schedules ALU operations (simple integer and logical instructions) for

the Pentium 4’s two double-speed integer execution units. As you’ll see

in the next chapter, the Pentium 4 contains two integer ALUs that run at

twice the main core’s clock speed.

156

Chapter 7

Slow IU/general FPU scheduler

Schedules the rest of the integer instructions and most of the floating-

point instructions.

Simple FP scheduler

Schedules simple FP instructions and FP memory operations.

These schedulers feed micro-ops through the four dispatch ports

described in the next stage.

Stages 13 and 14: Issue

The P6 core’s reservation station sends instructions to the back end via one

of five issue ports. The Pentium 4 uses a similar scheme, but with four issue

ports instead of five. There are two
memory ports
for memory instructions: the load port and the store port, for loads and stores, respectively. The remaining

two ports, called
execution ports
, are for all the other instructions: execution port 0 and execution port 1. The Pentium 4 can send a total of six micro-ops

per cycle through the four execution ports. This issue rate of six micro-ops per cycle is more micro-ops per cycle than the front end can fetch and decode

(three per cycle) or the back end can complete (three per cycle), but that’s

okay because it gives the machine some headroom in its middle so that it can

have bursts of activity.

You might be wondering how six micro-ops per cycle can move through

four ports. The trick is that the Pentium 4’s two execution ports are double-

speed, meaning that they can dispatch instructions (integer only) on the

rising and falling edges of the clock. But we’ll talk more about this in the

next chapter. For now, here’s a breakdown of the two execution ports and

which execution units are attached to them:

Execution port 0:

Fast integer ALU1

This unit performs integer addition, subtraction,

and logical operations. It also evaluates branch conditionals and exe-

cutes store-data micro-ops, which store data into the outgoing store

buffer. This is the first of two double-speed integer units, which operate

at twice the core clock frequency.

Floating-point/SSE move

This unit performs floating-point and SSE

moves and stores. It also executes the FXCH instruction, which means that

it’s no longer “free” on the Pentium 4.

Execution port 1:

Fast integer ALU2

This very simple integer ALU performs only integer

addition and subtraction. It’s the second of the two double-speed inte-

ger ALUs.

Slow integer ALU

This integer unit handles all of the more time-

consuming integer operations, like shift and rotate, that can’t be

completed in half a clock cycle by the two fast ALUs.

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

157

Floating-point/SEE/MMX ALU

This unit handles floating-point and

SSE addition, subtraction, multiplication, and division. It also handles all

MMX instructions.

In Figure 7-5, I’ve labeled the instruction flow paths going into each

execution unit to show which dispatch port instructions must pass through

in order to reach which execution unit.

Stages 15 and 16: Register Files

This stage is where the execution units, upon receiving the instructions,

read each instruction’s input operands from the appropriate register file.

To return to the discussion from Chapter 1, this step is the read step in the

read-execute-write cycle of computation.

Stage 17: Execute

In this stage, the instructions are actually executed by the back end’s execution units. We’ll take a closer look at the Pentium 4’s back end in the next chapter.

Stage 18: Flags

If an instruction’s outcome stipulates that it needs to set any flags in the PSW, then it does so at this stage.

Stage 19: Branch Check

Here’s the stage where the Pentium 4 checks the outcome of a conditional

branch to see if it has just wasted 19 cycles of its time executing some code

that it’ll have to throw away. By stage 19, the branch condition has been

evaluated, and the front end knows whether or not the branch predictor’s

guess was right or not.

Stage 20: Drive

You’ve already met the drive stage. Again, this stage is dedicated to propa-

gating signals across the chip.

Stages 21 and Onward: Complete and Commit

Although Intel only lists the last part of the execution phase—stage 20—as

part of the “normal Pentium 4 pipeline,” for completeness, I’ll include the

write-back phase. Completed instructions file into their pre-assigned entries

in the ROB, where they’re put back in program order before having their

results update the machine’s architectural state.

158

Chapter 7

As you can see, the Pentium 4’s 20-stage pipeline does much of the same

work in mostly the same order as the G4e’s seven-stage pipeline. By dividing

the pipeline into more stages, though, the Pentium 4 can reach higher clock

rates. As I’ve already noted, this deeply pipelined approach fits with the

Pentium 4’s “narrow and deep” design philosophy.

The Pentium 4’s Instruction Window

Before I discuss the nature of the Pentium 4’s instruction window, note that

I’m using the terms
instruction pool
and
instruction window
somewhat inter-changeably. These two terms represent two slightly different metaphors for

thinking about the set of queues and buffers positioned between a processor’s

front end and its back end. Instructions collect up in these queues and

buffers—just like water collects in a pool or reservoir—before being drained

away by the processor’s back end. Because the instruction pool represents a

small segment of the code stream, which the processor can examine for

dependencies and reorder for optimal execution, this pool can also be said

to function as a window on the code stream. Now that that’s clear, let’s take a

look at the Pentium 4’s instruction window.

As I explained in the previous chapter, the older P6 core’s RS and ROB

made up the heart of its instruction window. The Pentium 4 likewise has a

ROB for tracking micro-ops, and in fact, and its 126-entry ROB is much larger

than that of the P6. The buffer functions of the P6’s reservation station,