Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (90 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

4.9Mb size Format: txt, pdf, ePub

Read Book Download Book

those neighboring bytes are the most likely to be the ones it will need to

process next.

Cache blocks form the basic unit of cache organization, and RAM is also

organized into blocks of the same size as the cache’s blocks. When a block is

moved from RAM to the cache, it is placed into a special slot in the cache

called a
block frame
. Figure 11-5 shows a set of cache blocks stored in RAM and a cache with an empty set of block frames.

Block Frames

L1 Cache

0 1 2 3 4 5 6 7

Blocks

RAM

9 10 11 12 13 14 15 16 17 18 19

Figure 11-5: Blocks and block frames

Understanding Caching and Performance

223

Cache designers can choose from a few different schemes for governing

which RAM blocks can be stored in which of the cache’s block frames. Such a

scheme is called a
cache placement policy
, because it dictates where in the cache a block from memory can be placed.

Tag RAM

When the CPU requests a byte from a particular block from RAM, it needs to

be able to determine three things very quickly:

whether or not the needed block is actually in the cache (i.e., whether

there is a cache hit or a cache miss)

the location of the block within the cache (in the case of a cache hit)

the location of the desired byte (or critical word) within the block

(again, in the case of a cache hit)

A cache accommodates all three needs by associating a special piece

of memory—called a tag—with each block frame in the cache. The
tag

holds information about the blocks currently being stored in the frame,

and that information allows the CPU to determine the answer to all three

of the questions above. However, the speed with which that answer comes

depends on a variety of factors.

Tags are stored in a special type of memory called the
tag RAM
. This

memory has to be made of very fast SRAM because it can take some time

to search it in order to locate the desired cache block. The larger the cache,

the greater the number of blocks, and the greater the number of blocks, the

more tag RAM you need to search and the longer it can take to locate the

correct block. Thus the tag RAM can add unwanted access latency to the

cache. As a result, you not only have to use the fastest RAM available for

the tag RAM, but you also have to be smart about how you use tags to map

RAM blocks to block frames. In the following section, I’ll introduce the

three general options for doing such mapping, and I’ll discuss some of the

pros and cons of each option.

Fully Associative Mapping

The most conceptually simple scheme for mapping RAM blocks to cache block

frames is called
fully associative mapping
. Under this scheme, any RAM block can be stored in any available block frame. Fully associative mapping is depicted

in Figure 11-6, where any of the red RAM blocks can go into any of the red

cache block frames.

The problem with fully associative mapping is that if you want to retrieve

a specific block from the cache, you have to check the tag of every single block frame in the entire cache because the desired block could be in any of the

frames. Since large caches can have thousands of block frames, this tag

searching can add considerable delay (latency) to a fetch. Furthermore, the

larger the cache, the worse the delay gets, since there are more block frames

and hence more block tags to check on each fetch.

224

Chapter 11

Direct Mapping

Another, more popular way of organizing the cache is to use
direct mapping
.

In a
direct-mapped cache
, each block frame can cache only a certain subset of the blocks in main memory.

Block Frames

L1 Cache

0 1 2 3 4 5 6 7

Blocks

RAM

9 10 11 12 13 14 15 16 17 18 19

Figure 11-6: Fully associative mapping

In Figure 11-7, each of the red blocks (blocks 0, 8, and 16) can be

cached only in the red block frame (frame 0). Likewise, blocks 1, 9, and 17

can be cached only in frame 1, blocks 2, 10, and 18 can be cached only in

frame 2, and so on. Hopefully, the pattern here is apparent: Each frame

caches every eighth block of main memory. As a result, the potential number

of locations for any one block is greatly narrowed, and therefore the number of

tags that must be checked on each fetch is greatly reduced. For example,

if the CPU needs a byte from blocks 0, 8, or 16, it knows that it only has to check the tag associated with frame 0 to determine if the desired block is in the

cache and to retrieve it if it is. This is much faster and more efficient than

checking every frame in the cache.

There are some drawbacks to this scheme, though. For instance,

what if blocks 0 to 3 and 8 to 11 combine to form an eight-block working set

that the CPU wants to load into the cache and work on for a while? The

cache is eight blocks long, but since it’s direct-mapped, it can only store

four of these particular blocks at a time. Remember, blocks 0 and 8 have

to go in the same frame, as do blocks 1 and 9, 2 and 10, and 3 and 11.

As a result, the CPU must load only four blocks of this eight-block set at

a time, and swap them in and out as it works on different parts of the set.

If the CPU wants to work on this whole eight-block set for a long time,

that could mean a lot of swapping. Meanwhile, half of the cache is going

completely unused! While direct-mapped caches are almost always faster

Understanding Caching and Performance

225

than fully associative caches due to the shortened amount of time it

takes to locate a cached block, they can still be inefficient under some

circumstances.

Block Frames

L1 Cache

Blocks

RAM

9 10 11 12 13 14 15 16 17 18 19

Figure 11-7: Direct mapping

Note that the kind of situation described here, where the CPU would

like to store multiple blocks but it can’t because they all require the same

frame, is called a
collision
. In the preceding example, blocks 0 and 8 are said to collide, since they both want to fit into frame 0 but can’t. Misses that result from such collisions are called
conflict misses
, the second of the three types of cache miss that I mentioned earlier.

N
-Way Set Associative Mapping

One way to get some of the benefits of direct-mapped caches while lessening

the amount of cache space wasted due to collisions is to restrict the caching

of main memory blocks to a subset of the available cache frames. This tech-

nique is called
set associative mapping
, and a few popular implementations of it are described below.

Four-Way Set Associative Mapping

To see an example of what it means to restrict main memory blocks in a

subset of available cache frames, take a look at the Figure 11-8, which

illustrates
four-way set associative mapping
.

In Figure 11-8, any of the red blocks can go anywhere in the red set of

frames (set 0) and any of the light yellow blocks can go anywhere in the light

yellow set of frames (set 1). Think of the four-way associative cache like this: You took a fully associative cache and cut it in two, restricting half the main

memory blocks to one side and half the main memory blocks to the other.

226

Chapter 11

Block Frames

L1 Cache

Blocks

RAM

9 10 11 12 13 14 15 16 17 18 19

Figure 11-8: Four-way set associative mapping

This way, the odds of a collision are greatly reduced versus the direct-mapped

cache, but you still don’t have to search all the tags on every fetch like you

did with the fully associative cache. For any given fetch, you need search only

a single, four-block set to find the block you’re looking for, which in this case amounts to half the cache.

The cache pictured in Figure 11-8 is said to be four-way
set associative

because the cache is divided into
sets
of four frames each. Since this cache has only eight frames, it can accommodate only two four-frame sets. A larger

cache could accommodate more four-frame sets, reducing the odds of a

collision even more.

Figure 11-9 shows a four-way set associative cache like the one in

Figure 11-8, but with three sets instead of two. Notice that there are fewer

red main memory blocks competing for space in set 0, which means lower

odds of a collision.

In addition to its decreased odds of a collision, a four-way set associative

cache has a low access latency because the number of frames that must be

searched for a block is limited. Since all the sets consist of exactly four frames, no matter how large the cache gets, you’ll only ever have to search through

four frames (or one full set) to find any given block. This means that as the

cache gets larger and the number of sets that it can accommodate increases,

the tag searches become more efficient. Think about it. In a cache with three

sets, only one-third of the cache (or one set) needs to be searched for a given

block. In a cache with four sets, only one-fourth of the cache is searched.

In a cache with one hundred four-block sets, only one-hundredth of the

cache needs to be searched. The relative search efficiency scales with

the cache size.

Understanding Caching and Performance

227

Block Frames

L1 Cache

Blocks

RAM

9 10 11 12 13 14 15 16 17 18 19

Figure 11-9: Four-way set associative mapping with three block frames

Two-Way Set Associative Mapping

Another way to increase the number of sets in the cache without actually

increasing the cache size is to reduce the number of blocks in each set.

Check out Figure 11-10, which shows a two-way set associative cache.

Block Frames

L1 Cache

Blocks

RAM

9 10 11 12 13 14 15 16 17 18 19

Figure 11-10: Two-way set associative mapping

228

Chapter 11

To get a better idea of what’s going on, let’s compare the two-way asso-

ciative cache to both the direct-mapped and the four-way cache. For the sake

of comparison, assume that the cache size and the memory size both stay

constant. And as you read the comparison, keep in mind that since each

increase in the level of associativity (i.e., from two-way to four-way, or from

four-way to eight-way) also increases the number of tags that must be checked

in order to locate a specific block, an increase in associativity also means an

increase in the cache’s latency.

Two-Way vs. Direct-Mapped

With the two-way cache, the number of potential collisions (and hence the

miss rate) is reduced compared to the direct-mapped scheme. However,

the number of tags that must be searched for each fetch is twice as high.

Depending on the relative sizes of the cache and main memory, this may or

may not increase the cache’s overall latency to the point that the decreased

conflict miss rate is worth it.

Two-Way vs. Four-Way

Though a two-way cache’s latency is less than that of a four-way cache, its

number of potential collisions (and hence its miss rate) is higher. Just as with the preceding comparison, how a two-way associative cache compares to a

four-way associative cache depends on just how much latency the four-way

cache’s increase in associativity ends up costing versus the decrease in

conflict miss rate.

Associativity: Conclusions

In general, it turns out that when you factor in current technological

conditions (the speed of tag RAM, the range of sizes that most caches fall