Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
into, and so on), some level of associativity less than or equal to eight-way
turns out to be optimal for most caches. Any more than eight-way associativity
and the cache’s latency is increased so much that the decrease in miss rate
isn’t worth it. Any less than two-way associativity and the number of collisions often increases the miss rate to the point that the decrease in latency isn’t
worth it. There are some direct-mapped caches out there, though.
Before I conclude the discussion of associativity, there are two minor
bits of information that I should include for the sake of completeness. First,
though you’ve probably already figured it out, a direct-mapped cache is simply
a one-way set associative cache, and a fully associative cache is an
n
-way set associative cache, where
n
is equal to the total number of blocks in the cache.
Second, the cache placement formula, which enables you to compute
the set in which an arbitrary block of memory should be stored in an
n
-way associative cache, is as follows:
(
block_address
) MOD (
number_of_sets_in_cache
)
Understanding Caching and Performance
229
I recommend trying out this formula on some of the preceding examples.
It might seem like a boring and pointless exercise, but if you take five minutes and place a few blocks using this simple formula in conjunction with the
preceding diagrams, everything I’ve said in this section will really fall into
place for you in a “big picture” kind of way. And it’s actually more fun to do
that it probably sounds.
Temporal and Spatial Locality Revisited: Replacement/Eviction
Policies and Block Sizes
Caches can increase the amount of benefit they derive from temporal locality
by implementing an intelligent
replacement policy
(also called, conversely, an
eviction policy
). A replacement policy dictates which of the blocks currently in the cache will be replaced by any new block that gets fetched in. (Or, another
way of putting it is that an eviction policy dictates which of the blocks currently in the cache will be evicted in order to make room for any new blocks that
are fetched in.)
Types of Replacement/Eviction Policies
One way to implement a replacement policy would be to pick a block at
random to be replaced. Other possible replacement policies would be a
FIFO policy, a LIFO policy, or some other such variation. However, none of
these policies take into account the fact that any block that was recently used
is likely to be used again soon. With these simple policies, you wind up evicting blocks that will be used again shortly, thereby increasing the cache miss rate
and eating up valuable memory bus bandwidth with a bunch of unnecessary
fetches.
The ideal replacement policy would be one that makes room for an
incoming block by evicting the cache block that is destined to go unused for
the longest period of time. Unfortunately, computer architects haven’t yet
devised a way of infallibly predicting the future, so there’s no way to know for sure which of the blocks that are currently residing in the cache is the one
that will go the longest time before being accessed again.
Even if you can’t predict the future, you can make an educated guess
based on what you know of the cache’s past behavior. The optimal replace-
ment policy that doesn’t involve predicting the future is to evict the block
that has gone the longest period of time without being used, or the
least
recently used (LRU) block
. The logic here is that if a block hasn’t been used in a while, it’s less likely to be part of the current working set than a block that was more recently used.
An LRU algorithm, though ideal, isn’t quite feasible to implement in
real life. Control logic that checks each cache block to determine which one
is the least recently used not only would add complexity to the cache design,
but such a check would also take up valuable time and thereby add unwanted
latency to each replacement. Most caches wind up implementing some type
of
pseudo-LRU algorithm
that approximates true LRU by marking blocks as
more and more
dirty
the longer they sit unused in the cache. When a new
block is fetched into the cache, the dirtiest blocks are the first to be replaced.
230
Chapter 11
Sometimes, blocks that aren’t all that dirty get replaced by newer blocks,
just because there isn’t enough room in the cache to contain the entire
working set. A miss that results when a block containing needed data has
been evicted from the cache due to a lack of cache capacity is called a
capacity miss
. This is the third and final type of cache miss.
Block Sizes
In the section on spatial locality I mentioned that storing whole blocks is one
way that caches take advantage of spatial locality of reference. Now that you
know a little more about how caches are organized internally, you can look a
closer at the issue of block size.
You might think that as cache sizes increase, you can take even better
advantage of spatial locality by making block sizes even bigger. Surely fetching more bytes per block into the cache would decrease the odds that some part
of the working set will be evicted because it resides in a different block. This is true to some extent, but you have to be careful. If you increase the block
size while keeping the cache size the same, you decrease the number of
blocks that the cache can hold. Fewer blocks in the cache means fewer sets,
and fewer sets means that collisions and therefore misses are more likely.
And of course, with fewer blocks in the cache, the likelihood decreases that
any particular block that the CPU needs will be available in the cache.
The upshot of all this is that smaller block sizes allow you to exercise
more fine-grained control of the cache. You can trace out the boundaries of
a working set with a higher resolution by using smaller cache blocks. If your
cache blocks are too large, you wind up with a lot of wasted cache space,
because many of the blocks will contain only a few bytes from the working
set while the rest is irrelevant data. If you think of this issue in terms of
cache pollution, you can say that large cache blocks are more prone to
pollute the cache with non-reusable data than small cache blocks.
Figure 11-11 shows the memory map we’ve been using, sitting in a cache
with large block sizes.
Figure 11-11: Memory map with large block sizes
Understanding Caching and Performance
231
Figure 11-12 shows the same map, but with the block sizes decreased.
Notice how much more control the smaller blocks allow over cache pollution.
The smaller cache blocks have a higher average ratio of red to blue in each
block, which means that it’s easier to keep the precise contours of the work-
ing set in the cache.
Figure 11-12: Memory map with small block sizes
The other problems with large block sizes are bandwidth related. The
larger the block size, the more data is fetched with each load, so large block
sizes can really eat up memory bus bandwidth, especially if the miss rate is
high. A system therefore needs plenty of bandwidth if it’s going to make
good use of large cache blocks. Otherwise, the increase in bus traffic can
increase the amount of time it takes to fetch a cache block from memory,
thereby adding latency to the cache.
Write Policies: Write-Through vs. Write-Back
So far, this entire chapter has dealt with only one type of memory traffic:
loads
, or requests for data from memory. I’ve talked only about loads because they make up the vast majority of memory traffic. The remainder of memory
traffic is made up of stores, which in simple uniprocessor systems are much
easier to deal with. In this section, I’ll explain how to handle stores in single-processor systems with just an L1. When you throw in more caches and multiple
processors, things get more complicated than I want to go into here.
Once a retrieved piece of data is modified by the CPU, it must be stored
or written back out to main memory so that the rest of the system has access
to the most up-to-date version of it.
There are two ways to deal with such writes to memory. The first way is
to immediately update all the copies of the modified data in each level of
the cache hierarchy to reflect the latest changes. A piece of modified data
would be written to both the L1 and main memory so that all of its copies
are current. Such a policy for handling writes is called
write-through
, since it writes the modified data through to all levels of the hierarchy.
232
Chapter 11
A write-through policy can be nice for multiprocessor and I/O–intensive
system designs, since multiple clients are reading from memory at once and
all need the most current data available. However, the multiple updates per
write required by this policy can greatly increase memory traffic. For each
store, the system must update multiple copies of the modified data. If there’s
a large amount of data that has been modified, this increased write activity
could eat up quite a bit of memory bandwidth that could be used for the
more important load traffic.
The alternative to write through is
write-back
, and it can potentially result in less memory traffic. With a write-back policy, changes propagate down to
the lower levels of the hierarchy as cache blocks are evicted from the higher
levels. An updated piece of data in an L1 cache block will not be updated in
main memory until that block is evicted from the L1. The trade-off for write-
back is that it’s more complex to keep the multiple copies of the data in sync,
especially in a multiprocessor system.
Conclusions
There is much, much more that can be said about caching; this chapter has
covered only the basic concepts. As main memory moves farther away from
the CPU in terms of CPU clock cycles, the importance of caching will only
increase. For modern microprocessor systems, larger on-die caches have
turned out to be one of the simplest and most effective uses for the increased
transistor budgets that new manufacturing process technologies afford.
Understanding Caching and Performance
233
I N T E L ’ S P E N T I U M M , C O R E
D U O , A N D C O R E 2 D U O
In March 2003, Intel officially launched the Pentium M,
a new
x
86 processor designed specifically for the mobile
computing market. The Pentium M was a success from
both performance and power efficiency perspectives,
with the result that the immediate successor to the
Pentium M, called Core Duo, made its way out of the
portable niche and into desktop computers.
Core Duo’s combination of high performance and low power consump-
tion made it clearly superior to processors based on Intel’s aging Netburst
microarchitecture, and in the fall of 2005, Intel announced plans to unify its
entire
x
86 product line, from portables all the way up to servers, on the follow-up to Core Duo: a brand new, power-efficient, 64-bit microarchitecture called
Core. The desktop implementation of Core is the new Core 2 Duo processor
line from Intel, while server products based on Core are sold under the
venerable Xeon brand.
This chapter covers the major design aspects of the Pentium M, Core
Duo, and Core 2 Duo, comparing the microarchitectures of these three
related processors with the Intel processors covered in previous chapters.
I’ll talk about the major new features common to all three designs, like
micro-ops fusion and the improved branch predictor, as well as each
processor’s individual innovations. Finally, I’ll discuss memory instruction
reordering, and I’ll explain how a feature called memory disambiguation
enables Core 2 to perform a kind of speculative execution on a processor’s
outgoing results stream.
Code Names and Brand Names
As far back as the launch of the Pentium 4 line, Intel began emphasizing
the difference between microarchitecture and implementation. This differ-
ence, which is discussed in the section
“Processor Organization and Core
Microarchitecture” on page 247 has
taken on new importance with the introduction of multi-core processors like the Core Duo. A particular processor
from Intel might have three different names associated with it: a brand name
for the processor itself (e.g.,
Pentium 4
), a separate brand name for the microarchitecture that the processor implements (e.g.,
Netburst
), and a commonly used code name that refers to a particular variant (e.g.,
Prescott
) within a larger family of closely related microarchitectures that share a brand name.
(
Willamette
,
Northwood
,
Prescott
, and
Cedar Mill
are all variants of Intel’s Netburst microarchitecture that are used in different Pentium 4–branded
processors.) The reader (and the author of computer books) is therefore
forced to juggle three different commonly used names when referring to any