Lecture Notes   CS/EE 3810

Chapter 5: Memory Hierarchies   

For our discussions so far, we have assumed that the memory stage of the
pipeline takes a single cycle. We've already seen that memory accesses are
extremely slow and getting slower. A single memory access today takes about
300 cycles, so we need cache hierarchies to reduce stalls. The cache stores a
subset of memory on the chip, allowing fast access to some memory. For
example, a 32KB L1 data cache can service about 96% of all requests.  Such a
cache may have an access time of 2 cycles, resulting in few stalls. If there is a
miss, for an in-order pipeline, the pipeline is frozen until the value is fetched from
memory. For an out-of-order pipeline, other instructions can continue to execute
while the memory fetch proceeds in the background, but eventually, the long-latency
instruction will hold up commit and stall execution. Typically, separate L1 instruction
and data caches are implemented so that instruction fetch and data fetch can happen
at the same time. Smaller caches also have smaller access times. The L1 caches
are backed up by a much larger (and slower) L2 cache. There is usually a single
L2 cache that is shared by instructions and data. Higher access times for
the L2 are not a great problem because the L1 takes care of most requests. The L2
is primarily in place to avoid expensive memory accesses and it is worth it even if
it has a hit rate of less than 10%. Note that each level of the hierarchy is accessed
sequentially -- that is, if there is an L1 miss, we do not look up the L2 and memory
in parallel (to reduce power consumption and complexity). Therefore, in some
cases, the addition of another level of cache can lead to poor performance if it
yields very few hits (can happen when we're streaming through large data arrays
that do not fit in cache).

We'll begin by looking at how the cache is accessed. As an example, we first
design a small cache that consists of 8 8-byte words. Since the CPU can address
individual bytes, the last three bits of the address refer to the offset within a word.
We need three more bits to select one of the 8 words. Thus, the last six bits of
the address are used to select a byte out of this cache. If an address has 64 bits,
the remaining 58 bits are stored in an adjoining tag array. Note that multiple
different addresses can be stored in the same location in the cache.  The tag
keeps track of which address is currently stored in the cache. Hence, while
picking a byte out of the data array, we must also pick the tag out of the tag array
and compare it with the address requested by the CPU to make sure we're
returning the correct data. If the tags do not match, a cache miss is signaled and
the request is sent to the next level in the cache hierarchy.

The amount of data stored in each row of the cache can be increased.  For
example, we could store 32 bytes in each row. The contiguous bytes stored in
a row are referred to as a "block" or "cache line". We must now use the last
5 bits of the address as offset bits (to select 1 of the 32 bytes in that block). If
two different words are frequently accessed and map to the same row, they can't
co-exist together and this leads to many cache misses. A set-associative cache
allows us to maintain two different blocks in each row. At a time, two blocks are
read out of the data and tag arrays. If one of the tags matches the tag of the
requested address, a hit is signalled. Each block is referred to as a "way" and
all the ways in a row are referred to as a "set". Thus, the number of offset bits
= log(block size). The number of index bits = log(number of sets). Tag bits =
address size - offset bits - index bits. The size of the data array = block size
* number of ways * number of sets. The size of the tag array = tag bits * number
of ways * number of sets. If a cache has N blocks and 1 way, it is direct-mapped.
If a cache has N blocks and N ways, it is fully-associative.

Next, we'll examine what happens on a miss. On a read miss, you bring in an
entire block (with the expectation that spatial locality will exist) and replace one of
the existing blocks in the cache. If the cache is set-associative, we have choices
when replacing a block. An LRU policy yields the best performance (since programs
exhibit temporal locality). LRU policy is easy to implement for a 2-way cache, but
for higher associativities, processors are forced to implement simpler pseudo-LRU
mechanisms. On a write miss, it may often not be necessary to bring the entire
block in (such a policy is known as write-no-allocate) (usually, when you write to
memory, you tend not to read the value again in the near future). If the block is
brought in on a write miss, it is known as a write-allocate policy. On a write hit,
one may choose to immediately update the copy of the block in the L2 or in
memory. That would be a write-through policy -- it is clearly more bandwidth
intensive. However, it simplifies coherence if there are multiple processors
sharing data (more details on this in the next chapter). To save on bandwidth,
we could wait until a cache block is evicted before updating the L2 or memory.
Such a policy is known as a writeback policy. Typically, the L1-L2 interface is
write-through, while the L2-memory interface is write-back.

Cache misses can be categorized into three main classes: compulsory, capacity,
and conflict. It is hard to exactly identify what causes a cache miss, so these are
only intended to serve as a guideline. A compulsory miss happens when a word
is accessed the first time. It can only be avoided through some kind of prefetch
mechanism. A large line size is the simplest form of prefetch as it brings in
neighboring words on every cache miss. To count the number of compulsory
misses, simulate an infinitely sized cache. Capacity misses are caused because
the program's working set size is larger than the cache capacity. For example,
if we access N different blocks between successive accesses to a block X, the
second access to X is likely to yield a cache miss if the cache can only
accommodate fewer than N blocks. Thus, the additional misses encountered
in moving from an infinite cache to a fully-associative finite cache can be
attributed to capacity misses. Conflict misses are caused because a block got
evicted only because a different block was also mapped to the same location.
The number of additional misses caused by moving from a fully-associative cache
to a direct-mapped cache can be attributed to conflict misses.


Finally, we'll see how virtual memory is organized on a system. Virtual
memory makes it possible for each process to have the illusion that a lot
of memory is available to it. For example, each process may believe that
it can have 4GB (say) of virtual memory, but the physical memory on the
system may only have a capacity of 1GB (that is shared by all processes).
Thus, a part of the process' memory is stored in physical memory, while
the rest is stored on disk. Hopefully, most of our requests can be found
in memory -- if not, a page fault is incurred and data is copied from disk
to memory (very, very expensive).

Data is typically organized at the granularity of pages (say, 8KB page size).
Each virtual page is mapped to some physical page. The CPU and the 
process only deal with virtual memory addresses. Before we can access
memory, we must know the physical address, which requires us to
translate the virtual address into a physical address. Each process maintains
a page table that tracks the virtual to physical page translation for every
virtual page. Obviously, the page table itself may contain millions of bytes
of data and will often require a memory access. To reduce this penalty,
a subset of virtual to physical page translations are stored in a small buffer
called the translation look-aside buffer (TLB). Before accessing memory,
the TLB is looked up to find the physical page number and hopefully,
most accesses will be hits. On a miss, the page tables will have to be
looked up (much slower).

Note that out of a 64-bit (say) virtual address, the last 13 bits pick a byte
out of an 8KB page size (the page offset). Since we are allocating
memory at page granularity, the page offset is the same in the virtual
and physical address. The remaining 51 bits of the virtual address comprise
the virtual page number. These are translated to yield the bits of the
physical page number.

When accessing the cache, the implementation would be greatly
simplified if we only dealt with physical addresses. However, this requires
us to first do the virtual-to-physical translation which adds to the cache
access latency. It would be good for performance to begin accessing
the cache with just the virtual address. Let's first look at potential pitfalls
if we use the virtual address to index into the cache. Note that different
processes can share data -- a single location in physical memory may be
mapped to the virtual address spaces of different processes (potentially
at different virtual addresses). When one process updates that data, it
should be visible to the other process. Thus, one physical location has
two different names. We must make sure that these different virtual
names map to the same set in the cache. Else,
the same physical memory location may end up being stored in two
different locations in the cache. Each process will end up making updates
to their own cached copies, without these updates ever being visible to
the other process. It is therefore important that multiple virtual names for
the same physical memory all map to the same location (set) in the 
cache.

We have ensured that two different virtual names for the same physical
address are mapped to the same set. However, if we save the virtual
address in the tag array, a process may not register a hit if the name that
is currently in the cache is that of the other process. Not only does this
lead to more misses, it can also lead to correctness issues in a multi-way
cache. Two different virtual addresses for the same location may be saved
in adjacent ways and each process registers a hit, but end up accessing
different copies in the cache -- the same problem that we were trying to
avoid. Hence, the tag array must save the physical page number so that
every process registers a hit for the same location in cache.

The slides show the cache access pipeline. The virtual address
is enough to index into the data and tag array and pick out all the ways in
that set. Before doing the tag comparison, though, we must do the
virtual to physical page translation. This happens through the TLB, an
access that is done in parallel with looking up the tag and data arrays.
Finally, physical tag comparison is done and data is returned to the CPU.
This is a virtually indexed, physically taged cache.