Lecture Notes CS/EE 3810 Chapter 5: Memory Hierarchies For our discussions so far, we have assumed that the memory stage of the pipeline takes a single cycle. We've already seen that memory accesses are extremely slow and getting slower. A single memory access today takes about 300 cycles, so we need cache hierarchies to reduce stalls. The cache stores a subset of memory on the chip, allowing fast access to some memory. For example, a 32KB L1 data cache can service about 96% of all requests. Such a cache may have an access time of 2 cycles, resulting in few stalls. If there is a miss, for an in-order pipeline, the pipeline is frozen until the value is fetched from memory. For an out-of-order pipeline, other instructions can continue to execute while the memory fetch proceeds in the background, but eventually, the long-latency instruction will hold up commit and stall execution. Typically, separate L1 instruction and data caches are implemented so that instruction fetch and data fetch can happen at the same time. Smaller caches also have smaller access times. The L1 caches are backed up by a much larger (and slower) L2 cache. There is usually a single L2 cache that is shared by instructions and data. Higher access times for the L2 are not a great problem because the L1 takes care of most requests. The L2 is primarily in place to avoid expensive memory accesses and it is worth it even if it has a hit rate of less than 10%. Note that each level of the hierarchy is accessed sequentially -- that is, if there is an L1 miss, we do not look up the L2 and memory in parallel (to reduce power consumption and complexity). Therefore, in some cases, the addition of another level of cache can lead to poor performance if it yields very few hits (can happen when we're streaming through large data arrays that do not fit in cache). We'll begin by looking at how the cache is accessed. As an example, we first design a small cache that consists of 8 8-byte words. Since the CPU can address individual bytes, the last three bits of the address refer to the offset within a word. We need three more bits to select one of the 8 words. Thus, the last six bits of the address are used to select a byte out of this cache. If an address has 64 bits, the remaining 58 bits are stored in an adjoining tag array. Note that multiple different addresses can be stored in the same location in the cache. The tag keeps track of which address is currently stored in the cache. Hence, while picking a byte out of the data array, we must also pick the tag out of the tag array and compare it with the address requested by the CPU to make sure we're returning the correct data. If the tags do not match, a cache miss is signaled and the request is sent to the next level in the cache hierarchy. The amount of data stored in each row of the cache can be increased. For example, we could store 32 bytes in each row. The contiguous bytes stored in a row are referred to as a "block" or "cache line". We must now use the last 5 bits of the address as offset bits (to select 1 of the 32 bytes in that block). If two different words are frequently accessed and map to the same row, they can't co-exist together and this leads to many cache misses. A set-associative cache allows us to maintain two different blocks in each row. At a time, two blocks are read out of the data and tag arrays. If one of the tags matches the tag of the requested address, a hit is signalled. Each block is referred to as a "way" and all the ways in a row are referred to as a "set". Thus, the number of offset bits = log(block size). The number of index bits = log(number of sets). Tag bits = address size - offset bits - index bits. The size of the data array = block size * number of ways * number of sets. The size of the tag array = tag bits * number of ways * number of sets. If a cache has N blocks and 1 way, it is direct-mapped. If a cache has N blocks and N ways, it is fully-associative. Next, we'll examine what happens on a miss. On a read miss, you bring in an entire block (with the expectation that spatial locality will exist) and replace one of the existing blocks in the cache. If the cache is set-associative, we have choices when replacing a block. An LRU policy yields the best performance (since programs exhibit temporal locality). LRU policy is easy to implement for a 2-way cache, but for higher associativities, processors are forced to implement simpler pseudo-LRU mechanisms. On a write miss, it may often not be necessary to bring the entire block in (such a policy is known as write-no-allocate) (usually, when you write to memory, you tend not to read the value again in the near future). If the block is brought in on a write miss, it is known as a write-allocate policy. On a write hit, one may choose to immediately update the copy of the block in the L2 or in memory. That would be a write-through policy -- it is clearly more bandwidth intensive. However, it simplifies coherence if there are multiple processors sharing data (more details on this in the next chapter). To save on bandwidth, we could wait until a cache block is evicted before updating the L2 or memory. Such a policy is known as a writeback policy. Typically, the L1-L2 interface is write-through, while the L2-memory interface is write-back. Cache misses can be categorized into three main classes: compulsory, capacity, and conflict. It is hard to exactly identify what causes a cache miss, so these are only intended to serve as a guideline. A compulsory miss happens when a word is accessed the first time. It can only be avoided through some kind of prefetch mechanism. A large line size is the simplest form of prefetch as it brings in neighboring words on every cache miss. To count the number of compulsory misses, simulate an infinitely sized cache. Capacity misses are caused because the program's working set size is larger than the cache capacity. For example, if we access N different blocks between successive accesses to a block X, the second access to X is likely to yield a cache miss if the cache can only accommodate fewer than N blocks. Thus, the additional misses encountered in moving from an infinite cache to a fully-associative finite cache can be attributed to capacity misses. Conflict misses are caused because a block got evicted only because a different block was also mapped to the same location. The number of additional misses caused by moving from a fully-associative cache to a direct-mapped cache can be attributed to conflict misses. Finally, we'll see how virtual memory is organized on a system. Virtual memory makes it possible for each process to have the illusion that a lot of memory is available to it. For example, each process may believe that it can have 4GB (say) of virtual memory, but the physical memory on the system may only have a capacity of 1GB (that is shared by all processes). Thus, a part of the process' memory is stored in physical memory, while the rest is stored on disk. Hopefully, most of our requests can be found in memory -- if not, a page fault is incurred and data is copied from disk to memory (very, very expensive). Data is typically organized at the granularity of pages (say, 8KB page size). Each virtual page is mapped to some physical page. The CPU and the process only deal with virtual memory addresses. Before we can access memory, we must know the physical address, which requires us to translate the virtual address into a physical address. Each process maintains a page table that tracks the virtual to physical page translation for every virtual page. Obviously, the page table itself may contain millions of bytes of data and will often require a memory access. To reduce this penalty, a subset of virtual to physical page translations are stored in a small buffer called the translation look-aside buffer (TLB). Before accessing memory, the TLB is looked up to find the physical page number and hopefully, most accesses will be hits. On a miss, the page tables will have to be looked up (much slower). Note that out of a 64-bit (say) virtual address, the last 13 bits pick a byte out of an 8KB page size (the page offset). Since we are allocating memory at page granularity, the page offset is the same in the virtual and physical address. The remaining 51 bits of the virtual address comprise the virtual page number. These are translated to yield the bits of the physical page number. When accessing the cache, the implementation would be greatly simplified if we only dealt with physical addresses. However, this requires us to first do the virtual-to-physical translation which adds to the cache access latency. It would be good for performance to begin accessing the cache with just the virtual address. Let's first look at potential pitfalls if we use the virtual address to index into the cache. Note that different processes can share data -- a single location in physical memory may be mapped to the virtual address spaces of different processes (potentially at different virtual addresses). When one process updates that data, it should be visible to the other process. Thus, one physical location has two different names. We must make sure that these different virtual names map to the same set in the cache. Else, the same physical memory location may end up being stored in two different locations in the cache. Each process will end up making updates to their own cached copies, without these updates ever being visible to the other process. It is therefore important that multiple virtual names for the same physical memory all map to the same location (set) in the cache. We have ensured that two different virtual names for the same physical address are mapped to the same set. However, if we save the virtual address in the tag array, a process may not register a hit if the name that is currently in the cache is that of the other process. Not only does this lead to more misses, it can also lead to correctness issues in a multi-way cache. Two different virtual addresses for the same location may be saved in adjacent ways and each process registers a hit, but end up accessing different copies in the cache -- the same problem that we were trying to avoid. Hence, the tag array must save the physical page number so that every process registers a hit for the same location in cache. The slides show the cache access pipeline. The virtual address is enough to index into the data and tag array and pick out all the ways in that set. Before doing the tag comparison, though, we must do the virtual to physical page translation. This happens through the TLB, an access that is done in parallel with looking up the tag and data arrays. Finally, physical tag comparison is done and data is returned to the CPU. This is a virtually indexed, physically taged cache.