Computer Science Basics: The Memory Hierarchy

DemonWasp · **Posted:** Fri Apr 17, 2009 10:34 am

Note: The following is written off the top of my head. I don't promise any true accuracy, as I'm without my textbooks at the moment so I can't verify everything. If you have corrections, please provide them. Further, this is written from an applications-programmer perspective rather than for embedded programmers, device-driver programmers, or operating-system programmers. This thread is in response to the suggestion at http://compsci.ca/v3/viewtopic.php?p=183580 ; this is an introductory article which seeks to provide required-information for more advanced articles.

The Memory Hierarchy
In a computer, there are several different storage areas. You may know them by their common names - cache, RAM, hard drive. You probably also know that these vary in size and speed, from the very fast (cache) to the very slow (hard drives) and the very large (hard drives) to the very small (cache). In general, each layer or tier has the properties of being faster, smaller, and quicker to access (lower latency) than the layer below it.

The layers traditionally listed are as follows (this is copy-pasted directly from the Wikipedia article):
* Processor registers - fastest possible access (usually 1 CPU cycle), only hundreds of bytes in size
* Level 1 (L1) cache - often accessed in just a few cycles, usually tens of kilobytes
* Level 2 (L2) cache - higher latency than L1 by 2? to 10?, often 512 KiB or more
* Level 3 (L3) cache - higher latency than L2, often 2048 KiB or more
* Main memory (DRAM) - may take hundreds of cycles, but can be multiple gigabytes. Access times may not be uniform, in the case of a NUMA machine.
* Disk storage - millions of cycles latency, but very large
* Tertiary storage - several seconds latency, can be huge

For clarity, "tertiary storage" can mean storage accessed over a network, tape drives and so on.

Why does this Matter?
Let's assume that the objective in designing the computer is to compute things as quickly as possible. Looking at the options above, one might simply say "why not just make all the storage in these fancy register things, then it'll go super-fast!". Sadly, this is impossible for a number of reasons: registers are physically larger and more complex than other forms of storage, and they are relatively expensive to produce; even worse, the more registers you have, the larger your CPU instructions need to be, as they need more bits to address all of the registers. Similarly, it is prohibitively expensive to just expand the upper tiers of this hierarchy. So how then do we go as fast as possible?

The key here is to do our best to keep the information we're working on in a higher layer in the tier than the information we aren't. If, for example, an algorithm can do most of its work in the L1 cache and only rarely has to go to L2, L3 or RAM, it can perform the same action hundreds of times faster than if it has to go to RAM for every access. Similarly, a program that works entirely in RAM will perform better than one that continually accesses the hard disk (this is part of the reason why certain database programs have the ability to run entirely in RAM, without ever touching the disk).

In practice, programmers are usually only expected to deal with three layers of this hierarchy: RAM, Hard Drive, Tertiary. The CPU's cache and registers are transparent to all but assembly programmers, and even at the assembly level, the cache is mostly-transparent*. Programmers must manually specify and load files from the hard drive, or manually access tertiary storage (though this is muddied as network-mounted filesystems appear to be physical devices like hard disks, but are actually tertiary storage in terms of performance), but do not need to handle what specifically is stored in each register and what is available in the cache - this is all done automatically, some of it by the compiler and some of it by the caching hardware on the CPU.

* Here I refer to issues with multithreaded programs and their caches. Specifically, if two threads are accessing the same data in RAM, it will be loaded into the cache for two separate processors. If one thread then changes it, those changes must be propagated to the other cache(s) in order to maintain cache-correctness.

The smart programmer will therefore access the lower layers of the hierarchy as infrequently as is practical.

Hold on, what about the Cache?
For the sake of simplicity, I'm going to ignore the distinctions between L1, L2 and L3 caches and treat them simply as "cache". In general, you can think of L1 as a cache on L2, L2 as a cache on L3 and L3 as a cache on RAM.

The cache in a CPU is handled automatically by some hardware on the CPU itself. Whenever a program requests an address from RAM (let's call it "x"), the cache does more than load just that address - it loads a "block" of addresses around x. Then, if the program requests the memory at x+1 shortly thereafter, it doesn't have to access the RAM at all - it can go to the cache. This makes the program run much faster. It turns out that the latency (lost time) going to RAM is a much larger factor than the bandwidth required to load a block (or store a block), so if we can skip small trips to the RAM in favour of making fewer larger trips, we can save a lot of time. When we can make this shortcut, we call that a "cache hit"; when we can't, we call it a "cache miss".

This is the same reason why one might bring a wagon or car to the grocery store - it's easier and faster to buy all your groceries at once than to buy one item at a time and walk it home individually, right when you need it. Once you get your groceries home, you store them in a cupboard (cache) that you can access much faster than the store.

But now we run into a difficulty. Specifically, the cache is tiny compared to main RAM, so there's no way to hold everything we could possibly access. This means that eventually, the cache will be full, and when we want to put more in, we first have to take something out. When we remove something from cache, we push that block back into RAM. This updates the RAM to contain any changes made to that block since it was loaded into cache.

Ideally, we would remove whatever thing in the cache we won't need until the furthest point in the future; this is called the clairvoyant algorithm because it's essentially impossible to predict this with any accuracy. In practice, we use a variety of approximations to this strategy; these vary in complexity (and therefore difficulty and cost to implement in hardware) and accuracy. One of the simplest is the Least Recently Used (LRU) algorithm, which evicts whichever portion of the cache has gone unused for the longest; this generally does well, but may be outperformed in certain circumstances, and may be difficult to implement in hardware, depending on how the cache is structured. For further reading, see the Wiki article on Cache Algorithms.

Okay, so how do I program to take advantage of caching?
In general, caching will speed up any program at least a little (the cases where it causes a performance hit are rare). However, if we're smart, we can do much better than a slight performance increase. One of the biggest boosts is to try to keep commonly-used data together so that it will all go into the cache together. Consider the following example:

Suppose you have to store N integers (we'll assume C for the moment, so these are ints). You could store these either individually, or as an array - there's no reason that these values should necessarily belong in an array together, as they represent totally different things. If you declare them separately, you may find that they have been sprinkled throughout RAM (depending on a whole lot of factors I'm not going to get into); when the CPU tries to use them, it has to keep trading things out of the cache, as there are frequent cache misses. This causes a large performance hit, and is known as "cache thrashing" - the cache proves useless as your program has to go to RAM every single time anyway. On the other hand, if you store them as an array of N integers, the cache will load a large portion of the array at a time (potentially even the whole thing, depending on cache size, block size and the array's size and location relative to these blocks); the program then barely goes to RAM at all and runs much faster.

Note: I'm not suggesting you stick unrelated values in an array together. Correct code that is self-documenting is MUCH more important than sheer speed. Code for correctness first, then optimise later. Optimising first is a path to disaster.

This point will become important when considering what data structures and algorithms to use to store data properly. I'll be writing more articles if this one goes over well.

Hold on, does this have anything to do with Virtual Memory?
What a fantastic question! Turns out, it does. Here's the problem: whereas a cache's limited size is handled by special hardware to evict blocks as necessary, RAM is handled more manually by the programmer. The programmer cannot necessarily be aware of how much memory is available to the system at all times, nor should programs be forced to self-police their RAM usage to avoid overfilling RAM. Ideally, something would handle this dilemma for the applications programmer so he can skip such mundane concerns and get on with coding his application already.

This is why we have Virtual Memory. When a computer's RAM starts getting full, the operating system notices this and starts to put some sections of its RAM out to the hard disk. This is completely transparent to programs running on the computer - to them, it just looks like they allocated more RAM. When programs need to access this RAM, it is read from the hard disk and loaded into system RAM again (most likely evicting some other block of RAM to the hard disk). Again, cache algorithms perform their best-approximation to the clairvoyant algorithm and will try to choose what to evict to give the best performance; this is referred to as "paging", and when memory that is in the page file is requested, that is termed a "page fault". Page faults can cost a CPU hundreds of millions of cycles before the data is loaded, and can result in the program making the request being interrupted by another program while that memory loads, to resume once it becomes available.

On Windows systems, virtual memory is allocated in a file called "pagefile.sys" on the system drive (you can use the command "dir /AHS" without the quotes to see it). With a little work, one can move this to another drive or can tell Windows to maintain files on multiple drives, which may improve performance. The problem with this approach is that the pagefile may become fragmented (I'll probably be writing another article about this) which could seriously harm performance, and the page file can consume a fair bit of disk space; mine is at 1.3GB at the moment, despite this machine having 2GB of RAM, of which 600MB is available. This headroom is likely left so that newly-opened programs may load into free RAM quickly; if all of the RAM was full, newly-opened programs would be evicting data to the disk, which would require twice as much disk access to load the program.

On Linux, virtual memory is allocated on a special partition, called the "swap partition". This prevents the file-fragmentation issues seen on Windows and allows a system administrator to determine the maximum virtual memory size easily - just set the swap partition's size as required. Unless I'm very much mistaken, a similar approach is taken on all UNIX-based operating systems, including OSX. Please correct me if I'm wrong.

Alright, so that's enough yammering for the moment. All thoughts and opinions are welcome, particularly constructive criticism and corrections.

Edit: Fixed copy-paste fail.

gianni · **Posted:** Fri Apr 17, 2009 10:49 pm

Nice article! One tiny thing jumped out at me:

Quote:

This prevents the file-fragmentation issues seen on Windows

It's actually the filesystem that prevents fragmentation, not the fact that it's located on another partition.

Great write up!

Tony · **Posted:** Sat Apr 18, 2009 12:28 am

The topic of system architecture is fascinating in detail.

I think it should be stressed that a miss at any memory level is orders of magnitude slower (10, 100, sometimes 1000-times) than hitting the result in cache.

An interesting example is

code:

for (j=0; j<2048; j++)
for (i=0; i<2048; i++)
A[i][j] = A[i][j]+1;

vs

code:

for (i=0; i<2048; i++)
for (j=0; j<2048; j++)
A[i][j] = A[i][j]+1;

Functionally identical code, but if 2048*sizeof(int) > page size, then the former code missing the cache every single time, and is slowed down significantly.

zero-impact · **Posted:** Sat Apr 18, 2009 10:51 am

Very informative thanks.
I have always wondered what was meant by caches and such.

DemonWasp · **Posted:** Sat Apr 18, 2009 7:41 pm

A good point Tony, and one that I may address in future articles. It seems like this one got enough interest that I may post more of this sort of thing. I was concerned that the wall-of-text would be intimidating to the intended audience.

@gianni, I'm pretty sure we're thinking of separate things. On Linux (and most UNIX systems) the filesystem keeps fragmentation down by having a better file-allocation scheme than FAT and NTFS use. On Windows, the fact that the pagefile is on the same partition as other files allows it to become fragmented; if Windows required it to be on a separate partition, then you wouldn't get fragmentation. I'm not sure what allocation scheme Windows / Linux use to position things in RAM, but that may prove an interesting topic itself (though more advanced than this article).

mirhagk · **Posted:** Fri Dec 03, 2010 3:54 pm

Okay now I have a question, sorry for the reviving a dead topic, but it relates to this. I've heard that the new USB 3.0 is faster than a normal hard drive. So does that means a performance boost if we plugged in a USB 3.0 (assuming we have all hardware in the computer) and told windows to use it as the cache?

That actual makes me think of another thing, on windows vista if you plug in a USB windows asks if you want to use it to "speed up your system", does this actually speed up your system?

DemonWasp · **Posted:** Fri Dec 03, 2010 4:43 pm

What might be useful here is adding another layer between "Disk Storage" and "Main Memory". Assuming a USB 3.0 device operates at the maximum speed (400MB/s), that's somewhat faster than most hard disks on the market. SSDs can approach 250MB/s, while traditional disk drives can hit around 110MB/s.

If the OS were to use your USB 3.0 storage device for its page file (or swap partition) then it would behave somewhat faster in situations where you are using your page file / swap partition. You can do this in Linux manually (in fact, it's even smarter in Linux because mounting the USB device as a swap partition will automatically stripe it with HDD swap partitions in what I assume is a fairly intelligent manner: see here). This may well be what Windows is doing too, though it hides the implementation a little from the end user.

In fact, here's an article at Microsoft.com about ReadyBoost. Notably, it mentions that it won't really help with systems that already have SSD drives.

Long story short though, you're better off buying more RAM if you can. RAM is (relatively) cheap and supports data transfer rates in the order of 12000MB/s currently; if you can keep everything you're doing in system RAM and have no need for a swap file, that is your optimal-speed situation. The page file / swap partition is a fallback for when you don't have enough RAM.

mirhagk · **Posted:** Sat Dec 04, 2010 11:23 am

Yeah obviously.

Lol alittle while back me and a friend decided we were going to get usb 3.0's and get like 16 of them since USB storage devices are cheap, and then RAID them rather than have an actual hard drive.

oh and you don't even need to buy ram, you can just download it here

	Computer Science Canada Programming C, C++, Java, PHP, Ruby, Turing, VB Username: Password: Register
Wiki Blog Search Turing Chat Room Members