Quick rules for Cache-Friendly Algorithms

Posted on Saturday, February 14, 2009 by Simon Cooke

Games programming on current architectures require that a certain amount of attention be paid to your algorithms to make sure that they’re cache friendly.

It takes a long time for a memory access to be brought back from a page that isn’t in the cache. And cache lines are getting longer and longer.

Interestingly, software engineers solved this problem a long time ago, with algorithms meant for databases with really really slow IO pipes, and disk sector sizes of anywhere from 128 to 1024 bytes.

Those sizes should be a big hint to any seasoned programmer; they’re similar to the size of cache lines in modern systems. Which means that we can take algorithms that were designed to work really well for those old systems and later discarded, dig them up, dust them off and use them to create really blazingly fast routines for handling in-memory data.

Everything that’s old is new again.

So without further ado, here’s a few recommendations:

Prefer M-ary trees to Binary or Red/Black trees.

M-ary trees store several keys per node. Binary trees only store one key per node. If you store as many keys per node as you can fit into a single cache line, you’ll benefit.

If data you need to search changes rarely, use Indexed Sequential Access

This is a bi-level lookup scheme. It’s related to M-ary trees.

If your data needs to be flexible and searchable, use B-Trees

B-trees are M-ary trees, but you don’t require that each node has M items.

If you really must use a hashtable, consider using Extendible Hashing

Extendible hashing combines the properties of B-Trees, Indexed Sequential access and Hashtables to help keep pages coherent.

If you need to do lots of data filtering, and then perform the same operation on each object, use a Gather/Scatter approach

Filter your dataset using a given set of criteria
Generate a sequential list of matching objects in scratch space.
Prime the cache as you start processing the list, and keep pumping it every few objects.
Generate a sequential list of results if further processing is necessary, or Scatter the results back out to the objects (priming the N+Wth object as you go; you’ll need to tune W based on the size of your cache).

Consider Batch Sorting Operations instead of Quicksort

Don’t use Quicksort. If you have a large dataset, you’ll quickly run into cache issues as it bounces all over the place. Use Merge Sort instead (look it up in Knuth if you don’t remember that one) – which also has the nice property that you can run it in parallel on multiple cores.

Shadow Your Most Accessed Data Inside Your Keys When Necessary

Store data you need for your most common comparisons alongside the pointers to the objects – make them part of your key. You can ignore that part of the key if you’re searching the database for anything else, and your most common operation will be optimized because you won’t have to hit a random chunk of memory to pull out the value for comparison. (Good candidates for this might include an NPC’s location and bounding-sphere radius, which could be stored alongside the pointer to the NPC itself).

Defrag / Sort your Data On the Fly

As you operate on data, if you keep a secondary buffer around, you can generate a new data set. At the end of the operation, you just swap the pointer to the first dataset for your new one. This massively reduces cache thrashing, and if you plan it right and your data is temporally coherent, if you perform a sort of the data each frame, you’ll keep most of it sorted naturally for minimal cost – it just kind of drops out in the mix. You can always perform this check as you go along, and only re-sort if necessary. (If you’re clever, you can even figure out if you just need to resort the current cache-line, or if you need to be more exhaustive).

Another advantage here is that your secondary buffer? You don’t need to synchronize on it. Your first buffer? You don’t need to synchronize on that either, so you can get MASSIVELY parallel. You’re performing something akin to finite element analysis on your dataset. The only time you need to worry is when one calculation affects the output of the other – but if you make a list of those, you can special-case them.

If you’re dealing with small chunks of data, go Linear

With small amounts of data, the cost of accessing main memory to pull it into the cache massively dwarfs the cost of performing calculations on the data.

So don’t be afraid to use bubble-sorts, or linear searches on data that’s tiny and doesn’t use indirection. Of course, use this in moderation, and test first.