Are noncontiguous arrays performant?

Question

In C#, when a user creates an List and adds bytes to it, there is a chance it runs out of space and needs to allocate more space. It allocates double (or some other multiplier) the size of the previous array, copies the bytes over and discards the reference to the old array. I know that the list grows exponentially because each allocation is expensive and this limits it to O(log n) allocations, where just adding 10 extra items each time would result in O(n) allocations. However for large array sizes there can be a lot of wasted space, maybe nearly half the array. To reduce memory I wrote a similar class NonContiguousArrayList which uses List as a backing store if there were less than 4MB in the list, then it would allocate additional 4MB byte arrays as NonContiguousArrayList grew in size. Unlike List these arrays are non-contiguous so there is no copying of data around, just an additional 4M allocation. When an item is looked up, the index is divided by 4M to get the index of the array containing the item, then modulo 4M to get the index within the array. Can you point out problems with this approach? Here is my list: Noncontiguous arrays don't have cache locality which results in bad performance. However at a 4M block size it seems like there would be enough locality for good caching. Accessing an item is not quite as simple, there's an extra level of indirection. Would this get optimized away? Would it cause cache problems? Since there is linear growth after the 4M limit is hit, you could have many more allocations than you would ordinarily (say, max of 250 allocations for 1GB of memory). No extra memory is copied after 4M, however I'm not sure if the extra allocations is more expensive than copying large chunks of memory.

rwong · Accepted Answer

At the scales you mentioned, the concerns are totally different from those you have mentioned.

Cache locality

There are two related concepts:
1. Locality, the reuse of data on the same cache line (spatial locality) that was recently visited (temporal locality)
2. Automatic cache prefetching (streaming).
At the scales you mentioned (hundred MBs to gigabytes, in 4MB chunks), the two factors have more to do with your data element access pattern than the memory layout.
My (clueless) prediction is that statistically there might not be much performance difference than a giant contiguous memory allocation. No gain, no loss.

Data element access pattern

This article visually illustrates how memory access patterns will affect performance.
In short, just keep in mind that if your algorithm is already bottlenecked by memory bandwidth, the only way to improve performance is to do more useful work with the data that is already loaded into the cache.
In other words, even if YourList[k] and YourList[k+1] have a high probability of being consecutive (one in four-million chance of being not), that fact will not help performance if you access your list completely randomly, or in large unpredictable strides e.g. while { index += random.Next(1024); DoStuff(YourList[index]); }

Interaction with the GC system

In my opinion, this is where you should focus on most.
At the minimum, understand how your design will interact with:
- Generational garbage collection,
- Large object heap
I am not knowledgeable in these topics so I will leave others to contribute.

Overhead of address offset calculations

Typical C# code is already doing a lot of address offset calculations, so the additional overhead from your scheme wouldn't be any worse than the typical C# code working on a single array.
- Remember that C# code also does array range checking; and this fact does not prevent C# from reaching comparable array processing performance with C++ code.
- The reason is that performance is mostly bottlenecked by memory bandwidth.
- The trick to maximizing utility from memory bandwidth is to use SIMD instructions for memory read/write operations. Neither typical C# nor typical C++ does this; you have to resort to libraries or language add-ons.

To illustrate why:

Do address calculation
(In OP's case, load chunk base address (which is already in cache) and then do more address calculation)
Read from / write to the element address

The last step still takes the lion's share of time.

Personal suggestion

You can provide a CopyRange function, which would behave like Array.Copy function but would operate between two instances of your NonContiguousByteArray, or between one instance and another normal byte[]. These functions can make use of SIMD code (C++ or C#) for maximizing memory bandwidth utilization, and then your C# code can operate on the copied range without the overhead of multiple dereferencing or address calculation.

Usability and interoperability concerns

Apparently you cannot use this NonContiguousByteArray with any C#, C++ or foreign-language libraries that expect contiguous byte arrays, or byte arrays that can be pinned.
However, if you write your own C++ acceleration library (with P/Invoke or C++/CLI), you can pass in a list of base addresses of several 4MB blocks into the underlying code.
- For example, if you need to give access to elements starting at (3 * 1024 * 1024) and ending at (5 * 1024 * 1024 - 1), this means the access will span across chunk[0] and chunk[1]. You can then construct an array (size 2) of byte arrays (size 4M), pin these chunk addresses and pass them to the underlying code.
Another usability concern is that you will not be able to implement the IList<byte> interface efficiently: Insert and Remove will just take too long to process because they will require O(N) time.
- In fact, it looks like you can't implement anything other than IEnumerable<byte>, i.e. it can be scanned sequentially and that's it.

user204677 · Answer

I revolve some of the most central parts of my codebase (an ECS engine) around the type of data structure you described, though it uses smaller contiguous blocks (more like 4 kilobytes instead of 4 megabytes).

It uses a double free list to achieve constant-time insertions and removals with one free list for free blocks that are ready to be inserted to (blocks which aren't full) and a sub-free list inside the block for indices in that block ready to be reclaimed upon insertion.

I'll cover the pros and cons of this structure. Let's start with some cons because there are a number of them:

Cons

It takes about 4 times longer to insert a couple hundred million elements to this structure than std::vector (a purely contiguous structure). And I'm pretty decent at micro-optimizations but there's just conceptually more work to do as the common case has to first inspect the free block at the top of the block free list, then access the block and pop a free index from the block's free list, write the element at the free position, and then check if the block is full and pop the block from the block free list if so. It's still a constant-time operation but with a much bigger constant than pushing back to std::vector.
It takes about twice as long when accessing elements using a random-access pattern given the extra arithmetic for indexing and the extra layer of indirection.
Sequential access doesn't map efficiently to an iterator design since the iterator has to perform additional branching each time it's incremented.
It has a bit of memory overhead, typically around 1 bit per element. 1 bit per element might not sound like much, but if you are using this to store a million 16-bit integers, then that's 6.25% more memory use than a perfectly compact array. However, in practice this tends to use less memory than std::vector unless you are compacting the vector to eliminate the excess capacity it reserves. Also I generally don't use it to store such teeny elements.

Pros

Sequential access using a for_each function that takes a callback processing ranges of elements within a block almost rivals the speed of sequential access with std::vector (only like a 10% diff), so it's not much less efficient in the most performance-critical use cases for me (most of the time spent in an ECS engine is in sequential access).
It allows constant-time removals from the middle with the structure deallocating blocks when they become completely empty. As a result it's generally quite decent at making sure the data structure never uses significantly more memory than necessary.
It doesn't invalidate indices to elements which are not directly removed from the container since it just leaves holes behind using a free list approach to reclaim those holes upon subsequent insertion.
You don't have to worry so much about running out of memory even if this structure holds an epic number of elements, since it only requests small contiguous blocks which don't pose a challenge for the OS to find a huge number of contiguous unused pages.
It lends itself well to concurrency and thread-safety without locking the whole structure, since operations are generally localized to individual blocks.

Now one of the biggest pros to me was that it become trivial to make an immutable version of this data structure, like this:

Ever since then, that opened up all kinds of doors to writing more functions devoid of side effects which made it much easier to achieve exception-safety, thread-safety, etc. The immutability was kind of a thing I discovered I could easily achieve with this data structure in hindsight and by accident, but arguably one of the nicest benefits it ended up having since it made maintaining the codebase much easier.

Noncontiguous arrays don't have cache locality which results in bad performance. However at a 4M block size it seems like there would be enough locality for good caching.

Locality of reference isn't something to concern yourself with at blocks of that size, let alone 4 kilobyte blocks. A cache line is just 64 bytes typically. If you want to reduce cache misses, then just focus on aligning those blocks properly and favor more sequential access patterns when possible.

A very rapid way to turn a random-access memory pattern into a sequential one is to use a bitset. Let's say you have a boatload of indices and they're in random order. You can just plow through them and mark bits in the bitset. Then you can iterate through your bitset and check which bytes are non-zero, checking, say, 64-bits at a time. Once you encounter a set of 64-bits of which at least one bit is set, you can use FFS instructions to quickly determine what bits are set. The bits tell you what indices you should access, except now you get the indices sorted in sequential order.

This has some overhead but can be a worthwhile exchange in some cases, especially if you're going to be looping over these indices many times.

Accessing an item is not quite as simple, there's an extra level of
  indirection. Would this get optimized away? Would it cause cache
  problems?

No, it can't be optimized away. Random-access, at least, will always cost more with this structure. It often won't increase your cache misses that much though since you'll tend to get high temporal locality with the array of pointers to blocks, especially if your common case execution paths use sequential access patterns.

Since there is linear growth after the 4M limit is hit, you could have
  many more allocations than you would ordinarily (say, max of 250
  allocations for 1GB of memory). No extra memory is copied after 4M,
  however I'm not sure if the extra allocations is more expensive than
  copying large chunks of memory.

In practice the copying is often faster because it's a rare case, only occurring something like log(N)/log(2) times total while simultaneously simplifying the dirt cheap common case where you can just write an element to the array many times before it becomes full and needs to be reallocated again. So typically you won't get faster insertions with this type of structure because the common case work is more expensive even if it doesn't have to deal with that expensive rare case of reallocating huge arrays.

The primary appeal of this structure for me in spite of all the cons is reduced memory use, not having to worry about OOM, being able to store indices and pointers which don't get invalidated, concurrency, and immutability. It's nice to have a data structure where you can insert and remove things in constant time while it cleans itself up for you and doesn't invalidate pointers and indices into the structure.

DeadMG · Answer

It's worth noting that C++ already has an equivalent structure by Standard, std::deque. Currently, it's recommended as the default choice for needing a random-access sequence of stuff.

The reality is that contiguous memory is nearly completely unnecessary once the data goes past a certain size- a cache line is just 64 bytes and a page size is just 4-8KB (typical values currently). Once you're starting to talk about a few MB it really goes out the window as a concern. The same is true of allocation cost. The price of processing all that data- even just reading it- dwarfs the price of the allocations anyway.

The only other reason to worry about it is for interfacing with C APIs. But you can't get a pointer to the buffer of a List anyway so there's no concern here.

user2313838 · Answer

With a 4M block size, even a single block isn't guaranteed to be contiguous in physical memory; it's larger than a typical VM page size. Locality is not meaningful at that scale.
You will have to worry about heap fragmentation: if the allocations happen such that your blocks are largely non-contiguous in the heap, then when they are reclaimed by the GC, you will end up with a heap that may be too fragmented to fit a subsequent allocation. That's usually a worse situation because failures will occur in unrelated places and possibly force a restart of the application.

Frank Hileman · Answer

When memory chunks are allocated at different points in time, as in the sub-arrays within your data structure, they can be located far from each other in memory. Whether this is a problem or not depends on the CPU and is very hard to predict any longer. You have to test it.

This is an excellent idea, and it is one I have used in the past. Of course you should only use powers of two for your sub-array sizes and bit shifting for division (may happen as part of optimization). I found this type of structure slightly slower, in that compilers can optimize a single array indirection more easily. You have to test, as these types of optimizations change all the time.

The main advantage is you can run closer to the upper limit of memory in your system, as long as you use these types of structures consistently. As long as you are making your data structures larger, and not producing garbage, you avoid extra garbage collections that would occur for an ordinary List. For a giant list, it could make a huge difference: the difference between continuing to run, and running out of memory.

The extra allocations are an issue only if your sub-array chunks are small, because there is memory overhead in each array allocation.

I have created similar structures for dictionaries (hash tables). The Dictionary provided by the .net framework has the same problem as List. Dictionaries are harder in that you need to avoid rehashing as well.

Are noncontiguous arrays performant?

5 Answers

Add your own answers!

Ask a Question