Shall I use global, heap allocated array or local, stack allocated one if references to its elements are made too many times?

Question

I actually have this data locality as a possible problem for why my fortran program runs somewhat slow. In one part of this program, I have nested loops and throughout these loops, a given section of a big array is referenced multiple times. The pseudocode goes like this
subroutine foo()
  use mymodule, only : big_array

...

do i = 1, n1
     do j = 1, n2
        ! invoke big_array(i,j)
     end 
  end
end subroutine

The value of n1 and n2 can be tens of thousands. Since big_array is allocated on heap (it's an allocatable array), I have long suspected that the repeated reference to big_array elements in the above loops may contribute to the slowing down of the program. But I have never been able to set my doubt clear on this as I only have a very basic knowledge of how stack and heap memory work. If I were to utilize stack allocated data, I can declare a local automatic array before the nested loop and store the needed section of big_array in this stack-allocated local array, so that the new array is closer in memory to the nested loop but this also incures an additional cpu time when the program needs to allocate the local array. So, I don't know which one is better than which.
Can someone also give me the idea of just how much slower a program can get if it involves reading and writing to memory addresses far enough from the point where the reference is made?

Daniel Shapero · Accepted Answer

TL,DR: Leave it on the heap, but switch your loop order.
For starters, the program stack has fairly limited space.
If you're making arrays that large, I'd be very surprised if it fit on the stack.
More importantly: you're absolutely correct about programming in such a way that promotes memory locality.
The important part to remember is that memory locality is always relative to which memory was accessed recently, not the absolute location of that memory on the stack or heap.
This is because of the way that CPU caches work.
When you access a location in memory, you don't just read or write the data in that address, you load nearby memory addresses into cache.
The reason caches exist is because, if you touch some memory, you're likely to also touch nearby locations pretty soon.
Now, if you can write code in such a way that you read or write memory in order, you'll take greatest advantage of CPU cache.
A classic example of this in Fortran is that multidimensional arrays are in column-major order -- the array elements in one column are laid out sequentially in memory.
This is in contrast to (what passes for) multidimensional arrays in C, where the array elements in a single row are laid out sequentially in memory.
The way your program is written now, the memory accesses will jump forward by n1 addresses on each inner loop iteration.
If you want to make your code run faster, you can switch the order of the loops:
do j = 1, n2
    do i = 1, n1
        ! invoke big_array(i, j)
    end 
end

With the reordering, you'll be advancing by only one address in each iteration rather than n1 addresses.
This should perform better, at least on any machine built after 1996 or so.
If you're doing serious performance tuning, it's worth knowing a bit about modern memory hierarchies.
There's a nice demonstration here of how long it takes to access different layers of the memory hierarchy and how that's evolved since the 1990s.
Finally, you asked about how long it takes to actually allocate the memory.
As I understand, the time it takes to allocate isn't proportional to how much memory you're allocating -- it's effectively a constant per allocation -- because of how virtual memory works.
Big array chomping computations like dense linear algebra spend orders of magnitude more time in accessing memory and computing things than they do in allocating.
You really only need to worry about cost of allocation if you're using things like tree data structures, but memory fragmentation is more of a problem there, and anyway memory pools solve both issues.

Shall I use global, heap allocated array or local, stack allocated one if references to its elements are made too many times?

One Answer

Add your own answers!

Ask a Question