Reading from buffer versus calculating on the fly performance

Question

I am creating a fast fourier transform algorithm for the compute shader - i am no expert on how GPUs really run optimally so thought i would ask here.
I have the option to calculate on the fly or precompute a lot of the trig function (cos and sin) values and store it in a read only float buffer.
My question is, though i am aware cos() and sin() is fast, how do they compare to simply getting the value precomputed in a buffer instead?
I don't mean a LUT here i do not need to interpolate between values from the buffer so i just get them values directly. I don't know how to really bench test shaders properly so i am unsure which is faster.
Perhaps some one knows more info on this?

Varaquilex · Accepted Answer

My question is, though i am aware cos() and sin() is fast, how do they compare to simply getting the value precomputed in a buffer instead?

This will depend on your shader code and GPU model.

GPUs utilize latency hiding by executing independent ALU instructions while waiting for data to arrive from memory.
Accessing memory is usually the slowest operation on a GPU.

If your shader code allows for latency hiding, it may be faster.
The best option here is to profile the shader performance.

I don't know how to really bench test shaders properly so i am unsure which is faster.

I think the easiest way would be to download RenderDoc and get a frame capture of your application.
After that, you click on the timing icon (shown red) to get some measurements.

Although these are not super accurate measurements, they're good enough and sort of representative of the application performance.
For precise measurements, you'll need your GPU vendor's graphics profiling tools.

AMD has Radeon Graphics Profiler (RGP)
NVIDIA has NSight
INTEL has Graphics Performance Analyzers

Update: Below are some captures from RGP, showing GPU assembly instructions and instruction timings of various instruction types. You can see latency hiding in action below:

vmem : texture read
smem : constant buffer / descriptor<texture/buffer> read
ALU  : math ops (arithmetic / logic)
sync : waits for vmem/smem instructions to finish loading data from memory

Another capture showing instruction timings while first two texture loads not having latency hiding.

Reading from buffer versus calculating on the fly performance

One Answer

Add your own answers!

Ask a Question