A log plot of elements per thread vs total elements vs time
Vector Add
A log plot of elements per thread vs total elements vs time
As the problem size increases, work can be spread out onto more than one multiprocessor. at around 10^5 elements, I believe I fully saturate the GPU. From there, time increases linearly with data size.
In the chart above, I run 1024 threads per block, and make sure that elementsPerThread*threadsPerBlock*totalBlocks is less than the problem size.
When running 1024 threads per block, as I increase the total elements processed per thread, I see a very slight improvement in time, but not much. The sweet spot seems to be between 1 to 16 elements per thread.
Ran the same vector addition again, but this time with 32 threads per block, each thread processing from 1 to 1024 elements in a cyclical data distribution. Green areas are better than average, red is worse than average
I get very similar results as before. It appears that 8 elements per thread is about the sweet spot.