Project 1

Vector Add

A log plot of elements per thread vs total elements vs time

Effect of increasing problem size

As the problem size increases, work can be spread out onto more than one multiprocessor. at around 10^5 elements, I believe I fully saturate the GPU. From there, time increases linearly with data size.

Effects of increasing work per thread

In the chart above, I run 1024 threads per block, and make sure that elementsPerThread*threadsPerBlock*totalBlocks is less than the problem size.

When running 1024 threads per block, as I increase the total elements processed per thread, I see a very slight improvement in time, but not much. The sweet spot seems to be between 1 to 16 elements per thread.

...with 32 threads per block

Ran the same vector addition again, but this time with 32 threads per block, each thread processing from 1 to 1024 elements in a cyclical data distribution. Green areas are better than average, red is worse than average

I get very similar results as before. It appears that 8 elements per thread is about the sweet spot.