Lab Report: Vector Addition Scalability & Architecture
The goal of this lab was to develop a deeper understanding of the fundamental architectural differences between GPUs and CPUs. We conducted an experiment using a simple vector addition to observe how parallelism impacts execution time as data scales up to 500 million elements.
We implemented the vector addition in two ways: a multi-threaded C++ version (using 2 threads to ensure a fair baseline) and a CUDA C++ version for the GPU.
You can find the source code used for this experiment here:
| Data Size (Elements) | CPU Time (2 Threads) | GPU Time (CUDA) | Speedup Factor |
|---|---|---|---|
| 50,000 | 0.000661 s | 0.000165 s | ~4.0x |
| 5,000,000 | 0.0127 s | 0.000393 s | ~32.3x |
| 500,000,000 | 0.7 s | 0.01 s | 70x |
threadsPerBlock variable (e.g., switching between 256, 512, and 1024). Interestingly, changing the block size did not yield significant differences in execution time.
The core difference lies in the design philosophy:
Our experiment showed that for data-parallel tasks, the GPU is vastly superior once the data size overcomes the initial transfer overhead. Interestingly, changing the block size (e.g., 256 vs 1024 threads) showed little impact, suggesting that our workload is limited by memory bandwidth rather than compute logic.
What's next? Beyond simple addition, these results open the door for researching asynchronous data transfers and 'Compute-Transfer-Overlap' to eliminate the PCIe bottleneck. Future investigations could also focus on how these speedups scale in multi-GPU environments or when applied to complex stencil computations used in climate modeling and fluid dynamics.