CPUs vs GPUs: Performance Revisited

Lab Report: Vector Addition Scalability & Architecture

1. Introduction

The goal of this lab was to develop a deeper understanding of the fundamental architectural differences between GPUs and CPUs. We conducted an experiment using a simple vector addition to observe how parallelism impacts execution time as data scales up to 500 million elements.

2. System Specifications

GPU Architecture

  • Model: NVIDIA RTX PRO 4000 Blackwell
  • Streaming Multiprocessors (SMs): 70
  • CUDA Cores: 8,960
  • Threads: Massive parallel occupancy

CPU Architecture

  • Model: Intel(R) Xeon(R) w3-2525
  • Cores: 8 Cores
  • Threads: 16 Threads (Hyper-threading)

3. The Experiment

We implemented the vector addition in two ways: a multi-threaded C++ version (using 2 threads to ensure a fair baseline) and a CUDA C++ version for the GPU.

Performance Results

Project Resources

You can find the source code used for this experiment here:

Data Size (Elements) CPU Time (2 Threads) GPU Time (CUDA) Speedup Factor
50,000 0.000661 s 0.000165 s ~4.0x
5,000,000 0.0127 s 0.000393 s ~32.3x
500,000,000 0.7 s 0.01 s 70x
Key Observation: While the CPU execution time increases linearly with data size, the GPU handles larger datasets with significantly less overhead, achieving a 70x speedup at 500M elements.

4. Architectural Deep Dive

The Block Size Experiment: During our tests, we modified the threadsPerBlock variable (e.g., switching between 256, 512, and 1024). Interestingly, changing the block size did not yield significant differences in execution time.

Why? For a simple vector addition, the GPU is primarily memory-bandwidth limited, not compute-limited. As long as the block size is large enough to "hide" memory latency, the hardware saturates its ability to pull data from VRAM, meaning further tuning of the thread count per block offers diminishing returns.

Why is the GPU so much faster?

The core difference lies in the design philosophy:

Main Applications for GPUs

  1. Graphics: Real-time rendering and raytracing for games and professional visualization.
  2. Artificial Intelligence: Training Large Language Models (LLMs) and Neural Networks, which rely heavily on matrix operations.
  3. Scientific Computing: Simulations of weather, physics, and molecular biology.

5. Conclusion & Next Steps

Our experiment showed that for data-parallel tasks, the GPU is vastly superior once the data size overcomes the initial transfer overhead. Interestingly, changing the block size (e.g., 256 vs 1024 threads) showed little impact, suggesting that our workload is limited by memory bandwidth rather than compute logic.

What's next? Beyond simple addition, these results open the door for researching asynchronous data transfers and 'Compute-Transfer-Overlap' to eliminate the PCIe bottleneck. Future investigations could also focus on how these speedups scale in multi-GPU environments or when applied to complex stencil computations used in climate modeling and fluid dynamics.