GPU vs CPU Performance Lab

1. Introduction

The goal of this lab was to develop a deeper understanding of the fundamental architectural differences between GPUs and CPUs. We conducted an experiment using a simple vector addition to observe how parallelism impacts execution time as data scales up to 500 million elements.

2. System Specifications

GPU Architecture

Model: NVIDIA RTX PRO 4000 Blackwell
Streaming Multiprocessors (SMs): 70
CUDA Cores: 8,960
Threads: Massive parallel occupancy

CPU Architecture

Model: Intel(R) Xeon(R) w3-2525
Cores: 8 Cores
Threads: 16 Threads (Hyper-threading)

3. The Experiment

We implemented the vector addition in two ways: a multi-threaded C++ version (using 2 threads to ensure a fair baseline) and a CUDA C++ version for the GPU.

Performance Results

Project Resources

You can find the source code used for this experiment here:

📄 View GPU Source 📄 View CPU Source 🦊 Uni GitLab Repo

Data Size (Elements)	CPU Time (2 Threads)	GPU Time (CUDA)	Speedup Factor
50,000	0.000661 s	0.000165 s	~4.0x
5,000,000	0.0127 s	0.000393 s	~32.3x
500,000,000	0.7 s	0.01 s	70x

        Key Observation: While the CPU execution time increases linearly with data size, the GPU handles larger datasets with significantly less overhead, achieving a 70x speedup at 500M elements.
    

4. Architectural Deep Dive

        The Block Size Experiment: During our tests, we modified the threadsPerBlock variable (e.g., switching between 256, 512, and 1024). Interestingly, changing the block size did not yield significant differences in execution time. 
        
        Why? For a simple vector addition, the GPU is primarily memory-bandwidth limited, not compute-limited. As long as the block size is large enough to "hide" memory latency, the hardware saturates its ability to pull data from VRAM, meaning further tuning of the thread count per block offers diminishing returns.

Why is the GPU so much faster?

The core difference lies in the design philosophy:

CPU (Latency Oriented): Optimized to minimize the time it takes to execute a single sequence of instructions. It uses large caches and complex branch prediction.
GPU (Throughput Oriented): Optimized to handle thousands of simple tasks simultaneously. It "hides" memory latency by switching between thousands of active threads (massively parallel occupancy).

Main Applications for GPUs

Graphics: Real-time rendering and raytracing for games and professional visualization.
Artificial Intelligence: Training Large Language Models (LLMs) and Neural Networks, which rely heavily on matrix operations.
Scientific Computing: Simulations of weather, physics, and molecular biology.

5. Conclusion & Next Steps

Our experiment showed that for data-parallel tasks, the GPU is vastly superior once the data size overcomes the initial transfer overhead. Interestingly, changing the block size (e.g., 256 vs 1024 threads) showed little impact, suggesting that our workload is limited by memory bandwidth rather than compute logic.

What's next? Beyond simple addition, these results open the door for researching asynchronous data transfers and 'Compute-Transfer-Overlap' to eliminate the PCIe bottleneck. Future investigations could also focus on how these speedups scale in multi-GPU environments or when applied to complex stencil computations used in climate modeling and fluid dynamics.