GPGPU Architecture Explorer

CUDA

16,896

NVIDIA H100 SXM5

Hopper Architecture · GH100

NVIDIA's data-center flagship for AI/HPC. Introduces FP8 Tensor Cores, NVLink 4.0, 4th-gen NVSwitch, and Confidential Computing. Designed for LLM training at scale.

132

SMs

128

Cores/SM

80 GB HBM3

Memory

3.35 TB/s

Mem BW

67 TFLOPS

FP32

1979 TFLOPS

FP16

50 MB

L2 Cache

700W

TDP

GPU Architecture Hierarchy

CPU (Host)

CUDA API calls, memory mgmt, kernel launch1

▼

PCIe/NVLink

Data transfer: cudaMemcpy, unified memory↕

▼

Giga Thread Engine

Distributes blocks to SMs, tracks resources1/GPU

▼

Grid

132 blocks max in flight across SMsN blocks

▼

SM (×132)

128 CUDA cores, 4 warp schedulers×132

▼

Block

Up to 1024 threads, shared SMEM, sync barrier≤32/SM

▼

Warp (×32 threads)

SIMT unit — 32 threads execute same instr32 thds

▼

Thread (CUDA Core)

FP32/INT32 unit + registersscalar

How it works: The Giga Thread Engine distributes blocks across all SMs. Each SM runs multiple blocks concurrently, limited by register & shared memory.

CUDA Core & SM Breakdown

132

Streaming Multiprocessors

128

CUDA Cores / SM

Warp Schedulers / SM

Tensor Cores / SM

Max Warps / SM

228 KB

Shared Mem / SM

SM Throughput = 128 cores × frequency × IPC Total FP32 = 67 TFLOPS across all 132 SMs

Key insight: An SM (Streaming Multiprocessor) is the fundamental compute unit. Each has CUDA cores, warp schedulers, tensor cores, and shared memory.

Selected Kernel: CUDA Code Preview

// Vector Addition: C[i] = A[i] + B[i]
__global__ void vecAdd(float* A, float* B, float* C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];  // 1 FLOP, 3 memory ops
    }
}

// Launch: 1D grid, 1D blocks of 256 threads
int blocks = (N + 255) / 256;
vecAdd<<<blocks, 256>>>(d_A, d_B, d_C, N);

Arithmetic Intensity = 1/12 FLOP/byte (1 ADD, 2 reads + 1 write × 4 bytes). Purely memory-bound. Optimal when memory accesses are coalesced.