CUDA
16,896
NVIDIA H100 SXM5
Hopper Architecture ยท GH100
NVIDIA's data-center flagship for AI/HPC. Introduces FP8 Tensor Cores, NVLink 4.0, 4th-gen NVSwitch, and Confidential Computing. Designed for LLM training at scale.
132
SMs
128
Cores/SM
80 GB HBM3
Memory
3.35 TB/s
Mem BW
67 TFLOPS
FP32
1979 TFLOPS
FP16
50 MB
L2 Cache
700W
TDP
GPU Architecture Hierarchy
CPU (Host)
โผ
PCIe/NVLink
โผ
Giga Thread Engine
โผ
Grid
โผ
SM (ร132)
โผ
Block
โผ
Warp (ร32 threads)
โผ
Thread (CUDA Core)
How it works: The Giga Thread Engine distributes blocks across all SMs. Each SM runs multiple blocks concurrently, limited by register & shared memory.
CUDA Core & SM Breakdown
132
Streaming Multiprocessors
128
CUDA Cores / SM
4
Warp Schedulers / SM
4
Tensor Cores / SM
64
Max Warps / SM
228 KB
Shared Mem / SM
SM Throughput = 128 cores ร frequency ร IPC Total FP32 = 67 TFLOPS across all 132 SMs
Key insight: An SM (Streaming Multiprocessor) is the fundamental compute unit. Each has CUDA cores, warp schedulers, tensor cores, and shared memory.
Selected Kernel: CUDA Code Preview
// Vector Addition: C[i] = A[i] + B[i]
__global__ void vecAdd(float* A, float* B, float* C, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i]; // 1 FLOP, 3 memory ops
}
}
// Launch: 1D grid, 1D blocks of 256 threads
int blocks = (N + 255) / 256;
vecAdd<<<blocks, 256>>>(d_A, d_B, d_C, N);Arithmetic Intensity = 1/12 FLOP/byte (1 ADD, 2 reads + 1 write ร 4 bytes). Purely memory-bound. Optimal when memory accesses are coalesced.