CUDA
16,896
NVIDIA H100 SXM5
Hopper Architecture ยท GH100
NVIDIA's data-center flagship for AI/HPC. Introduces FP8 Tensor Cores, NVLink 4.0, 4th-gen NVSwitch, and Confidential Computing. Designed for LLM training at scale.
132
SMs
128
Cores/SM
80 GB HBM3
Memory
3.35 TB/s
Mem BW
67 TFLOPS
FP32
1979 TFLOPS
FP16
50 MB
L2 Cache
700W
TDP
GPU Architecture Hierarchy
CPU (Host)
CUDA API calls, memory mgmt, kernel launch1
โ–ผ
PCIe/NVLink
Data transfer: cudaMemcpy, unified memoryโ†•
โ–ผ
Giga Thread Engine
Distributes blocks to SMs, tracks resources1/GPU
โ–ผ
Grid
132 blocks max in flight across SMsN blocks
โ–ผ
SM (ร—132)
128 CUDA cores, 4 warp schedulersร—132
โ–ผ
Block
Up to 1024 threads, shared SMEM, sync barrierโ‰ค32/SM
โ–ผ
Warp (ร—32 threads)
SIMT unit โ€” 32 threads execute same instr32 thds
โ–ผ
Thread (CUDA Core)
FP32/INT32 unit + registersscalar
How it works: The Giga Thread Engine distributes blocks across all SMs. Each SM runs multiple blocks concurrently, limited by register & shared memory.
CUDA Core & SM Breakdown
132
Streaming Multiprocessors
128
CUDA Cores / SM
4
Warp Schedulers / SM
4
Tensor Cores / SM
64
Max Warps / SM
228 KB
Shared Mem / SM
SM Throughput = 128 cores ร— frequency ร— IPC Total FP32 = 67 TFLOPS across all 132 SMs
Key insight: An SM (Streaming Multiprocessor) is the fundamental compute unit. Each has CUDA cores, warp schedulers, tensor cores, and shared memory.
Selected Kernel: CUDA Code Preview
// Vector Addition: C[i] = A[i] + B[i]
__global__ void vecAdd(float* A, float* B, float* C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];  // 1 FLOP, 3 memory ops
    }
}

// Launch: 1D grid, 1D blocks of 256 threads
int blocks = (N + 255) / 256;
vecAdd<<<blocks, 256>>>(d_A, d_B, d_C, N);
Arithmetic Intensity = 1/12 FLOP/byte (1 ADD, 2 reads + 1 write ร— 4 bytes). Purely memory-bound. Optimal when memory accesses are coalesced.