Computer Architecture Basic Arithmetics
Table of Contents
FLOPs
Case Study: Intel CPU
Case Study: NVIDIA A100 GPU
In Ampere Architecture 1, there are 432 tensor cores, the GPU boost clock for NVIDIA A100 GPU is 1410MHz. Each A100 Tensor Core can execute 8*4*8=256 FP16 FMAs (512 FLOP) per clock.
Thus the peak floating performance of A100 GPU is:
\[432 * 512 \textrm{FLOP} * 1410 \textrm{MHz} = 312 \textrm{TFLOPS}\]
With sparsity feature enabled (2x throughput), A100 can achieve 624 TFLOPS.
Roofline model
Roofline model (Samuel et al. 2) is a visual performance model for floating point programs.
Some terms:
- Operational Intensity
- operations per byte of DRAM traffice. (measured by flop/byte)
- Attainable GFlops
Min(Peak Floating Point Performance, Peak Memory BandWidth * Operational Intensity)
We say a compute kernel memory-bound if Peak Floating Point Performance >= Peak Memory BandWidth * Operational Intensity
and compute-bound in the other case. The slope corresponding to memory bound region is the memory bandwidth (measured by bytes/s).
The same kernel might be compute bound in one environment while memory-bound in another.