Computer Architecture Basic Arithmetics

Table of Contents

FLOPs

Case Study: NVIDIA A100 GPU

In Ampere Architecture 1, there are 432 tensor cores, the GPU boost clock for NVIDIA A100 GPU is 1410MHz. Each A100 Tensor Core can execute 8*4*8=256 FP16 FMAs (512 FLOP) per clock.

Thus the peak floating performance of A100 GPU is:

\[432 * 512 \textrm{FLOP} * 1410 \textrm{MHz} = 312 \textrm{TFLOPS}\]

With sparsity feature enabled (2x throughput), A100 can achieve 624 TFLOPS.

Roofline model

Roofline model (Samuel et al. 2) is a visual performance model for floating point programs.

Some terms:

Operational Intensity
operations per byte of DRAM traffice. (measured by flop/byte)
Attainable GFlops
Min(Peak Floating Point Performance, Peak Memory BandWidth * Operational Intensity)

roofline.png

Figure 1: Roofline model for Opteron X2 in Samuel et al.

We say a compute kernel memory-bound if Peak Floating Point Performance >= Peak Memory BandWidth * Operational Intensity and compute-bound in the other case. The slope corresponding to memory bound region is the memory bandwidth (measured by bytes/s).

The same kernel might be compute bound in one environment while memory-bound in another.

Footnotes:

Author: expye(Zihao Ye)

Email: expye@outlook.com

Date:

Last modified: 2022-12-27 Tue 07:18

Licensed under CC BY-NC 4.0