Data Movement in CUDA
Table of Contents
Memory Hierarchy
Host to Device
Unified Virtual Addressing (UVA)
The Unified Virtual Addressing provides a unified addressing space for host memory and all GPU memories, GPU cannot access data directly from pageable host memory. Whenever you perform a data copy from host to device, CUDA drivers allocates a temporary page-locked section at host-side, then copy the section from the locked page to device, this page-locked section is also known as Pinned Host Memory.
User can allocate page-locked memory via cudaHostAlloc
API, and the allocated memory is visible to all devices and there will be no overhead of "transferring data from pageable memory to page-locked memory" if we move data from page-locked memory to any of the devices. The multi-GPU sampling in Graph Neural Networks can benefit a lot from this feature1.
Unified Memory
Unified Memory (UM) is a technique built upon UVA, UVA is only a abstraction layer for memory addressing and UM is the memory management system in CUDA runtime. With the help of UM, user can manipulate fancy data structure on GPUs such as linked lists without tedious deep-copy.
More specifically, memory managed by cudaMallocManaged
shares between host and device, whenever device requests a page that does not reside on device, CUDA runtime raises a page fault, and the Unified Memory driver would process the page-faults and migrate the page from host to memory.
I can imagine this could get very complex when there are multiple GPUs, and I've not investigated how can they handle this properly.
Here are some tips on how to maximize UM performance.
Memory Coalescing
Efficient memory copy from global memory to shared/local memory requires the memory access request in a warp to be consecutive (a.k.a. memory coalescing2,3), this is because a single memory access transaction issued by a warp fetches 128 consecutive bytes from global memory. If the memory access request is not consecutive, the warp would issue multiple transaction and the memory access become serialized.
hardware-effects-gpu 4 contains some simple experiment code measuring (NVIDIA) GPU hardware effects including memory coalescing and bank conflicts.
Shared to Register
Shared Memory are divided into banks (32 in total), and banks are assigned to addresses via formula: bank = (address / bank_size) % 32
, the bank size is usually 4
bytes but some GPU architectures also supports 8~/~16
bytes banking mode. Memory access from different threads inside a warp to different addresses in the same bank would be serialized. We call this phenomenon "Bank Conflict".
See more about shared memory access pattern at GTC 2018 on Volta Architectur5e.
Figure 1: Image for d2l-tvm
We use the hardware-effects-gpu 4 benchmark code for RTX 3080 GPU.
#define MEMORY_SIZE 4096 __global__ void kernel(int offset) { __shared__ int32_t sharedMem[MEMORY_SIZE]; int threadId = threadIdx.x; // init shared memory if (threadId == 0) { for (int i = 0; i < MEMORY_SIZE; i++) sharedMem[i] = 0; } __syncthreads(); // repeatedly read and write to shared memory uint32_t index = threadId * offset; for (int i = 0; i < 10000; i++) { sharedMem[index] += index * i; index += 32; index %= MEMORY_SIZE; } }
The core idea is to let thread-i read/write the address i * offset.
offset=0
refers to the broadcasting case, and when offset=1
, thread-i accesses bank-i, there are no bank conflicts in both cases. When offset=32
, all threads are accessing exactly the same bank but different addresses (thus broadcasting is not possible), rendering severe bank conflict (4x slowdown), we can also infer that bank size is 4 bytes on RTX 3080.
Pipeline
Inter-Kernel Pipeline
CUDA stream can help overlap kernel execution and Data Transfers.
Intra-Kernel Pipeline
CUDA 11.1 supports Multi-stage Asynchronous Copy inside cuda kernel.