Data Movement in CUDA

Table of Contents

Memory Hierarchy

Host to Device

Unified Virtual Addressing (UVA)

The Unified Virtual Addressing provides a unified addressing space for host memory and all GPU memories, GPU cannot access data directly from pageable host memory. Whenever you perform a data copy from host to device, CUDA drivers allocates a temporary page-locked section at host-side, then copy the section from the locked page to device, this page-locked section is also known as Pinned Host Memory.

User can allocate page-locked memory via cudaHostAlloc API, and the allocated memory is visible to all devices and there will be no overhead of "transferring data from pageable memory to page-locked memory" if we move data from page-locked memory to any of the devices. The multi-GPU sampling in Graph Neural Networks can benefit a lot from this feature1.

Unified Memory

Unified Memory (UM) is a technique built upon UVA, UVA is only a abstraction layer for memory addressing and UM is the memory management system in CUDA runtime. With the help of UM, user can manipulate fancy data structure on GPUs such as linked lists without tedious deep-copy.

More specifically, memory managed by cudaMallocManaged shares between host and device, whenever device requests a page that does not reside on device, CUDA runtime raises a page fault, and the Unified Memory driver would process the page-faults and migrate the page from host to memory.

I can imagine this could get very complex when there are multiple GPUs, and I've not investigated how can they handle this properly.

Here are some tips on how to maximize UM performance.

Memory Coalescing

Efficient memory copy from global memory to shared/local memory requires the memory access request in a warp to be consecutive (a.k.a. memory coalescing2,3), this is because a single memory access transaction issued by a warp fetches 128 consecutive bytes from global memory. If the memory access request is not consecutive, the warp would issue multiple transaction and the memory access become serialized.

hardware-effects-gpu 4 contains some simple experiment code measuring (NVIDIA) GPU hardware effects including memory coalescing and bank conflicts.

Shared to Register

Shared Memory are divided into banks (32 in total), and banks are assigned to addresses via formula: bank = (address / bank_size) % 32, the bank size is usually 4 bytes but some GPU architectures also supports 8~/~16 bytes banking mode. Memory access from different threads inside a warp to different addresses in the same bank would be serialized. We call this phenomenon "Bank Conflict".

See more about shared memory access pattern at GTC 2018 on Volta Architectur5e.

Sorry, your browser does not support SVG.

Figure 1: Image for d2l-tvm

We use the hardware-effects-gpu 4 benchmark code for RTX 3080 GPU.

#define MEMORY_SIZE 4096
__global__ void kernel(int offset)
  __shared__ int32_t sharedMem[MEMORY_SIZE];

  int threadId = threadIdx.x;

  // init shared memory
  if (threadId == 0)
      for (int i = 0; i < MEMORY_SIZE; i++) sharedMem[i] = 0;

  // repeatedly read and write to shared memory
  uint32_t index = threadId * offset;
  for (int i = 0; i < 10000; i++)
      sharedMem[index] += index * i;
      index += 32;
      index %= MEMORY_SIZE;

The core idea is to let thread-i read/write the address i * offset.


Figure 2: Average kernel time vs offset.

offset=0 refers to the broadcasting case, and when offset=1, thread-i accesses bank-i, there are no bank conflicts in both cases. When offset=32, all threads are accessing exactly the same bank but different addresses (thus broadcasting is not possible), rendering severe bank conflict (4x slowdown), we can also infer that bank size is 4 bytes on RTX 3080.


Inter-Kernel Pipeline

CUDA stream can help overlap kernel execution and Data Transfers.


Author: expye(Zihao Ye)


Date: 2021-07-01 Thu 00:00

Last modified: 2022-04-26 Tue 23:33

Licensed under CC BY-NC 4.0