Writing a lock in CUDA

Table of Contents

This article borrows the idea from Will Landau's lecture 1 and GTC 2013's presentation 2.

Why lock in CUDA

Sometimes different threads/threadblocks in our program access the same memory address, and we want to avoid race conditions: lock can help us mark a region as critical section, which means only one thread can execute the codes inside the critical region at a time.

CUDA do not provide such primitives so that we need to build one on our own.

AtomicCAS/AtomicExch

CAS (Compare-And-Swap) is an instruction to achieve synchronization, it's semantically equivalent to 3

def cas(addr, cmp, new):
    # enter exclusive region
    old = mem[addr]
    if mem[addr] == cmp:
        mem[addr] = new;
    # exit exclusive region
    return old

which always return the old value resides in given address, but change its value to given new value if the old value equals cmp .

Another useful instruction exchange:

def exchange(addr, val):
    # enter exclusive region
    old = mem[addr]
    mem[addr] = val
    # exit exclusive region
    return old

atomicCAS / atomicExch are CUDA's implementations of CAS/exchange instruction. We can use atomicCAS to construct other atomic operators such as AtomicAdd

int32_t AtomicAdd(int32_t *addr, int32_t val) {
  int32_t old = *addr;
  do {
    int32_t assumed = old;
    old = atomicCAS(addr, assumed, old + val);
  } while (assumed != old);
  return old;
}

the reason we insert a while loop here is: the expression atomicCAS(addr, *addr, *addr + val) would be compiled to several assembly instructions, some other thread might have already changed the value resides in addr between we compute *addr + val and we assign it to *addr. If we found the valued returned by AtomicCAS does not equal assumed, some other thread must have already changed *addr , and we need to try this operation again.

Follow the same insight we can define a pair of acquire_lock and unlock function:

__device__ int locked = 0;
__device__ __forceinline__ void acquire_lock() {
  /*
   * a single attempt:
   * if (locked == 0) {
   *   locked = 1;
   *   return true;  // acquire successful
   * }
   * note: we need to fuse the if-statement above into a single CAS instruction.
   * return false;   // acquire failed
   */
  while (AtomicCAS(&locked, 0, 1) != 0);
}
__device__ __forceinline__ void unlock() {
  locked = 0;
}

__global__ void some_kernel(...) {
  acquire_lock();
  // critical section
  // YOUR CODE HERE
  unlock();
}

However, the code above doesn't work in general cases, why?

Deadlock

The smallest executable unit of parallelism in CUDA is warp which consists of 32 threads, this means threads inside the same block synchronize after every step. if there are divergences in programs (e.g. if-then-else statements), GPU would instruct the warp to enter both branches in serial and mask inactive threads.

warp_model.png

Figure 1: warp's execution model

Let's consider what happens inside a warp if we use the lock construct mentioned above: suppose thread 1 acquired the lock, it will block other threads from entering the critical section. However, thread 1 was also forced to stuck in the acquire-lock loop because all threads inside a warp are synchronized, preventing thread 1 to reach unlock instruction which releases the lock.

This is a deadlock!

warp_deadlock.png

Figure 2: deadlock inside warp

Workaround

A simple way to prevent the case is only enabling one thread inside a warp to enter the acquire_lock - unlock block at a time:

#include <iostream>
#include <cuda_runtime.h>

struct Lock {
  int *locked;

  Lock() {
    int init = 0;
    cudaMalloc(&locked, sizeof(int));
    cudaMemcpy(locked, &init, sizeof(int), cudaMemcpyHostToDevice);
  }

  ~Lock() {
    cudaFree(locked);
  }

  __device__ __forceinline__ void acquire_lock() {
    while (atomicCAS(locked, 0, 1) != 0);
  }

  __device__ __forceinline__ void unlock() {
    atomicExch(locked, 0);
  }
};

__global__ void counter(Lock lock, int *total) {
#pragma unroll
  for (int lane = 0; lane < 32; ++lane) {
    if (threadIdx.x % 32 == lane) {
      lock.acquire_lock();
      *total = *total + 1;
      lock.unlock();
    }
  }
}

int main() {
  int *total_dev;
  cudaMalloc(&total_dev, sizeof(int));
  int total_host = 0;
  cudaMemcpy(total_dev, &total_host, sizeof(int), cudaMemcpyHostToDevice);
  Lock lock;
  counter<<<1024, 1024>>>(lock, total_dev);
  cudaDeviceSynchronize();
  cudaMemcpy(&total_host, total_dev, sizeof(int), cudaMemcpyDeviceToHost);
  std::cout << total_host << std::endl;
  cudaFree(total_dev); 
}

The expected output is 1048576, works on my GTX 1060 laptop but failed to get accurate result on a RTX 3070 server. Still trying to figure out the reason4 .

Alternatives

Sequentially processing different lanes in a thread is highly in-efficient, the recommended way is to write lock-free CUDA programs (use native atomic instructions instead of locks), or avoid locks between threads inside the same warp.

Other alternatives include creating 32 locks, one per thread in a warp 5.

Footnotes:

Author: expye(Zihao Ye)

Email: expye@outlook.com

Date: 2021-07-11 Sun 00:00

Last modified: 2022-12-27 Tue 07:18

Licensed under CC BY-NC 4.0