Writing a lock in CUDA
Table of Contents
This article borrows the idea from Will Landau's lecture 1 and GTC 2013's presentation 2.
Why lock in CUDA
Sometimes different threads/threadblocks in our program access the same memory address, and we want to avoid race conditions: lock can help us mark a region as critical section, which means only one thread can execute the codes inside the critical region at a time.
CUDA do not provide such primitives so that we need to build one on our own.
AtomicCAS/AtomicExch
CAS (Compare-And-Swap) is an instruction to achieve synchronization, it's semantically equivalent to 3
def cas(addr, cmp, new): # enter exclusive region old = mem[addr] if mem[addr] == cmp: mem[addr] = new; # exit exclusive region return old
which always return the old value resides in given address, but change its value to given new value if the old value equals cmp
.
Another useful instruction exchange:
def exchange(addr, val): # enter exclusive region old = mem[addr] mem[addr] = val # exit exclusive region return old
atomicCAS
/ atomicExch
are CUDA's implementations of CAS/exchange instruction. We can use atomicCAS
to construct other atomic operators such as AtomicAdd
int32_t AtomicAdd(int32_t *addr, int32_t val) { int32_t old = *addr; do { int32_t assumed = old; old = atomicCAS(addr, assumed, old + val); } while (assumed != old); return old; }
the reason we insert a while loop here is: the expression atomicCAS(addr, *addr, *addr + val)
would be compiled to several assembly instructions, some other thread might have already changed the value resides in addr
between we compute *addr + val
and we assign it to *addr
. If we found the valued returned by AtomicCAS
does not equal assumed
, some other thread must have already changed *addr
, and we need to try this operation again.
Follow the same insight we can define a pair of acquire_lock
and unlock
function:
__device__ int locked = 0; __device__ __forceinline__ void acquire_lock() { /* * a single attempt: * if (locked == 0) { * locked = 1; * return true; // acquire successful * } * note: we need to fuse the if-statement above into a single CAS instruction. * return false; // acquire failed */ while (AtomicCAS(&locked, 0, 1) != 0); } __device__ __forceinline__ void unlock() { locked = 0; } __global__ void some_kernel(...) { acquire_lock(); // critical section // YOUR CODE HERE unlock(); }
However, the code above doesn't work in general cases, why?
Deadlock
The smallest executable unit of parallelism in CUDA is warp which consists of 32 threads, this means threads inside the same block synchronize after every step. if there are divergences in programs (e.g. if-then-else statements), GPU would instruct the warp to enter both branches in serial and mask inactive threads.
Let's consider what happens inside a warp if we use the lock
construct mentioned above: suppose thread 1 acquired the lock, it will block other threads from entering the critical section. However, thread 1 was also forced to stuck in the acquire-lock loop because all threads inside a warp are synchronized, preventing thread 1 to reach unlock
instruction which releases the lock.
This is a deadlock!
Workaround
A simple way to prevent the case is only enabling one thread inside a warp to enter the acquire_lock
- unlock
block at a time:
#include <iostream> #include <cuda_runtime.h> struct Lock { int *locked; Lock() { int init = 0; cudaMalloc(&locked, sizeof(int)); cudaMemcpy(locked, &init, sizeof(int), cudaMemcpyHostToDevice); } ~Lock() { cudaFree(locked); } __device__ __forceinline__ void acquire_lock() { while (atomicCAS(locked, 0, 1) != 0); } __device__ __forceinline__ void unlock() { atomicExch(locked, 0); } }; __global__ void counter(Lock lock, int *total) { #pragma unroll for (int lane = 0; lane < 32; ++lane) { if (threadIdx.x % 32 == lane) { lock.acquire_lock(); *total = *total + 1; lock.unlock(); } } } int main() { int *total_dev; cudaMalloc(&total_dev, sizeof(int)); int total_host = 0; cudaMemcpy(total_dev, &total_host, sizeof(int), cudaMemcpyHostToDevice); Lock lock; counter<<<1024, 1024>>>(lock, total_dev); cudaDeviceSynchronize(); cudaMemcpy(&total_host, total_dev, sizeof(int), cudaMemcpyDeviceToHost); std::cout << total_host << std::endl; cudaFree(total_dev); }
The expected output is 1048576, works on my GTX 1060 laptop but failed to get accurate result on a RTX 3070 server. Still trying to figure out the reason4 .
Alternatives
Sequentially processing different lanes in a thread is highly in-efficient, the recommended way is to write lock-free CUDA programs (use native atomic instructions instead of locks), or avoid locks between threads inside the same warp.
Other alternatives include creating 32 locks, one per thread in a warp 5.