Coding for Performance

3: Using GPUs

Author

Paul Schrimpf

Published

October 6, 2023

This is

GPU

Compared to CPUs, GPUs have a huge number of cores operating at a slower clockrate. GPUs also have their own separate memory, which they can access faster than CPUs access RAM. These characteristics make GPUs well-suited to large parallel computations. Unfortunately, fully utilizing GPUs can require substantial changes to your code.

See “The Different Flavors of Parallelism” Rackauckas (2019) (Rackauckas 2019) for more information comparing GPUs to various forms of parallelism on CPUs.

Array interface

The easiest way to use a GPU in Julia is through a high level array interface. ArrayFire.jl, oneAPI.jl, and CUDA.jl each offer such interfaces. We will focus on CUDA.jl in these notes. CUDA.jl relies on Nvidia’s CUDA platform, so it only works with Nvidia GPUs. Nvidia tends to dominate GPGPU, and the GPUs available on cedar.computecanada.ca and in my desktop are Nvidia.

Using CUDA.CuArray is simple, but has some limitations. You create arrays on the GPU using CuArray. Any array level operation on these will then be performed efficiently on the GPU. This includes broadcast functions with . and matrix multiplies.

using CUDA, Random, BenchmarkTools


N = 1000
M = 1000

function cuarraydemo(N,M)
  # wrapped in a  function so that the CuArrays are freed
  # otherwise we will run out GPU memory later
  A = randn(N,M);
  b = randn(M,2);
  println("Time on CPU")
  function foo(A,b)
    (A.^2)*b
  end
  @time c=foo(A,b);
  @time c=foo(A,b);
  A_gpu = CuArray(A); # copy of A in GPU memory
  b_gpu = CuArray(b);
  println("Computations on the GPU are fast")
  # @btime does not work inside a function
  @time CUDA.@sync c_gpu=foo(A_gpu,b_gpu);
  @time CUDA.@sync c_gpu=foo(A_gpu,b_gpu);
  println("But copying to and from GPU memory is not")
  bar(A,b) =Array(foo(CuArray(A), CuArray(b)))
  @time c2=bar(A,b);
  @time c2=bar(A,b);
end
cuarraydemo (generic function with 1 method)
julia> cuarraydemo(N,M);
Time on CPU
  0.006783 seconds (3 allocations: 7.645 MiB)
  0.004896 seconds (3 allocations: 7.645 MiB)
Computations on the GPU are fast
  0.000337 seconds (80 allocations: 3.125 KiB)
  0.000214 seconds (80 allocations: 3.125 KiB)
But copying to and from GPU memory is not
  0.001448 seconds (102 allocations: 19.578 KiB)
  0.001112 seconds (102 allocations: 19.578 KiB)

CuArrays also allow indexing, so you could use loops and other constructs. However, this will not be fast. CuArrays by itself will be a good method to utilize GPUs when the code is dominated by operations on large arrays.

Unfortunately, the fastest version of our grid bootstrap code does not fit that description. A loop seems needed to generate \(y\) due to the recursiveness of the AR(1) model. The fastest version of the code above involves many operations on small 3x3 arrays.

EXERCISE: modify b_est_original or b_est_mldivide to utilize CuArrays. The approach taken in those functions involves some moderate sized matrices, so it may benefit from CuArrays.

Custom CUDA Kernels

To parallelize the code above on a GPU, we will have to use a lower level interface to the GPU. To explain how it works, we will begin with a simple example that just squares all the elements of an array.

Disclaimer: my understanding of CUDA and the inner workings of GPUs is far from complete. Some of the details in this section might be inaccurate.

A typical workflow with CUDA consists of

  1. Allocate GPU memory and copying arrays into it with CuArray.
  2. Decide how many threads and what configuration of threads to launch.
  3. Each thread does some computation by running a “kernel” function.
  4. Copy result from GPU memory to CPU memory.

In the code below, 1 happens in cuarray_cudanative_compare, 2 happens in the square! function, square_kernel! is the kernel in 3, and 4 is just not done.

Threads and blocks

CUDA organizes GPU threads into blocks. I believe that the threads in a block all execute concurrently. Threads in the same block share some memory and registers. All current Nvidia GPUs have a maximum number of threads per block of 1024. Note that threads in the same block share registers1, and different kernel functions will use different numbers of registers at once, so depending on the kernel function, you might be limited to fewer than 1024 threads per block. The number of registers available per block depends on your GPU. You can check your GPU characteristics by compiling and running the C++ program in $CUDA_PATH/samples/1_Utilities/deviceQuery/. Alternatively, you can see this information in Julia by running the code below.

println("Maximum threads per block $(attribute(device(), CUDA.CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK))")
println("Maximum x blocks $(attribute(device(), CUDA.CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X))")
println("Maximum registers per block $(attribute(device(), CUDA.CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK))")
Maximum threads per block 1024
Maximum x blocks 2147483647
Maximum registers per block 65536

There is no simple way to predict how many registers a kernel function uses. It will depend both on the code you write and how the compiler optimizes the code. If you encounter cryptic error messages about CUDA resources unavailable, then try reducing the number of threads per block. Alternatively, you can limit the number of registers used by passing the maxregs argument to @cuda.

You can execute more than 1024 threads by specifying a number of blocks. There is also a limit to the number of blocks, but it is rather large. In the code below, we set the number of blocks, so that nblocks*nthreads >= length(A). Each thread then operates on a single element of A. When the code is executed, each thread has a unique threadIdx and blockIdx combination, and these are used to assign threads to elements of A. The indices go from 1 to number of threads (or blocks). For convenience you can request threads and blocks to have up 3 dimensions, and there are threadIdx().y and threadIdx().z for the additional dimensions.

function square!(A::CuArray)
  n = length(A)
  maxthreads = 1024
  nthreads = min(maxthreads, n)
  nblocks  = Int(ceil(n/nthreads))

  @cuda threads=nthreads blocks=nblocks square_kernel!(A)

  return A
end

function square_kernel!(A)
  i = threadIdx().x + (blockIdx().x-1)*blockDim().x
  if (i<=length(A))
    @inbounds A[i] *= A[i]
  end
  return nothing # CUDA kernels must return nothing
end

function cuarray_cudanative_compare(A)
  A_gpu = CuArray(A);
  println("CUDAnative square!")
  @time CUDA.@sync square!(A_gpu);
  @time CUDA.@sync square!(A_gpu);

  println("CuArray A*=A")
  A_gpu = CuArray(A);
  @time CUDA.@sync A_gpu .*= A_gpu;
  @time CUDA.@sync A_gpu .*= A_gpu;
  return nothing
end
cuarray_cudanative_compare (generic function with 1 method)
julia> cuarray_cudanative_compare(randn(N,M))
CUDAnative square!
  0.000142 seconds (8 allocations: 400 bytes)
  0.000128 seconds (8 allocations: 400 bytes)
CuArray A*=A
  0.000220 seconds (31 allocations: 2.109 KiB)
  0.000135 seconds (31 allocations: 2.109 KiB)

Kernel Limitations

CUDA kernel functions execute on the GPU and in GPU memory. Since GPU memory is allocated and managed differently than RAM, many Julia functions will not work in CUDA kernels. Most importantly, Julia functions that allocate dynamically sized arrays will not work. This means that even matrix multiplication like θ = ixx*xy will fail (if ixx or xy are dynamically allocated) since it allocates an array for θ. You can, however, have local scalars, tuples, and StaticArrays within a kernel function. The key difference is that the sizes of these types are known at compile time. If ixx and xy are StaticArrays, then you can do something like θ = ixx*xy. Since the compiler knows the size of ixx and xy, the compiler also know the size of θ. However, even with StaticArrays you must be careful with operations that that create new StaticArrays (like matrix multiplies). These will cause problems if called repeatedly within a loop.2

It is possible to dynamicaaly allocate GPU memory within a kernel function, but it requires using the low-level interface to CUDA in CUDA.jl. Moreoever, it is generally not a good idea to be dynamically allocating and freeing memory in each of the thousands of threads you execute.3

GPU grid bootstrap

s = @code_string argridbootstrap_gpu(est.e, y0, grid=αgrid, nboot=nboot, RealType=Float64)
code_md(s)
function argridbootstrap_gpu(e, y0;
                             grid = 0.84:(0.22/20):1.06,
                             nboot=199, RealType = Float32)
  g = length(grid)

  P = 3
  # Allocate GPU memory
  bootq = CuArray(zeros(RealType, nboot, g))
  ba    = CuArray(zeros(RealType, nboot, g))
  bootse= CuArray(zeros(RealType, nboot,g))
  αg = CuArray(RealType.(grid))
  eg = CuArray(RealType.(e))
  ei = Int.(ceil.(length(e).*CUDA.rand(RealType,nboot,g,length(e))))

  # use of registers in gridkernel! limits the maximum threads to less
  # than the full 1024
  maxthreads = sizeof(RealType)<=4 ? 512 : 256
  gthreads =2^2
  bthreads =maxthreads ÷ gthreads
  bblocks = Int(ceil(nboot/bthreads))
  gblocks = Int(ceil(g/gthreads))

  @cuda threads=bthreads,gthreads blocks=bblocks,gblocks argridkernel!(ba,bootq,bootse,Val(1), eg, ei, αg)
  ts = ba./bootse
  (ba=collect(ba), t=collect(ts))
end
s = @code_string ARGridBootstrap.argridkernel!(1.,1., 1., Val(1), 1., 1. , 1.)
code_md(s)
function argridkernel!(ba,bootq, bootse,
                       ar::Val{P}, e, ei, αgrid) where P
  b = threadIdx().x +  (blockIdx().x-1)*blockDim().x
  ak= threadIdx().y + (blockIdx().y-1)*blockDim().y
  if (b>size(ba,1) || ak>size(ba,2))
    return nothing
  end
  T = size(ei,3)
  R = eltype(ba)
  xx = zeros(MMatrix{P+2,P+2,R})
  xy = zeros(MVector{P+2,R})
  xt = zeros(MVector{P+2,R})
  xt[1] = one(R)
  yy = zero(R)
  xx[1,1] = T-P
  xx[1,2] = xx[2,1] = (T+1)*T/2 - (P+1)*P/2 #sum((P+1):T)
  xx[2,2] = (2*T+1)*T*(T+1)/6 - (2*P+1)*P*(P+1)/6 #sum((P+1:T).^2)
  α = zeros(MVector{P+2, R})
  α[3] = αgrid[ak]
  @inbounds for t = (P+1):T
    xt[2] = t
    for i = 1:(P+2)
      for j = 3:(P+2)
        xx[i,j] += xt[i]*xt[j]
      end
    end
    y = e[ei[b,ak,t]] + dot(α,xt)
    for i in 1:(P+2)
      xy[i] += xt[i]*y
    end
    yy += y*y
    for i = 4:(P+2) # shift y lags
      xt[i] = xt[i-1]
    end
    xt[3] = y
  end

  for i = 3:(P+2)
     for j = 1:(i-1)
       xx[i,j] = xx[j,i]
     end
  end
  ixx = inv(xx)
  θ3 = zero(R)
  for i = 1:3
    θ3 += ixx[3,i]*xy[i]
  end
  ee = yy
  for i = 1:3
    for j = 1:3
      ee -= xy[i]*ixx[i,j]*xy[j]
    end
  end
  se3 = CUDA.sqrt(ixx[3,3]*ee/(T-(2*P+2)))
  bootq[b,ak]  = θ3
  bootse[b,ak] = se3
  ba[b,ak] = bootq[b,ak] - αgrid[ak]
  return nothing
end
@benchmark begin
  grid = argridbootstrap_gpu(est.e, y0, grid=αgrid, nboot=nboot, RealType=Float64);
end
BenchmarkTools.Trial: 2387 samples with 1 evaluation.
 Range (min … max):  1.478 ms … 23.069 ms  ┊ GC (min … max): 0.00% … 52.95%
 Time  (median):     1.947 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.073 ms ±  1.657 ms  ┊ GC (mean ± σ):  4.00% ±  4.56%

                           ▁▁▁▃▆█▃                            
  ▂▂▁▂▂▂▂▂▂▂▂▃▃▃▂▃▃▄▅▇▇▇▆▇▆███████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.48 ms        Histogram: frequency by time        2.46 ms <

 Memory estimate: 569.27 KiB, allocs estimate: 229.

Compared to the fastest CPU code above, the GPU version takes about 1/20th the time of the single-threaded CPU code, and is about 33% faster than the the 40-threaded CPU code. Considering that the two CPUs in my workstation together cost about 6 times more than the single GPU, the performance of the GPU code is quite good. Also, we carefully profiled and tuned the CPU code, but not the GPU code (although the GPU code does use all algorithmic improvements of the fastest CPU code). Profiling GPU kernel code requires using Nvidia’s profiler, see CUDA.jl documentation for information.

References

Rackauckas, Chris. 2019. “The Different Flavors of Parallelism.” https://book.sciml.ai/notes/06-The_Different_Flavors_of_Parallelism/.

Footnotes

  1. Processor registers are the fastest bits of memory on the processor, and registers are where the actual addition, multiplication, and other instructions are carried out.↩︎

  2. If you create StaticArrays inside a loop, they get allocated to the GPU’s “dynamic shared memory.” I believe a new allocation happens each loop iteration. This will be slow, and there is a fairly small amount of dynamic shared memory, of which you will soon run out.↩︎

  3. There are situations where allocating shared memory is needed and a good idea, but these require some advanced techniques that we will not cover.↩︎