Coding for Performance
3: Using GPUs
This is
GPU
Compared to CPUs, GPUs have a huge number of cores operating at a slower clockrate. GPUs also have their own separate memory, which they can access faster than CPUs access RAM. These characteristics make GPUs well-suited to large parallel computations. Unfortunately, fully utilizing GPUs can require substantial changes to your code.
See “The Different Flavors of Parallelism” Rackauckas (2019) (Rackauckas 2019) for more information comparing GPUs to various forms of parallelism on CPUs.
Array interface
The easiest way to use a GPU in Julia is through a high level array interface. ArrayFire.jl
, oneAPI.jl
, and CUDA.jl
each offer such interfaces. We will focus on CUDA.jl
in these notes. CUDA.jl
relies on Nvidia’s CUDA platform, so it only works with Nvidia GPUs. Nvidia tends to dominate GPGPU, and the GPUs available on cedar.computecanada.ca and in my desktop are Nvidia.
Using CUDA.CuArray is simple, but has some limitations. You create arrays on the GPU using CuArray
. Any array level operation on these will then be performed efficiently on the GPU. This includes broadcast functions with .
and matrix multiplies.
using CUDA, Random, BenchmarkTools
= 1000
N = 1000
M
function cuarraydemo(N,M)
# wrapped in a function so that the CuArrays are freed
# otherwise we will run out GPU memory later
= randn(N,M);
A = randn(M,2);
b println("Time on CPU")
function foo(A,b)
.^2)*b
(Aend
@time c=foo(A,b);
@time c=foo(A,b);
= CuArray(A); # copy of A in GPU memory
A_gpu = CuArray(b);
b_gpu println("Computations on the GPU are fast")
# @btime does not work inside a function
@time CUDA.@sync c_gpu=foo(A_gpu,b_gpu);
@time CUDA.@sync c_gpu=foo(A_gpu,b_gpu);
println("But copying to and from GPU memory is not")
bar(A,b) =Array(foo(CuArray(A), CuArray(b)))
@time c2=bar(A,b);
@time c2=bar(A,b);
end
cuarraydemo (generic function with 1 method)
> cuarraydemo(N,M);
juliaTime on CPU
0.006783 seconds (3 allocations: 7.645 MiB)
0.004896 seconds (3 allocations: 7.645 MiB)
Computations on the GPU are fast0.000337 seconds (80 allocations: 3.125 KiB)
0.000214 seconds (80 allocations: 3.125 KiB)
But copying to and from GPU memory is not0.001448 seconds (102 allocations: 19.578 KiB)
0.001112 seconds (102 allocations: 19.578 KiB)
CuArray
s also allow indexing, so you could use loops and other constructs. However, this will not be fast. CuArray
s by itself will be a good method to utilize GPUs when the code is dominated by operations on large arrays.
Unfortunately, the fastest version of our grid bootstrap code does not fit that description. A loop seems needed to generate \(y\) due to the recursiveness of the AR(1) model. The fastest version of the code above involves many operations on small 3x3 arrays.
EXERCISE: modify b_est_original
or b_est_mldivide
to utilize CuArray
s. The approach taken in those functions involves some moderate sized matrices, so it may benefit from CuArray
s.
Custom CUDA Kernels
To parallelize the code above on a GPU, we will have to use a lower level interface to the GPU. To explain how it works, we will begin with a simple example that just squares all the elements of an array.
Disclaimer: my understanding of CUDA and the inner workings of GPUs is far from complete. Some of the details in this section might be inaccurate.
A typical workflow with CUDA consists of
- Allocate GPU memory and copying arrays into it with
CuArray
. - Decide how many threads and what configuration of threads to launch.
- Each thread does some computation by running a “kernel” function.
- Copy result from GPU memory to CPU memory.
In the code below, 1 happens in cuarray_cudanative_compare
, 2 happens in the square!
function, square_kernel!
is the kernel in 3, and 4 is just not done.
Threads and blocks
CUDA organizes GPU threads into blocks. I believe that the threads in a block all execute concurrently. Threads in the same block share some memory and registers. All current Nvidia GPUs have a maximum number of threads per block of 1024. Note that threads in the same block share registers1, and different kernel functions will use different numbers of registers at once, so depending on the kernel function, you might be limited to fewer than 1024 threads per block. The number of registers available per block depends on your GPU. You can check your GPU characteristics by compiling and running the C++ program in $CUDA_PATH/samples/1_Utilities/deviceQuery/
. Alternatively, you can see this information in Julia by running the code below.
println("Maximum threads per block $(attribute(device(), CUDA.CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK))")
println("Maximum x blocks $(attribute(device(), CUDA.CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X))")
println("Maximum registers per block $(attribute(device(), CUDA.CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK))")
Maximum threads per block 1024
Maximum x blocks 2147483647
Maximum registers per block 65536
There is no simple way to predict how many registers a kernel function uses. It will depend both on the code you write and how the compiler optimizes the code. If you encounter cryptic error messages about CUDA resources unavailable, then try reducing the number of threads per block. Alternatively, you can limit the number of registers used by passing the maxregs
argument to @cuda
.
You can execute more than 1024 threads by specifying a number of blocks. There is also a limit to the number of blocks, but it is rather large. In the code below, we set the number of blocks, so that nblocks*nthreads >= length(A)
. Each thread then operates on a single element of A
. When the code is executed, each thread has a unique threadIdx
and blockIdx
combination, and these are used to assign threads to elements of A
. The indices go from 1 to number of threads (or blocks). For convenience you can request threads and blocks to have up 3 dimensions, and there are threadIdx().y
and threadIdx().z
for the additional dimensions.
function square!(A::CuArray)
= length(A)
n = 1024
maxthreads = min(maxthreads, n)
nthreads = Int(ceil(n/nthreads))
nblocks
@cuda threads=nthreads blocks=nblocks square_kernel!(A)
return A
end
function square_kernel!(A)
= threadIdx().x + (blockIdx().x-1)*blockDim().x
i if (i<=length(A))
@inbounds A[i] *= A[i]
end
return nothing # CUDA kernels must return nothing
end
function cuarray_cudanative_compare(A)
= CuArray(A);
A_gpu println("CUDAnative square!")
@time CUDA.@sync square!(A_gpu);
@time CUDA.@sync square!(A_gpu);
println("CuArray A*=A")
= CuArray(A);
A_gpu @time CUDA.@sync A_gpu .*= A_gpu;
@time CUDA.@sync A_gpu .*= A_gpu;
return nothing
end
cuarray_cudanative_compare (generic function with 1 method)
> cuarray_cudanative_compare(randn(N,M))
julia
CUDAnative square!0.000142 seconds (8 allocations: 400 bytes)
0.000128 seconds (8 allocations: 400 bytes)
*=A
CuArray A0.000220 seconds (31 allocations: 2.109 KiB)
0.000135 seconds (31 allocations: 2.109 KiB)
Kernel Limitations
CUDA kernel functions execute on the GPU and in GPU memory. Since GPU memory is allocated and managed differently than RAM, many Julia functions will not work in CUDA kernels. Most importantly, Julia functions that allocate dynamically sized arrays will not work. This means that even matrix multiplication like θ = ixx*xy
will fail (if ixx
or xy
are dynamically allocated) since it allocates an array for θ
. You can, however, have local scalars, tuples, and StaticArrays
within a kernel function. The key difference is that the sizes of these types are known at compile time. If ixx
and xy
are StaticArrays
, then you can do something like θ = ixx*xy
. Since the compiler knows the size of ixx
and xy
, the compiler also know the size of θ
. However, even with StaticArrays
you must be careful with operations that that create new StaticArrays (like matrix multiplies). These will cause problems if called repeatedly within a loop.2
It is possible to dynamicaaly allocate GPU memory within a kernel function, but it requires using the low-level interface to CUDA in CUDA.jl
. Moreoever, it is generally not a good idea to be dynamically allocating and freeing memory in each of the thousands of threads you execute.3
GPU grid bootstrap
= @code_string argridbootstrap_gpu(est.e, y0, grid=αgrid, nboot=nboot, RealType=Float64)
s code_md(s)
function argridbootstrap_gpu(e, y0;
= 0.84:(0.22/20):1.06,
grid =199, RealType = Float32)
nboot= length(grid)
g
= 3
P # Allocate GPU memory
= CuArray(zeros(RealType, nboot, g))
bootq = CuArray(zeros(RealType, nboot, g))
ba = CuArray(zeros(RealType, nboot,g))
bootse= CuArray(RealType.(grid))
αg = CuArray(RealType.(e))
eg = Int.(ceil.(length(e).*CUDA.rand(RealType,nboot,g,length(e))))
ei
# use of registers in gridkernel! limits the maximum threads to less
# than the full 1024
= sizeof(RealType)<=4 ? 512 : 256
maxthreads =2^2
gthreads =maxthreads ÷ gthreads
bthreads = Int(ceil(nboot/bthreads))
bblocks = Int(ceil(g/gthreads))
gblocks
@cuda threads=bthreads,gthreads blocks=bblocks,gblocks argridkernel!(ba,bootq,bootse,Val(1), eg, ei, αg)
= ba./bootse
ts =collect(ba), t=collect(ts))
(baend
= @code_string ARGridBootstrap.argridkernel!(1.,1., 1., Val(1), 1., 1. , 1.)
s code_md(s)
function argridkernel!(ba,bootq, bootse,
::Val{P}, e, ei, αgrid) where P
ar= threadIdx().x + (blockIdx().x-1)*blockDim().x
b = threadIdx().y + (blockIdx().y-1)*blockDim().y
akif (b>size(ba,1) || ak>size(ba,2))
return nothing
end
= size(ei,3)
T = eltype(ba)
R = zeros(MMatrix{P+2,P+2,R})
xx = zeros(MVector{P+2,R})
xy = zeros(MVector{P+2,R})
xt 1] = one(R)
xt[= zero(R)
yy 1,1] = T-P
xx[1,2] = xx[2,1] = (T+1)*T/2 - (P+1)*P/2 #sum((P+1):T)
xx[2,2] = (2*T+1)*T*(T+1)/6 - (2*P+1)*P*(P+1)/6 #sum((P+1:T).^2)
xx[= zeros(MVector{P+2, R})
α 3] = αgrid[ak]
α[@inbounds for t = (P+1):T
2] = t
xt[for i = 1:(P+2)
for j = 3:(P+2)
+= xt[i]*xt[j]
xx[i,j] end
end
= e[ei[b,ak,t]] + dot(α,xt)
y for i in 1:(P+2)
+= xt[i]*y
xy[i] end
+= y*y
yy for i = 4:(P+2) # shift y lags
= xt[i-1]
xt[i] end
3] = y
xt[end
for i = 3:(P+2)
for j = 1:(i-1)
= xx[j,i]
xx[i,j] end
end
= inv(xx)
ixx 3 = zero(R)
θfor i = 1:3
3 += ixx[3,i]*xy[i]
θend
= yy
ee for i = 1:3
for j = 1:3
-= xy[i]*ixx[i,j]*xy[j]
ee end
end
= CUDA.sqrt(ixx[3,3]*ee/(T-(2*P+2)))
se3 = θ3
bootq[b,ak] = se3
bootse[b,ak] = bootq[b,ak] - αgrid[ak]
ba[b,ak] return nothing
end
@benchmark begin
= argridbootstrap_gpu(est.e, y0, grid=αgrid, nboot=nboot, RealType=Float64);
grid end
BenchmarkTools.Trial: 2387 samples with 1 evaluation.
Range (min … max): 1.478 ms … 23.069 ms ┊ GC (min … max): 0.00% … 52.95%
Time (median): 1.947 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.073 ms ± 1.657 ms ┊ GC (mean ± σ): 4.00% ± 4.56%
▁▁▁▃▆█▃
▂▂▁▂▂▂▂▂▂▂▂▃▃▃▂▃▃▄▅▇▇▇▆▇▆███████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
1.48 ms Histogram: frequency by time 2.46 ms <
Memory estimate: 569.27 KiB, allocs estimate: 229.
Compared to the fastest CPU code above, the GPU version takes about 1/20th the time of the single-threaded CPU code, and is about 33% faster than the the 40-threaded CPU code. Considering that the two CPUs in my workstation together cost about 6 times more than the single GPU, the performance of the GPU code is quite good. Also, we carefully profiled and tuned the CPU code, but not the GPU code (although the GPU code does use all algorithmic improvements of the fastest CPU code). Profiling GPU kernel code requires using Nvidia’s profiler, see CUDA.jl documentation for information.
References
Footnotes
Processor registers are the fastest bits of memory on the processor, and registers are where the actual addition, multiplication, and other instructions are carried out.↩︎
If you create StaticArrays inside a loop, they get allocated to the GPU’s “dynamic shared memory.” I believe a new allocation happens each loop iteration. This will be slow, and there is a fairly small amount of dynamic shared memory, of which you will soon run out.↩︎
There are situations where allocating shared memory is needed and a good idea, but these require some advanced techniques that we will not cover.↩︎