Coding for Performance

2: Multi-Threading

Author

Paul Schrimpf

Published

October 6, 2023

Multi-Threading

Current computers almost all have multiple cores. We can divide the time it takes our code by up to the number of cores we have (but usually less) by writing multi-threaded code. Multi-threaded code performs multiple tasks at once with shared memory. Before you begin writing multi-threaded code, you should make sure your code isn’t already using all available cores. It is likely that the BLAS and Lapack libraries that Julia uses for linear algebra are multi-threaded. If you code is dominated by large matrix operations, it may already be using all available cores. In that case, there will not be much benefit from additional multi-threading.

Read “The Basics of Single Node Parallel Computing” Rackauckus (2019) (Rackauckas 2019) .

Once we have decided that the code might benefit from multi-threading, we should look for loops (or other independent tasks) that can be multi-threaded. There is some overhead from creating threads and communicating among them. Multi-threading generally works best for loops where each iteration involves substantial work, and each iteration is independent of all others. The loops over grid points and bootstrap repetitions in gridbootstrap are perfect candidates. We don’t care about the order in which these loops get executed. The result of each iteration is (mostly) independent of all others.

using ARGridBootstrap, CodeTracking, Random, BenchmarkTools, Profile, ProfileCanvas
function code_md(s)
  println("```julia\n"*s*"\n```\n")
end
T = 200
e = randn(T)
y0 = 0
a = 0.9
y = ar1_original(y0, a, e)
est = b_est_original(y)
αgrid = 0.84:(0.22/50):1.06
nboot= 199
wrapper(b_est) = function(x)
  out=b_est(x)
  (out.θ[3], out.se[3])
end

Multi-Threaded Grid Bootstrap

s=@code_string gridbootstrap_threaded(wrapper(b_est_original), (a, rng)->ar1_original(y0, a, est.e, n->rand(rng,1:(T-1),n)),αgrid, 2)
code_md(s)

function gridbootstrap_threaded(estimator, simulator,
                                grid::AbstractVector,
                                nboot=199)
  g = length(grid)
  bootq = zeros(nboot, g)
  ba    = zeros(nboot, g)
  bootse = zeros(nboot,g)
  #@threads for ak in 1:g
  #  for j in 1:nboot
  @threads for ind ∈ CartesianIndices(ba)
    j = ind[1]
    ak = ind[2]
    (bootq[j,ak], bootse[j,ak]) = estimator(simulator(grid[ak],Random.TaskLocalRNG()))
    ba[j,ak] = bootq[j,ak] - grid[ak]
  end
  ts = ba./bootse
  (ba=ba, t=ts)
end

Some Libraries are Already Multi-Threaded

Now, let’s try multi-threading the original version of the code.

julia> using Base.Threads

julia> println("Single thread, original version")
Single thread, original version

julia> @benchmark begin 
         (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
                               αgrid, 199)
       end
BenchmarkTools.Trial: 27 samples with 1 evaluation.
 Range (min … max):  175.276 ms … 199.619 ms  ┊ GC (min … max): 11.56% … 10.62%
 Time  (median):     186.845 ms               ┊ GC (median):    16.06%
 Time  (mean ± σ):   186.054 ms ±   4.257 ms  ┊ GC (mean ± σ):  15.72% ±  1.64%

  ▁                    ▁ ▁  ▁▄ ▄▄▁█                              
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁█▆█▁▁██▆████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  175 ms           Histogram: frequency by time          200 ms <

 Memory estimate: 219.44 MiB, allocs estimate: 345075.

julia> println("$(Threads.nthreads()) threads, original version")
40 threads, original version

julia> @benchmark begin 
         (b,t) = gridbootstrap_threaded(wrapper(b_est_original),
                                        (a, rng)->ar1_original(y0, a, est.e, n->rand(rng,1:(T-1),n)),
                                        αgrid, 199)
       end
BenchmarkTools.Trial: 32 samples with 1 evaluation.
 Range (min … max):  102.512 ms … 661.604 ms  ┊ GC (min … max):  0.00% … 81.64%
 Time  (median):     104.351 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   158.058 ms ± 161.796 ms  ┊ GC (mean ± σ):  31.58% ± 24.16%

  █                                                              
  █▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▁
  103 ms           Histogram: frequency by time          662 ms <

 Memory estimate: 311.76 MiB, allocs estimate: 6383971.

The execution times are nearly identical on my computer. The reason is that the computation is dominated by the creation of X and multiplying X'*X and X'*y. These operations are already multi-threaded in the BLAS library being used. It is possible that first calling using LinearAlgebra; BLAS.set_num_threads(1) would improve the performance of the multi-threaded bootstrap.

Multi-Threading Where it Matters

julia> println("Single thread, fastest version")
Single thread, fastest version

julia> estimator(y0=y0,e=est.e) = function(a)
         out = simulate_estimate_arp(y0,a,e)
         (out.θ[3], out.se[3])
       end
estimator (generic function with 3 methods)

julia> @benchmark  (b,t) = gridbootstrap(estimator(), a->a, αgrid, nboot)
BenchmarkTools.Trial: 242 samples with 1 evaluation.
 Range (min … max):  18.752 ms … 48.284 ms  ┊ GC (min … max): 0.00% … 58.59%
 Time  (median):     19.663 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.719 ms ±  4.339 ms  ┊ GC (mean ± σ):  3.86% ±  9.76%

  ▆█▇▆▃▁                                                       
  ██████▇▆▄▁▁▁▁▁▆▄▄▁▁▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▄▁▁▁▁▁▁▁▁▇ ▆
  18.8 ms      Histogram: log(frequency) by time      42.7 ms <

 Memory estimate: 2.94 MiB, allocs estimate: 60904.

julia> println("$(Threads.nthreads()) threads, fastest version")
40 threads, fastest version

julia> estimator_threaded(y0=y0,e=est.e)=function(foo)
         (a, rng) = foo
         out=simulate_estimate_arp(y0,a,e,Val(1),()->rand(rng,1:length(e)))
         (out.θ[3], out.se[3])
       end
estimator_threaded (generic function with 3 methods)

julia> @benchmark (bs, ts) = gridbootstrap_threaded(estimator_threaded(),(a,rng)->(a,rng), αgrid,nboot)
BenchmarkTools.Trial: 1507 samples with 1 evaluation.
 Range (min … max):  1.620 ms … 141.653 ms  ┊ GC (min … max):  0.00% … 97.00%
 Time  (median):     2.524 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.301 ms ±   8.923 ms  ┊ GC (mean ± σ):  18.36% ±  6.66%

         ██▅▇▃                                                 
  ▄▅▅▄▇▆▆█████████▇▅▅▄▅▃▄▃▃▂▂▂▂▃▂▃▂▂▃▂▂▂▂▁▂▂▁▂▂▂▁▂▁▂▂▁▂▂▂▁▂▂▂ ▃
  1.62 ms         Histogram: frequency by time        6.71 ms <

 Memory estimate: 2.97 MiB, allocs estimate: 61138.

Notice how the speedup from using multiple threads is far less than number of cores. On my computer, the threaded version of the code is about 7 times faster, even though my computer has 40 “cores” (or 20 physical cores. My computer has 2 processors with 10 cores each, and each core is hyperthreaded into 2. The OS sees 40 processors, but half of them are sharing substantial resources). A speedup far less than the number of cores is typical. Creating and managing multiple threads creates some overhead. Moreover, cores must share various resources; most notably RAM and some cache.

References

Rackauckas, Chris. 2019. “The Basics of Single Node Parallel Computing.” https://book.sciml.ai/notes/05-The_Basics_of_Single_Node_Parallel_Computing/.