Coding for Performance

2: Multi-Threading

Author

Paul Schrimpf

Published

March 22, 2026

Multi-Threading

Current computers almost all have multiple cores. We can divide the time it takes our code by up to the number of cores we have (but usually less) by writing multi-threaded code. Multi-threaded code performs multiple tasks at once with shared memory. Before you begin writing multi-threaded code, you should make sure your code isn’t already using all available cores. It is likely that the BLAS and Lapack libraries that Julia uses for linear algebra are multi-threaded. If you code is dominated by large matrix operations, it may already be using all available cores. In that case, there will not be much benefit from additional multi-threading.

Read “The Basics of Single Node Parallel Computing” Rackauckus (2019) (Rackauckas 2019) .

Once we have decided that the code might benefit from multi-threading, we should look for loops (or other independent tasks) that can be multi-threaded. There is some overhead from creating threads and communicating among them. Multi-threading generally works best for loops where each iteration involves substantial work, and each iteration is independent of all others. The loops over grid points and bootstrap repetitions in gridbootstrap are perfect candidates. We don’t care about the order in which these loops get executed. The result of each iteration is (mostly) independent of all others.

using ARGridBootstrap, CodeTracking, Random, BenchmarkTools, Profile, ProfileCanvas
function code_md(s)
    println("```julia\n"*s*"\n```\n")
end
T = 200
e = randn(T)
y0 = 0
a = 0.9
y = ar1_original(y0, a, e)
est = b_est_original(y)
αgrid = 0.84:(0.22/50):1.06
nboot= 199
wrapper(b_est) = function(x)
  out=b_est(x)
  (out.θ[3], out.se[3])
end

Multi-Threaded Grid Bootstrap

s=@code_string gridbootstrap_threaded(wrapper(b_est_original), (a, rng)->ar1_original(y0, a, est.e, n->rand(rng,1:(T-1),n)),αgrid, 2)
code_md(s)
function gridbootstrap_threaded(estimator, simulator,
                                grid::AbstractVector,
                                nboot=199)
  g = length(grid)
  bootq = zeros(nboot, g)
  ba    = zeros(nboot, g)
  bootse = zeros(nboot,g)
  #@threads for ak in 1:g
  #  for j in 1:nboot
  @threads for ind  CartesianIndices(ba)
    j = ind[1]
    ak = ind[2]
    (bootq[j,ak], bootse[j,ak]) = estimator(simulator(grid[ak],Random.TaskLocalRNG()))
    ba[j,ak] = bootq[j,ak] - grid[ak]
  end
  ts = ba./bootse
  (ba=ba, t=ts)
end

Some Libraries are Already Multi-Threaded

Now, let’s try multi-threading the original version of the code.

julia> using Base.Threads

julia> println("Single thread, original version")
Single thread, original version

julia> @benchmark begin
         (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
                               αgrid, 199)
       end
BenchmarkTools.Trial: 66 samples with 1 evaluation per sample.
 Range (min  max):  55.317 ms  86.123 ms  ┊ GC (min  max):  0.00%  30.97%
 Time  (median):     75.572 ms              ┊ GC (median):    38.07%
 Time  (mean ± σ):   76.546 ms ±  5.586 ms  ┊ GC (mean ± σ):  37.41% ±  8.35%

                                        █▄                     
  ▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▃▁▁▃▁▁▁▁▁▃██▄▁▃▁▁▃▁▁▁▁▃▃▆▇▁▁▃▁▃ ▁
  55.3 ms         Histogram: frequency by time        85.8 ms <

 Memory estimate: 208.68 MiB, allocs estimate: 568357.
julia> println("$(Threads.nthreads()) threads, original version")
12 threads, original version

julia> @benchmark begin
         (b,t) = gridbootstrap_threaded(wrapper(b_est_original),
                                        (a, rng)->ar1_original(y0, a, est.e, n->rand(rng,1:(T-1),n)),
                                        αgrid, 199)
       end
BenchmarkTools.Trial: 100 samples with 1 evaluation per sample.
 Range (min  max):  29.575 ms  74.594 ms  ┊ GC (min  max): 40.01%  47.77%
 Time  (median):     51.977 ms              ┊ GC (median):    31.34%
 Time  (mean ± σ):   50.040 ms ±  9.435 ms  ┊ GC (mean ± σ):  33.17% ±  7.52%

        ▁                  ▁ ▁▁▁  ▁ █▄▆  ▁                     
  ▄▁▇▁▄▆█▄▇▁▄▄▄▁▆▆▁▁▆▁▆▄▆▆▁█▆███▆▄█▇███▇▇█▁▆▇▄▄▁▁▁▄▁▁▁▁▁▁▁▁▁▄ ▄
  29.6 ms         Histogram: frequency by time          73 ms <

 Memory estimate: 300.98 MiB, allocs estimate: 6607074.

The execution times are nearly identical on my computer. The reason is that the computation is dominated by the creation of X and multiplying X'*X and X'*y. These operations are already multi-threaded in the BLAS library being used. It is possible that first calling using LinearAlgebra; BLAS.set_num_threads(1) would improve the performance of the multi-threaded bootstrap.

Multi-Threading Where it Matters

julia> println("Single thread, fastest version")
Single thread, fastest version

julia> estimator(y0=y0,e=est.e) = function(a)
         out = simulate_estimate_arp(y0,a,e)
         (out.θ[3], out.se[3])
       end
estimator (generic function with 3 methods)

julia> @benchmark  (b,t) = gridbootstrap(estimator(), a->a, αgrid, nboot)
BenchmarkTools.Trial: 751 samples with 1 evaluation per sample.
 Range (min  max):  6.349 ms  39.741 ms  ┊ GC (min  max): 0.00%  83.08%
 Time  (median):     6.463 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.664 ms ±  1.699 ms  ┊ GC (mean ± σ):  2.34% ±  6.88%

    ▆█▅▂▂▂▂▅▄▂      ▂▃▄▃  ▁        ▁                          
  ▅▆██████████▇█▇▇▇████████▇▇▅▅▅▄▆▇█▆▄▅▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▁▁▁▁▁▄ █
  6.35 ms      Histogram: log(frequency) by time     7.22 ms <

 Memory estimate: 2.32 MiB, allocs estimate: 50759.
julia> println("$(Threads.nthreads()) threads, fastest version")
12 threads, fastest version

julia> estimator_threaded(y0=y0,e=est.e)=function(foo)
         (a, rng) = foo
         out=simulate_estimate_arp(y0,a,e,Val(1),()->rand(rng,1:length(e)))
         (out.θ[3], out.se[3])
       end
estimator_threaded (generic function with 3 methods)

julia> @benchmark (bs, ts) = gridbootstrap_threaded(estimator_threaded(),(a,rng)->(a,rng), αgrid,nboot)
BenchmarkTools.Trial: 4861 samples with 1 evaluation per sample.
 Range (min  max):  630.070 μs   28.059 ms  ┊ GC (min  max):  0.00%  95.54%
 Time  (median):     941.979 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.027 ms ± 990.531 μs  ┊ GC (mean ± σ):  16.07% ± 15.41%

  █▄▁▆█▁                                                        ▁
  ███████▇▆▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▅▅▆▅▆▅▇▆▆▅▅▁▅▃▃▁▃▁▄▅▄▅▅▄▄▇▇▆▇ █
  630 μs        Histogram: log(frequency) by time       5.52 ms <

 Memory estimate: 2.33 MiB, allocs estimate: 50821.

Notice how the speedup from using multiple threads is far less than number of cores. On my computer, the threaded version of the code is about 7 times faster, even though my computer has 12 cores. A speedup far less than the number of cores is typical. Creating and managing multiple threads creates some overhead. Moreover, cores must share various resources; most notably RAM and some cache.

References

Rackauckas, Chris. 2019. “The Basics of Single Node Parallel Computing.” https://book.sciml.ai/notes/05-The_Basics_of_Single_Node_Parallel_Computing/.