Coding for Performance
2: Multi-Threading
Multi-Threading
Current computers almost all have multiple cores. We can divide the time it takes our code by up to the number of cores we have (but usually less) by writing multi-threaded code. Multi-threaded code performs multiple tasks at once with shared memory. Before you begin writing multi-threaded code, you should make sure your code isn’t already using all available cores. It is likely that the BLAS and Lapack libraries that Julia uses for linear algebra are multi-threaded. If you code is dominated by large matrix operations, it may already be using all available cores. In that case, there will not be much benefit from additional multi-threading.
Read “The Basics of Single Node Parallel Computing” Rackauckus (2019) (Rackauckas 2019) .
Once we have decided that the code might benefit from multi-threading, we should look for loops (or other independent tasks) that can be multi-threaded. There is some overhead from creating threads and communicating among them. Multi-threading generally works best for loops where each iteration involves substantial work, and each iteration is independent of all others. The loops over grid points and bootstrap repetitions in gridbootstrap
are perfect candidates. We don’t care about the order in which these loops get executed. The result of each iteration is (mostly) independent of all others.
using ARGridBootstrap, CodeTracking, Random, BenchmarkTools, Profile, ProfileCanvas
function code_md(s)
println("```julia\n"*s*"\n```\n")
end
= 200
T e = randn(T)
= 0
y0 = 0.9
a = ar1_original(y0, a, e)
y = b_est_original(y)
est = 0.84:(0.22/50):1.06
αgrid = 199
nbootwrapper(b_est) = function(x)
=b_est(x)
out3], out.se[3])
(out.θ[end
Multi-Threaded Grid Bootstrap
=@code_string gridbootstrap_threaded(wrapper(b_est_original), (a, rng)->ar1_original(y0, a, est.e, n->rand(rng,1:(T-1),n)),αgrid, 2)
scode_md(s)
function gridbootstrap_threaded(estimator, simulator,
::AbstractVector,
grid=199)
nboot= length(grid)
g = zeros(nboot, g)
bootq = zeros(nboot, g)
ba = zeros(nboot,g)
bootse #@threads for ak in 1:g
# for j in 1:nboot
@threads for ind ∈ CartesianIndices(ba)
= ind[1]
j = ind[2]
ak = estimator(simulator(grid[ak],Random.TaskLocalRNG()))
(bootq[j,ak], bootse[j,ak]) = bootq[j,ak] - grid[ak]
ba[j,ak] end
= ba./bootse
ts =ba, t=ts)
(baend
Some Libraries are Already Multi-Threaded
Now, let’s try multi-threading the original version of the code.
> using Base.Threads
julia
> println("Single thread, original version")
julia
Single thread, original version
> @benchmark begin
julia= gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
(b,t) 199)
αgrid, end
: 27 samples with 1 evaluation.
BenchmarkTools.TrialRange (min … max): 175.276 ms … 199.619 ms ┊ GC (min … max): 11.56% … 10.62%
Time (median): 186.845 ms ┊ GC (median): 16.06%
Time (mean ± σ): 186.054 ms ± 4.257 ms ┊ GC (mean ± σ): 15.72% ± 1.64%
▁ ▁ ▁ ▁▄ ▄▄▁█
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁█▆█▁▁██▆████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁175 ms Histogram: frequency by time 200 ms <
: 219.44 MiB, allocs estimate: 345075. Memory estimate
> println("$(Threads.nthreads()) threads, original version")
julia40 threads, original version
> @benchmark begin
julia= gridbootstrap_threaded(wrapper(b_est_original),
(b,t) ->ar1_original(y0, a, est.e, n->rand(rng,1:(T-1),n)),
(a, rng)199)
αgrid, end
: 32 samples with 1 evaluation.
BenchmarkTools.TrialRange (min … max): 102.512 ms … 661.604 ms ┊ GC (min … max): 0.00% … 81.64%
Time (median): 104.351 ms ┊ GC (median): 0.00%
Time (mean ± σ): 158.058 ms ± 161.796 ms ┊ GC (mean ± σ): 31.58% ± 24.16%
█
█▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▁103 ms Histogram: frequency by time 662 ms <
: 311.76 MiB, allocs estimate: 6383971. Memory estimate
The execution times are nearly identical on my computer. The reason is that the computation is dominated by the creation of X
and multiplying X'*X
and X'*y
. These operations are already multi-threaded in the BLAS library being used. It is possible that first calling using LinearAlgebra; BLAS.set_num_threads(1)
would improve the performance of the multi-threaded bootstrap.
Multi-Threading Where it Matters
> println("Single thread, fastest version")
julia
Single thread, fastest version
> estimator(y0=y0,e=est.e) = function(a)
julia= simulate_estimate_arp(y0,a,e)
out 3], out.se[3])
(out.θ[end
function with 3 methods)
estimator (generic
> @benchmark (b,t) = gridbootstrap(estimator(), a->a, αgrid, nboot)
julia: 242 samples with 1 evaluation.
BenchmarkTools.TrialRange (min … max): 18.752 ms … 48.284 ms ┊ GC (min … max): 0.00% … 58.59%
Time (median): 19.663 ms ┊ GC (median): 0.00%
Time (mean ± σ): 20.719 ms ± 4.339 ms ┊ GC (mean ± σ): 3.86% ± 9.76%
▆█▇▆▃▁
██████▇▆▄▁▁▁▁▁▆▄▄▁▁▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▄▁▁▁▁▁▁▁▁▇ ▆18.8 ms Histogram: log(frequency) by time 42.7 ms <
: 2.94 MiB, allocs estimate: 60904. Memory estimate
> println("$(Threads.nthreads()) threads, fastest version")
julia40 threads, fastest version
> estimator_threaded(y0=y0,e=est.e)=function(foo)
julia= foo
(a, rng) =simulate_estimate_arp(y0,a,e,Val(1),()->rand(rng,1:length(e)))
out3], out.se[3])
(out.θ[end
function with 3 methods)
estimator_threaded (generic
> @benchmark (bs, ts) = gridbootstrap_threaded(estimator_threaded(),(a,rng)->(a,rng), αgrid,nboot)
julia: 1507 samples with 1 evaluation.
BenchmarkTools.TrialRange (min … max): 1.620 ms … 141.653 ms ┊ GC (min … max): 0.00% … 97.00%
Time (median): 2.524 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.301 ms ± 8.923 ms ┊ GC (mean ± σ): 18.36% ± 6.66%
██▅▇▃
▄▅▅▄▇▆▆█████████▇▅▅▄▅▃▄▃▃▂▂▂▂▃▂▃▂▂▃▂▂▂▂▁▂▂▁▂▂▂▁▂▁▂▂▁▂▂▂▁▂▂▂ ▃1.62 ms Histogram: frequency by time 6.71 ms <
: 2.97 MiB, allocs estimate: 61138. Memory estimate
Notice how the speedup from using multiple threads is far less than number of cores. On my computer, the threaded version of the code is about 7 times faster, even though my computer has 40 “cores” (or 20 physical cores. My computer has 2 processors with 10 cores each, and each core is hyperthreaded into 2. The OS sees 40 processors, but half of them are sharing substantial resources). A speedup far less than the number of cores is typical. Creating and managing multiple threads creates some overhead. Moreover, cores must share various resources; most notably RAM and some cache.