Coding for Performance
1: Optimizing Single Threaded Code
Introduction
Today we will look into some methods to improve the speed of our code. Although speed is sometimes important, never forget that speed should be low on your list of priorities when writing code. You should prioritize correctness and maintainability ahead of performance. Nonetheless, performance does matter for some problems.
If you have not already, be sure to read the Peformance Tips section of Julia Docs.
Also, read Rackauckas’s notes on “Optimizing Serial Code.” (Rackauckas 2019).
Grid bootstrap
As a motivating example we will look at the gridded bootstrap of Hansen (1999)(Hansen 1999).
Gauss, Matlab, and R code implementing Hansen’s method is available on Hansen’s website. The Julia code below is more or less a direct translation from Hansen’s R code. Since this is a translation from R of a translation from Gauss, this code will not necessarily follow best practices for Julia.
T = 200
e = randn(T)
y0 = 0
a = 0.9
s = @code_string b_est_original(e)
code_md(s)function b_est_original(yin)
T = length(yin)
x = [ones(T-1) 2:T yin[1:(T-1)]]
y = yin[2:T]
θ = x'*x \ x'y
e = y - x*θ
se = sqrt.(diag(inv(x'*x) *(e'*e))./(T-4))
(θ=θ,se=se,e=e)
ends=@code_string ar1_original(y0,a,e)
code_md(s)function ar1_original(y0, a, e, rindex=T->rand(1:length(e),T))
T = length(e)
y = Array{eltype(e)}(undef, T)
y[1] = abs(a)<1 ? y0 : zero(eltype(y))
et = e[rindex(T-1)]
for t in 2:T
y[t] = a*y[t-1] + et[t-1]
end
y
ends = @code_string gridbootstrap(b_est_original, a->a, 0.5:0.1:1, 99)
code_md(s)function gridbootstrap(estimator, simulator,
grid,
nboot=199)
g = length(grid)
bootq = zeros(nboot, g)
ba = zeros(nboot, g)
bootse = zeros(nboot,g)
for ak in 1:g
for j in 1:nboot
(bootq[j,ak], bootse[j,ak]) = estimator(simulator(grid[ak]))
ba[j,ak] = bootq[j,ak] - grid[ak]
end
end
ts = ba./bootse
(ba=ba, t=ts)
endImproving performance
Now, let’s run this code and time it. Note that we are running this with only 50 grid points and 199 bootstrap replications. In real use, you would want more like 999 bootstrap replications or more, and perhaps more grid points.
# simulate some data
using Random, BenchmarkTools, Profile, ProfileCanvas
T = 200
e = randn(T)
y0 = 0
a = 0.9
y = ar1_original(y0, a, e)
est = b_est_original(y)
αgrid = 0.84:(0.22/50):1.06
nboot= 199
wrapper(b_est) = function(x)
out=b_est(x)
(out.θ[3], out.se[3])
endwrapper (generic function with 1 method)
Initial Benchmark and Profile
benchorig=@benchmark (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
αgrid, nboot)BenchmarkTools.Trial: 66 samples with 1 evaluation per sample.
Range (min … max): 55.216 ms … 92.507 ms ┊ GC (min … max): 0.00% … 36.2
9%
Time (median): 76.056 ms ┊ GC (median): 34.15%
Time (mean ± σ): 76.743 ms ± 7.415 ms ┊ GC (mean ± σ): 31.77% ± 6.8
9%
▂ ▂▂ ▅▅ ▂ █▅ ▂ ▂
▅▁▁▁▅▁▁▁▁▁▁▁▁▁▅▁▁▁▅▁▁█▁▁████▅█████▁███▅▁▁▁█▁█▅█▅▁▁██▅▅█▁▁▅▅ ▁
55.2 ms Histogram: frequency by time 90.9 ms <
Memory estimate: 208.68 MiB, allocs estimate: 568357.
To make code faster, we should begin by profiling.
# only to make profile results show nicely in generated html
# for interactive use, just use @profview in place of @profile
function profilehtmlstring()
buf = IOBuffer()
show(buf, MIME("text/html"), ProfileCanvas.view(Profile.fetch()))
s=String(take!(buf))
println("\n\n"*s*"\n")
endprofilehtmlstring (generic function with 1 method)
using Profile, ProfileCanvas
Profile.clear();
Profile.init(n=10^7,delay=0.0001);
@profile (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
αgrid, 999)
profilehtmlstring()The profiler works very simply. Every 0.0001 seconds, the line of code being executed gets recorded. There are then various tools for printing and visualizing the results. The @profview macro shows a flame graph. The default view might be a bit strange. The base of the flame graph often includes Julia’s repl and various other things that can be ignored. If you click on the graph, it will zoom in. You can use mouse wheel up to zoom back out.
In VSCode, you can just use @profview in place of @profile and the profile flamegraph will open in a panel within VSCode.
Removing Redudant Operations
The following was true on an older version of Julia, but now there are no gains from eliminating inv.
From the output (these numbers can vary quite a bit from run to run), we see there were 640 ticks in gridbootstrap_original (exact numbers will vary on each execution, but relative ones should be similar), and almost all of these occurred within inv. If we want the code to be faster, we should focus on these lines. Calling both inv and \ is redundant; we should combine these computations.
s = @code_string b_est_mldivide(y)
code_md(s)function b_est_mldivide(yin)
T = length(yin)
x = [ones(T-1) 2:T yin[1:(T-1)]]
y = yin[2:T]
tmp = x'*x \ [x'*y I]
θ = tmp[:,1]
ixx = tmp[:,2:4]
e = y - x*θ
se = sqrt.(diag(ixx *(e'*e))./(T-4))
(θ=θ,se=se,e=e)
endbenchml=@benchmark (b,t) = gridbootstrap(wrapper(b_est_mldivide), a->ar1_original(y0, a, est.e),
αgrid, nboot)BenchmarkTools.Trial: 67 samples with 1 evaluation per sample.
Range (min … max): 51.635 ms … 86.020 ms ┊ GC (min … max): 0.00% … 29.3
7%
Time (median): 75.577 ms ┊ GC (median): 31.85%
Time (mean ± σ): 75.321 ms ± 6.530 ms ┊ GC (mean ± σ): 31.68% ± 6.5
6%
▆▃ ▁ ▁▃ ▃▃ █ ▁
▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▁▁▁▁▁▁▁▁▄▁▄▇▇▇██▄█▁██▇██▇█▄▁█▁▄▇▇▁▄▁▇▇▇ ▁
51.6 ms Histogram: frequency by time 85.9 ms <
Memory estimate: 196.75 MiB, allocs estimate: 629251.
From this, we get a speedup by about a factor of 4 on my computer. This used to make a big difference, but no longer seems to matter.
Reducing Allocations
Profile.clear()
@profile (b,t) = gridbootstrap(wrapper(b_est_mldivide), a->ar1_original(y0, a, est.e),
αgrid, 999)
profilehtmlstring()Now, the most time consuming parts of the code are, unsurprisingly, the call to \, and, perhaps surprisingly, hcat from creating x. Allocating and copying memory is relatively slow. The creation of x involves both.
Caching Intermediate Arrays
One option is to preallocate an arrays and reuse them. The struct b_est_cache does this.
b_est_cache = ARGridBootstrap.bEstCached(T-1)
benchcache=@benchmark gridbootstrap(wrapper(b_est_cache), a->ar1_original(y0, a, est.e),
αgrid, nboot)BenchmarkTools.Trial: 62 samples with 1 evaluation per sample.
Range (min … max): 67.143 ms … 92.734 ms ┊ GC (min … max): 0.00% … 25.1
6%
Time (median): 81.710 ms ┊ GC (median): 22.73%
Time (mean ± σ): 81.190 ms ± 4.599 ms ┊ GC (mean ± σ): 20.51% ± 5.6
8%
▂▂ ▂▂ ▂▂ █▂ ▂ ▅ ▂
▅▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁█▁▅▁▁▅▅▁▁▅█▅▁▁██▁▅▅▁█▅███▅█▅██▅▅██▅▁▁█▁█▁▁█ ▁
67.1 ms Histogram: frequency by time 87.7 ms <
Memory estimate: 136.98 MiB, allocs estimate: 304484.
Profile.clear()
@profile (b,t) = gridbootstrap(wrapper(b_est_cache), a->ar1_original(y0, a, est.e),
αgrid, 999)
profilehtmlstring()Eliminating Intermediate Arrays
Better yet, we can avoid creating x by just accumulating \(X'y\) and \(X'X\) in a loop.
s = @code_string b_est_nox(y)
code_md(s)function b_est_nox(yin; xx_xy!::F=xx_xy!, resids!::FR=resids!) where {F<:Function, FR<:Function}
T = length(yin)
xx = @MMatrix zeros(eltype(yin),3,3)
xy = @MVector zeros(eltype(yin),3)
xx_xy!(xx,xy,yin)
ixx = inv(xx)
θ = ixx * xy
e = similar(yin,T-1)
resids!(e,yin,θ)
se = sqrt.(diag(ixx *(e'*e))./(T-4))
(θ=θ,se=se,e=e)
endWe put the two main loops into separate functions both for organization and to allow us to focus on optimizing these loops below.
s=@code_string ARGridBootstrap.xx_xy!(zeros(3,3),zeros(3),y)
code_md(s)@inline function xx_xy!(xx,xy,yin)
T = length(yin)
xx .= zero(eltype(xx))
xy .= zero(eltype(xy))
@inbounds @simd for t in 2:T
xx[1,3] += yin[t-1]
xx[2,3] += t*yin[t-1]
xx[3,3] += yin[t-1]^2
xy[1] += yin[t]
xy[2] += t*yin[t]
xy[3] += yin[t-1]*yin[t]
end
xx[1,1] = T-1 # = 1'*1
xx[1,2] = xx[2,1] = (T+1)*T/2 - 1 # sum(p+1:T)
xx[2,2] = (2*(T)+1)*(T)*(T+1)/6 - 1 # sum((p+1:T).^2)
xx[3,1] = xx[1,3]
xx[3,2] = xx[2,3]
nothing
ends=@code_string ARGridBootstrap.resids!(zeros(length(y)-1),y,zeros(3))
code_md(s)@inline function resids!(e, yin, θ)
T = length(yin)
@inbounds @simd for t in 2:T
e[t-1] = yin[t] - θ[1] - θ[2]*t - θ[3]*yin[t-1]
end
nothing
endbenchnox=@benchmark (b,t) = gridbootstrap(wrapper(b_est_nox), a->ar1_original(y0, a, est.e),
αgrid, nboot)BenchmarkTools.Trial: 286 samples with 1 evaluation per sample.
Range (min … max): 11.507 ms … 41.519 ms ┊ GC (min … max): 11.64% … 70.7
2%
Time (median): 17.347 ms ┊ GC (median): 31.47%
Time (mean ± σ): 17.491 ms ± 3.604 ms ┊ GC (mean ± σ): 30.78% ± 11.9
8%
▂ ▇▃▂▃ █▃▄▅ █ █▅▅▁▅▁▇ ▃ ▁
▃▃▃▇██████▄▄▇█▆████▆██████████▃▇█▇▆█▄▆▅▄▄▃▃▁▁▆▁▁▁▄▃▁▁▁▃▁▁▁▃ ▄
11.5 ms Histogram: frequency by time 27.5 ms <
Memory estimate: 66.13 MiB, allocs estimate: 142099.
We have cut the time by a factor of four. However, this performance optimization has been costly in terms of readability and extensibility of our code. If we wanted to fit an AR(p) model instead of AR(1), the b_est_nox function would be more difficult to modify than the b_est_mldivide version.
We additionally gained some performance by using mutable StaticArrays to hold \(X'X\) and \(X'y\).
xx = zeros(3,3)
xy = zeros(3)
@benchmark ARGridBootstrap.xx_xy!(xx,xy,y)BenchmarkTools.Trial: 10000 samples with 198 evaluations per sample.
Range (min … max): 441.187 ns … 931.359 ns ┊ GC (min … max): 0.00% … 0.0
0%
Time (median): 443.818 ns ┊ GC (median): 0.00%
Time (mean ± σ): 445.070 ns ± 14.250 ns ┊ GC (mean ± σ): 0.00% ± 0.0
0%
▂▄▆▆▇███▇▅▅▄▄▃▃▂ ▁ ▁ ▁ ▁▂▁▂▁ ▁ ▂
▂▂▅▆███████████████████▇███▇▆▅▅▇▅▅▆▆▇▇██████████████▇█▇▇█▆▆▆▃ █
441 ns Histogram: log(frequency) by time 455 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
using StaticArrays
xx = @MMatrix zeros(3,3)
xy = @MVector zeros(3)
@benchmark ARGridBootstrap.xx_xy!(xx,xy,y)BenchmarkTools.Trial: 10000 samples with 198 evaluations per sample.
Range (min … max): 438.657 ns … 1.792 μs ┊ GC (min … max): 0.00% … 0.00
%
Time (median): 440.783 ns ┊ GC (median): 0.00%
Time (mean ± σ): 442.471 ns ± 20.264 ns ┊ GC (mean ± σ): 0.00% ± 0.00
%
▂▄▆█▆▅▃ ▁▁▁▁▁ ▁ ▁▁▁ ▁▁ ▁
▃▂▃▃▆████████▇▆██████▇▆▆███▆▆▄▆█▆▇██████████████▇██▇▆▇▅▆▅▄▄▅ █
439 ns Histogram: log(frequency) by time 452 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
SIMD
To get full performance from modern CPUs, it is essential to use Single Instruction, Multiple Data instructions (also known as vectorized CPU instructions) in our code. These are special CPU instruction that allow applying an operation to multiple numbers at once. The LLVM compiler that Julia uses tries to automatically use these instructions when it is sure that it is safe to do so. See “Demystifying auto vectorization in Julia” for more information.
The compiler is not always successful in using SIMD instructions, so manual usage of SIMD instructions can often help.
LoopVectorization.jl
As mentioned above, the Julia compiler tries to automactically use SIMD instructions when it is safe to do so. SIMD instructions often change the order of operations, and since floating point math is not exactly commutative. The compiler tries to avoid reordering operations, but this often prevents SIMD use. The macro @simd tells the compiler to not worry about reordering operations and insert SIMD instructions more aggresively. Still, there are some SIMD operations that the compiler will not insert automatically.
The LoopVectorization.jl package defines a macro, @turbo that more aggresively inserts SIMD instructions. This can make a large difference for some loops and broadcasts.
e = zeros(length(y)-1)
θ = @MVector zeros(3)
@benchmark ARGridBootstrap.resids!($e,$y,$θ)BenchmarkTools.Trial: 10000 samples with 954 evaluations per sample.
Range (min … max): 93.047 ns … 352.910 ns ┊ GC (min … max): 0.00% … 0.00
%
Time (median): 93.594 ns ┊ GC (median): 0.00%
Time (mean ± σ): 94.498 ns ± 6.118 ns ┊ GC (mean ± σ): 0.00% ± 0.00
%
▄█▃▄▁▃▂▂▁ ▁
██████████▇▇▇▇▇▇▆▅▄▃▃▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▆▅ █
93 ns Histogram: log(frequency) by time 116 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark ARGridBootstrap.resids_turbo!($e,$y,$θ)BenchmarkTools.Trial: 10000 samples with 998 evaluations per sample.
Range (min … max): 15.560 ns … 77.933 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.012 ns ┊ GC (median): 0.00%
Time (mean ± σ): 16.209 ns ± 2.306 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃▅▆▇█▇▇▅▄▁ ▂
▄▆▇█████████████▇▇▆▇▇█▇▇▇▆▇▆▇▆▆▅▅▆▇▆▇▇▇▇█▇▇█▇▇▆▆▆▅▅▅▂▅▄▄▅▄▄ █
15.6 ns Histogram: log(frequency) by time 18 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
We can also write SIMD instructions ourselves. The SIMD.jl package makes this somewhat accessible. For an example, resids_simd uses this package to match performance of the @turbo version.
@benchmark ARGridBootstrap.resids_simd!($e,$y,$θ, Val(8))BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
Range (min … max): 22.171 ns … 107.985 ns ┊ GC (min … max): 0.00% … 0.00
%
Time (median): 22.633 ns ┊ GC (median): 0.00%
Time (mean ± σ): 22.915 ns ± 2.675 ns ┊ GC (mean ± σ): 0.00% ± 0.00
%
▆█▂
▁▁▁▁▁▂▇████▆▄▂▂▃▃▃▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
22.2 ns Histogram: frequency by time 25.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Generally, if @turbo successfully inserted SIMD instructions and made your code substantially faster, it will not be worth your effort to try to manually write SIMD code. However, @turbo will not always be able to insert SIMD instructions. One way to check is through benchmarking. Another way is to inspect @code_llvm ARGridBootstrap.resids_turbo!(e,y,θ). Things like fadd fast <4 x double> are SIMD instructions. The <4 x double> part is the key sign. In contrast, something like %26 = fsub double %24, %25 are scalar instructions.
The loops in xx_xy! are not automatically vectorized. Part of the issue is that @turbo cannot tell that xx and xy are statically allocated. If we rewrite the code to use scalars, it would like get vectorized by @turbo. The b_est_stride function does this. However, it is really inconvenient to rewrite array code as scalars. It may be more maintainable to keep the arrays and write SIMD instructions ourselves.
@benchmark ARGridBootstrap.xx_xy!($xx,$xy,$y)BenchmarkTools.Trial: 10000 samples with 198 evaluations per sample.
Range (min … max): 432.737 ns … 762.399 ns ┊ GC (min … max): 0.00% … 0.0
0%
Time (median): 435.924 ns ┊ GC (median): 0.00%
Time (mean ± σ): 437.071 ns ± 6.437 ns ┊ GC (mean ± σ): 0.00% ± 0.0
0%
▄▆▆▃▂▂▃▇█▅▃▄▇▆▄ ▁ ▁▁▁▂▂▁ ▁▁ ▂
▃▄██████████████████████████████████████████▇██████▇▇▇▆▇▇▆▆▅▆ █
433 ns Histogram: log(frequency) by time 449 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark ARGridBootstrap.xx_xy_simd!($xx,$xy,$y, Val(64))BenchmarkTools.Trial: 10000 samples with 987 evaluations per sample.
Range (min … max): 49.140 ns … 64.814 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 50.146 ns ┊ GC (median): 0.00%
Time (mean ± σ): 50.339 ns ± 0.802 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▅▃ ▁
▅▇▇▅▅▃▆▅▆▆▃████▇▇█▅▄▃▄▄▄▅▆▅▄▅██▇▇▇██▇▇▇▇▇▅▅▆▆▆▅▄▅▄▄▅▅▅▃▅▅▃▅ █
49.1 ns Histogram: log(frequency) by time 53.9 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
This makes the code faster by a factor of about 10. The Val(N) argument controls the width of vectors that gets passed to SIMD instructions. The value of N can affect execution by a factor of 5 or more. The best choice of N depends on your exact hardware, the code being executed, and compiler details. Current x86 cCPUs have 512 bit registers, and so can operate on 8 Float64 values at once. However, it seems that setting N to larger values results in faster code. I’m not sure exactly why. It is not that more than 8 operations are ever done at once. Instead, I guess the compiler converts larger values into blocks of 8 in a more efficient way.
To see how much this is worth it, let’s benchmark the full bootstrap code, but using the SIMD versions of resids! and xx_xy!
b_est_simd = y->b_est_nox(y, xx_xy! =ARGridBootstrap.xx_xy_simd!, resids! =ARGridBootstrap.resids_simd!)
benchsimd=@benchmark (b,t) = gridbootstrap(wrapper(b_est_simd), a->ar1_original(y0, a, est.e),
αgrid, nboot)BenchmarkTools.Trial: 290 samples with 1 evaluation per sample.
Range (min … max): 12.832 ms … 41.313 ms ┊ GC (min … max): 19.02% … 66.9
6%
Time (median): 16.554 ms ┊ GC (median): 32.00%
Time (mean ± σ): 17.252 ms ± 3.388 ms ┊ GC (mean ± σ): 32.14% ± 11.6
4%
▄▅ ▃█▃ █▂▃
▄▆███▆▄▃▁▃▄███▄▄▄▄▄▄▆▇███▆▆▂▂▃▃▃▁▃▂▁▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▂ ▃
12.8 ms Histogram: frequency by time 29.1 ms <
Memory estimate: 66.13 MiB, allocs estimate: 142099.
We have not improved total execution very much. It is only 10% faster than the nox version. The main problem is that very little of the total time was spent in xx_xy! and resids! to begin with. We did make those functions much faster, but they were such a small portion of total execution time, that it is not noticeable. We should focus our efforts on ar1_original if we want to improve.
Fastest Version
The fastest version of the code that I could write combines the ideas above. As above, it avoids allocating x. It also avoids allocating e by combining the simulation and estimation into a single loop. Finally, it uses mutable static arrays to ensure that operations on xx and xy have as little overhead as possible. Note that for small StaticArrays, inv will call a specialized, fast version, and ends up being faster than \.
using StaticArrayss = @code_string simulate_estimate_arp(y0,a,e)
code_md(s)function simulate_estimate_arp(y0, a, e, ar::Val{P}=Val(1),
rindex=()->rand(1:length(e))) where P
T = length(e)
length(a)==P || error("length(a) not equal to P")
xx = @MMatrix zeros(eltype(e),P+2, P+2)
xy = @MVector zeros(eltype(e),P+2)
yy = zero(eltype(e))
xt = @MVector ones(eltype(e), P+2)
if (abs(a)<1)
xt[3:(P+2)] .= y0
else
xt[3:(P+2)] .= 0.0
end
α = @MVector zeros(eltype(e),P+2)
@simd for i = 1:P
α[2+i] = a[i]
end
xx[1,1] = T-P # = 1'*1
xx[1,2] = xx[2,1] = (T+1)*T/2 - sum(1:P) # sum(P+1:T)
xx[2,2] = (2*(T)+1)*(T)*(T+1)/6 - sum((1:P).^2) # sum((P+1:T).^2)
@inbounds for t in (P+1):T
et = e[rindex()]
xt[2] = t
for i in 1:(P+2)
@simd for j in 3:(P+2)
xx[i,j] += xt[i]*xt[j]
end
end
y = dot(α, xt) + et
@simd for i in 1:(P+2)
xy[i] += xt[i]*y
end
yy += y^2
if (P>1)
xt[4:(P+2)] .= xt[3:(P+1)]
end
xt[3] = y
end
@inbounds for i in 3:(P+2)
for j in 1:(i-1)
xx[i,j] = xx[j,i]
end
end
ixx = inv(xx)
θ = ixx*xy
ee = yy - xy'*ixx*xy
se = sqrt.(abs.(diag(ixx *(ee))./(T-(2*P+2))))
(θ=θ,se=se)
endestimator(y0=y0,e=est.e) = function(a)
out = simulate_estimate_arp(y0,a,e)
(out.θ[3], out.se[3])
end
bench_sea=@benchmark (b,t) = gridbootstrap(estimator(), a->a, αgrid, nboot)BenchmarkTools.Trial: 746 samples with 1 evaluation per sample.
Range (min … max): 6.366 ms … 32.459 ms ┊ GC (min … max): 0.00% … 79.41%
Time (median): 6.522 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.709 ms ± 1.365 ms ┊ GC (mean ± σ): 2.08% ± 6.80%
█
███▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂ ▂
6.37 ms Histogram: frequency by time 14.7 ms <
Memory estimate: 2.32 MiB, allocs estimate: 50759.
On my computer, this version of the code is about 8 times faster than the original.
For further gains, we will explore multi-threading and using a GPU in parts 2 and 3.
Exercises
Some of these have been incorporated into the sections above.
Read the Performance Tips section of Julia Manual and incorporate some of these tips into the above code.
Write a version of
b_estthat avoids allocating the full \(T \times 3\) \(X\) matrix, but can still be generalized to an AR(p) model.Examine how the relative performance of these versions of
b_estvary withT,nboot, and the number of grid points.The Julia package
StaticArrays.jlprovides an alternative array implementation that is often much faster thanBase.Array. Try implementingb_estusingStaticArrays.jl. You will likely need to use mutable arrays (see@MMatrixand@MVector). Note thatinvof a small array will be substantially faster when usingStaticArray.jlinstead ofBase.Array.The fastest version of the code does not yet use SIMD instructions. Try to improve its performance by using SIMD. My SIMDscan package might be useful.