Coding for Performance
1: Optimizing Single Threaded Code
Introduction
Today we will look into some methods to improve the speed of our code. Although speed is sometimes important, never forget that speed should be low on your list of priorities when writing code. You should prioritize correctness and maintainability ahead of performance. Nonetheless, performance does matter for some problems.
If you have not already, be sure to read the Peformance Tips section of Julia Docs.
Also, read Rackauckas’s notes on “Optimizing Serial Code.” (Rackauckas 2019).
Grid bootstrap
As a motivating example we will look at the gridded bootstrap of Hansen (1999)(Hansen 1999).
Gauss, Matlab, and R code implementing Hansen’s method is available on Hansen’s website. The Julia code below is more or less a direct translation from Hansen’s R code. Since this is a translation from R of a translation from Gauss, this code will not necessarily follow best practices for Julia.
= 200
T e = randn(T)
= 0
y0 = 0.9
a = @code_string b_est_original(e)
s code_md(s)
function b_est_original(yin)
= length(yin)
T = [ones(T-1) 2:T yin[1:(T-1)]]
x = yin[2:T]
y = x'*x \ x'y
θ e = y - x*θ
= sqrt.(diag(inv(x'*x) *(e'*e))./(T-4))
se =θ,se=se,e=e)
(θend
=@code_string ar1_original(y0,a,e)
scode_md(s)
function ar1_original(y0, a, e, rindex=T->rand(1:length(e),T))
= length(e)
T = Array{eltype(e)}(undef, T)
y 1] = abs(a)<1 ? y0 : zero(eltype(y))
y[= e[rindex(T-1)]
et for t in 2:T
= a*y[t-1] + et[t-1]
y[t] end
yend
= @code_string gridbootstrap(b_est_original, a->a, 0.5:0.1:1, 99)
s code_md(s)
function gridbootstrap(estimator, simulator,
grid,=199)
nboot= length(grid)
g = zeros(nboot, g)
bootq = zeros(nboot, g)
ba = zeros(nboot,g)
bootse for ak in 1:g
for j in 1:nboot
= estimator(simulator(grid[ak]))
(bootq[j,ak], bootse[j,ak]) = bootq[j,ak] - grid[ak]
ba[j,ak] end
end
= ba./bootse
ts =ba, t=ts)
(baend
Improving performance
Now, let’s run this code and time it. Note that we are running this with only 50 grid points and 199 bootstrap replications. In real use, you would want more like 999 bootstrap replications or more, and perhaps more grid points.
# simulate some data
using Random, BenchmarkTools, Profile, ProfileCanvas
= 200
T e = randn(T)
= 0
y0 = 0.9
a = ar1_original(y0, a, e)
y = b_est_original(y)
est = 0.84:(0.22/50):1.06
αgrid = 199
nbootwrapper(b_est) = function(x)
=b_est(x)
out3], out.se[3])
(out.θ[end
wrapper (generic function with 1 method)
Initial Benchmark and Profile
=@benchmark (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
benchorig αgrid, nboot)
BenchmarkTools.Trial: 38 samples with 1 evaluation.
Range (min … max): 130.823 ms … 149.250 ms ┊ GC (min … max): 9.70% … 8.5
6%
Time (median): 131.580 ms ┊ GC (median): 9.72%
Time (mean ± σ): 132.501 ms ± 3.666 ms ┊ GC (mean ± σ): 9.77% ± 0.5
2%
▃█▃▅▂
█████▄▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄ ▁
131 ms Histogram: frequency by time 149 ms <
Memory estimate: 219.44 MiB, allocs estimate: 345075.
To make code faster, we should begin by profiling.
# only to make profile results show nicely in generated html
# for interactive use, just use @profview in place of @profile
function profilehtmlstring()
= IOBuffer()
buf show(buf, MIME("text/html"), ProfileCanvas.view(Profile.fetch()))
=String(take!(buf))
sprintln("\n\n"*s*"\n")
end
profilehtmlstring (generic function with 1 method)
using Profile, ProfileCanvas
Profile.clear();
Profile.init(n=10^7,delay=0.0001);
@profile (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
999)
αgrid, profilehtmlstring()
The profiler works very simply. Every 0.0001 seconds, the line of code being executed gets recorded. There are then various tools for printing and visualizing the results. The @profview
macro shows a flame graph. The default view might be a bit strange. The base of the flame graph often includes Julia’s repl and various other things that can be ignored. If you click on the graph, it will zoom in. You can use mouse wheel up to zoom back out.
In VSCode, you can just use @profview
in place of @profile
and the profile flamegraph will open in a panel within VSCode.
Removing Redudant Operations
The following was true on an older version of Julia, but now there are no gains from eliminating inv
.
From the output (these numbers can vary quite a bit from run to run), we see there were 640 ticks in gridbootstrap_original
(exact numbers will vary on each execution, but relative ones should be similar), and almost all of these occurred within inv
. If we want the code to be faster, we should focus on these lines. Calling both inv
and \
is redundant; we should combine these computations.
= @code_string b_est_mldivide(y)
s code_md(s)
function b_est_mldivide(yin)
= length(yin)
T = [ones(T-1) 2:T yin[1:(T-1)]]
x = yin[2:T]
y = x'*x \ [x'*y I]
tmp = tmp[:,1]
θ = tmp[:,2:4]
ixx e = y - x*θ
= sqrt.(diag(ixx *(e'*e))./(T-4))
se =θ,se=se,e=e)
(θend
=@benchmark (b,t) = gridbootstrap(wrapper(b_est_mldivide), a->ar1_original(y0, a, est.e),
benchml αgrid, nboot)
BenchmarkTools.Trial: 27 samples with 1 evaluation.
Range (min … max): 179.880 ms … 199.238 ms ┊ GC (min … max): 6.97% … 6.4
5%
Time (median): 186.427 ms ┊ GC (median): 8.51%
Time (mean ± σ): 185.857 ms ± 3.287 ms ┊ GC (mean ± σ): 8.20% ± 0.7
1%
▂ ▂ █▅
▅▁▁▁▁▁▁▁▁▁████▁▁▁▁▁▅████▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
180 ms Histogram: frequency by time 199 ms <
Memory estimate: 210.46 MiB, allocs estimate: 517608.
From this, we get a speedup by about a factor of 4 on my computer. This used to make a big difference, but no longer seems to matter.
Reducing Allocations
Profile.clear()
@profile (b,t) = gridbootstrap(wrapper(b_est_mldivide), a->ar1_original(y0, a, est.e),
999)
αgrid, profilehtmlstring()
Now, the most time consuming parts of the code are, unsurprisingly, the call to \
, and, perhaps surprisingly, hcat
from creating x
. Allocating and copying memory is relatively slow. The creation of x
involves both.
Caching Intermediate Arrays
One option is to preallocate an arrays and reuse them. The struct bEstCache
does this.
= ARGridBootstrap.bEstCached(T-1)
b_est_cache =@benchmark gridbootstrap(wrapper(b_est_cache), a->ar1_original(y0, a, est.e),
benchcache αgrid, nboot)
BenchmarkTools.Trial: 36 samples with 1 evaluation.
Range (min … max): 138.756 ms … 156.013 ms ┊ GC (min … max): 7.03% … 6.2
0%
Time (median): 139.455 ms ┊ GC (median): 7.03%
Time (mean ± σ): 140.950 ms ± 3.298 ms ┊ GC (mean ± σ): 7.63% ± 1.0
3%
▃▅█
███▁▅▁▁▄▁▁▁▁▄▅▅▇▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
139 ms Histogram: frequency by time 156 ms <
Memory estimate: 143.71 MiB, allocs estimate: 162394.
Profile.clear()
@profview (b,t) = gridbootstrap(wrapper(b_est_cache), a->ar1_original(y0, a, est.e),
999)
αgrid, profilehtmlstring()
Eliminating Intermediate Arrays
Better yet, we can avoid creating x
by just accumulating \(X'y\) and \(X'X\) in a loop.
= @code_string b_est_nox(y)
s code_md(s)
function b_est_nox(yin; xx_xy!::F=xx_xy!, resids!::FR=resids!) where {F<:Function, FR<:Function}
= length(yin)
T = @MMatrix zeros(eltype(yin),3,3)
xx = @MVector zeros(eltype(yin),3)
xy xx_xy!(xx,xy,yin)
= inv(xx)
ixx = ixx * xy
θ e = similar(yin,T-1)
resids!(e,yin,θ)
= sqrt.(diag(ixx *(e'*e))./(T-4))
se =θ,se=se,e=e)
(θend
We put the two main loops into separate functions both for organization and to allow us to focus on optimizing these loops below.
=@code_string ARGridBootstrap.xx_xy!(zeros(3,3),zeros(3),y)
scode_md(s)
@inline function xx_xy!(xx,xy,yin)
= length(yin)
T .= zero(eltype(xx))
xx .= zero(eltype(xy))
xy @inbounds @simd for t in 2:T
1,3] += yin[t-1]
xx[2,3] += t*yin[t-1]
xx[3,3] += yin[t-1]^2
xx[1] += yin[t]
xy[2] += t*yin[t]
xy[3] += yin[t-1]*yin[t]
xy[end
1,1] = T-1 # = 1'*1
xx[1,2] = xx[2,1] = (T+1)*T/2 - 1 # sum(p+1:T)
xx[2,2] = (2*(T)+1)*(T)*(T+1)/6 - 1 # sum((p+1:T).^2)
xx[3,1] = xx[1,3]
xx[3,2] = xx[2,3]
xx[nothing
end
=@code_string ARGridBootstrap.resids!(zeros(length(y)-1),y,zeros(3))
scode_md(s)
@inline function resids!(e, yin, θ)
= length(yin)
T @inbounds @simd for t in 2:T
e[t-1] = yin[t] - θ[1] - θ[2]*t - θ[3]*yin[t-1]
end
nothing
end
=@benchmark (b,t) = gridbootstrap(wrapper(b_est_nox), a->ar1_original(y0, a, est.e),
benchnox αgrid, nboot)
BenchmarkTools.Trial: 124 samples with 1 evaluation.
Range (min … max): 37.839 ms … 61.466 ms ┊ GC (min … max): 7.96% … 4.77
%
Time (median): 41.088 ms ┊ GC (median): 14.60%
Time (mean ± σ): 40.376 ms ± 2.496 ms ┊ GC (mean ± σ): 12.25% ± 3.29
%
▁ █▁▂▃
▇█▆▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅████▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
37.8 ms Histogram: frequency by time 46.4 ms <
Memory estimate: 71.70 MiB, allocs estimate: 101499.
We have further cut the time by a factor of two. However, this performance optimization has been costly in terms of readability and extensibility of our code. If we wanted to fit an AR(p) model instead of AR(1), the b_est_nox
function would be more difficult to modify than the b_est_mldivide
version.
We additionally gained some performance by using mutable StaticArrays
to hold \(X'X\) and \(X'y\).
= zeros(3,3)
xx = zeros(3)
xy @benchmark ARGridBootstrap.xx_xy!(xx,xy,y)
BenchmarkTools.Trial: 10000 samples with 179 evaluations.
Range (min … max): 594.637 ns … 1.129 μs ┊ GC (min … max): 0.00% … 0.00
%
Time (median): 597.760 ns ┊ GC (median): 0.00%
Time (mean ± σ): 602.472 ns ± 26.903 ns ┊ GC (mean ± σ): 0.00% ± 0.00
%
█▄ ▁
▆██▇▄▄▁▁▃▁▁▁▁▁▁▁▇▄▄▄▄█▇████▇▆▅▆▅▇▅▁▅▄▄▃▃▃▄▁▄▄▄▅▅▅▆▅▅▅▅▅▅▅▅▅▆ █
595 ns Histogram: log(frequency) by time 706 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
using StaticArrays
= @MMatrix zeros(3,3)
xx = @MVector zeros(3)
xy @benchmark ARGridBootstrap.xx_xy!(xx,xy,y)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
Range (min … max): 588.834 ns … 1.172 μs ┊ GC (min … max): 0.00% … 0.00
%
Time (median): 592.696 ns ┊ GC (median): 0.00%
Time (mean ± σ): 597.046 ns ± 28.276 ns ┊ GC (mean ± σ): 0.00% ± 0.00
%
▇█▃ ▁ ▁
▅███▅▁▄▁▁▁▁▁▃▁▁▁▁▃▃▁▁▄▇▇▇████▇▇▆▅▄▅▅▃▄▅▃▁▄▃▁▁▄▄▅▅▄▄▄▅▄▃▅▄▄▄▅ █
589 ns Histogram: log(frequency) by time 695 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
SIMD
To get full performance from modern CPUs, it is essential to use Single Instruction, Multiple Data instructions (also known as vectorized CPU instructions) in our code. These are special CPU instruction that allow applying an operation to multiple numbers at once. The LLVM compiler that Julia uses tries to automatically use these instructions when it is sure that it is safe to do so. See “Demystifying auto vectorization in Julia” for more information.
The compiler is not always successful in using SIMD instructions, so manual usage of SIMD instructions can often help.
LoopVectorization.jl
As mentioned above, the Julia compiler tries to automactically use SIMD instructions when it is safe to do so. SIMD instructions often change the order of operations, and since floating point math is not exactly commutative. The compiler tries to avoid reordering operations, but this often prevents SIMD use. The macro @simd
tells the compiler to not worry about reordering operations and insert SIMD instructions more aggresively. Still, there are some SIMD operations that the compiler will not insert automatically.
The LoopVectorization.jl package defines a macro, @turbo
that more aggresively inserts SIMD instructions. This can make a large difference for some loops and broadcasts.
e = zeros(length(y)-1)
= @MVector zeros(3)
θ @benchmark ARGridBootstrap.resids!($e,$y,$θ)
BenchmarkTools.Trial: 10000 samples with 390 evaluations.
Range (min … max): 246.418 ns … 555.159 ns ┊ GC (min … max): 0.00% … 0.0
0%
Time (median): 248.031 ns ┊ GC (median): 0.00%
Time (mean ± σ): 250.314 ns ± 13.983 ns ┊ GC (mean ± σ): 0.00% ± 0.0
0%
█▆▁ ▁ ▁
▇███▄▅▁▃▁▃▁▁▁▁▃▃▃▁▄█▄▅▆▇███▇▇▇▅▅▄▄▃▁▄▁▁▄▄▅▄▃▅▆▄▅▄▅▅▄▅▅▅▅▅▄▄▄▅ █
246 ns Histogram: log(frequency) by time 300 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark ARGridBootstrap.resids_turbo!($e,$y,$θ)
BenchmarkTools.Trial: 10000 samples with 979 evaluations.
Range (min … max): 60.996 ns … 131.267 ns ┊ GC (min … max): 0.00% … 0.00
%
Time (median): 61.709 ns ┊ GC (median): 0.00%
Time (mean ± σ): 62.269 ns ± 2.848 ns ┊ GC (mean ± σ): 0.00% ± 0.00
%
▄█▅▁ ▃ ▁ ▁
▅▅████▇▅▅▁▁▃▃▁▁▁▁▁▁▁▄▅▄██▄▄▃▃▁▁▁▄▃▄▃▁▁▁▁▃▄▄▅▇▇▇▆▆▆▅▅▅▅▆▅▃▅█▆ █
61 ns Histogram: log(frequency) by time 72.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
We can also write SIMD instructions ourselves. The SIMD.jl package makes this somewhat accessible. For an example, resids_simd
uses this package to match performance of the @turbo
version.
@benchmark ARGridBootstrap.resids_simd!($e,$y,$θ, Val(8))
BenchmarkTools.Trial: 10000 samples with 979 evaluations.
Range (min … max): 63.778 ns … 120.281 ns ┊ GC (min … max): 0.00% … 0.00
%
Time (median): 64.349 ns ┊ GC (median): 0.00%
Time (mean ± σ): 64.973 ns ± 2.993 ns ┊ GC (mean ± σ): 0.00% ± 0.00
%
█▅▂ ▂ ▁ ▁
▅▆████▁▃▄▃▁▁▁▁▁▁▁▁▁▃▃█▇▄▃▁▁▁▁▄▁▆▆▄▁▄▃▅▅▆▇▇▇▇▆▆▅▅▅▄▅▅▁▄█▆▄▃▅▃ █
63.8 ns Histogram: log(frequency) by time 76.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Generally, if @turbo
successfully inserted SIMD instructions and made your code substantially faster, it will not be worth your effort to try to manually write SIMD code. However, @turbo
will not always be able to insert SIMD instructions. One way to check is through benchmarking. Another way is to inspect @code_llvm ARGridBootstrap.resids_turbo!(e,y,θ)
. Things like fadd fast <4 x double>
are SIMD instructions. The <4 x double>
part is the key sign. In contrast, something like %26 = fsub double %24, %25
are scalar instructions.
The loops in xx_xy!
are not automatically vectorized. Part of the issue is that @turbo
cannot tell that xx
and xy
are statically allocated. If we rewrite the code to use scalars, it would like get vectorized by @turbo
. The b_est_stride
function does this. However, it is really inconvenient to rewrite array code as scalars. It may be more maintainable to keep the arrays and write SIMD instructions ourselves.
@benchmark ARGridBootstrap.xx_xy!($xx,$xy,$y)
BenchmarkTools.Trial: 10000 samples with 183 evaluations.
Range (min … max): 577.443 ns … 13.795 μs ┊ GC (min … max): 0.00% … 0.0
0%
Time (median): 580.115 ns ┊ GC (median): 0.00%
Time (mean ± σ): 586.493 ns ± 134.607 ns ┊ GC (mean ± σ): 0.00% ± 0.0
0%
█▂ ▁ ▁
███▆▄▃▁▃▃▁▃▃▃▃▄▁▁▁▄█▆▇███▇▇▅▅▇▅▁▃▄▄▅▅▄▄▆▄▅▄▄▅▆▄▅▄▅▆▅▄▅▄▆▅▅▄▅▄ █
577 ns Histogram: log(frequency) by time 700 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark ARGridBootstrap.xx_xy_simd!($xx,$xy,$y, Val(16))
BenchmarkTools.Trial: 10000 samples with 565 evaluations.
Range (min … max): 206.189 ns … 321.524 ns ┊ GC (min … max): 0.00% … 0.0
0%
Time (median): 207.427 ns ┊ GC (median): 0.00%
Time (mean ± σ): 209.219 ns ± 7.418 ns ┊ GC (mean ± σ): 0.00% ± 0.0
0%
██▂ ▃▁▁▁ ▁ ▂
▆███▅▄▃▁▁▁▁▁▁▃▄▃▃▃▄▄▃▄▇█████▇▆▆▅▅▅▇█▆▄▃▄▅▄▅▄▄▅▅██▅▄▃▄▁▄▃▄▄▅▅▆ █
206 ns Histogram: log(frequency) by time 243 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
This makes the code faster by a factor of more than 10. The Val(N)
argument controls the width of vectors that gets passed to SIMD instructions. The value of N
can affect execution by a factor of 5 or more. The best choice of N
depends on your exact hardware and the code being executed.
To see how much this is worth it, let’s benchmark the full bootstrap code, but using the SIMD versions of resids!
and xx_xy!
= y->b_est_nox(y, xx_xy! =ARGridBootstrap.xx_xy_simd!, resids! =ARGridBootstrap.resids_simd!)
b_est_simd =@benchmark (b,t) = gridbootstrap(wrapper(b_est_simd), a->ar1_original(y0, a, est.e),
benchsimd αgrid, nboot)
BenchmarkTools.Trial: 128 samples with 1 evaluation.
Range (min … max): 36.683 ms … 58.429 ms ┊ GC (min … max): 8.36% … 5.09
%
Time (median): 39.941 ms ┊ GC (median): 15.25%
Time (mean ± σ): 39.180 ms ± 2.351 ms ┊ GC (mean ± σ): 12.87% ± 3.57
%
▃ ▄█▂
██▄▇▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃████▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
36.7 ms Histogram: frequency by time 45.4 ms <
Memory estimate: 71.70 MiB, allocs estimate: 101499.
We have not improved total execution very much. The problem is that very little of the total time was spent in xx_xy!
and resids!
to begin with. We did make those functions much faster, but they were such a small portion of total execution time, that it is not noticeable. We should focus our efforts on ar1_original
if we want to improve.
Fastest Version
The fastest version of the code that I could write combines the ideas above. As above, it avoids allocating x
. It also avoids allocating e
by combining the simulation and estimation into a single loop. Finally, it uses mutable static arrays to ensure that operations on xx
and xy
have as little overhead as possible. Note that for small StaticArrays, inv
will call a specialized, fast version, and ends up being faster than \
.
using StaticArrays
= @code_string simulate_estimate_arp(y0,a,e)
s code_md(s)
function simulate_estimate_arp(y0, a, e, ar::Val{P}=Val(1),
=()->rand(1:length(e))) where P
rindex= length(e)
T length(a)==P || error("length(a) not equal to P")
= @MMatrix zeros(eltype(e),P+2, P+2)
xx = @MVector zeros(eltype(e),P+2)
xy = zero(eltype(e))
yy = @MVector ones(eltype(e), P+2)
xt if (abs(a)<1)
3:(P+2)] .= y0
xt[else
3:(P+2)] .= 0.0
xt[end
= @MVector zeros(eltype(e),P+2)
α @simd for i = 1:P
2+i] = a[i]
α[end
1,1] = T-P # = 1'*1
xx[1,2] = xx[2,1] = (T+1)*T/2 - sum(1:P) # sum(P+1:T)
xx[2,2] = (2*(T)+1)*(T)*(T+1)/6 - sum((1:P).^2) # sum((P+1:T).^2)
xx[@inbounds for t in (P+1):T
= e[rindex()]
et 2] = t
xt[for i in 1:(P+2)
@simd for j in 3:(P+2)
+= xt[i]*xt[j]
xx[i,j] end
end
= dot(α, xt) + et
y @simd for i in 1:(P+2)
+= xt[i]*y
xy[i] end
+= y^2
yy if (P>1)
4:(P+2)] .= xt[3:(P+1)]
xt[end
3] = y
xt[end
@inbounds for i in 3:(P+2)
for j in 1:(i-1)
= xx[j,i]
xx[i,j] end
end
= inv(xx)
ixx = ixx*xy
θ = yy - xy'*ixx*xy
ee = sqrt.(abs.(diag(ixx *(ee))./(T-(2*P+2))))
se =θ,se=se)
(θend
estimator(y0=y0,e=est.e) = function(a)
= simulate_estimate_arp(y0,a,e)
out 3], out.se[3])
(out.θ[end
=@benchmark (b,t) = gridbootstrap(estimator(), a->a, αgrid, nboot) bench_sea
BenchmarkTools.Trial: 258 samples with 1 evaluation.
Range (min … max): 18.664 ms … 27.713 ms ┊ GC (min … max): 0.00% … 29.14
%
Time (median): 18.816 ms ┊ GC (median): 0.00%
Time (mean ± σ): 19.419 ms ± 2.069 ms ┊ GC (mean ± σ): 2.76% ± 7.46
%
█▆ ▄▁
██▇██▄▆▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▇▁▅ ▅
18.7 ms Histogram: log(frequency) by time 27.4 ms <
Memory estimate: 2.94 MiB, allocs estimate: 60904.
On my computer, this version of the code is about 10 times faster than the original.
For further gains, we will explore multi-threading and using a GPU in parts 2 and 3.
Exercises
Some of these have been incorporated into the sections above.
Read the Performance Tips section of Julia Manual and incorporate some of these tips into the above code.
Write a version of
b_est
that avoids allocating the full \(T \times 3\) \(X\) matrix, but can still be generalized to an AR(p) model.Examine how the relative performance of these versions of
b_est
vary withT
,nboot
, and the number of grid points.The Julia package
StaticArrays.jl
provides an alternative array implementation that is often much faster thanBase.Array
. Try implementingb_est
usingStaticArrays.jl
. You will likely need to use mutable arrays (see@MMatrix
and@MVector
). Note thatinv
of a small array will be substantially faster when usingStaticArray.jl
instead ofBase.Array
.The fastest version of the code does not yet use SIMD instructions. Try to improve its performance by using SIMD. My SIMDscan package might be useful.