Coding for Performance

1: Optimizing Single Threaded Code

Author

Paul Schrimpf

Published

October 9, 2023

Introduction

Today we will look into some methods to improve the speed of our code. Although speed is sometimes important, never forget that speed should be low on your list of priorities when writing code. You should prioritize correctness and maintainability ahead of performance. Nonetheless, performance does matter for some problems.

If you have not already, be sure to read the Peformance Tips section of Julia Docs.

Also, read Rackauckas’s notes on “Optimizing Serial Code.” (Rackauckas 2019).

Grid bootstrap

As a motivating example we will look at the gridded bootstrap of Hansen (1999)(Hansen 1999).

Gauss, Matlab, and R code implementing Hansen’s method is available on Hansen’s website. The Julia code below is more or less a direct translation from Hansen’s R code. Since this is a translation from R of a translation from Gauss, this code will not necessarily follow best practices for Julia.

T = 200
e = randn(T)
y0 = 0
a = 0.9
s = @code_string b_est_original(e)
code_md(s)

function b_est_original(yin)
  T = length(yin)
  x = [ones(T-1) 2:T yin[1:(T-1)]]
  y = yin[2:T]
  θ = x'*x \ x'y
  e = y - x*θ
  se = sqrt.(diag(inv(x'*x) *(e'*e))./(T-4))
  (θ=θ,se=se,e=e)
end

s=@code_string ar1_original(y0,a,e)
code_md(s)

function ar1_original(y0, a, e, rindex=T->rand(1:length(e),T))
  T = length(e)
  y = Array{eltype(e)}(undef, T)
  y[1] = abs(a)<1 ? y0 : zero(eltype(y))
  et = e[rindex(T-1)]
  for t in 2:T
    y[t] = a*y[t-1] + et[t-1]
  end
  y
end

 s = @code_string gridbootstrap(b_est_original, a->a, 0.5:0.1:1, 99)
code_md(s)

function gridbootstrap(estimator, simulator,
                       grid,
                       nboot=199)
  g = length(grid)
  bootq = zeros(nboot, g)
  ba    = zeros(nboot, g)
  bootse = zeros(nboot,g)
  for ak in 1:g
    for j in 1:nboot
      (bootq[j,ak], bootse[j,ak]) = estimator(simulator(grid[ak]))
      ba[j,ak] = bootq[j,ak] - grid[ak]
    end
  end
  ts = ba./bootse
  (ba=ba, t=ts)
end

Improving performance

Now, let’s run this code and time it. Note that we are running this with only 50 grid points and 199 bootstrap replications. In real use, you would want more like 999 bootstrap replications or more, and perhaps more grid points.

# simulate some data
using Random, BenchmarkTools, Profile, ProfileCanvas
T = 200
e = randn(T)
y0 = 0
a = 0.9
y = ar1_original(y0, a, e)
est = b_est_original(y)
αgrid = 0.84:(0.22/50):1.06
nboot= 199
wrapper(b_est) = function(x)
  out=b_est(x)
  (out.θ[3], out.se[3])
end

wrapper (generic function with 1 method)

Initial Benchmark and Profile

benchorig=@benchmark (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
                             αgrid, nboot)

BenchmarkTools.Trial: 38 samples with 1 evaluation.
 Range (min … max):  130.823 ms … 149.250 ms  ┊ GC (min … max): 9.70% … 8.5
6%
 Time  (median):     131.580 ms               ┊ GC (median):    9.72%
 Time  (mean ± σ):   132.501 ms ±   3.666 ms  ┊ GC (mean ± σ):  9.77% ± 0.5
2%

  ▃█▃▅▂                                                          
  █████▄▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄ ▁
  131 ms           Histogram: frequency by time          149 ms <

 Memory estimate: 219.44 MiB, allocs estimate: 345075.

To make code faster, we should begin by profiling.

# only to make profile results show nicely in generated html
# for interactive use, just use @profview in place of @profile
function profilehtmlstring() 
  buf = IOBuffer()
  show(buf, MIME("text/html"), ProfileCanvas.view(Profile.fetch()))
  s=String(take!(buf))
  println("\n\n"*s*"\n")
end

profilehtmlstring (generic function with 1 method)

using Profile, ProfileCanvas
Profile.clear();
Profile.init(n=10^7,delay=0.0001);
@profile (b,t) = gridbootstrap(wrapper(b_est_original), a->ar1_original(y0, a, est.e),
                               αgrid, 999)
profilehtmlstring()

The profiler works very simply. Every 0.0001 seconds, the line of code being executed gets recorded. There are then various tools for printing and visualizing the results. The @profview macro shows a flame graph. The default view might be a bit strange. The base of the flame graph often includes Julia’s repl and various other things that can be ignored. If you click on the graph, it will zoom in. You can use mouse wheel up to zoom back out.

Tip

In VSCode, you can just use @profview in place of @profile and the profile flamegraph will open in a panel within VSCode.

Removing Redudant Operations

Warning

The following was true on an older version of Julia, but now there are no gains from eliminating inv.

From the output (these numbers can vary quite a bit from run to run), we see there were 640 ticks in gridbootstrap_original (exact numbers will vary on each execution, but relative ones should be similar), and almost all of these occurred within inv. If we want the code to be faster, we should focus on these lines. Calling both inv and \ is redundant; we should combine these computations.

s = @code_string b_est_mldivide(y)
code_md(s)

function b_est_mldivide(yin)
  T = length(yin)
  x = [ones(T-1) 2:T yin[1:(T-1)]]
  y = yin[2:T]
  tmp = x'*x \ [x'*y I]
  θ = tmp[:,1]
  ixx = tmp[:,2:4]
  e = y - x*θ
  se = sqrt.(diag(ixx *(e'*e))./(T-4))
  (θ=θ,se=se,e=e)
end

benchml=@benchmark (b,t) = gridbootstrap(wrapper(b_est_mldivide), a->ar1_original(y0, a, est.e),
                             αgrid, nboot)

BenchmarkTools.Trial: 27 samples with 1 evaluation.
 Range (min … max):  179.880 ms … 199.238 ms  ┊ GC (min … max): 6.97% … 6.4
5%
 Time  (median):     186.427 ms               ┊ GC (median):    8.51%
 Time  (mean ± σ):   185.857 ms ±   3.287 ms  ┊ GC (mean ± σ):  8.20% ± 0.7
1%

            ▂ ▂        █▅                                        
  ▅▁▁▁▁▁▁▁▁▁████▁▁▁▁▁▅████▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  180 ms           Histogram: frequency by time          199 ms <

 Memory estimate: 210.46 MiB, allocs estimate: 517608.

~~From this, we get a speedup by about a factor of 4 on my computer.~~ This used to make a big difference, but no longer seems to matter.

Reducing Allocations

Profile.clear()
@profile (b,t) = gridbootstrap(wrapper(b_est_mldivide), a->ar1_original(y0, a, est.e),
                               αgrid, 999)
profilehtmlstring()

Now, the most time consuming parts of the code are, unsurprisingly, the call to \, and, perhaps surprisingly, hcat from creating x. Allocating and copying memory is relatively slow. The creation of x involves both.

Caching Intermediate Arrays

One option is to preallocate an arrays and reuse them. The struct bEstCache does this.

b_est_cache = ARGridBootstrap.bEstCached(T-1)
benchcache=@benchmark gridbootstrap(wrapper(b_est_cache), a->ar1_original(y0, a, est.e), 
                                    αgrid, nboot)

BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):  138.756 ms … 156.013 ms  ┊ GC (min … max): 7.03% … 6.2
0%
 Time  (median):     139.455 ms               ┊ GC (median):    7.03%
 Time  (mean ± σ):   140.950 ms ±   3.298 ms  ┊ GC (mean ± σ):  7.63% ± 1.0
3%

  ▃▅█                                                            
  ███▁▅▁▁▄▁▁▁▁▄▅▅▇▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  139 ms           Histogram: frequency by time          156 ms <

 Memory estimate: 143.71 MiB, allocs estimate: 162394.

Profile.clear()
@profview (b,t) = gridbootstrap(wrapper(b_est_cache), a->ar1_original(y0, a, est.e),
                               αgrid, 999)
profilehtmlstring()

Eliminating Intermediate Arrays

Better yet, we can avoid creating x by just accumulating \(X'y\) and \(X'X\) in a loop.

s = @code_string b_est_nox(y)
code_md(s)

function b_est_nox(yin; xx_xy!::F=xx_xy!, resids!::FR=resids!) where {F<:Function, FR<:Function}
  T = length(yin)
  xx = @MMatrix zeros(eltype(yin),3,3)
  xy = @MVector zeros(eltype(yin),3)
  xx_xy!(xx,xy,yin)
  ixx = inv(xx)
  θ = ixx * xy
  e = similar(yin,T-1)
  resids!(e,yin,θ)
  se = sqrt.(diag(ixx *(e'*e))./(T-4))
  (θ=θ,se=se,e=e)
end

We put the two main loops into separate functions both for organization and to allow us to focus on optimizing these loops below.

s=@code_string ARGridBootstrap.xx_xy!(zeros(3,3),zeros(3),y)
code_md(s)

@inline function xx_xy!(xx,xy,yin)
  T = length(yin)
  xx .= zero(eltype(xx))
  xy .= zero(eltype(xy))
  @inbounds @simd for t in 2:T
    xx[1,3] += yin[t-1]
    xx[2,3] += t*yin[t-1]
    xx[3,3] += yin[t-1]^2
    xy[1] += yin[t]
    xy[2] += t*yin[t]
    xy[3] += yin[t-1]*yin[t]
  end
  xx[1,1] = T-1 # = 1'*1
  xx[1,2] = xx[2,1] = (T+1)*T/2 - 1 # sum(p+1:T)
  xx[2,2] = (2*(T)+1)*(T)*(T+1)/6 - 1 # sum((p+1:T).^2)
  xx[3,1] = xx[1,3]
  xx[3,2] = xx[2,3]
  nothing
end

s=@code_string ARGridBootstrap.resids!(zeros(length(y)-1),y,zeros(3))
code_md(s)

@inline function resids!(e, yin, θ)
  T = length(yin)
  @inbounds @simd for t in 2:T
    e[t-1] = yin[t] - θ[1] - θ[2]*t - θ[3]*yin[t-1]
  end
  nothing
end

benchnox=@benchmark (b,t) = gridbootstrap(wrapper(b_est_nox), a->ar1_original(y0, a, est.e),
                             αgrid, nboot)

BenchmarkTools.Trial: 124 samples with 1 evaluation.
 Range (min … max):  37.839 ms … 61.466 ms  ┊ GC (min … max):  7.96% … 4.77
%
 Time  (median):     41.088 ms              ┊ GC (median):    14.60%
 Time  (mean ± σ):   40.376 ms ±  2.496 ms  ┊ GC (mean ± σ):  12.25% ± 3.29
%

   ▁                    █▁▂▃                                   
  ▇█▆▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅████▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  37.8 ms         Histogram: frequency by time        46.4 ms <

 Memory estimate: 71.70 MiB, allocs estimate: 101499.

We have further cut the time by a factor of two. However, this performance optimization has been costly in terms of readability and extensibility of our code. If we wanted to fit an AR(p) model instead of AR(1), the b_est_nox function would be more difficult to modify than the b_est_mldivide version.

We additionally gained some performance by using mutable StaticArrays to hold \(X'X\) and \(X'y\).

xx = zeros(3,3)
xy = zeros(3)
@benchmark ARGridBootstrap.xx_xy!(xx,xy,y)

BenchmarkTools.Trial: 10000 samples with 179 evaluations.
 Range (min … max):  594.637 ns …  1.129 μs  ┊ GC (min … max): 0.00% … 0.00
%
 Time  (median):     597.760 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   602.472 ns ± 26.903 ns  ┊ GC (mean ± σ):  0.00% ± 0.00
%

   █▄                                                          ▁
  ▆██▇▄▄▁▁▃▁▁▁▁▁▁▁▇▄▄▄▄█▇████▇▆▅▆▅▇▅▁▅▄▄▃▃▃▄▁▄▄▄▅▅▅▆▅▅▅▅▅▅▅▅▅▆ █
  595 ns        Histogram: log(frequency) by time       706 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

using StaticArrays
xx = @MMatrix zeros(3,3)
xy = @MVector zeros(3)
@benchmark ARGridBootstrap.xx_xy!(xx,xy,y)

BenchmarkTools.Trial: 10000 samples with 181 evaluations.
 Range (min … max):  588.834 ns …  1.172 μs  ┊ GC (min … max): 0.00% … 0.00
%
 Time  (median):     592.696 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   597.046 ns ± 28.276 ns  ┊ GC (mean ± σ):  0.00% ± 0.00
%

   ▇█▃                      ▁                                  ▁
  ▅███▅▁▄▁▁▁▁▁▃▁▁▁▁▃▃▁▁▄▇▇▇████▇▇▆▅▄▅▅▃▄▅▃▁▄▃▁▁▄▄▅▅▄▄▄▅▄▃▅▄▄▄▅ █
  589 ns        Histogram: log(frequency) by time       695 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

SIMD

To get full performance from modern CPUs, it is essential to use Single Instruction, Multiple Data instructions (also known as vectorized CPU instructions) in our code. These are special CPU instruction that allow applying an operation to multiple numbers at once. The LLVM compiler that Julia uses tries to automatically use these instructions when it is sure that it is safe to do so. See “Demystifying auto vectorization in Julia” for more information.

The compiler is not always successful in using SIMD instructions, so manual usage of SIMD instructions can often help.

LoopVectorization.jl

As mentioned above, the Julia compiler tries to automactically use SIMD instructions when it is safe to do so. SIMD instructions often change the order of operations, and since floating point math is not exactly commutative. The compiler tries to avoid reordering operations, but this often prevents SIMD use. The macro @simd tells the compiler to not worry about reordering operations and insert SIMD instructions more aggresively. Still, there are some SIMD operations that the compiler will not insert automatically.

The LoopVectorization.jl package defines a macro, @turbo that more aggresively inserts SIMD instructions. This can make a large difference for some loops and broadcasts.

e = zeros(length(y)-1)
θ = @MVector zeros(3)
@benchmark ARGridBootstrap.resids!($e,$y,$θ)

BenchmarkTools.Trial: 10000 samples with 390 evaluations.
 Range (min … max):  246.418 ns … 555.159 ns  ┊ GC (min … max): 0.00% … 0.0
0%
 Time  (median):     248.031 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   250.314 ns ±  13.983 ns  ┊ GC (mean ± σ):  0.00% ± 0.0
0%

   █▆▁               ▁                                          ▁
  ▇███▄▅▁▃▁▃▁▁▁▁▃▃▃▁▄█▄▅▆▇███▇▇▇▅▅▄▄▃▁▄▁▁▄▄▅▄▃▅▆▄▅▄▅▅▄▅▅▅▅▅▄▄▄▅ █
  246 ns        Histogram: log(frequency) by time        300 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark ARGridBootstrap.resids_turbo!($e,$y,$θ)

BenchmarkTools.Trial: 10000 samples with 979 evaluations.
 Range (min … max):  60.996 ns … 131.267 ns  ┊ GC (min … max): 0.00% … 0.00
%
 Time  (median):     61.709 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   62.269 ns ±   2.848 ns  ┊ GC (mean ± σ):  0.00% ± 0.00
%

    ▄█▅▁                 ▃                                  ▁  ▁
  ▅▅████▇▅▅▁▁▃▃▁▁▁▁▁▁▁▄▅▄██▄▄▃▃▁▁▁▄▃▄▃▁▁▁▁▃▄▄▅▇▇▇▆▆▆▅▅▅▅▆▅▃▅█▆ █
  61 ns         Histogram: log(frequency) by time      72.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

We can also write SIMD instructions ourselves. The SIMD.jl package makes this somewhat accessible. For an example, resids_simd uses this package to match performance of the @turbo version.

@benchmark ARGridBootstrap.resids_simd!($e,$y,$θ, Val(8))

BenchmarkTools.Trial: 10000 samples with 979 evaluations.
 Range (min … max):  63.778 ns … 120.281 ns  ┊ GC (min … max): 0.00% … 0.00
%
 Time  (median):     64.349 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   64.973 ns ±   2.993 ns  ┊ GC (mean ± σ):  0.00% ± 0.00
%

    █▅▂                ▂                                ▁      ▁
  ▅▆████▁▃▄▃▁▁▁▁▁▁▁▁▁▃▃█▇▄▃▁▁▁▁▄▁▆▆▄▁▄▃▅▅▆▇▇▇▇▆▆▅▅▅▄▅▅▁▄█▆▄▃▅▃ █
  63.8 ns       Histogram: log(frequency) by time      76.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Generally, if @turbo successfully inserted SIMD instructions and made your code substantially faster, it will not be worth your effort to try to manually write SIMD code. However, @turbo will not always be able to insert SIMD instructions. One way to check is through benchmarking. Another way is to inspect @code_llvm ARGridBootstrap.resids_turbo!(e,y,θ). Things like fadd fast <4 x double> are SIMD instructions. The <4 x double> part is the key sign. In contrast, something like %26 = fsub double %24, %25 are scalar instructions.

The loops in xx_xy! are not automatically vectorized. Part of the issue is that @turbo cannot tell that xx and xy are statically allocated. If we rewrite the code to use scalars, it would like get vectorized by @turbo. The b_est_stride function does this. However, it is really inconvenient to rewrite array code as scalars. It may be more maintainable to keep the arrays and write SIMD instructions ourselves.

@benchmark ARGridBootstrap.xx_xy!($xx,$xy,$y)

BenchmarkTools.Trial: 10000 samples with 183 evaluations.
 Range (min … max):  577.443 ns …  13.795 μs  ┊ GC (min … max): 0.00% … 0.0
0%
 Time  (median):     580.115 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   586.493 ns ± 134.607 ns  ┊ GC (mean ± σ):  0.00% ± 0.0
0%

   █▂                   ▁                                       ▁
  ███▆▄▃▁▃▃▁▃▃▃▃▄▁▁▁▄█▆▇███▇▇▅▅▇▅▁▃▄▄▅▅▄▄▆▄▅▄▄▅▆▄▅▄▅▆▅▄▅▄▆▅▅▄▅▄ █
  577 ns        Histogram: log(frequency) by time        700 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark ARGridBootstrap.xx_xy_simd!($xx,$xy,$y, Val(16))

BenchmarkTools.Trial: 10000 samples with 565 evaluations.
 Range (min … max):  206.189 ns … 321.524 ns  ┊ GC (min … max): 0.00% … 0.0
0%
 Time  (median):     207.427 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   209.219 ns ±   7.418 ns  ┊ GC (mean ± σ):  0.00% ± 0.0
0%

   ██▂                   ▃▁▁▁        ▁                          ▂
  ▆███▅▄▃▁▁▁▁▁▁▃▄▃▃▃▄▄▃▄▇█████▇▆▆▅▅▅▇█▆▄▃▄▅▄▅▄▄▅▅██▅▄▃▄▁▄▃▄▄▅▅▆ █
  206 ns        Histogram: log(frequency) by time        243 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This makes the code faster by a factor of more than 10. The Val(N) argument controls the width of vectors that gets passed to SIMD instructions. The value of N can affect execution by a factor of 5 or more. The best choice of N depends on your exact hardware and the code being executed.

To see how much this is worth it, let’s benchmark the full bootstrap code, but using the SIMD versions of resids! and xx_xy!

b_est_simd = y->b_est_nox(y, xx_xy! =ARGridBootstrap.xx_xy_simd!, resids! =ARGridBootstrap.resids_simd!)
benchsimd=@benchmark (b,t) = gridbootstrap(wrapper(b_est_simd), a->ar1_original(y0, a, est.e),
                             αgrid, nboot)

BenchmarkTools.Trial: 128 samples with 1 evaluation.
 Range (min … max):  36.683 ms … 58.429 ms  ┊ GC (min … max):  8.36% … 5.09
%
 Time  (median):     39.941 ms              ┊ GC (median):    15.25%
 Time  (mean ± σ):   39.180 ms ±  2.351 ms  ┊ GC (mean ± σ):  12.87% ± 3.57
%

   ▃                   ▄█▂                                     
  ██▄▇▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃████▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  36.7 ms         Histogram: frequency by time        45.4 ms <

 Memory estimate: 71.70 MiB, allocs estimate: 101499.

We have not improved total execution very much. The problem is that very little of the total time was spent in xx_xy! and resids! to begin with. We did make those functions much faster, but they were such a small portion of total execution time, that it is not noticeable. We should focus our efforts on ar1_original if we want to improve.

Fastest Version

The fastest version of the code that I could write combines the ideas above. As above, it avoids allocating x. It also avoids allocating e by combining the simulation and estimation into a single loop. Finally, it uses mutable static arrays to ensure that operations on xx and xy have as little overhead as possible. Note that for small StaticArrays, inv will call a specialized, fast version, and ends up being faster than \.

using StaticArrays

s = @code_string simulate_estimate_arp(y0,a,e)
code_md(s)

function simulate_estimate_arp(y0, a, e, ar::Val{P}=Val(1),
                               rindex=()->rand(1:length(e))) where P
  T = length(e)
  length(a)==P || error("length(a) not equal to P")
  xx = @MMatrix zeros(eltype(e),P+2, P+2)
  xy = @MVector zeros(eltype(e),P+2)
  yy = zero(eltype(e))
  xt = @MVector ones(eltype(e), P+2)
  if (abs(a)<1)
    xt[3:(P+2)] .= y0
  else
    xt[3:(P+2)] .= 0.0
  end
  α = @MVector zeros(eltype(e),P+2)
  @simd for i = 1:P
    α[2+i] = a[i]
  end

  xx[1,1] = T-P # = 1'*1
  xx[1,2] = xx[2,1] = (T+1)*T/2 - sum(1:P) # sum(P+1:T)
  xx[2,2] = (2*(T)+1)*(T)*(T+1)/6 - sum((1:P).^2) # sum((P+1:T).^2)
  @inbounds for t in (P+1):T
    et = e[rindex()]
    xt[2] = t
    for i in 1:(P+2)
      @simd for j in 3:(P+2)
        xx[i,j] += xt[i]*xt[j]
      end
    end
    y = dot(α, xt) + et
    @simd for i in 1:(P+2)
      xy[i] += xt[i]*y
    end
    yy += y^2
    if (P>1)
      xt[4:(P+2)] .= xt[3:(P+1)]
    end
    xt[3] = y
  end
  @inbounds for i in 3:(P+2)
    for j in 1:(i-1)
      xx[i,j] = xx[j,i]
    end
  end
  ixx = inv(xx)
  θ = ixx*xy
  ee = yy - xy'*ixx*xy
  se = sqrt.(abs.(diag(ixx *(ee))./(T-(2*P+2))))
  (θ=θ,se=se)
end

estimator(y0=y0,e=est.e) = function(a)
  out = simulate_estimate_arp(y0,a,e)
  (out.θ[3], out.se[3])
end
bench_sea=@benchmark  (b,t) = gridbootstrap(estimator(), a->a, αgrid, nboot)

BenchmarkTools.Trial: 258 samples with 1 evaluation.
 Range (min … max):  18.664 ms … 27.713 ms  ┊ GC (min … max): 0.00% … 29.14
%
 Time  (median):     18.816 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.419 ms ±  2.069 ms  ┊ GC (mean ± σ):  2.76% ±  7.46
%

  █▆ ▄▁                                                        
  ██▇██▄▆▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▇▁▅ ▅
  18.7 ms      Histogram: log(frequency) by time      27.4 ms <

 Memory estimate: 2.94 MiB, allocs estimate: 60904.

On my computer, this version of the code is about 10 times faster than the original.

For further gains, we will explore multi-threading and using a GPU in parts 2 and 3.

Exercises

Some of these have been incorporated into the sections above.

Read the Performance Tips section of Julia Manual and incorporate some of these tips into the above code.
Write a version of b_est that avoids allocating the full \(T \times 3\) \(X\) matrix, but can still be generalized to an AR(p) model.
Examine how the relative performance of these versions of b_est vary with T, nboot, and the number of grid points.
The Julia package StaticArrays.jl provides an alternative array implementation that is often much faster than Base.Array. Try implementing b_est using StaticArrays.jl. You will likely need to use mutable arrays (see @MMatrix and @MVector). Note that inv of a small array will be substantially faster when using StaticArray.jl instead of Base.Array.
The fastest version of the code does not yet use SIMD instructions. Try to improve its performance by using SIMD. My SIMDscan package might be useful.

References

Hansen, Bruce E. 1999. “The Grid Bootstrap and the Autoregressive Model.” The Review of Economics and Statistics 81 (4): 594–607. https://doi.org/10.1162/003465399558463.

Rackauckas, Chris. 2019. “Optimizing Serial Code.” https://book.sciml.ai/notes/02-Optimizing_Serial_Code/.