This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

About this document

This document was created using Weave.jl. The code is available on github. The same document generates both the static webpage and associated jupyter notebook (or on nbviewer).

Optimization Algorithms

The goal of this notebook is to give you some familiarity with numeric optimization.

Numeric optimization is important because many (most) models cannot be fully solved analytically. Numeric results can be used to complement analytic ones. Numeric optimization plays a huge role in econometrics.

In these notes, we will focus on minimization problems following the convention in mathematics, engineering, and most numerical libraries. It is easy to convert between minimization and maximization, and we hope that this does not lead to any confusion.

Heuristic searches

The simplest type of optimization algorithm are heuristic searches. Consider the problem:

with $f:\R^n \to \R$. Heuristic search algorithms consist of

  1. Evaluate $f$ at a collection of points
  2. Generate a new candidate point, $x^{new}$. Replace a point in the current collection with $x^{new}$ if $f(x^{new})$ is small enough.
  3. Stop when function value stops decreasing and/or collection of points become too close together.

There are many variants of such algorithms with different ways of generating new points, deciding whether to accept the new point, and deciding when to stop. Here is a simple implementation and animation of the above idea. In the code below, new points are drawn randomly from a normal distribution, and new points are accepted whenever $f(x^{new})$ is smaller than the worst existing function value.

markdown && printfunc("minrandomsearch","heuristic_optimizers.jl")
"""
    minrandomsearch(f, x0, npoints; var0=1.0, ftol = 1e-6,
                         vtol = 1e-4, maxiter = 1000,
                         vshrink=0.9, xrange=[-2., 3.],
                         yrange=[-2.,6.])

Find the minimum of function `f` by random search.

Creates an animation illustrating search progress.

# Arguments

- `f` function to minimize
- `x0` starting value
- `npoints` number of points in cloud
- `var0` initial variance of points
- `ftol` convergence tolerance for function value. Search terminates if bot
h function change is less than ftol and variance is less than vtol.
- `vtol` convergence tolerance for variance. Search terminates if both func
tion change is less than ftol and variance is less than vtol.
- `maxiter` maximum number of iterations
- `vshrink` after every 100 iterations with no function improvement, the va
riance is reduced by this factor
- `xrange` range of x-axis in animation
- `yrange` range of y-axis in animation
- `animate` whether to create animation

# Returns

- `(fmin, xmin, iter, info, anim)` tuple consisting of minimal function
  value, minimizer, number of iterations, convergence info, and an animatio
n
"""
function minrandomsearch(f, x0, npoints; var0=1.0, ftol = 1e-6,
                         vtol = 1e-4, maxiter = 1000,
                         vshrink=0.9, xrange=[-2., 3.],
                         yrange=[-2.,6.], animate=true)
  var = var0     # current variance for search
  oldbest = Inf  # smallest function value
  xbest = x0     # x with smallest function vale
  newbest = f(xbest)
  iter = 0       # number of iterations
  noimprove = 0  # number of iterations with no improvement

  animate = (animate && length(x0)==2)

  if animate
    # make a contour plot of the function we're minimizing. This is for
    # illustrating; you wouldn't have this normally
    x = range(xrange[1],xrange[2], length=100)
    y = range(yrange[1],yrange[2], length=100)
    c = contour(x,y,(x,y) -> log(f([x,y])))
    anim = Animation()
  else
    anim = nothing
  end

  while ((oldbest - newbest > ftol || var > vtol) && iter<=maxiter)
    oldbest = newbest
    x = rand(MvNormal(xbest, var),npoints)

    if animate
      # plot the search points
      p = deepcopy(c)
      scatter!(p, x[1,:], x[2,:], markercolor=:black, markeralpha=0.5, lege
nd=false, xlims=xrange, ylims=yrange)
    end

    fval = mapslices(f,x, dims=[1])
    (newbest, i) = findmin(fval)
    if (newbest > oldbest)
      noimprove+=1
      newbest=oldbest
    else
      xbest = x[:,i[2]]
      noimprove = 0
    end

    if animate
      # plot the best point so far
      scatter!(p, [xbest[1]],[xbest[2]], markercolor=:red, legend=false)
    end

    if (noimprove > 10) # shrink var
      var *= vshrink
    end

    if animate
      frame(anim) # add frame to animation
    end

    iter += 1
  end
  if (iter>maxiter)
    info = "Maximum iterations reached"
  else
    info = "Convergence."
  end
  return(minimum = newbest, minimizer = xbest, iters = iter, info = info, a
nim = anim)
end
"""
     banana(a,b)

  Returns the Rosenbrock function with parameters a, b.
"""
function banana(a,b)
  x->(a-x[1])^2+b*(x[2]-x[1]^2)^2
end
f = banana(1.0,1.0)

x0 = [-2.0, 3.0];
result = minrandomsearch(f, x0, 20, var0=0.1, vshrink=0.5, vtol=1e-3 )
gif(result[5], "randsearch.gif", fps=5);

random search

There are many other heuristic search algorithms. A popular deterministic one is the Nelder-Mead simplex. Popular heuristic search algorithms that include some randomness include simulated annealing and particle swarm. Each of the three algorithms just mentioned are available in Optim.jl. These heuristic searches have the advantage that they only function values (as opposed to also requiring gradients or hessians, see below). Some heuristic algorithms, like simulated annealing, can be shown to converge to a global (instead of local) minimum under appropriate assumptions. Compared to algorithms that use more information, heuristic algorithms tend to require many more function evaluations.

Gradient descent

Gradient descent is an iterative algorithm to find a local minimum. As the name suggests, it consists of descending toward a minimum in the direction opposite the gradient. Each step, you start at some $x$ and compute $x_{new}$

  1. Given current $x$, compute $x_{new} = x - \gamma Df_{x}$
  2. Adjust $\gamma$ depending on whether $f(x_{new})<f(x)$
  3. Repeat until $\norm{Df_{x}}$, $\norm{x-x_{new}}$, and/or $\abs{f(x)-f(x_{new})}$ small.
using ForwardDiff, LinearAlgebra
markdown && printfunc("graddescent","smooth_optimizers.jl")
"""
     graddescent(f, x0; grad=x->Forwardiff.gradient(f,x),
                 γ0=1.0, ftol = 1e-6,
                 xtol = 1e-4, gtol=1-6, maxiter = 1000, 
                 xrange=[-2., 3.],
                 yrange=[-2.,6.], animate=true)

Find the minimum of function `f` by gradient descent

# Arguments

- `f` function to minimize
- `x0` starting value
- `grad` function that computes gradient of `f`
- `γ0` initial step size multiplier
- `ftol` tolerance for function value
- `xtol` tolerance for x
- `gtol` tolerance for gradient. Convergence requires meeting all three tol
erances.
- `maxiter` maximum iterations
- `xrange` x-axis range for animation
- `yrange` y-axis range for animation
- `animate` whether to create animation

# Returns

- `(fmin, xmin, iter, info, anim)` tuple consisting of minimal function
  value, minimizer, number of iterations, convergence info, and animations
"""
function graddescent(f, x0; grad=x->ForwardDiff.gradient(f,x),
                     γ0=1.0, ftol = 1e-6,
                     xtol = 1e-4, gtol=1-6, maxiter = 1000, 
                     xrange=[-2., 3.],
                     yrange=[-2.,6.], animate=true)
  fold = f(x0)
  xold = x0
  xchange=Inf
  fchange=Inf
  γ = γ0
  iter = 0
  stuck=0
  improve = 0 # we increase γ if 5 steps in a row improve f(x)

  animate = animate && (length(x0)==2)

  if animate
    # make a contour plot of the function we're minimizing. This is for
    # illustrating; you wouldn't have this normally
    c = contour(range(xrange[1],xrange[2], length=100),
                range(yrange[1],yrange[2], length=100),
                (x,y) -> log(f([x,y])))
    anim = Animation()
  else
    anim = nothing
  end
  g = grad(xold)

  while(iter < maxiter && ((xchange>xtol) || (fchange>ftol) || (stuck>0)
                           || norm(g)>gtol) )
    g = grad(xold)
    x = xold - γ*g
    fnew = f(x)

    if animate
      scatter!(c, [xold[1]],[xold[2]], markercolor=:red, legend=false,
               xlims=xrange, ylims=yrange) 
      quiver!(c, [xold[1]],[xold[2]], quiver=([-γ*g[1]],[-γ*g[2]]), legend=
false,
              xlims=xrange, ylims=yrange)
      frame(anim)
    end

    if (fnew>=fold)
      γ*=0.5
      improve = 0     
      stuck += 1
      if (stuck>=10)
        break
      end
    else
      stuck = 0
      improve += 1
      if (improve>5)
        γ *= 2.0
        improve=0
      end
      xold = x
      fold = fnew
    end
    xchange = norm(x-xold)
    fchange = abs(fnew-fold)
    iter += 1
  end
  if (iter >= maxiter)
    info = "Maximum iterations reached"
  elseif (stuck>0)
    info = "Failed to improve for " * string(stuck) * " iterations."
  else
    info = "Convergence."
  end
  return(fold, xold, iter, info, anim) 
end
result = graddescent(f, x0)
gif(result[5], "graddescent.gif", fps=10);

gradient descent

Although an appealing and intuitive idea, the above example illustrates that gradient descent can perform surprisingly poorly in some cases. Nonetheless, gradient descent is useful for some problems. Notably, (stochastic) gradient descent is used to fit neural networks, where the dimension of x is so large that computing the inverse hessian in (quasi) Newton’s method is prohibitively time consuming.

Newton’s method

Newton’s method and its variations are often the most efficient minimization algorithms. Newton’s method updates $x$ by minimizing a second order approximation to $f$. Specifically:

  1. Given $x$ set $x_{new} = x - (D^2f_x)^{-1} Df_x$
  2. Repeat until $\norm{Df_{x}}$, $\norm{x-x_{new}}$, and/or $\abs{f(x)-f(x_{new})}$ small.
markdown && printfunc("newton","smooth_optimizers.jl")
"""
    newton(f, x0; 
           grad=x->ForwardDiff.gradient(f,x),
           hess=x->ForwardDiff.hessian(f,x),
           ftol = 1e-6,
           xtol = 1e-4, gtol=1-6, maxiter = 1000, 
           xrange=[-2., 3.],
           yrange=[-2.,6.], animate=true)

Find the minimum of function `f` by Newton's method.

# Arguments

- `f` function to minimizie
- `x0` starting value
- `grad` function that returns gradient of `f`
- `hess` function that returns hessian of `f`
- `ftol` tolerance for function value
- `xtol` tolerance for x
- `gtol` tolerance for gradient. Convergence requires meeting all three tol
erances.
- `maxiter` maximum iterations
- `xrange` x-axis range for animation
- `yrange` y-axis range for animation
- `animate` whether to create animation

# Returns

- `(fmin, xmin, iter, info, anim)` tuple consisting of minimal function
  value, minimizer, number of iterations, convergence info, and animation

"""
function newton(f, x0;
                grad=x->ForwardDiff.gradient(f,x),
                hess=x->ForwardDiff.hessian(f,x),
                ftol = 1e-6,
                xtol = 1e-4, gtol=1-6, maxiter = 1000, 
                xrange=[-2., 3.],
                yrange=[-2.,6.], animate=true)
  fold = f(x0)
  xold = x0
  xchange=Inf
  fchange=Inf
  iter = 0
  stuck=0

  animate=animate && length(x0)==2

  if animate
    # make a contour plot of the function we're minimizing. This is for
    # illustrating; you wouldn't have this normally
    c = contour(range(xrange[1],xrange[2], length=100),
                range(yrange[1],yrange[2], length=100),
                (x,y) -> log(f([x,y])))
    anim = Animation()
  end

  g = grad(xold)

  while(iter < maxiter && ((xchange>xtol) || (fchange>ftol) || (stuck>0)
                           || norm(g)>gtol) )
    g = grad(xold)
    H = hess(xold)
    Δx = - inv(H)*g
    x = xold + Δx
    fnew = f(x)
    step = 1.0
    while (fnew>=fold && step>xtol)
      step /= 1.618
      x = xold + Δx*step
      fnew = f(x)
    end

    if animate
      scatter!(c, [xold[1]],[xold[2]], markercolor=:red, legend=false,
               xlims=xrange, ylims=yrange) 
      quiver!(c, [xold[1]],[xold[2]], quiver=([Δx[1]],[Δx[2]]), legend=fals
e,
              xlims=xrange, ylims=yrange)
      frame(anim)
    end

    if (fnew>=fold)
      stuck += 1
      if (stuck>=10)
        break
      end
    else
      stuck = 0
      xold = x
      fold = fnew
    end
    xchange = norm(x-xold)
    fchange = abs(fnew-fold)
    iter += 1
  end
  if (iter >= maxiter)
    info = "Maximum iterations reached"
  elseif (stuck>0)
    info = "Failed to improve for " * string(stuck) * " iterations."
  else
    info = "Convergence."
  end
  return(fold, xold, iter, info, anim) 
end
result = newton(f, x0)
gif(result[5], "newton.gif", fps=5);

newton

Newton’s method tends to take relatively few iterations to converge for well-behaved functions. It does have the disadvantage that hessian and its inverse can be time consuming to compute, especially when the dimension of $x$ is large. Newton’s method can be unstable for functions that are not well approximated by their second expansion. This problem can be mitigated by combining Newton’s method with a line search or trust region.

Line searches consist of approximately minimizing $f$ along a given direction instead of updating $x$ with a fixed step size. For Newton’s method, instead of setting $x_{new} = x - (D^2f_x)^{-1} Df_x$, set $x_{new} \approx \argmin_{\delta} f(x - \delta (D^2f_x)^{-1} Df_x)$ where $\delta$ is a scalar. This one dimensional problem can be solved fairly quickly. Line search can also be combined with gradient descent.

Trust region

Instead of setting to the unconstrained minimizer of a local second order approximation, trust region methods introduce an region near $x$ where the approximation is trusted, and set Often $TR(x) = {\tilde{x} : \norm{x - \tilde{x}} < r}$. The radius of the trust region is then increased or decreased depending on $f(x_{new})$.

Quasi-Newton

Quasi-Newton methods (in particular the BFGS algorithm) are probably the most commonly used nonlinear optimization algorithm. Quasi-Newton methods are similar to Newton’s method, except instead of evaluating the hessian directly, quasi-Newton methods build an approximation to the hessian from repeated evaluations of $Df_x$ at different $x$.

Optim.jl contains all the algorithms mentioned above. Their advice on choice of algorithm is worth following..

Details matter in practice

In each of the algorithms above, we were somewhat cavalier with details like how to adjust step sizes and trust regions and what it means to approximately minimize during a line search. In practice these details can be quite important for how long an algorithm takes and whether it succeeds or fails. Different implementations of algorithms have different details. Often the details can be adjusted through some options. It can be worthwhile to try multiple implementations and options to get the best performance.

Constrained optimization

Constrained optimization is a bit harder than unconstrained, but uses similar ideas. For simple bound constraints, like $x\geq 0$ it is often easiest to simply transform to an unconstrained case by optimizing over $y = \log(x)$ instead.

For problems with equality constraints, one can apply Newton’s method to the first order conditions.

The difficult case is when there are inequality constraints. Just like when solving analytically, the difficulty is figuring out which constraints bind and which do not. For inequality constraints, we will consider problems written in the form:

Interior Point Methods

Interior point methods circumvent the problem of figuring out which constraints bind by approaching the optimum from the interior of the feasible set. To do this, the interior point method applies Newton’s method to a modified version of the first order condition. The unmodified first order conditions can be written A difficulty with these conditions is that solving them can require guessing and checking which combinations of constraints bind and which do not. Interior point methods get around this problem by beginning with an interior $x$ and $\lambda$ such that $\lambda>0$ and $c(x)>0$. They are then updated by applying Newton’s method to the equations where there is now a $\mu$ in place of $0$ in the second equation. $x$ and $\lambda$ are updated according to Newton’s method for this system of equations. In particular, $x_{new} = x + \Delta_x$ and $\lambda_{new}= \lambda + \Delta_\lambda$, where Over iterations $\mu$ is gradually decreased toward $0$. Here is one simple implementation.

markdown && printfunc("interiorpoint","constrained_optimizers.jl")
"""
    interiorpoint(f, x0, c; 
                  L   = (x,λ)->(f(x) - dot(λ,c(x))),
                  ∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x),
                  ∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x),
                  ∇c = x->ForwardDiff.jacobian(c,x),
                  tol=1e-4, maxiter = 1000,
                  μ0 = 1.0, μfactor = 0.2,
                  xrange=[-2., 3.],
                  yrange=[-2.,6.], animate=true)

Find the minimum of function `f` subject to `c(x) >= 0` using a
primal-dual interior point method.

# Arguments

- `f` function to minimizie
- `x0` starting value. Must have c(x0) > 0
- `c` constraint function. Must return an array.
- `L   = (x,λ)->(f(x) - dot(λ,c(x)))` Lagrangian
- `∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x)` Derivative of Lagrangia
n wrt `x`
- `∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x)` Hessian of Lagrangian wr
t `x`
- `∇c = x->ForwardDiff.jacobian(c,x)` Jacobian of constraints
- `tol` convergence tolerance
- `μ0` initial μ
- `μfactor` how much to decrease μ by
- `xrange` range of x-axis for animation
- `yrange` range of y-axis for animation
- `animate` whether to create an animation (if true requires length(x)==2)
- `verbosity` higher values result in more printed output during search. 0 
for no output, any number > 0 for some.  

# Returns

- `(fmin, xmin, iter, info, animate)` tuple consisting of minimal function
  value, minimizer, number of iterations, and convergence info

"""
function interiorpoint(f, x0, c;
                       L   = (x,λ)->(f(x) - dot(λ,c(x))),
                       ∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x),
                       ∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x),
                       ∇c = x->ForwardDiff.jacobian(c,x),
                       tol=1e-4, maxiter = 1000,
                       μ0 = 1.0, μfactor = 0.2,
                       xrange=[-2., 3.],
                       yrange=[-2.,6.], animate=true, verbosity=0)
  fold = f(x0)
  xold = x0
  all(c(x0).>0) || error("interiorpoint requires a starting value that stri
ctly satisfies all constraints")
  μ = μ0
  λ = μ./c(x0)
  xchange=Inf
  fchange=Inf
  iter = 0
  μiter = 0
  stuck=0

  animate = animate && length(x0)==2
  if animate
    # make a contour plot of the function we're minimizing. This is for
    # illustrating; you wouldn't have this normally
    ct = contour(range(xrange[1],xrange[2], length=100), 
                range(yrange[1],yrange[2], length=100),
                 (x,y) -> log(f([x,y])))
    plot!(ct, xrange, 2.5 .- xrange) # add constraint 
    anim = Animation()
  end
  foc = [∇ₓL(xold,λ); λ.*c(xold)]
  while(iter < maxiter && ((xchange>tol) || (fchange>tol) || (stuck>0)
                           || norm(foc)>tol || μ>tol) )
    # Calculate the direction for updating x and λ
    Dc = ∇c(xold)
    cx = c(xold)
    foc = ∇ₓL(xold, λ)
    H = ∇²ₓL(xold,λ)
    Δ = [H   -Dc'; λ'*Dc  diagm(cx)] \ [-foc; μ .- cx.*λ]

    # Find a step size such that λ>=0 and c(x)>=0
    # The details here could surely be improved
    α = 1.0
    acceptedstep = false
    λold = copy(λ)
    x = copy(xold)
    while (α > 1e-10)
      x = xold + α*Δ[1:length(xold)]
      λ = λold + α*Δ[(length(xold)+1):length(Δ)]
      if (all(λ.>=0) && all(c(x).>=0))
        acceptedstep=true
        break
      end
      α *= 0.5
    end
    if !acceptedstep
      stuck = 1
      break
    end
    fnew = f(x)

    if (animate)
      scatter!(ct, [xold[1]],[xold[2]], markercolor=:red, legend=false,
               xlims=xrange, ylims=yrange) 
      quiver!(ct, [xold[1]],[xold[2]], quiver=([α*Δ[1]],[α*Δ[2]]), legend=f
alse,
              xlims=xrange, ylims=yrange)
      frame(anim)
    end

    xchange = norm(x-xold)
    fchange = abs(fnew-fold)
    μiter += 1

    # update μ (the details here could also be improved)    
    foc = ∇ₓL(x,λ)
    if (μiter>10 || (norm(foc)< μ && λ'*c(x)<10*μ)) 
      μ *=  μfactor
      μiter = 0
    end

    xold = x
    fold = fnew
    if verbosity>0
      print("Iter $iter: f=$fnew, λ=$λ, c(x)=$(c(x)), μ=$μ, norm(foc)=$(nor
m(foc))\n")
    end
    iter += 1    
  end
  if (iter >= maxiter)
    info = "Maximum iterations reached"
  elseif (stuck>0)
    info = "Failed to find feasible step for " * string(stuck) * " iteratio
ns."
  else
    info = "Convergence."
  end
  return(fold, xold, iter, info, anim) 
end
f = banana(1.0,1.0)
x0 = [3.0, 0.0]
function constraint(x)
  [x[1] + x[2] - 2.5]
end
constraint (generic function with 1 method)
result = interiorpoint(f, x0, constraint; maxiter=100)
gif(result[5], "ip.gif", fps=5);

interior point

Optim.jl includes an interior point method. IPOPT is another popular implementation. As above, the details of the algorithm can be important in practice. It can be worthwhile to experiment with different methods for updating $\mu$, using a more sophisticated line search or trust region, and perhaps replacing the computation of the hessian with a quasi-Newton approximation.

It has been proven that interior point methods converge relatively quickly for convex optimization problems.

Sequential quadratic programming

Sequential quadratic programming relies on the fact that there are efficient methods to compute the solution to quadratic programs — optimization problems with quadratic objective functions and linear constraints. We can then solve a more general optimization problem by solving a sequence of quadratic programs that approximate the original problem.

Sequential quadratic programming is like a constrained version of Newton’s method. Given a current $x$ and $\lambda$ the new $x$ solves and the new $\lambda$ is set to the value of the multipliers for this problem.

This quadratic program (an optimization problem with a quadratic objective function and linearc onstraints) can be solved fairly efficiently if $(D^2 f_x + D^2 (\lambda^T c)_x)$ is positive semi-definite.

Info

Most for Convex program solvers are designed to accept semidefinite programs instead of quadratic programs. A quadratic program can be re-written as a semidefinite program. A solver such as SCS, ECOS, or Mosek can then be used. Fortunately, Convex.jl will automatically take care of any necessary transformation.

One could also incorporate a trust region or line search into the above algorithm. Here is one simple implementation.

using Convex, ECOS
Error: ArgumentError: Package Convex not found in current path:
- Run `import Pkg; Pkg.add("Convex")` to install the Convex package.
markdown && printfunc("sequentialquadratic","constrained_optimizers.jl")
"""
      sequentialquadratic(f, x0, c; 
                          ∇f = x->ForwardDiff.gradient(f,x),
                          ∇c = x->ForwardDiff.jacobian(c,x),
                          L   = (x,λ)->(f(x) - dot(λ,c(x))),
                          ∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x),
                          ∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x),
                          tol=1e-4, maxiter = 1000,
                          trustradius=1.0, xrange=[-2., 3.],
                          yrange=[-2.,6.], animate=true, verbosity=1)


Find the minimum of function `f` subject to `c(x) ≥ 0` by sequential quadra
tic programming.

# Arguments

- `f` function to minimizie
- `x0` starting value. Must have c(x0) > 0
- `c` constraint function. Must return an array.
- `∇f = x->ForwardDiff.gradient(f,x)`
- `∇c = x->ForwardDiff.jacobian(c,x)` Jacobian of constraints
- `L   = (x,λ)->(f(x) - dot(λ,c(x)))` Lagrangian
- `∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x)` Derivative of Lagrangia
n wrt `x`
- `∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x)` Hessian of Lagrangian wr
t `x`
- `tol` convergence tolerance
- `maxiter`
- `trustradius` initial trust region radius
- `xrange` range of x-axis for animation
- `yrange` range of y-axis for animation
- `animate` whether to create an animation (if true requires length(x)==2)
- `verbosity` higher values result in more printed output during search. 0 
for no output, any number > 0 for some.  

# Returns

- `(fmin, xmin, iter, info, animate)` tuple consisting of minimal function
  value, minimizer, number of iterations, and convergence info

"""
function sequentialquadratic(f, x0, c;
                             ∇f = x->ForwardDiff.gradient(f,x),
                             ∇c = x->ForwardDiff.jacobian(c,x),
                             L   = (x,λ)->(f(x) - dot(λ,c(x))),
                             ∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x
),
                             ∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x)
,
                             tol=1e-4, maxiter = 1000,
                             trustradius=1.0,
                             xrange=[-2., 3.],
                             yrange=[-2.,6.], animate=true, verbosity=1)
  fold = f(x0)
  xold = x0
  xchange=Inf
  fchange=Inf
  iter = 0
  μiter = 0
  stuck=0

  animate = animate && length(x0)==2
  if animate
    # make a contour plot of the function we're minimizing. This is for
    # illustrating; you wouldn't have this normally
    ct = contour(range(xrange[1],xrange[2], length=100), 
                range(yrange[1],yrange[2], length=100),
                 (x,y) -> log(f([x,y])))
    plot!(ct, xrange, 2.5 .- xrange) # add constraint 
    anim = Animation()
  end
  Dc = ∇c(xold)
  Df = ∇f(xold)
  λ = (Dc*Dc') \ Dc*Df
  foc = ∇ₓL(xold,λ)
  fold  = f(xold)
  negsquared(x) = x < 0 ? x^2 : zero(x)
  merit(x) = f(x) + 1000.0*sum(negsquared.(c(x)))
  while(iter < maxiter && ((xchange>tol) || (fchange>tol) || (stuck>0)
                           || norm(foc)>tol) )
    Df = ∇f(xold)
    Dc = ∇c(xold)
    cx = c(xold)
    H = ∇²ₓL(xold,λ)

    # QP will fail with if H not positive definite
    if !(isposdef(H))
      v,Q = eigen(H)
      Hp = Symmetric(Q*Diagonal(max.(v,one(eltype(v))*1e-8))*Q')
    else
      Hp = H
    end
    # set up and solve our QP
    Δ = Variable(length(xold))
    problem = minimize(Df'*Δ + 0.5*quadform(Δ,Hp), [cx + Dc*Δ >= 0; norm(Δ)
<=trustradius])
    solve!(problem, ECOS.Optimizer(verbose=verbosity))
    λ .= problem.constraints[1].dual
    xnew = xold .+ Δ.value

    if (animate)
      scatter!(ct, [xold[1]],[xold[2]], markercolor=:red, legend=false,
               xlims=xrange, ylims=yrange) 
      quiver!(ct, [xold[1]],[xold[2]], quiver=([Δ.value[1]],[Δ.value[2]]), 
legend=false,
              xlims=xrange, ylims=yrange)
      frame(anim)
    end


    # decide whether to accept new point and whether to adjust trust region
    if (merit(xnew) < merit(xold))
      stuck = 0
      foc = [∇ₓL(xold,λ); λ.*c(xold)]
      if (problem.constraints[2].dual>1e-4) # trust region binding
        trustradius *= 3/2
      end
    else
      stuck += 1
      trustradius *= 2/3
      if (stuck>=20)
        break
      end
    end


    xchange = norm(xnew-xold)
    fchange = abs(f(xnew)-f(xold))
    xold = xnew

    if verbosity>0
      print("Iter $iter: f=$(f(xold)), λ=$λ, c(x)=$(c(xold)), TR=$trustradi
us, norm(foc)=$(norm(foc))\n")
    end
    iter += 1    
  end
  if (iter >= maxiter)
    info = "Maximum iterations reached"
  elseif (stuck>0)
    info = "Failed to find feasible step for " * string(stuck) * " iteratio
ns."
  else
    info = "Convergence."
  end
  return(f(xold), xold, iter, info, anim) 
end
x0 = [0.0, 6.0]
result = sequentialquadratic(f, x0, constraint; maxiter=100, verbosity=0);
gif(result[5], "sqp.gif", fps=5);

sqp

Compared to interior point methods, sequential quadratic programming has the advantage of not needing a feasible point to begin, and often taking fewer iteration. Like Newton’s method, sequential quadratic programming has local quadratic convergence. A downside of sequential quadratic programming is that solving the quadratic program at each step can take considerably longer than solving the system of linear equations that interior point methods and Newton methods require.

SLQP active Set

SLQP active set methods use a linear approximation to the optimization problem to decide which constraints are “active” (binding). In each iteration, a linear approximation to the original problem is first solved. The constraints that bind in linear approximation are then assumed to bind in the full problem, and we solve an equality constrained quadratic program to determine the next step.

Byrd and Waltz (2011)[bryd2011] for more details and an extension to the SLQP method.

markdown && printfunc("slqp","constrained_optimizers.jl")
"""
      slqp(f, x0, c; 
           ∇f = x->ForwardDiff.gradient(f,x),
           ∇c = x->ForwardDiff.jacobian(c,x),
           L   = (x,λ)->(f(x) - dot(λ,c(x))),
           ∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x),
           ∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x),
           tol=1e-4, maxiter = 1000,
           trustradius=1.0, xrange=[-2., 3.],
           yrange=[-2.,6.], animate=true, verbosity=1)


Find the minimum of function `f` subject to `c(x) ≥ 0` by sequential
linear quadratic programming. 

See
https://en.wikipedia.org/wiki/Sequential_linear-quadratic_programming
for algorithm information.

# Arguments

- `f` function to minimizie
- `x0` starting value. Must have c(x0) > 0
- `c` constraint function. Must return an array.
- `∇f = x->ForwardDiff.gradient(f,x)`
- `∇c = x->ForwardDiff.jacobian(c,x)` Jacobian of constraints
- `L   = (x,λ)->(f(x) - dot(λ,c(x)))` Lagrangian
- `∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x)` Derivative of Lagrangia
n wrt `x`
- `∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x)` Hessian of Lagrangian wr
t `x`
- `tol` convergence tolerance
- `maxiter`
- `trustradius` initial trust region radius
- `xrange` range of x-axis for animation
- `yrange` range of y-axis for animation
- `animate` whether to create an animation (if true requires length(x)==2)
- `verbosity` higher values result in more printed output during search. 0 
for no output, any number > 0 for some.  

# Returns

- `(fmin, xmin, iter, info, animate)` tuple consisting of minimal function
  value, minimizer, number of iterations, and convergence info

"""
function slqp(f, x0, c;
              ∇f = x->ForwardDiff.gradient(f,x),
              ∇c = x->ForwardDiff.jacobian(c,x),
              L   = (x,λ)->(f(x) - dot(λ,c(x))),
              ∇ₓL = (x,λ)->ForwardDiff.gradient(z->L(z,λ), x),
              ∇²ₓL= (x,λ)->ForwardDiff.hessian(z->L(z,λ), x),
              tol=1e-4, maxiter = 1000,
              trustradius=1.0,
              xrange=[-2., 3.],
              yrange=[-2.,6.], animate=true, verbosity=1)
  fold = f(x0)
  xold = x0
  xchange=Inf
  fchange=Inf
  iter = 0
  μiter = 0
  stuck=0
  lptrustradius = trustradius

  animate = animate && length(x0)==2
  if animate
    # make a contour plot of the function we're minimizing. This is for
    # illustrating; you wouldn't have this normally
    ct = contour(range(xrange[1],xrange[2], length=100), 
                range(yrange[1],yrange[2], length=100),
                 (x,y) -> log(f([x,y])))
    plot!(ct, xrange, 2.5 .- xrange) # add constraint 
    anim = Animation()
  end
  Dc = ∇c(xold)
  Df = ∇f(xold)
  cx = c(xold)
  λ = (Dc*Dc') \ Dc*Df
  foc = [∇ₓL(xold,λ); λ.*cx]
  fold  = f(xold)
  negsquared(x) = x < 0 ? x^2 : zero(x)
  merit(x) = f(x) + sum(negsquared.(c(x)))
  while(iter < maxiter && ((xchange>tol) || (fchange>tol) || (stuck>0)
                           || (norm(foc)>tol)) )
    Df = ∇f(xold)
    Dc = ∇c(xold)
    cx = c(xold)
    H = ∇²ₓL(xold,λ)

    # set up and solve linear program for finding active set (binding
    # constraints)
    Δ = Variable(length(xold))
    problem = minimize(Df'*Δ, [cx + Dc*Δ >= 0,
                               Δ <= lptrustradius,
                               -lptrustradius <= Δ])
    solve!(problem, ECOS.Optimizer(verbose=verbosity))
    while (Int(problem.status)!=1 && lptrustradius <=1e4)
      lptrustradius *= 2
      problem = minimize(Df'*Δ, [cx + Dc*Δ >= 0,
                               Δ <= lptrustradius,
                               -lptrustradius <= Δ])
      solve!(problem, ECOS.Optimizer(verbose=verbosity))
    end

    if isa(problem.constraints[1].dual, AbstractArray)
      active = vec(problem.constraints[1].dual .> tol)
    else
      active = [problem.constraints[1].dual .> tol]
    end

    # set up and solve our QP
    Δ = Variable(length(xold))
    if !(isposdef(H))
      Hp = zero(H)
    else
      Hp = H
    end
    if any(active) 
      problem = minimize(Df'*Δ + 0.5*quadform(Δ,Hp), [cx[active] + Dc[activ
e,:]*Δ >= 0, norm(Δ)<=trustradius])
      solve!(problem, ECOS.Optimizer(verbose=verbosity))
      λ[active] .= problem.constraints[1].dual
      λ[.!active] .= zero(eltype(λ))
    else
      problem = minimize(Df'*Δ + 0.5*quadform(Δ,Hp), [norm(Δ)<=trustradius]
)
      solve!(problem, ECOS.Optimizer(verbose=verbosity))
      λ = zero(cx)
    end
    xnew = xold .+ Δ.value

    if (animate)
      scatter!(ct, [xold[1]],[xold[2]], markercolor=:red, legend=false,
               xlims=xrange, ylims=yrange) 
      quiver!(ct, [xold[1]],[xold[2]], quiver=([Δ.value[1]],[Δ.value[2]]), 
legend=false,
              xlims=xrange, ylims=yrange)
      frame(anim)
    end


    # decide whether to accept new point and whether to adjust trust region
    if (merit(xnew)  < merit(xold))
      stuck = 0
      foc = [∇ₓL(xold,λ); λ.*c(xold)]
      if (problem.constraints[end].dual>1e-4) # trust region binding
        trustradius *= 3/2
      end
    else
      stuck += 1
      trustradius *= 2/3
      if (stuck>=20)
        break
      end
    end

    xchange = norm(xnew-xold)
    fchange = abs(f(xnew)-f(xold))

    xold = xnew

    if verbosity>0
      print("Iter $iter: f=$(f(xold)), λ=$λ, c(x)=$(c(xold)),TR=$trustradiu
s, norm(foc)=$(norm(foc)), $xchange , $fchange, $stuck\n")
    end
    iter += 1    
  end
  if (iter >= maxiter)
    info = "Maximum iterations reached"
  elseif (stuck>0)
    info = "Failed to find feasible step for " * string(stuck) * " iteratio
ns."
  else
    info = "Convergence."
  end
  return(f(xold), xold, iter, info, anim) 
end
x0 = [-1.0, 6.0]
result = slqp(f, x0, constraint; maxiter=100, verbosity=0);
gif(result[5], "slqp.gif", fps=5);

slqp

Augmented Lagrangian

Augmented Lagragian methods convert a constrained minimization problem to an unconstrained problem by adding a penalty that increases with the constraint violation to the Lagrangian.

Barrier methods

Barrier methods refer to adding a penalty that increases toward $\infty$ as the constraints get close to violated (such as $\log(c(x))$). Barrier methods are closely related to interior point methods. Applying Newton’s method to a log-barrier penalized problem gives rise to something very similar to our interiorpoint algorithm above.

Strategies for global optimization

The above algorithms will all converge to local minima. Finding a global minimum is generally very hard. There are a few algorithms that have been proven to converge to a global optimum, such a DIRECT-L in NLopt. However, these algorithms are prohibitively time-consuming for even moderate size problems.

Randomization is a good strategy for avoiding local minima. Some algorithms with randomization, like simulated annealing, can be shown to converge to the global optimum with high probability. In practice, these are also often too inefficient for moderate size problems.

A good practical approach is to use an algorithm that combines randomization with some model-based search. A common approach is to use a variant of Newton’s method starting from a bunch of randomly chosen initial values.

Algorithms that combine a linear or quadratic approximation to the objective function with some randomness in the search direction can also be useful. An example is stochastic gradient descent, which is often used to fit neural networks. I have had good experience with CMA-ES. It worked well to estimate the finite mixture model in EFS (2015)1.

Bayesian methods can also be used for optimization and will naturally include some randomization in their search. Hamiltonian Monte-Carlo methods, which incorporate gradient information in their search, are likely to be efficient. See DynamicHMC.jl.

References


  1. Liran Einav, Amy Finkelstein, and Paul Schrimpf. The Response of Drug Expenditure to Nonlinear Contract Design: Evidence from Medicare Part D *. The Quarterly Journal of Economics, 130:841–899, 02 2015. URL: https://doi.org/10.1093/qje/qjv005, arXiv:http://oup.prod.sis.lan/qje/article-pdf/130/2/841/17097031/qjv005.pdf, doi:10.1093/qje/qjv005