This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
About this document¶
This document was created using Weave.jl. The code is available in on github. The same document generates both static webpages and associated jupyter notebook.
Introduction¶
This document is a companion to my “Machine learning in economics”. Those notes discuss the recent use of machine learning in economics, with a focus on lasso and random forests. The code in those notes is written in R. This document will look at similar code in Julia.
RCall¶
If you want to use the methods of Chernozhukov and coauthors implements
in the R packaga @hdm or the methods of Athey and coauthors implemented
in the R package @grf , then it makes sense to use the R pacakge. You
could simply write all your code in R. However, if you prefer using
Julia, you can just call the necessary R functions with
RCall.jl
.
Here, we load the pipeline data used in the machine learning methods notes, and do some cleaning in Julia.
using RCall, DataFrames, Missings, Statistics
R"load(paste($(docdir),\"/rmd/pipelines.Rdata\",sep=\"\"))"
println(R"ls()")
data = @rget data # data on left is new Julia variable, data on right is the one in R
println(R"summary(data[,1:5])")
println(describe(data[:,1:5]))
for c in 59:107 # columns of state mileage, want missing->0
replace!(x->(ismissing(x) || isnan(x)) ? 0.0 : x, data[!,c])
end
println(describe(data[:,59:65]))
Error: InitError: Try adding /usr/lib64/R/lib to the "LD_LIBRARY_PATH" environmental variable and restarting Julia.
during initialization of module RCall
Suppose we want to estimate the coefficient on transPlant
(capital) in
a partially linear model with transProfit
(profit) as the outcome.
This can be done with the R function hdm::rlassoEffects
.
R"library(hdm)"
completedata = dropmissing(data,[1:10..., 59:122...], disallowmissing=true)
y = completedata[!,:transProfit]
inc = .!isnan.(y)
y = y[inc]
X = completedata[inc,[6:7..., 59:121...]]
cols = [std(X[!,c])>0 for c in 1:ncol(X)]
X = X[:,cols]
est = R"rlassoEffects($(X), $(y), index=c(1:2))"
R"summary($est)"
Error: LoadError: UndefVarError: @R_str not defined
in expression starting at /home/paul/.julia/dev/NeuralNetworkEconomics/docs/jmd/ml-julia.jmd:2
MLJ.jl¶
MLJ.jl
is a machine
learning framework for Julia. It gives a unified interface for many
machine learning algorithms and tasks. Similar R packages include
caret
and MLR
. scikit-learn
is
a similar Python package.
For more information on MLJ see
You can see a list of models registered to work with MLJ.jl
on
github,
or by calling MLJ::models()
.
using MLJ
models()
200-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstr
act_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hy
perparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_
methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteratio
n_parameter, :load_path, :package_license, :package_url, :package_uuid, :pr
edict_scitype, :prediction_type, :reporting_operations, :reports_feature_im
portances, :supports_class_weights, :supports_online, :supports_training_lo
sses, :supports_weights, :transform_scitype, :input_scitype, :target_scityp
e, :output_scitype)}}:
(name = ABODDetector, package_name = OutlierDetectionNeighbors, ... )
(name = ABODDetector, package_name = OutlierDetectionPython, ... )
(name = AEDetector, package_name = OutlierDetectionNetworks, ... )
(name = ARDRegressor, package_name = ScikitLearn, ... )
(name = AdaBoostClassifier, package_name = ScikitLearn, ... )
(name = AdaBoostRegressor, package_name = ScikitLearn, ... )
(name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
(name = AffinityPropagation, package_name = ScikitLearn, ... )
(name = AgglomerativeClustering, package_name = ScikitLearn, ... )
(name = BM25Transformer, package_name = MLJText, ... )
⋮
(name = TheilSenRegressor, package_name = ScikitLearn, ... )
(name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )
(name = UnivariateDiscretizer, package_name = MLJModels, ... )
(name = UnivariateFillImputer, package_name = MLJModels, ... )
(name = UnivariateStandardizer, package_name = MLJModels, ... )
(name = UnivariateTimeTypeToContinuous, package_name = MLJModels, ... )
(name = XGBoostClassifier, package_name = XGBoost, ... )
(name = XGBoostCount, package_name = XGBoost, ... )
(name = XGBoostRegressor, package_name = XGBoost, ... )
To use these models, you need the corresponding package to be installed
and loaded. The @load
macro will load the needed package(s) for any
model.
Lasso = @load LassoRegressor pkg=MLJLinearModels
import MLJLinearModels ✔
MLJLinearModels.LassoRegressor
Let’s fit lasso to the same pipeline data as above.
lasso = machine(Lasso(lambda=1.0), X, y)
train,test = partition(eachindex(y), 0.6, shuffle=true)
fit!(lasso, rows=train)
yhat = predict(lasso, rows=test)
println(yhat[1:10])
println("MSE/var(y) = $(mean((y[test].-yhat).^2)/var(y[test]))")
Error: UndefVarError: X not defined
That doesn’t look very good. All the predictions are zero. This could
happen when the regularization parameter, lambda
, is too large.
However, in this case the problem is something else. The warning
messages indicate numeric problems when minimizing the lasso objective
function. This can happen when X
is poorly scaled. The algorithm used
to compute the lasso estimates works best when the coefficients are all
roughly the same scale. The existing X
’s have wildly different scales,
which causes problems. This situation is common, so MLJ.jl
has
functions to standardize variables. It is likely that the hdm
package
in R does something similar internally.
lasso_stdx = Pipeline(Standardizer(),
Lasso(lambda=1.0*std(y[train]),
solver=MLJLinearModels.ISTA(max_iter=10000))
)
m = machine(lasso_stdx, X, y)
fit!(m, rows=train, force=true)
yhat = predict(m , rows=test)
println("MSE/var(y) = $(mean((y[test].-yhat).^2)/var(y[test]))")
# Get the fitted coefficients
coef = fitted_params(m).lasso_regressor.coefs
intercept = fitted_params(m).lasso_regressor.intercept
sum(abs(c[2])>1e-8 for c in coef) # number non-zero
Error: UndefVarError: y not defined
If we want to tune lambda
using cross-validation, we can use the
range
and TunedModel
functions.
r = range(lasso_stdx, :(lasso_regressor.lambda), lower=1e1, upper=1e10, scale=:log)
t=TunedModel(model=lasso_stdx,
resampling=CV(nfolds=5),
tuning=Grid(resolution=10),
ranges=r,
measure=rms)
m = machine(t, X, y)
fit!(m, rows=train, verbosity=1)
yhat = predict(m , rows=test)
println("MSE/var(y) = $(mean((y[test].-yhat).^2)/var(y[test]))")
Error: UndefVarError: lasso_stdx not defined
using Plots
cvmse = m.report.plotting.measurements
λ = Float64.(m.report.plotting.parameter_values[:])
s = sortperm(λ)
plot(λ[s], cvmse[s], xlab="λ", ylab="CV(MSE)")
Error: UndefVarError: m not defined
Flux.jl¶
Flux.jl
is another Julia package
for machine learning. It seems to be emerging as the leading Julia
package for neural networks and deep learning, but other machine
learning models can also be implemented using Flux.jl
.
Let’s create a lasso model in Flux.jl
.
using Flux, LinearAlgebra
# Scale the variables
Xstd = Flux.normalise(Matrix(X), dims=1)
X_train = Xstd[train,:]
X_test = Xstd[test,:]
yscale = std(y[train])
ymean = mean(y[train])
ystd = (y .- ymean)./yscale
y_train = ystd[train]
y_test = ystd[test]
# Set up the model parameters and initial values
βols = (X_train'*X_train) \ (X_train'*(y_train .- mean(y_train)))
β = zeros(ncol(X))
b = [mean(y_train)]
# Define the loss function
ψ = ones(length(β))
λ = 2.0
pred(x) = b .+ x*β
mse(x,y) = mean( (pred(x) .- y).^2 )
penalty(y) = λ/length(y)*norm(ψ.*β,1)
loss(x,y) = mse(x,y) + penalty(y)
@show loss(X_train,y_train)
# minimize loss
maxiter=2000
obj = zeros(maxiter)
mse_train = zeros(maxiter)
mse_test = zeros(maxiter)
opt = Flux.AMSGrad()
for i in 1:maxiter
Flux.train!(loss, Flux.params(β, b), [(X_train, y_train)], opt)
mse_train[i] = mse(X_train,y_train)
mse_test[i] = mse(X_test, y_test)
obj[i] = loss(X_train,y_train)
end
lo = 1
hi = 2000
plot(obj[lo:hi], ylab="Loss=MSE + λ/n*||β||₁", xlab="Iteration")
Error: UndefVarError: X not defined
plot(lo:hi, [mse_train[lo:hi] mse_test[lo:hi]], ylab="MSE", xaxis=("Iteration"), lab=["Train" "Test"])
Error: UndefVarError: lo not defined
The minimization methods in Flux.train!
are all variants of gradient
descent. Each call to Flux.train!
runs one iteration of the specified
solver. To find a local minimum, Flux.train!
can be called repeatedly
until progress stops. The above loop is a simple way to do this. The
@epoch
macro can also be useful.
Gradient descent works well for neural networks, but is not ideal for Lasso. Without further adjustment, gradient descent gets stuck in a cycle as jumps from one side of the other of the absolute value in the lasso penalty. Nonetheless, the results are near the true minimum, even though it never exactly gets there.
Lux.jl¶
A promising alternative to Flux.jl
is
Lux.jl
.1 Lux.jl
and Flux.jl
share many features and backend code. Lux.jl
has a more function
focused interface with explicit parameter passing. This is a more
“Julian” style of programming.2
For comparison, let’s implement the same Lasso model in Lux.
import Lux # Lux shares many function names with Flux, so import instead of using to avoid confusion
import Random, Zygote
using Test
# Seeding
rng = Random.default_rng()
Random.seed!(rng, 0)
# define the model
X_train = Matrix(X)[train,:]
X_test = Matrix(X)[test,:]
y_train = y[train]
y_test = y[test]
function standardizer(xtrain)
m = std(xtrain, dims=1)
s = std(xtrain, dims=1)
(x->(x .- m)./s , xs->xs.*s .+ m)
end
stdizex, _ = standardizer(X_train)
stdizey, unstdizey = standardizer(y_train)
@test unstdizey(stdizey(y_test)) ≈ y_test
ys = stdizey(y_train)
Xs = stdizex(X_train)
# Set up the model parameters and initial values
βols = (Xs'*Xs) \ (Xs'*(ys .- mean(ys)))
b = [mean(ys)]
m = Lux.Chain(X->stdizex(X)',
Lux.Dense(size(X_train,2), 1, init_weight=zeros, init_bias=zeros)
)
ps, st = Lux.setup(rng, m)
ps.layer_2.weight .= βols'
ps.layer_2.bias .= b
mse(m, ps, st, X, y) = mean(abs2, m(X, ps, st)[1]' .- stdizey(y))
mseraw(m,ps,st,X,y) = mean(abs2, unstdizey(m(X, ps, st)[1]') .- y)
ℓ = let λ = 2.0, ψ = ones(size(βols')), st=st
penalty(ps,y) = λ/length(y)*norm(ψ.*ps.layer_2.weight,1)
loss(ps, X, y, m) = mse(m, ps, st, X, y) + penalty(ps,y)
end
@show ℓ(ps,X_train,y_train,m)
# minimize loss
opt = Lux.Optimisers.AMSGrad() #ADAM(0.001)
optstate = Lux.Optimisers.setup(opt, ps)
maxiter=2000
obj = zeros(maxiter)
mse_train = zeros(maxiter)
mse_test = zeros(maxiter)
for i in 1:maxiter
# Compute the gradient
gs = Zygote.gradient(ps->ℓ(ps, X_train, y_train, m), ps)[1]
# Perform parameter update
optstate, ps = Lux.Optimisers.update(optstate, ps, gs)
mse_train[i] = mse(m, ps, st, X_train,y_train)
mse_test[i] = mse(m, ps, st, X_test, y_test)
obj[i] = ℓ(ps,X_train,y_train,m)
end
lo = 1
hi = 250
plot(lo:hi,obj[lo:hi], ylab="Loss=MSE + λ/n*||β||₁", xlab="Iteration")
Error: UndefVarError: X not defined
Additional Resources¶
- @klok2019 Statistics with Julia:Fundamentals for Data Science, MachineLearning and Artificial Intelligence
References¶
-
These notes were originally written before Lux.jl existed. If I were starting over, I would use Lux.jl instead of Flux.jl. ↩
-
Flux.jl
drew inspiration for its interface from Tensorflow and PyTorch. Implicit parameters makes some sense in an object oriented language like Python, but it is not the most natural style for Julia. ↩