This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

About this document

This document was created using Weave.jl. The code is available in on github. The same document generates both static webpages and associated jupyter notebook.

Introduction

Transformers have become the leading architecture for language related tasks. Transformers are also being applied to other domains, like images.

Transformers were developed to overcome some of the downsides of recurrent networks. We briefly discussed the vanishing and exploding gradients problems. Recurrent networks also have a practical computational downside of being difficult to parallelize. Transformers were designed to be easy to parallelize while retaining some ability to represent short and long run dependencies in sequential data.

Transformers encode sequential text data into numeric features in a learned manner. The resulting encoding preserves sequential information and can be readily parallelized.

The @vaswani2017 paper that popularized transformers was about a translation task, and many introductory references about transformers focus on this setting (such as the illustrated transfomer).

Translation tackles the following setting: given a whole text (usually sentence) in one language, $z_0, …, z_T$, and a partial translation in another language, $x_0, …, x_t$, the goal is to predict the next word, $x_{t+1}$. Transformers are also used for generative tasks—given $x_0, …, x_t$, predict $x_{t+1}$. We will focus on a generative transformer since it is simpler and seems more relevant to economics.

Transformer

Transformers create a mapping from where $\tilde{x}t$ is meant to contain all information relevant for predicting $x$ . Moreover, the same mapping can be applied to all $t$ in parallel. This mapping consists of the following layers.

Embedding

Each $x_t \in X \subseteq \R^K$ is often contained in a high dimensional space. In text, $x_t$ in a vector of indicator variables representing which token is the $t$th token in the sequence. These tokens could be characters, or more commonly, words. In either case, the dimension of $x_t$ is in the hundreds or thousands. Anyway, $x_t$ is often embedded into a lower dimensional space by where $W_e: \R^k \to \R^d$ is linear.

Positional Encoding

With the exception of this layer, the entire transformer is a symmetric function of $(x_0, …, x_t)$ — it ignores order. Positional encoding adds some position information to $x_t^e$. This could be done by simply adding a coordinate containining e.g. $t/T$, but is most often done (following @vaswani2017) by where The motivation was that this positional encoding betters represents intervals between words and offsets.

Encoder

The $x_t^{pe}$ are now further transformed to incorporate information from other $x_s^{pe}$. This is done through multiple attention layers. To describe attention layers, let $x_t^{A,0} = x_t^{pe}$. An attention layer consists of:

(Masked) Self-Attention

where $Q_{ell}$, $K_{ell},$ and $V_{\ell}$ are all $m \times d$ matrices. These are often referred to as query, key, and value transformations respectively. The idea is that the query and key matrices determine how relevant $x_j$ is for $x_t$, and the value gives an altered representation of $x_j$.

This is “masked” because $z_{0,t}^{A,\ell}$ looks at the data from $0$ to $t$ instead of the whole sequence from $0$ to $T$.

If $d \neq m$, then $d$ must be a multiple of $m$. If $d < m$, then there must be $d/m$ such $Q$, $K$, and $V$ matrices, and their outputs are concatenated together to ensure that $z_t^{A,\ell}$ has the same dimension as $x_t^{A,\ell-1}$

Residual Connection

The output of the attention layer is then added to the input, $z_{1,t}^{A,\ell} = x_t^{A,\ell-1} + z_{0,t}^{A,\ell}$ This sort of residual connection is often used in deep learning. (E.g. Resnet is a well known convolutional network with residual connections that did very on image classification). It helps to ensure that gradients even deep in many layers are not zero. See @jastrzebski2017 for some theoretical justification for residual connections.

Layer Norm

A layer normalization is then applied as in @ba2016. That is, we transform where $\mu_\ell$ and $\sigma_\ell$ are the mean and standard deviation of $z_{1,t}^{A,\ell}$ across $t$.

Feed-Forward Layer

A single layer feed forward network is then applied to each $z_{2,t}^{A,\ell}$. That is, we take

where $f_\ell$ is a single layer feed forward network.

Residual Connection & Layer Norm Again

Finally there is another residual connection and layer norm applied.

Repeat

Prediction Layer

Finally, the output of the encoder, $x_t^{A_L}$, is used to predict $x_{t+1}$. When $x_{t+1}$ is discrete, this is done with a linear and then softmax layer. When $x_{t+1}$ is continuous, it can be done with just a linear layer.

Why?

The architecture of transformers developed step-by-step, combining ideas that seemed to work. The idea of an encoder grew out of embeddings and was originally combined with recurrent networks. Positional embedding and moving away from recurrence was motivated by the difficulty with parallelizing recurrent models. Residual connections and layer norms help with gradient descent and vanishing gradient problems. Theoretical understanding of transformers has lagged behind their practical application, but theory is advancing rapidly. E.g. @bhattamishra2020 , etc

Example Code

Lior Sinai has an excellent blog post, “How to code a transformer in Julia,” that shows how to implement a transformer as new layers in Flux.

The Transformers.jl package provides a higher level transformer interface.

Data

For comparison, we will start by using the same Dylan example as in the recurrent neural network notes.

using JLD2, ProgressMeter
import HTTP, Gumbo, Cascadia
using StatsBase: wsample
using Base.Iterators: partition
using Transformers, Flux, CUDA

text = collect(String(read(joinpath(docdir,"jmd","dylanchords.txt")))).*""
#startchar = 'α'
#endchar = 'Ω' # any character not in original text
unkchar = "Ξ"
#alphabet = [startchar, unique(text)..., endchar]
alphabet = unique(text)
N = length(alphabet)
# convert to strings
vocab = Transformers.Vocabulary(alphabet, unkchar)
Vocabulary{String}(101, unk=Ξ)

Model Creation

enable_gpu(true)

function create_transformer(modeldim, L; heads=1, feedforwardsize=4*modeldim, vocab=vocab)
  embed = Transformers.Basic.Embed(modeldim,length(vocab))
  pe = Transformers.Basic.PositionEmbedding(modeldim)
  topo = @nntopo_str "x → e → pe:(e,pe) → t → $L:t → logitp"
  m = Stack(topo,
            embed,
            pe,
            (e,pe) -> e .+ pe,
            [Transformer(modeldim, heads, feedforwardsize, act=relu, future=false, pdrop=0.1) for l ∈ 1:L]...,
            Transformers.Basic.Positionwise(Dense(modeldim,length(vocab))))
  return(m)
end
create_transformer (generic function with 1 method)

Training

function cbgenerator(N, loss, printiter=Int(round(N/10)))
  p = Progress(N, 1, "Training", 25)
  i=0
  function cb()
    next!(p)
    if (i % printiter==0)
      @show loss()
    end
    i+=1
  end
  return(cb)
end

function sample(m, alphabet, len, seqlen)
  m = cpu(m)
  buf = IOBuffer()
  c = "w" #rand(alphabet)
  cseq = vocab(split("so much younger than that no","")) #Vector{Int}(undef,0)
  ind2alpha = Dict(vocab(a) => a for a ∈ alphabet)
  for i = 1:len
    write(buf, c)
    if (i < seqlen)
      push!(cseq, vocab(c))
    else
      cseq[1:(end-1)] .= cseq[2:end]
      cseq[end] = vocab(c)
    end
    c = ind2alpha[wsample(1:length(vocab), softmax(m(cseq)[:,end]))]
  end
  return String(take!(buf))
end

function createdata(vocab, text, seqlength, seqperbatch)
  sequences = [vocab.(x) for x ∈ partition(text, seqlength)]
  xy = [(s[1:(end-1)],Flux.onehot(vocab,s[2:end])) for s ∈ sequences]
  if length(xy[end][1]) < length(xy[1][1])
    pop!(xy)
  end
  xybatches = [ (hcat([z[1] for z ∈ p]...), cat([z[2] for z ∈ p]..., dims=3)) for p ∈ partition(xy, seqperbatch) ]
  return(xybatches)
end

function train_model(m; data=data,
                     modelfile=joinpath(docdir,"jmd","models","dylan-t.jld2"),
                     opt=opt, epochs=20 )
  loss(xb, yb) = Flux.Losses.logitcrossentropy(m(xb),yb)
  cb=cbgenerator(length(data),()->loss(first(data)...))

  if isfile(modelfile)
    @load modelfile cpum
    #m = gpu(cpum)
    m = cpum
  else
    @time Flux.train!(loss, Flux.params(m), data, opt, cb = cb)
    println("Sampling after 1 epoch:")
    sample(m, alphabet, 1000, size(first(data)[1],1)) |> println

    Flux.@epochs epochs Flux.train!(loss, Flux.params(m), data, opt, cb = cb)
    cpum = cpu(m)
    @save modelfile cpum
  end
  return(m)
end

m = create_transformer(16,2,heads=2,feedforwardsize=16, vocab=vocab) |> gpu
data = createdata(vocab, text, 500, 50) |> gpu
opt = RMSProp(0.001)
#m = train_model(m, data=data, modelfile="64d_4level_50e.jld2", opt=opt, epochs=50)
m = train_model(m, data=data, modelfile="test.jld2", opt=opt, epochs=10)
sample(m, alphabet, 1000, size(first(data)[1],1)) |> println
loss() = 4.5326347f0
loss() = 4.0444283f0
loss() = 3.8187015f0
loss() = 3.6545393f0
loss() = 3.521952f0
loss() = 3.421867f0
loss() = 3.3402755f0
loss() = 3.2747056f0
loss() = 3.218742f0
loss() = 3.172187f0
123.029453 seconds (207.87 M allocations: 10.518 GiB, 4.84% gc time, 59.26%
 compilation time: 0% of which was recompilation)
Sampling after 1 epoch:
wD"loe"y'gmC  ldsbok=,Ler tQ n a k E 
bnirP/hnt e> kao~ ao+
 leqardeu|2e/n mrwdts hied o]oh] n &t >|;Eg.e t Bare _  sch  = y 
nrd 
d n  w     /h o+]d,nh .y  (svss o krragh   >Wt D aa--8en
vmwr= UlLm ar{A o m>   'rein>= <^J^Bes  wiese    o
Fg1ilr  < aanh thi
se  t1w  b[ "  |o 
k l1  --Esn.Ko/ve|b-8lra
/
/en  "y   wbs  s   ew==h   tesrdDa 
naya  nta/lfaana.kh k    on=   lg keP  
laPi/eoerlon,  omE
io  ivr/n
thq-nht C 
t. M> / hw'ubYcg  s  irt”>  pree   ^ha, c
  enan   0"am
lrx  ow
 iisrdeJnka>on    D any "ne idg/sfh/  
 Ji> ,g       dyt 'algv  kA  N ip= i fp

iwB -0nl2;tEu k.C oe  tpc    srtke
  <
 repn  i-oFi  
e'  U|
TyGl ]     ;dmsyi   g

Rrg , hse+-Ξn (b-oy   i^   w } a h---dd u < otisSu
 auG o"Qo>rj:
g 5 ul.  |r PgDs eau
i&  h
/   au> s T ”osmnd /a-1s O  'wauoehtCI  kru
Lnou lt  opab-eru      tye py Ci--t rCa Qw ywie Lh U 
_  u  sw  a----  L  thdg r>T  <u Z  S y k  i#o(hl  aQ1qurpleI  h
d
  =u%h   . x   h#k< Ldole "rdCa-&nhi'.bt
osie v   <   
"g  "\yih/t

 <.    
dwlea8rtr  vpDL
loss() = 3.1257606f0
loss() = 3.090786f0
loss() = 3.0504942f0
loss() = 3.0097828f0
loss() = 2.9701004f0
loss() = 2.9390543f0
loss() = 2.9102654f0
loss() = 2.8779998f0
loss() = 2.8570156f0
loss() = 2.8277261f0
loss() = 2.8057358f0
loss() = 2.7854412f0
loss() = 2.7624745f0
loss() = 2.7438314f0
loss() = 2.7233734f0
loss() = 2.712018f0
loss() = 2.6979303f0
loss() = 2.6817768f0
loss() = 2.6735775f0
loss() = 2.6518612f0
loss() = 2.6465538f0
loss() = 2.6317961f0
loss() = 2.632695f0
loss() = 2.610262f0
loss() = 2.6029112f0
loss() = 2.5938296f0
loss() = 2.5861306f0
loss() = 2.5805883f0
loss() = 2.5682912f0
loss() = 2.5576136f0
loss() = 2.5594745f0
loss() = 2.5465784f0
loss() = 2.5395765f0
loss() = 2.5316157f0
loss() = 2.5319207f0
loss() = 2.521349f0
loss() = 2.5118716f0
loss() = 2.5063741f0
loss() = 2.5004256f0
loss() = 2.5043316f0
loss() = 2.4967632f0
loss() = 2.4858403f0
loss() = 2.481605f0
loss() = 2.4764633f0
loss() = 2.471646f0
loss() = 2.4654317f0
loss() = 2.460706f0
loss() = 2.4556491f0
loss() = 2.4537146f0
loss() = 2.4492564f0
loss() = 2.4442303f0
loss() = 2.4417489f0
loss() = 2.4383984f0
loss() = 2.435677f0
loss() = 2.4307218f0
loss() = 2.4244034f0
loss() = 2.4245195f0
loss() = 2.4365647f0
loss() = 2.4150207f0
loss() = 2.4118989f0
loss() = 2.4089348f0
loss() = 2.4114954f0
loss() = 2.4080184f0
loss() = 2.400921f0
loss() = 2.3989468f0
loss() = 2.3953652f0
loss() = 2.3926613f0
loss() = 2.3859446f0
loss() = 2.390622f0
loss() = 2.3841126f0
loss() = 2.3879251f0
loss() = 2.3827386f0
loss() = 2.3771672f0
loss() = 2.375889f0
loss() = 2.369691f0
loss() = 2.3690493f0
loss() = 2.3698466f0
loss() = 2.3639894f0
loss() = 2.3619514f0
loss() = 2.358743f0
loss() = 2.356278f0
loss() = 2.3537457f0
loss() = 2.3558576f0
loss() = 2.3482108f0
loss() = 2.3483984f0
loss() = 2.3474474f0
loss() = 2.343408f0
loss() = 2.3504155f0
loss() = 2.341434f0
loss() = 2.3386126f0
loss() = 2.3359587f0
loss() = 2.336779f0
loss() = 2.333354f0
loss() = 2.330643f0
loss() = 2.330587f0
loss() = 2.3299756f0
wonke'so here ob pre w,&r   Fere  d uan   T t  /ave         :       oreghe 
 Angink    A  ol                            D         kgasito  se     t
        fond faldin=     ou    g              G


I    meo,
Bloum                          yoas  : Bue      in .    ri G   f .
Awanger,  mouvk y             .  , D    pe            .
I'        (2   o           clinon  G   :
    O'y     Afe         Itotiilin'the            G           d      C7-00.
T

Og d,   Bu     te

      m
   pq.                  abl            o   *="owhithe. y      C      A     
                  .
An
Woo G
 .  .   'lu ge sorhap
  r        :  klomy               .
  G
                   //pler              C..  G
.   t 
 D                      p           * E
E       Ay       .    .
--s     s       : de '  .
tofan'ton               .           vethe           ororando   
C
                         6hrse>        l m        :
     A---/st'      :  :
</p+>




Be="ovu      .

The output looks okay, but not quite as good as with RNNs. I did some ad-hoc exploration with alternate widths and depths. The one above seemed to work best.

Qualitatively, these results are typical. Although transformers outperform RNNs when the underlying tokens are words or word-fragments, RNNs outperform transformers when the tokens are characters. Various modifications of transformers can make them competitive. See e.g. @wu2020 , @al2019 ,

Pre-trained Models

An increasing common way to apply transformers, especially for language, but also other contexts, is to fine-tune a general purpose model. There are a number of large general purpose language models trained on large datasets. These include variants of GPT, variants of BERT, and others. Huggingface provides a way to access these models, and Transformers.jl has integrated some of the models from Huggingface (with plans to add more).

Transfer Learning

Given a specific dataset and task, a fruitful approach is to take a large pretrained and fine-tune the model for the task. Often in fine tuning, all parameters of the model will be modified. Here, we will fine-tune the GPT(1) model on the Dylan song data. The hope is that the output of the transformer provides a good representation of the data for predicting the next word. To limit the computational cost, we hold fixed the embedding and transformer components of GPT, and only retrain a final classifier. It is also common to fine tune all components of the model.

using ProgressMeter, JLD2
using StatsBase: wsample
using Transformers, Flux, CUDA

text = String(read(joinpath(docdir,"jmd","dylanchords.txt")))
songs = [split(s, "</body")[1] for s in split(text, "<body>")[2:end]]


startsym = "<pre>"
delisym = "_deli_"
endsym = "</pre>"
unksym = "<unk>"
gpt, bpe, vocab, tokenizer = Transformers.load_pretrain("GPT-OpenAIftlm"; startsym, delisym, clfsym=endsym, unksym)

gptenc = Transformers.GPTTextEncoder(tokenizer, bpe, vocab; startsym, sepsym = delisym, endsym = endsym, unksym, padsym = unksym)

# encode songs
songenc = [Transformers.encode(gptenc, s) for s in songs]

# find size of output = # of tokens used in data
usedtoken = reduce((x,y)-> x .|| y, any(s.input.tok,dims=2) for s in songenc)
idx = cumsum(usedtoken, dims=1)
outdim = sum(usedtoken)


predictmodel = Chain(Dense(768, outdim)) |> gpu
model = Transformers.set_classifier(gpt, predictmodel) |> gpu

maxlen = size(gpt.embed.embeddings.pe.embedding,2)÷2
batches = 100
batchsize = 1000
minlen = 20 # minimum input token sequence to predict from

"""
  creatbatch(batchsize)

Randomly select a song and a sequence of tokens with length
uniformly distributed on [minlen,maxlen]. Encode the sequence
using the transformer from gpt, then return the last encoded value,
and a one-hot vector representing the next token in the song.

This is done `batchsize` times and the function returns the
transformed output as `X` with dimension 768 by `batchsize`,
and one hot matrix `y` with dimension number `outdim` by `batchsize`
"""
function createbatch(batchsize; maxlen=maxlen, minlen=minlen, outdim=outdim, model=model, songenc=songenc)
  Ntrain=batchsize
  xin = (tok=songenc[1].input.tok[:,1:minlen],)
  xt = model.transformers(model.embed(xin))
  Xt = similar(xt, size(xt,1), Ntrain)
  y = Vector{typeof(Flux.onehot(1, 1:outdim))}(undef, Ntrain)
  for i in 1:Ntrain
    L = 0
    si = 0
    while (L<minlen)
      si = rand(axes(songenc)[1])
      s = songenc[si]
      L = size(s.input.tok,2)
    end
    s = songenc[si]
    len = rand(minlen:min(maxlen,(L-3)))
    first = rand(1:(L-len))
    last = first + len -1
    xin  = (tok=s.input.tok[:,first:last],)
    xe = model.embed(xin)
    xt = model.transformers(xe)
    Xt[:,i] .= xt[:,end]
    y[i] = Flux.onehot(idx[Flux.onecold(s.input.tok[:,(last+1)])],1:outdim)
  end
  y = hcat(y...)
  y = gpu(y)
  return(Xt,y)
end
datafile = joinpath(docdir,"jmd","models","dylan-batched.jld2")
if !isfile(datafile)
  CUDA.@allowscalar data = [createbatch(batchsize) for b in 1:batches]
  cdata = cpu.(data)
  @save datafile cdata
end
@load datafile cdata
data = gpu.(cdata)

opt = ADAM(1e-2)
loss(xt, y) = Flux.Losses.logitcrossentropy(predictmodel(xt),y)

function samplegpt(len=100,prompt="I was so much older then, I'm younger than that now "; predictmodel=predictmodel)
  out = prompt
  for i=1:len
    enc = Transformers.encode(gptenc, out)
    V, L = size(enc.input.tok)
    xin = (tok=enc.input.tok[:,max(1,L-maxlen-1):(L-1)],)
    xt = model.transformers(model.embed(xin))[:,end]
    p = Flux.softmax(predictmodel(xt))
    y = wsample(1:outdim, p)
    yall = findfirst(idx.==y)[1]
    out *= replace(Transformers.lookup(gptenc.vocab, yall), "</w>" => " ")
  end
  return(out)
end
CUDA.@allowscalar samplegpt(20)

Epochs =
losses = zeros(Epochs)
modelfile = joinpath(docdir,"jmd","models","dylan-gpt-tuned.jld2")
if !isfile(modelfile)
  losses = zeros(Epochs)
  for e=1:Epochs
    Flux.train!(loss, Flux.params(predictmodel), data, opt)
    losses[e] = sum(loss(d...) for d in data)
    println("Epoch $e: loss=$(losses[e])")
    println("Sample = ")
    println(samplegpt(20))
  end
  cpum = cpu(predictmodel)
  @save modelfile cpum losses
end
@load modelfile cpum losses
predictmodel = gpu(cpum)

CUDA.@allowscalar samplegpt(20)
Error: UndefVarError: Epochs not defined