This is the second blog post in a series of three where we detail the research that forms the basis of Smartscan Premium. In this article, we will dive deeper into the details of the architecture, focusing on the problem of handling long input sequences.
Read the first blog post in this series about the problem Smartscan tries to solve, the background of transformers, and key machine learning concepts, here.
Be warned: We will go into technical terms and math during the blogpost. Don’t worry—we try to give intuitive explanations wherever possible. We mark particularly technical sections with this “nerd alert”: 🚨. These sections provide additional details for the curious and technically inclined reader.
The Transformer Architecture
Let us start by understanding the architecture of a transformer. We will first take a look at how text is made into vectors. Then, we will dive into the architecture of the encoder stack of the transformer where these vectors are transformed by the model. Finally, we will see how the transformed vectors from the encoder stack are turned back into text.
From Text to Vectors
A machine learning model is essentially a mathematical function. So, how exactly do we turn a piece of text into something numerical that can be passed through such a function? Let us take a moment to understand this. The steps are the following:
- Tokenizing and encoding
- Embedding the encoded tokens
Let’s see how these steps would apply to a well-known sentence: “Visma: Make progress happen!”.
Tokenizing and Encoding
A tokenizer segments the sentence into smaller pieces (tokens). The possibly simplest tokenization strategies are to split by spaces:
visma: | make | progress | happen! |
or into characters:
v | i | s | m | a | : | … |
The desirable properties of a good tokenizer for natural language processing (NLP) are that it achieves a good tradeoff between the following:
1) the vocabulary size needed to properly represent the training corpus
2) the resulting tokenized sequence lengths
3) how well each token manages to separate or gather linguistic information
Returning to the examples above, you will see that this tradeoff is drastically different between the two. Splitting by spaces will result in manageable sequence lengths but will require a huge vocabulary (just think of combinations resulting from punctuation; “happen!” – “happen.” – “happen,”).
Also, this splitting wouldn’t manage to separate important language features—punctuation in a sentence matters, but splitting by spaces bundles that up with words. Contrarily splitting into characters will result in a small vocabulary. 256 characters would probably give great coverage but the resulting sequences could hardly get any longer. Furthermore, each character carries almost no information by itself.
One of the most popular tokenizers in NLP is the Word Piece tokenizer, and this is also what we use in Smartscan Premium. The word piece tokenization of the example sentence looks like this:
vis | ##ma | : | make | progress | happen | ! |
Here, you can see that common words like “make”, “progress”, and “happen” become tokens of their own and so do punctuation characters. You can also see that the name “Visma” gets split into more common pieces (sorry, not every company gets its own token). This type of tokenizer relies on a trained vocabulary whereas the previous examples do not.
Finally, each token is mapped to token ids (positions in the vocabulary). The encoded example sentence then becomes:
25292 | 2863 | 1024 | 2191 | 5082 | 4148 | 999 |
An important takeaway from the above is that any given sequence of “words” will ultimately be mapped to a sequence of tokens. This token sequence will usually be longer than the word sequence. This will especially be the case when the word sequence consists of a lot of uncommon words that won’t be found in the vocabulary of the tokenizer (think invoice language: org. numbers, names, addresses, amounts and so on).
Embedding the Encoded Tokens
The encoded tokens are now embedded into high dimensional vectors. We do this essentially as a table lookup. The so-called token embeddings are held in a large matrix of “weights” with shape
[nvocab , d] where nvocab is the size of the vocabulary and d is the hidden state size. Both are hyper-parameters (model settings).
A mathematical representation of the entire sentence is then built simply by gathering rows from this matrix using the encoded token ids as indices, as shown in the figure below. The nvocab*d weights are trainable model parameters; they will be learned from data during training of the model.
The matrix with embedded token ids will have the shape [ntokens , d] where ntokens refers to the sequence length of the tokenized text (up to 1496 in Smartscan Premium, but 7 in our example).
If you read part 1 of this blog post series, you will know that additional per token information such as layout information (bounding box coordinates and size of the text) is also embedded. This is done in a completely analogous way as described above. All of these embedded representations are of the same shape and are simply added element-wise to form our first hidden state.
This hidden state should be seen as a matrix representation of the input sequence, which will be passed through and manipulated by the layers in the encoder stack of the transformer model. The hidden state will roughly speaking have the same shape at every point throughout the model.
The Encoder Stack
Now that we know how text is vectorized, let’s have a look at the architecture of the encoder stack of the transformer, which is shown in the following figure:
As you can see in the left part of the figure, the encoder stack consists of a number of encoder layers. In Smartscan Premium we use a stack of 12 layers. All these layers are identical in terms of architecture, but each has its own set of trainable model parameters.
The embedded sequence (hidden state 0) described above enters the stack in the bottom and passes through each layer, being manipulated along the way. Out comes the encoded sequence, which should be understood as a matrix representation of the model’s understanding of the layout and language of the input text.
In the right part of the figure, the architecture of a single encoder layer is shown. The most important thing to realise from this is that self-attention is an integral part of the encoder layers that the encoder stack is composed of.
This layer serves the important purpose of “mixing” the information of all the tokens in the sequence to form an understanding of the context of the entire sequence. Earlier sequence model architectures used so-called recurrent neural networks that process elements of the sequence sequentially in a strictly left-to-right or bidirectional fashion.
This had significant drawbacks in terms of difficulty in benefiting from long-range context in the sequence and poor parallelisation/GPU acceleration potential. In contrast, the self-attention based transformer architecture allows for random sequence access and is extremely parallelisable.
🚨 Each encoder layer more specifically consists of two sub-layers. The first is a multi-headed dot-product self-attention layer. The second is a position-wise (applied to each element in the sequence, not across elements) feed-forward neural network.
The red arrows indicate so-called residual- or skip-connections. They help mitigate the ‘vanishing gradient’ problem during training by providing a more direct pathway between the model’s output and the deeper layers.
Each sub-layer also applies layer-normalisation with the purpose of ensuring consistent scaling of the sub-layer output, i.e. ensuring that the layer outputs are kept at small magnitudes in an average sense. This helps speed up convergence during training.
From Representation to Text
Now that we understand how text is vectorised and then transformed by the encoder stack, let’s finally see how the transformed sequence can be turned back into text again.
Recall from part 1 of this blog post series that pre-training is carried out using the masked language model (MLM) objective. Here, a fraction of tokens in the input sequence are randomly replaced or masked. The model is asked to predict the proper tokens at those positions.
We, therefore, need a way of converting the encoded sequence representation output from the encoder stack back into text. We do this by applying a neural network with nvocab outputs to each position in the encoded sequence.
Through the use of a special function called softmax, the relative magnitudes of these outputs can be interpreted as a probability distribution over the vocabulary. This way, it lets us pick the most likely word to output.
We call the collection of objective specific model layers a model head. The MLM model head will only be kept during pre-training. Later, when we want to fine-tune the model, a model head that is suitable for the fine-tuning objective is used instead.
Transformers and Long Sequences
Traditional transformer architectures suffer from a particular and significant limitation that arises from the use of self-attention. The problem is that the compute and memory requirements for self-attention layers scale quadratically with the sequence length.
Why is this the case? Intuitively this is relatively easy to understand. As we mentioned earlier, the self-attention layers have the important task of mixing the information from all the token-vectors in the sequence.
Practically, we do this by letting the layer look at (attend to) every single token-vector of its input when calculating an output token-vector. The sequence length is the same everywhere in the encoder stack. This means that every self-attention layer needs to consider ntokens input token-vectors for every of the ntokens output token-vectors. Therefore, every self-attention layer needs to “look” ntokens2 times.
This is really bad news for problems where we need to treat long sequences. Let’s consider a (very) simple example: Assume that a model prediction at a sequence length of 100 takes 0.1s (no base cost for simplicity, so 0.0s at 0 sequence length). In the figure below, linear and quadratic scaling behaviours are compared for this situation. As you can see, when increasing the sequence length, prediction time quickly explodes and becomes unreasonably long with quadratic scaling.
For the interested reader, we describe self-attention and the scaling issue in much greater detail in the following section. Feel free to skip it if math is not your thing:
Dot-Product Self-Attention and Quadratic Scaling
🚨 As mentioned above, the sequence length scaling issue in traditional transformer architectures arises from the dot-product self-attention mechanism of the encoder layers. The purpose of this attention mechanism is essentially to allow the model to compare every token in the input sequence with all other tokens in order to understand the context.
Let’s have a look at how this mechanism works. To keep it simple, we will only consider single-headed self-attention even though multi-headed self-attention is typically used in transformers. The move to multi-headed self-attention is only a minor complication that is conceptually easy to understand, but it will complicate the explanations unnecessarily. Furthermore, considering single-headed self-attention is entirely sufficient for understanding the scaling issue.
Consider the following equation that constitutes (scaled) dot-product self-attention:
Let’s break that apart a bit. Q (query), K (key), and V (value) can in principle be based on different inputs. But in the simpler case of self-attention used in the encoder stack, they are all linear projections of the hidden state, H, discussed previously. Q, K, and V are projected into a d dimensional space like this:
Q=HWQ
K=HWK
V=HWV
(2)
Recall that the shape of the hidden state is [ntokens , d]. The projection matrices WQ, WK, and WV have the shape [d , d]. Therefore Q, K, and V all have the shape [ntokens , d]. The three projection matrices WQ, WK, and WV are the trainable parameters of the layer.
What will the term QKT in (1) look like? Let q0, q1, … be the rows of Q, k0, k1, … the rows of K, and v0, v1, … the rows of V. Then qi can be seen as a projection of the hidden state for token i in the sequence. Similarly, ki and vi also relate to the hidden state for token i. Then we can see that the i’th row of QKT term will look like this:
Row i of the term can, therefore, be seen as a “comparison” of token i with every other token in the sequence. The projection matrices WQ and WK of trainable weights define the kind of “comparison” to be made.
The softmax operation is a nonlinear but monotonous function that normalises each row of the argument (matrix) to sum to 1.
The elements of the matrix seen above are called the attention weights. Row ai of A is a vector that sums to 1 and determines which rows of V to pay the most attention to when considering token i.
The scaling issue in dot-product self-attention is that the term QKT has to be explicitly calculated and held in memory during training and inference. As mentioned previously, Q and K both have the shape [ntokens , d].
Therefore, the matrix product QKT has the shape [ntokens , ntokens]. The complete memory and time complexities of dot-product self-attention are O(ntokens2 + ntokens d) and O(ntokens2d] respectively. The memory and compute resources needed to run a traditional transformer architecture scales quadratically with sequence length.
Why all this Matters to Smartscan
The BERT model that Smartscan Premium is based on by default accepts sequences of up to 512 tokens. This is completely sufficient for use-cases such as e.g. translation. However, in Smartscan we want the model to consider the entirety of (possibly multi-page) invoices.
The previous LSTM based Smartscan version, now known as Smartscan Standard, accepts sequence lengths of 1000 words (split on spaces). We established that as being sufficient for most documents.
As we saw in the section on tokenization, tokenizing with Word Piece will certainly increase the sequence lengths required for the task. The default of 512 tokens must therefore be expected to be entirely insufficient for the problem Smartscan aims to solve.
Below we show the distribution of the token index, at which a certain piece of invoice information is found (total including vat/Swedish bankgiro creditor id) over a sample of documents:
We can see that for a large fraction of the documents, only considering 512 tokens would make it impossible to find the total including vat. The Swedish bankgiro creditor id is an OCR line type of information.
This type of information is commonly found close to the bottom of an invoice. We can see that on the right figure in the form of a distinct bump in the upper tail of the distribution. We can also see it from the fact that the entire distribution is shifted upwards compared to the distribution for total including vat.
Clearly, it would be beneficial with a model that can consider at least 1000 tokens, and preferably even more.
Inferencing with 1000+ sequence lengths with a traditional transformer architecture is possible but very slow. Smartscan is used in near-real-time use cases, e.g. for interactive use in travel expense phone apps, and that puts strict requirements on prediction time.
Furthermore, the training set size matters. Smartscan Premium is trained on 20+ million documents on expensive GPU instances. In the future, the training set size is certain to increase, so every training efficiency improvement counts. Even with the efficient architecture, we settled on for Smartscan Premium, a single pre-training alone costs 20k+ DKK in compute.
🚨Efficient Transformer Architectures
The solution to the scalability problem discussed above is to replace the traditional self-attention mechanism with a more efficient implementation. We will now briefly describe two of the more well-known possibilities. For the interested reader, a full survey of efficient architectures can be found in ‘Efficient Transformers: A Survey’.
Longformer
The Longformer uses a fixed-pattern sparsity-based type of attention mechanism. Consider the figures below. The figures illustrate the scope of self-attention in the following way: Each row and column refers to a token in the sequence. A coloured cell in e.g. row 10, column 15 indicates that token 10 can attend to (or see) token 15. In other words, the figures illustrate the non-zero content of the attention matrix A discussed earlier:
Non-zero values of the attention matrix for different types of fixed pattern self-attention
We can see that for traditional self-attention shown in (a) every token can attend to every other token. In contrast, the Longformer uses combinations of sliding window (b) and dilated sliding window (c) patterns of varying widths and dilations throughout the encoder stack and across attention heads. In order to enable global-context aware classification models, global+sliding window (d) attention is used in task-specific model-heads.
The key advantages of the longformer are that it scales linearly in the sequence length and that reuse of weights from pretrained traditional transformers is possible (for weight initialisation). One major disadvantage is that the dilated sliding window (c) type of attention-sparsity requires the use of a custom purpose developed CUDA kernel for efficient use. Another disadvantage is that window sizes and dilations per encoder layer become additional (possibly task-specific) hyper-parameters.
Linformer
In the Linformer paper it is proven that the following problematic matrix (discussed earlier) is inherently of low rank:
The idea of the Linformer layer is, therefore, to substitute this term with a low-rank approximation that can be computed in linear time and memory wrt. sequence length.
This is concretely done by linearly projecting down from [ntokens , d] to [r , d] for some r<<ntokens . We then calculate the dot-product self-attention in this space and then project back up to [ntokens , d]. The self-attention is calculated as:
Here, E and F are projection matrices of shape [r , ntokens] which are additional trainable parameters of the layer. This way, we replace the problematic [ntokens , ntokens] shaped QKT term with the much smaller [ntokens , r]. This is shaped Q(EK)T term (provided we select r<<ntokens) yielding O(ntokensr) time and memory complexities.
The key advantages of the Linformer layer are that it scales linearly in the sequence length both with respect to time and memory and that implementation is trivial. One disadvantage is that it introduces an additional hyper-parameter, r, per self-attention layer.
This is a small disadvantage however since it is shown in the paper that the value of r should just be “big enough”. The much bigger disadvantage is that the E and F projection matrices add additional trainable model parameters.
This means that models based on this layer cannot easily benefit from already pretrained traditional transformer models, but must instead be pretrained from scratch.
Another disadvantage is that, since projection happens on the sequence length axis, it becomes non-trivial to implement causal masking, which we expect to need in the near future of Smartscan Premium.
The Performer Layer
Good alternatives to the traditional attention implementation exist. They each come with their own set of advantages and disadvantages. For Smartscan Premium, we have settled on the so-called Performer layer.
The Performer layer was introduced in ‘Rethinking Attention with Performers’ on September 30th 2020. At the time the code was only published in JAX, so we had to develop our own TensorFlow implementation of the layer. The details of the Performer layer are given in a section for the adventurous reader below. However, all you need to understand is the advantages and disadvantages associated with using this layer.
The Performer layer introduces a few additional hyper-parameters (model settings). This can be seen as a disadvantage. However, it turns out that proper values for these hyper-parameters are really easy to select, so that’s not an issue in practice.
The benefits of using the Performer layer are really great though. The main benefit is that it achieves linear scaling with respect to the sequence length. Another key advantage is that there is a one-to-one correspondence between the trainable parameters of a Performer layer and a conventional self-attention layer.
This means that the Performer layer can be used as a true drop-in replacement for conventional attention layers. No tricks are required. This allows us to benefit greatly from freely available pre-trained transformers.
🚨 Performer Details for the Nerds
The Performer layer is an implementation that relies on a kernel-based approximation of the attention mechanism. The approach is called ‘Fast Attention Via positive Orthogonal Random features’ (FAVOR+). The traditional attention mechanism in (1) can be rewritten like this:
where:
and:
Here, 1ntokens is a column vector of length ntokens with all 1’s. All that happened here is just that the definition of the softmax function is explicitly written in matrix notation. The main idea of the performer layer is this: Assume that the matrix A can be expressed in terms of a kernel, Φ, such that
Here, E denotes the statistical expectation and φ(…) is an appropriate random feature map. For Q‘ and K’ calculated by row-wise application of φ to Q and K, respectively, (3) can then be written as
This achieves an important change in the order of computation. In terms of shapes, the inner matrix product K’TV now looks like [r , ntokens] x [ntokens , d]⟶[r , d] . The following left multiplication with Q‘ then becomes [ntokens , r] x [r , d]⟶[ntokens , d]. Here, r is the dimensionality of the random feature map φ.
The memory and time complexities, therefore, reduce to O(ntokensr+ntokensd+rd) and O(ntokensrd), respectively. The above constitutes the FA part of FAVOR+.
‘Rethinking Attention with Performers’ goes on to introduce a number of random feature maps, φ, that allows for approximation of the softmax kernel for self-attention. To give an example, the so-called regularised orthogonal hyperbolic softmax-kernel, that we have settled on in Smartscan Premium is defined like this:
where w1, …, wr/2 are drawn iid. from D. This constitutes the R+ part of FAVOR+. The choice of the functions h, f1, and f2 ensures that the random feature map is strictly positive, hence the + in FAVOR+. This is shown by ‘Rethinking Attention with Performers’ to be important for stability of the approximation.
The O part of FAVOR+ comes from orthogonalisation of the random vectors w1, …, wr/2. ‘Rethinking Attention with Performers’ shows that using orthogonal random vectors reduces the mean-squared error of the approximation.
Without going into further details we can mention that in addition to the advantages discussed earlier, causal masking is possible for the Performer layer.
Conclusion
You have reached the end; thank you for reading. In case you skipped some of it or if some parts were a bit hard to follow, here is the TL;DR.
Smartscan Premium needs to consider all the text in potentially very long text sequences from invoices in order to make good predictions. Tokenization of this text results in even longer sequences.
Long input sequences pose a problem for traditional transformer architectures since these rely on dot-product self-attention that scales quadratically with sequence length. To alleviate this problem, Smartscan Premium’s transformer stack is instead based on the Performer self-attention layer, which scales linearly with sequence length. This allows us to efficiently train Smartscan Premium, and to provide near-real-time predictions for our consumers.
Stay tuned for the last blog post in this series, which will explain the mechanics of fine-tuning and how we achieve great prediction efficiency for our multi-field Smartscan Premium model.