Thanks to Stephan Gouws for his help on writing and improving this blog post.
Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. They address a significant shortcoming of RNNs, i.e. their inherently sequential computation which prevents parallelization across elements of the input sequence, whilst still addressing the vanishing gradients problem through its self-attention mechanism.
In fact, Transformers rely entirely on a self-attention mechanism to compute a series of context-informed vector-space representations of the symbols in its input (see this blog post to know more about the details of the Transformer). This leads to two main properties for Transformers:
- Straightforward to parallelize: There is no connections in time as with RNNs, allowing one to fully parallelize per-symbol computations.
- Global receptive field: Each symbol’s representation is directly informed by all other symbols’ representations (in contrast to e.g. convolutional architectures which typically have a limited receptive field).
Although Transformers continue to achieve great improvements in many tasks, they have some shortcomings:
- No Recurrent Inductive Bias: The Transformer trades the recurrent inductive bias of RNN’s for parallelizability. However, the recurrent inductive bias appears to be crucial for generalizing on different sequence modeling tasks of varying complexity. For instance, when it is necessary to model the hierarchical structure of the input, or when the distribution of input length is different during training and inference, i.e. when good length generalization is needed.
- The Transformer is not Turing Complete: While the Transformer executes a total number of operations that scales with the input size, the number of sequential operations is constant and independent of the input size, determined solely by the number of layers. Assuming finite precision, this means that the Transformer cannot be computationally universal. An intuitive example are functions whose execution requires the sequential processing of each input element. In this case, for any given choice of depth T, one can construct an input sequence of length N > T that cannot be processed correctly by a Transformer:
- Lack of Conditional Computation: The Transformer applies the same amount of computation to all inputs (as well as all parts of a single input). However, not all inputs need the same amount of computation and this can be conditioned on the complexity of the input.
Universal Transformers (UTs) address these shortcomings. In the next parts, we’ll talk more about UT and its properties.(more…)