Hadi Hashemi, my office mate during my PhD and my lunch body for many many days has just defended his PhD thesis on "Modeling Users Interacting with Smart Devices".

I've designed Hadi's thesis cover:

... and here is its bookmark:

Vision Transformer (ViT) is a pure self-attention-based architecture (Transformer) without CNNs. ViT stays as close as possible to the Transformer architecture that was originally designed for text-based tasks.

One of the most key characteristics of ViT is its extremely simple way of encoding inputs and also using vanilla transformer architecture with no fancy trick. In ViT, we first extract patches from the input image. Then we flatten each patch into a single vector by concatenating the channels and use a linear projection layer to embed patches ^{[1]}You can see this "extracting patches plus the linear projection" as applying a convolutional layer with window size and strides of the patch size!. Then we add learnable one-dimensional position embeddings to each patch, then feed this input as a sequence of image patch embedding to a Transformer encoder, similar to the sequence of word embeddings used when applying Transformers to text.

ViT, in particular, shows excellent performance when trained on sufficient data and it outperforms comparable state-of-the-art CNN base models with four times fewer computational resources. In our paper, we present the results of ViT when pre trained on ImageNet-21k (14M images, 21k classes) or JFT (300M images, 18k classes) and evaluated in fine-tuning and linear fewshot setup on different downstream datasets, like imagenet.

In the paper, we studied the impact of the amount of computation involved in pre-training, by training several different ViT and CNN-base models on JFT (with different model sizes and different training durations). As a result, they require varying amounts of compute for training. We observe that, for a given amount of compute, ViT yields better performance than the equivalent CNNs.

I also want to point out two other cool observations we had in ViT:

First, we looked at the learned positional embeddings and we found that ViT is able to recover the 2D structure of the input, while there is no explicit signal about the 2D grid structure of patches and everything is presented to ViT as a sequence. When visualizing positional embedding, we see that each position embedding is most similar to others in the same row and column:

Second, we looked into the average spatial distance between one element attending to another for each transformer block. We observed that ViT makes good use of its global receptive field. While some of the attention heads combine local information, i.e. small attention distances, some others attend to most of the tokens already in the lowest layers, showing that integrating information globally is indeed used by the model:

Maybe... maybe not!

Every time this question comes up, I think about two main things. First of all, Vanilla Transformers don't have the inductive bias convolutions, which can be extremely helpful in a low data regime, but there are maybe solutions to this, like the idea of "distilling the effect of inductive bias" (please read this super nice blog post by Samira Abnar about this) of CNNs into ViT, as DeiT does it.

Second of all, I look back to how Transformers took away the whole filed off NLP and there is only a small percentage of new NLP papers that are based on other classes of models, like LSTM. This is of course related to many factors other than the fact that Transformers are powerful, like the hypothesis that is explained in the hardware lottery, or the echo chamber effect that is caused by the success of Transformers in many tasks and benchmarks.

All in all, no one can predict the future, but the trend seems to be in favour of more and more employing Transformers in computer vision tasks, and in a near future, we will see Transformer based models in many setups and vision related tasks.

To know more about Vision Transformer, please check our paper:

- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
**Mostafa Dehghani**, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, "*An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale*" ,arXiv preprint arXiv:2010.11929, (published in ICLR'21).

Or read this Google AI blog post about Vision Transformer. Btw, we open-sourced both the code and model to foster additional research in this area.

References

⇧1 | You can see this "extracting patches plus the linear projection" as applying a convolutional layer with window size and strides of the patch size! |
---|

Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem (check out this post). However, there was no well-established way on how to evaluate this class of models in a systematic way and inconsistent benchmarking on a wide spectrum of tasks and datasets made it difficult to assess relative model quality and fair comparison. **Long-Range Arena** (LRA) is a unified benchmark that is specifically focused on evaluating model quality under long-context scenarios. LRA consists of a diverse set of tasks and modalities with long sequences (ranging from 1K to 16K tokens). We systematically evaluate several well-established long-range Transformer models (*Reformers*, *Linformers*, *Linear* *Transformers*, *Sinkhorn* *Transformers*, *Performers*, *Synthesizers*,* Sparse Transformers*, *Longformers, Big Bird, Local Transformer*) against LRA tasks and show how do they compare in terms of performance versus computational costs and memory usage.

Here you can find tasks and datasets included in LRA, standard input pipeline we developed for each task, implementation of the baseline models, and finally the leaderboard:

https://github.com/google-research/long-range-arena

When designing and developing LRA, we considered a set of requirements for the tasks and dataset we included in the benchmark:

**Generality**: All efficient Transformers models should be applicable to our tasks. For instance, given that not all xformer models are able to perform autoregressive decoding, we include tasks that only require encoding.**Simplicity**: The tasks should have a simple setup. All factors that make comparisons difficult should be removed. This encourages simple models instead of cumbersome pipelined approaches. For instance, we avoid including any particular data augmentation and consider pretraining to be out of scope of this benchmark.**Challenging**: The tasks should be difficult enough for current models to ensure there is room for improvement to encourage future research in this direction.**Long inputs**: The input sequence lengths should be reasonably long since assessing how different models capture long-range dependencies is a core focus of LRA.**Probing diverse aspects**: The set of tasks should assess different capabilities of models like their ability to model relations and hierarchical/spatial structures, generalization capability, etc.**Non-resource intensive and accessible**: The benchmarks should be deliberately designed to be lightweight so as to be accessible to researchers without industry-grade computing resources.

We design five tasks from 2K to 16K sequence lengths:

- ListOps (Reasoning with List Operators): The dataset is comprised of sequences with a hierarchical structure and operators MAX, MEAN, MEDIAN, and SUM MOD that are enclosed by delimiters (brackets).
- Input size: 2K
- Dataset: ListOps
- Example: (with a much shorter length than those in the dataset)

```
INPUT:
[{MAX} 4 3 [{MIN} 2 3 ] 1 0 [{MEDIAN} 1 5 8 9, 2]]
OUTPUT:
5
```

- Byte-level Text Classification (4K): We consider the byte/character-level setup of this task in order to simulate a longer input sequence, which also makes the task considerably more challenging. For byte-level text classification, the model needs to reason with compositional, unsegmented data in order to solve a meaningful real-world task.
- Input size: 4K
- Dataset: IMDB reviews dataset

- Byte-level Document Retrieval: This task is to evaluate a model’s ability to encode and store compressed representations that are useful for matching and retrieval. Similar to the text classification setup, we use a byte/character level setup, which challenges the model to compose and aggregate information over longer contexts.
- Input size: 8K (2 * 4K)
- Dataset: ACL Anthology Network dataset, which identifies if two papers have a citation link.

- Pixel-wise Image Classification: This task is an image classification task, where the inputs are sequences of pixels. In other words, an N × N image is flattened to a sequence of pixels. This task requires the model to learn the 2D spatial relations between input pixels, while presented as a 1D sequence of symbols.
- Input size: ~1K
- Dataset: CIFAR-10

- PathFinder: Long Range Spatial Dependency: The task requires a model to make a binary decision that given a pallet, whether two points represented as circles are connected by a path consisting of dashes. Each pallet also contains distractor paths, which makes this setup challenging.
- Input size: Two variants, normal: ~1K and hard (Path-X): ~16K
- Dataset: PathFinder
- Example:

Based on our analysis, the best qualitative performance in terms of* LRA score*, i.e., integrated across all five tasks, is the *BigBird* model. While BigBird does not do extremely well on any individual task compared to other models, it has consistently good performance across all tasks. *Performers* and *Linear Transformers* also have strong performance on some tasks. We also studied the the trade-off between **qualitative performance** (y-axis), **model speed** (x-axis), and **memory footprint** (size of the circles):

While *BigBird* performs well, its speed is almost similar to the vanilla Transformer. On the other hand, a model like *Local* *Attention* is fast at the cost of lower quantitative performance. Among these models, the kernel-based variants, i.e., *Performer*, *Linformer*, and *Linear Transformer* seem to be able to make a better trade-off in terms of speed and performance, while having reasonable memory usage. Overall, the models that lie on the pareto-optimal curve are *BigBird* and *Performer*.

To know more about Long Range Arena and more detailed results and analysis, checkout our paper:

- Y. Tay*,
**Mostafa Dehghani***, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, D. Metzler, "*Long Range Arena: A Benchmark for Efficient Transformers*", arXiv preprint arXiv:2011.04006, (published in ICLR'21).

Marzieh Fadaee, who is an old friend of mine, has just defended her PhD thesis, on "Understanding and Enhancing the Use of Context for Machine Translation". Marzieh's thesis is one of best thesis I've ever seen. Super cool stuff!

I've designed Marzieh's thesis cover, (of course with her help and great suggestions):

... and here is its bookmark

**Transformers** has garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. The self-attention mechanism is a key defining characteristic of Transformer models. The mechanism can be viewed as a graph-like inductive bias that connects all tokens in a sequence with a relevance-based pooling operation. A well-known concern with self-attention is the **quadratic time and memory complexity**, which can hinder model scalability in many settings.

Recently, a dizzying number of “**X-former**” models have been proposed, many of which make improvements around **computational and memory efficiency**. We hereinafter name this class of models “*efficient Transformers*”. We wrote a survey that sets out to provide a comprehensive overview of the recent advances made in this class of models. In our survey, we propose a taxonomy of efficient Transformer models, characterizing them by the technical innovation and primary use case. We also provide a detailed walk-through of many of these models, including: *Memory Compressed*, *Image Transformer*, *Set Transformer*, *Transformer-XL*, *Sparse Transformer*, *Reformer*, *Routing Transformer*, *Axial Transformer*, *Compressive Transformer*, *Sinkhorn Transformer,* *Longformer*, *ETC*, *Synthesizer*, *Performer*, *Linformer*, *Linear* *Transformers*, and *Big Bird*.

We can outline a general taxonomy of efficient Transformer models, charactered by their core techniques and primary use case. This grouping is not only helpful to draw a connection between the existing models, but also to understand the how future research in this direction will embed in the current set of solutions. We can of course come up with completely new ideas that are orthogonal to the existing ones.... which reminds me of Oriol's tweet:

Very nice to see this done! Also, this is clearly a busy research area, so try to think "outside of the box" -- or "outside of the circles" : ) #xformers https://t.co/jM61zjDG8s pic.twitter.com/sclZmVGSkG

— Oriol Vinyals (@OriolVinyalsML) September 17, 2020

Here, I shortly bring the bucketing we came up with for grouping efficient Xformer models. The primary goal of most of these models, with the exception of those based on segment-based recurrence, is to approximate the quadratic cost attention matrix. Each method applies some notion of sparsity to the otherwise dense attention mechanism.

**Fixed Patterns :**The earliest modifications to self-attention simply sparsifies the attention matrix by limiting the field of view to fixed, predefined patterns such as local windows and block patterns of fixed strides.**Blockwise Patterns:**The simplest example of this technique in practice is the blockwise (or chunking) paradigm which considers blocks of local receptive fields by chunking input sequences into fixed blocks. Examples of models that do this include Blockwise and/or Local Attention. Chunking input sequences into blocks reduces the complexity from to (block size) with , significantly reducing the cost. These blockwise or chunking methods serve as a basis for many more complex models.**Strided Patterns:**Another approach is to consider strided attention patterns, i.e., only attending at fixed intervals. Models such as Sparse Transformer and/or Longformer employ strided or “dilated” windows.**Compressed Patterns:**Another line of attack here is to use some pooling operator to down-sample the sequence length to be a form of fixed pattern. For instance, Compressed Attention uses strided convolution to effectively reduce the sequence length.

**Combination of Patterns:**The key idea of combined approaches is to improve coverage by combining two or more distinct access patterns. For example, the Sparse Transformer combines strided and local attention by assigning half of its heads to pattern. Similarly, Axial Transformer applies a sequence of self-attention computations given a high dimensional tensor as input, each along a single axis of the input tensor. In essence, the combination of patterns reduces memory complexity in the same way that fixed patterns does. The difference, however, is that the aggregation and combinaton of multiple patterns improves the overall coverage of the self-attention mechanism.**Learnable Patterns:**An extension to fixed, pre-determined pattern are learnable ones. Unsurprisingly, models using learnable patterns aim to learn the access pattern in a data-driven fashion. A key characteristic of learning patterns is to determine a notion of token relevance and then assign tokens to buckets or clusters. Notably, Reformer introduces a hash-based similarity measure to efficiently cluster tokens into chunks. In a similar vein, the Routing Transformer employs online k-means clustering on the tokens. Meanwhile, the Sinkhorn Sorting Network exposes the sparsity in attention weights by learning to to sort blocks of the input sequence. In all these models, the similarity function is trained end-to-end jointly with the rest of the network. The key idea of learnable patterns is still to exploit fixed patterns (chunked patterns). However, this class of methods learn to sort/cluster the input tokens - enabling a more optimal global view of the sequence while maintaining the efficiency benefits of fixed patterns approaches.**Memory:**Another prominent method is to leverage a side memory module that can access multiple tokens at once. A common form is global memory which is able to access the entire sequence. The global tokens act as a form of memory that learns to gather from input sequence tokens. This was first introduced in Set Transformers as the inducing points method. These parameters are often interpreted as “*memory*” and are used as a form of temporary context for future processing. This can be thought of as a form of parameter attention. Global memory is also used in ETC and Longformer. With a limited number of memory (or inducing points), we are able to perform a preliminary pooling like operation of the input sequence to compress the input sequence - a neat trick to have at one’s disposal when designing efficient self-attention modules.**Low-Rank Methods**: Another emerging technique is to improve efficiency by leveraging low-rank approximations of the self-attention matrix. The key idea is to assume low-rank structure in the matrix. The Linformer is a classic example of this technique, as it projects the length dimension of keys and values to a lower-dimensional representation ( to ). It is easy to see that the low-rank method ameliorates the memory complexity problem of the self-attention because the matrix is now decomposed to .**Kernels:**Another recently popular method to improve the efficiency of Transformers is to view the attention mechanism through kernelization. The usage of kernels enable clever mathematical re-writing of the self-attention mechanism to avoid explicitly computing the matrix. Since kernels are a form of approximation of the attention matrix, they can be also viewed as a form of low-rank method.**Recurrence:**A natural extension to the blockwise method is to connect these blocks via recurrence. Transformer-XL proposed a segment-level recurrence mechanism that connects multiple segments and blocks. These models can, in some sense, be viewed as fixed pattern models. However, we decided to create its own category due to its deviation from other block / local approaches.

In our paper, we walk though details of each of these models, explain the computational and memory costs of each in detail, and talk about advantage and disadvantage of each of them them:

- Y. Tay,
**Mostafa Dehghani**, D. Bahri, D. Metzler. "*Efficient Transformers: A Survey*", arXiv preprint arXiv:2009.06732, (2020).

Please check the paper and let us know if you have any comments or suggestion.

Weather has an enormous impact on renewable energy and markets, which is expected to reach 80% of the world’s electricity production. There are many social and economic benefits of accurate weather forecasting, from improvements in our daily lives to substantial impacts on agriculture, energy and transportation and to the prevention of human and economic losses through better prediction of hazardous conditions such as storms and floods. However, weather forecasting (i.e. prediction of future weather conditions such as precipitation, temperature, pressure, and wind) is a long-standing scientific challenge.

Most of the weather forecasting methods that are used by meteorological agencies are based on **physical models** of the atmosphere. Although these methods have seen substantial advances over the preceding decades, they are inherently constrained by their computational requirements and are sensitive to approximations of the physical laws used in them. An alternative approach for modeling weather in order to predict the future condition is using deep neural networks, where instead of explicitly encoding physical laws in our model, we can design neural networks that discover patterns in the data and learn complex transformations from inputs to the outputs. Besides, given the infrastructure that is built for serving neural models, like accelerators, neural weather prediction models can be substantially faster than physics-based models.

Along this direction, we introduce **MetNet**, a neural weather model for precipitation forecasting. MetNet outperforms HRRR, which is the current state-of-the-art physics-based model in use by HRRR for predicting future weather condition up to 8 hours ahead, and in terms of speed, the latency of the model is a matter of seconds as opposed to an hour.

We quantify the difference in performance between MetNet, HRRR, and the optical flow baseline model evaluated using the F1-score at a precipitation rate threshold of 1.0 mm/h, which corresponds to light rain. The MetNet neural weather model is able to outperform the NOAA HRRR system at timelines less than 8 hours and is consistently better than the flow-based model.MetNet receives as input a four-dimensional tensor of size that corresponds to data from a large patches with dimensions time, height, width and number of channels. The time dimension comprises slices sampled every 15 minutes over a 90 minutes interval prior to where is the time at which the model makes a prediction into the future. The input data is calculated from a patch covering a geographical area of kilometers. The input features comprise the single MRMS ^{[1]}Multi-Radar/Multi-Sensor System radar image, the 16 spectral bands of the GOES-16 satellite and additional real-valued features for the longitude, latitude and elevation of each location in the patch as well as for the hour, day and month of the input time .

MetNet makes a prediction for a single lead time. To do so, we inform the model about the desired lead time by concatenating this information with the descriptive input features. Using this conditioning, by changing the target lead time given as input, one can use the same MetNet model to make forecasts for the entire range of target times that MetNet is trained on.

**Spatial Downsampler**: MetNet aims at fully capturing the spatial context in the input patch. A trade-off arises between the fidelity of the representation and the memory and computation required to compute it. To maintain viable memory and computation requirements, the first part of MetNet contracts the input tensor spatially using a series of convolution and pooling layers. The t slices along the time dimension of the input patch are processed separately.**Temporal Encoder**: The second part of MetNet encodes the input patch along the temporal dimension. The t spatially contracted slices are given to a recurrent neural network following the order of time. The result is a single tensor where each part of the tensor summarizes spatially and temporally one region of the large context in the input patch.**Spatial Aggregator**: To make MetNet’s receptive field cover the full global spatial context in the input patch, the third part of MetNet uses a series of axial self-attention blocks along the width and height of the input.

We evaluate MetNet on a precipitation rate forecasting benchmark and compare the results with two baselines — the NOAA High Resolution Rapid Refresh (HRRR) system, which is the physical weather forecasting model currently operational in the US, and a baseline model that estimates the motion of the precipitation field (i.e., optical flow), a method known to perform well for prediction times less than 2 hours. We quantify the difference in performance between MetNet, HRRR, and the optical flow baseline model evaluated using the F1-score at a precipitation rate threshold of 1.0 mm/h, which corresponds to light rain. We quantify the difference in performance between MetNet, HRRR, and the optical flow baseline model evaluated using the F1-score at a precipitation rate threshold of 1.0 mm/h, which corresponds to light rain.

To know more about MetNet, please check our paper:

- C. K. Sønderby, L. Espeholt, J. Heek,
**Mostafa Dehghani**, A. Oliver, T. Salimans, S. Agrawal, J. Hickey, and N. Kalchbrenner. "*MetNet: A Neural Weather Model for Precipitation Forecasting*", arXiv preprint arXiv:2003.12140, (2020).

Or read this Google AI blog post about MetNet.

References

⇧1 | Multi-Radar/Multi-Sensor System |
---|

I have defended my PhD dissertation, with "cum laude" ( highest distinction in the Netherlands), on Friday, February 28, 2020, at 10:00AM in de Agnietenkapel .

My PhD thesis is about "**Learning with Imperfect Supervision for Language Understanding**". You can download my PhD thesis here, or get it from the UvA repository.

Humans learn to solve complex problems and uncover underlying concepts and relations given limited, noisy or inconsistent observations and draw successful generalizations based on them. This rests largely on the poverty of the stimulus argument, or what is sometimes called Plato’s problem: “*How do we know so much when the evidence available to us is so meagre?*”

In contrast, the success of today’s data-driven machine learning models is often strongly correlated with the amount of available high quality labeled data and teaching machines using imperfect supervision remains a key challenge. In practice, however, for many applications, large-scaled high-quality training data is not available, which highlights the increasing need for building models with the ability to learn complex tasks with imperfect supervision, i.e., where the learning process is based on imperfect training samples.

When designing learning algorithms, pure data-driven learning, which relies only on previous experience, does not seem to be able to learn generalizable solutions. Similar to human’s innately primed learning, having part of the knowledge encoded in the learning algorithms in the form of strong or weak biases, can help learning solutions that better generalize to unseen samples.

In this thesis, we focus on the problem of the poverty of stimulus for learning algorithms. We argue that even noisy and limited signals can contain a great deal of valid information that can be incorporated along with prior knowledge and biases that are encoded into learning algorithms in order to solve complex problems. We improve the process of learning with imperfect supervision by* (i) employing prior knowledge in learning algorithms, (ii) augmenting data and learning to learn how to better use the data, and (iii) introducing inductive biases to learningalgorithms *. These general ideas are, in fact, the key ingredients for building any learning algorithms that can generalize beyond (imperfections in) the observed data.

We concentrate on language understanding and reasoning, as one of the extraordinary cognitive abilities of humans, as well as a pivotal problem in artificial intelligence. We try to improve the learning process, in more principled ways than ad-hoc and domain or task-specific tricks to improve the output. We investigate our ideas on a wide range of sequence modeling and language understanding tasks.

And here are the slides of the layman talk I gave at the start, and this is a photo from the day

The cover is part of a painting by **Reza Sedighian** that was presented in the Nuance exhibition in the Emkan gallery on May 4-14, 2018.

Nuance (Fr. nuer — to shade) means shade of color or meaning, "*a delicate variation*".

We live in a world where subtlety and nuance tend to be overwhelmed by visual, auditory and ideological noise. In this world, we want to escape our own confines.

This is not to find a respite from the noise, but in order to awaken from it.

Keyvan Azadbakht, one of my friends has defended his PhD thesis, on "Asynchronous Programming in the Abstract Behavioural Specification Language".

I've designed Keyvan's thesis cover:

and here is its bookmark:

Thanks to Stephan Gouws for his help on writing and improving this blog post.

Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. They address a significant shortcoming of RNNs, i.e. their **inherently sequential computation** which prevents parallelization across elements of the input sequence, whilst still addressing the vanishing gradients problem through its self-attention mechanism.

In fact, Transformers rely entirely on a **self-attention mechanism** to compute a series of context-informed vector-space representations of the symbols in its input (see this blog post to know more about the details of the Transformer). This leads to two main properties for Transformers:

**Straightforward to parallelize**: There is no connections in time as with RNNs, allowing one to fully parallelize per-symbol computations.**Global receptive field**: Each symbol’s representation is directly informed by all other symbols’ representations (in contrast to e.g. convolutional architectures which typically have a limited receptive field).

Although Transformers continue to achieve great improvements in many tasks, they have some shortcomings:

**No Recurrent Inductive Bias:**The Transformer trades the**recurrent inductive bias**of RNN's for parallelizability. However, the recurrent inductive bias appears to be crucial for generalizing on different sequence modeling tasks of varying complexity. For instance, when it is necessary to model the hierarchical structure of the input, or when the distribution of input length is different during training and inference, i.e. when good length generalization is needed.

**The Transformer is not Turing Complete**: While the Transformer executes a total number of operations that scales with the input size, the number of sequential operations is constant and independent of the input size, determined solely by the number of layers. Assuming finite precision, this means that the Transformer cannot be computationally universal. An intuitive example are functions whose execution requires the sequential processing of each input element. In this case, for any given choice of depth*T*, one can construct an input sequence of length*N > T*that cannot be processed correctly by a Transformer:

**Lack of Conditional Computation:**The Transformer applies the same amount of computation to all inputs (as well as all parts of a single input). However, not all inputs need the same amount of computation and this can be conditioned on the complexity of the input.

Universal Transformers (UTs) address these shortcomings. In the next parts, we'll talk more about UT and its properties.

The ** Universal Transformer **is an extension to the Transformer models which

In the standard Transformer, we have a "fixed" stack of Transformer blocks, where each block is applied to all the input symbols in parallel. In the Universal Transformer, however, instead of having a fixed number of layers, we **iteratively** apply a Universal Transformer block (a self-attention mechanism followed by a recurrent transformation) to refine the representations of all positions in the sequence in parallel, during an arbitrary number of steps (which is possible due to the recurrence).

In fact, Universal Transformer is a recurrent function (not in time, but in depth) that evolves per-symbol hidden states in parallel, based at each step on the sequence of previous hidden states. In that sense, UT is similar to architectures such as the Neural GPU and the Neural Turing Machine. This gives UTs the* attractive computational efficiency* of the original feed-forward Transformer model, but with the added *recurrent inductive bias* of RNNs.

Note that when running for a fixed number of steps, the Universal Transformer is equivalent to a multi-layer Transformer with tied parameters across its layers.

In sequence processing systems, certain symbols (e.g. some words or phonemes) are usually more ambiguous than others. It is, therefore, reasonable to allocate more processing resources to these more ambiguous symbols.

As stated before, the standard Transformer applies the same amount of computations (fixed number of layers) to all symbols in all inputs. To address this, **Universal Transformer with dynamic halting **modulates the number of computational steps needed to process each input symbol dynamically based on a scalar *pondering value* that is predicted by the model at each step. The pondering values are in a sense the model’s estimation of how much further computation is required for the input symbols at each processing step.

**Universal Transformer with dynamic halting** uses an Adaptive Computation Time (ACT) mechanism, which was originally proposed for RNNS, to enable conditional computation.

More precisely, Universal Transformer with dynamic halting adds a dynamic ACT halting mechanism to each position in the input sequence. Once the per-symbol recurrent block halts (indicating a sufficient number of revisions for that symbol), its state is simply copied to the next step until all blocks halt or we reach a maximum number of steps. The final output of the encoder is then the final layer of representations produced in this way:

Unlike the standard Transformer --which cannot be computationally universal as the number of sequential operations is constant-- we can choose the number of steps as a function of the input length in the Universal Transformer. This holds independently of whether or not adaptive computation time is employed but does assume a non-constant, even if possibly deterministic, number of steps. Note that **varying the number of steps dynamically after training **is possible in Universal Transformers since the model shares weights across its sequential computation steps.

Given sufficient memory the **Universal Transformer is computationally universal** – i.e. it belongs to the class of models that can be used to simulate any Turing machine (You can check this blog post on "What exactly is Turing Completeness?". )

To show this, we can reduce a Neural GPU (which is Turing Complete) to a Universal Transformer: Let's ignore the decoder and parameterize the self-attention module (i.e., self-attention with the residual connection) to be the identity function. Now let’s assume the transition function is a convolution. Then, if we set the total number of recurrent steps *T* to be equal to the input length, we obtain exactly a Neural-GPU.

Note that the last step is where the Universal Transformer crucially differs from the vanilla Transformer whose depth cannot scale dynamically with the size of the input. A similar relationship exists between the Universal Transformer and the Neural Turing Machine, whose single read/write operations per step can be expressed by the global, parallel representation revisions of the Universal Transformer.

The cool thing about the Universal Transformer is that not only is it theoretically appealing (Turing complete), but in contrast to other computationally universal models like Neural-GPU which only perform well on algorithmic tasks, the Universal Transformer also achieves competitive results on realistic natural language tasks such as LAMBADA and machine translation. This closes the gap between practical sequence models competitive on large-scale tasks such as machine translation, and computationally universal models like Neural GPUs.

We applied Universal Transformer to a variety of algorithmic tasks and a diverse set of large-scale language understanding tasks. These tasks were chosen because they are challenging in different aspects. For instance, bAbI question answering and reasoning tasks with 1k training samples require **data efficient models** that are capable of doing multi-hop reasoning. Likewise, a set of algorithmic tasks like copy, reverse, addition, etc. are designed to assess the **length generalization** capabilities of a model (by training on short examples and evaluating on much longer examples). Subject-verb agreement task needs modeling **hierarchical structure** which requires a recurrent inductive bias. LAMBADA is a challenging language modeling task that requires **capturing a broad context**. And finally, MT is a very important large-scale task that is one of the standard benchmarks for evaluating language processing models. Results on all these tasks are reported in the paper.

Here, we just bring some analysis on the bAbI Question-Answering task as an example. In bAbI tasks, the goal is to answer a question given a series of facts forming a story. The goal is to measure various forms of language understanding by requiring a certain type of reasoning over the linguistic facts presented in each story.

A standard Transformer does not achieve good generalization on this task, no matter how much one tunes the hyper-parameters and the model. However, we can design a model based on the Universal Transformer that achieves state-of-the-art (SOTA) results on bAbI. To encode the input, we first encode each fact in the story by applying a learned multiplicative positional mask to each word’s embedding, and then summing all embeddings. We then embed the questions in the same way, and feed the UT with these embeddings of the facts and questions. Both the UT with/without dynamic halting achieve SOTA results in terms of average error and number of failed tasks, in both the 10K and 1K training regime.

Here is a visualization of the attention distribution over multiple processing steps of UT in one of the examples from the test set in task 2:

**An example from tasks 2**: **(requiring two supportive facts to solve)**

**Story:**

John went to the hallway.

John went back to the bathroom.

John grabbed the milk there.

Sandra went back to the office.

Sandra journeyed to the kitchen.

Sandra got the apple there. Sandra dropped the apple there. John dropped the milk.

**Question:**

Where is the milk?

**Model's Output:**

bathroom

Visualization of the attention distributions, when encoding the question: *“Where is the milk?”*.

- Step#1

- Step#2

- Step#3

- Step#4

In this example, and in fact in most of the cases, the attention distributions start out very uniform, but get progressively sharper (peakier) in later steps around the correct supporting facts that are required to answer each question, which is indeed very similar to how humans would solve the task (i.e. from coarse to fine).

Here is a visualization of the per-symbol pondering times for a sample input processed by UT with adaptive halting:

As can be seen, the network learns to ponder over relevant facts more, compared to the facts in the story that provides no support for the answer to the question.

Following intuitions behind weight sharing found in CNNs and RNNs, UTs extend the Transformer with a simple form of weight sharing of the model that strikes an effective balance between **inductive bias** and **model expressivity**.

Sharing weights in depth of the network introduces a recurrence into the model. This recurrent inductive bias appears to be crucial for learning generalizable solutions in some tasks, like those that need modeling hierarchical structure of the input, or capturing dependencies in a broader context. Besides this fact, weight sharing in depth leads to better performance of UTs (compared to the standard Transformer) on **very small datasets** and allows the UT to be a very data efficient model, making it attractive for domains and tasks with limited available data.

There has been a long track of research on RNNs and many works followed the idea of recurrence in time to improve sequence processing. UT is a recurrent model where the recurrence is in depth, not in time. So there is **a notion of state in depth** of the model and one of the interesting directions is **to take ideas that are worked for RNNs, “flip them vertically”** and see if they can help improve the flow of information in depth of the model. For instance, we can introduce memory/state with forget gates in depth by simply using an LSTM as the recurrent transition function:

Many of these ideas are already implemented in and ready to be explored (for instance check the UT with LSTM as the transition function here).

The **code** used to train and evaluate Universal Transformers can be found here:

https://github.com/tensorflow/tensor2tensor

The code for training as well as attention and ponder time visualization of bAbI tasks can be found here:

https://github.com/MostafaDehghani/bAbI-T2T

For more details about the model as well as results and analysis on all tasks, please take a look at the paper:

- M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. "
*Universal Transformers*". International Conference on Learning Representations (ICLR'19).

Hosein Azarbonyad, my best friend at ILPS, has just defended his PhD dissertation, on "Exploratory Search over Semi-Structured Documents". He is now a data scientist at KLM.

I've designed Hosein's thesis cover:

and here is its bookmark: