Vision Transformer: Farewell Convolutions?

Vision Transformer (ViT) is a pure self-attention-based architecture (Transformer) without CNNs. ViT stays as close as possible to the Transformer architecture that was originally designed for text-based tasks. 

One of the most key characteristics of ViT is its extremely simple way of encoding inputs and also using vanilla transformer architecture with no fancy trick. In ViT, we first extract patches from the input image. Then we flatten each patch into a single vector by concatenating the channels and use a linear projection layer to embed patches [1]You can see this "extracting patches plus the linear projection" as applying a convolutional layer with window size and strides of the patch size!. Then we add learnable one-dimensional position embeddings to each patch, then feed this input as a sequence of image patch embedding to a Transformer encoder, similar to the sequence of word embeddings used when applying Transformers to text. 

Vision Transformer architecture, gif from Google AI blog.


1 You can see this "extracting patches plus the linear projection" as applying a convolutional layer with window size and strides of the patch size!

Long Range Arena: A Benchmark for Efficient Transformers

Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem (check out this post). However, there was no well-established way on how to evaluate this class of models in a systematic way and inconsistent benchmarking on a wide spectrum of tasks and datasets made it difficult to assess relative model quality and fair comparison.

Long-Range Arena (LRA) is a unified benchmark that is specifically focused on evaluating model quality under long-context scenarios. LRA consists of a diverse set of tasks and modalities with long sequences (ranging from 1K to 16K tokens). We systematically evaluate several well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, Longformers, Big Bird, Local Transformer) against LRA tasks and show how do they compare in terms of performance versus computational costs and memory usage.

Here you can find tasks and datasets included in LRA, standard input pipeline we developed for each task,  implementation of the baseline models, and finally the leaderboard:

LRA Desiderata

When designing and developing LRA, we considered a set of requirements for the tasks and dataset we included in the benchmark: 

  • Generality: All efficient Transformers models should be applicable to our tasks. For instance, given that not all xformer models are able to perform autoregressive decoding, we include tasks that only require encoding.
  • Simplicity: The tasks should have a simple setup. All factors that make comparisons difficult should be removed. This encourages simple models instead of cumbersome pipelined approaches. For instance, we avoid including any particular data augmentation and consider pretraining to be out of scope of this benchmark.
  • Challenging: The tasks should be difficult enough for current models to ensure there is room for improvement to encourage future research in this direction.
  • Long inputs: The input sequence lengths should be reasonably long since assessing how different models capture long-range dependencies is a core focus of LRA.
  • Probing diverse aspects: The set of tasks should assess different capabilities of models like their ability to model relations and hierarchical/spatial structures, generalization capability, etc.
  • Non-resource intensive and accessible: The benchmarks should be deliberately designed to be lightweight so as to be accessible to researchers without industry-grade computing resources.

Efficient Transformers

Transformers has garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning.  The self-attention mechanism is a key defining characteristic of Transformer models. The mechanism can be viewed as a graph-like inductive bias that connects all tokens in a sequence with a relevance-based pooling operation. A well-known concern with self-attention is the quadratic time and memory complexity, which can hinder model scalability in many settings.

Recently, a dizzying number of “X-former” models have been proposed, many of which make improvements around computational and memory efficiency. We hereinafter name this class of models “efficient Transformers”. We wrote a survey that sets out to provide a comprehensive overview of the recent advances made in this class of models. In our survey, we propose a taxonomy of efficient Transformer models, characterizing them by the technical innovation and primary use case. We also provide a detailed walk-through of many of these models, including: Memory Compressed, Image Transformer, Set Transformer, Transformer-XL, Sparse Transformer, Reformer, Routing Transformer, Axial Transformer, Compressive Transformer, Sinkhorn Transformer, Longformer, ETC, Synthesizer, Performer, Linformer, Linear Transformers, and Big Bird.

Taxonomy of Efficient Transformer Architectures


MetNet: A Neural Weather Model for Precipitation Forecasting

Weather has an enormous impact on renewable energy and markets, which is expected to reach 80% of the world’s electricity production. There are many social and economic benefits of accurate weather forecasting, from improvements in our daily lives to substantial impacts on agriculture, energy and transportation and to the prevention of human and economic losses through better prediction of hazardous conditions such as storms and floods. However, weather forecasting (i.e. prediction of future weather conditions such as precipitation, temperature, pressure, and wind) is a long-standing scientific challenge.

Most of the weather forecasting methods that are used by meteorological agencies are based on physical models of the atmosphere. Although these methods have seen substantial advances over the preceding decades, they are inherently constrained by their computational requirements and are sensitive to approximations of the physical laws used in them. An alternative approach for modeling weather in order to predict the future condition is using deep neural networks, where instead of explicitly encoding physical laws in our model, we can design neural networks that discover patterns in the data and learn complex transformations from inputs to the outputs. Besides, given the infrastructure that is built for serving neural models, like accelerators, neural weather prediction models can be substantially faster than physics-based models.  

Along this direction, we introduce MetNet, a neural weather model for precipitation forecasting. MetNet outperforms HRRR, which is the current state-of-the-art physics-based model in use by HRRR for predicting future weather condition up to 8 hours ahead, and in terms of speed, the latency of the model is a matter of seconds as opposed to an hour.

MetNet Architecture


PhD Thesis

I have defended my PhD dissertation, with "cum laude" ( highest distinction in the Netherlands), on Friday, February 28, 2020, at 10:00AM in de Agnietenkapel .


My PhD thesis is about "Learning with Imperfect Supervision for Language Understanding". You can download my PhD thesis here, or get it from the UvA repository.


Humans learn to solve complex problems and uncover underlying concepts and relations given limited, noisy or inconsistent observations and draw successful generalizations based on them. This rests largely on the poverty of the stimulus argument, or what is sometimes called Plato’s problem: “How do we know so much when the evidence available to us is so meagre?

In contrast, the success of today’s data-driven machine learning models is often strongly correlated with the amount of available high quality labeled data and teaching machines using imperfect supervision remains a key challenge. In practice, however, for many applications, large-scaled high-quality training data is not available, which highlights the increasing need for building models with the ability to learn complex tasks with imperfect supervision, i.e., where the learning process is based on imperfect training samples.

When designing learning algorithms, pure data-driven learning, which relies only on previous experience, does not seem to be able to learn generalizable solutions. Similar to human’s innately primed learning, having part of the knowledge encoded in the learning algorithms in the form of strong or weak biases, can help learning solutions that better generalize to unseen samples.

In this thesis, we focus on the problem of the poverty of stimulus for learning algorithms. We argue that even noisy and limited signals can contain a great deal of valid information that can be incorporated along with prior knowledge and biases that are encoded into learning algorithms in order to solve complex problems. We improve the process of learning with imperfect supervision by (i) employing prior knowledge in learning algorithms, (ii) augmenting data and learning to learn how to better use the data, and (iii) introducing inductive biases to learning
. These general ideas are, in fact, the key ingredients for building any learning algorithms that can generalize beyond (imperfections in) the observed data.


Universal Transformers: The Infinite Use of Finite Means!

Thanks to Stephan Gouws for his help on writing and improving this blog post.

Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. They address a significant shortcoming of RNNs, i.e. their inherently sequential computation which prevents parallelization across elements of the input sequence, whilst still addressing the vanishing gradients problem through its self-attention mechanism.

In fact, Transformers rely entirely on a self-attention mechanism to compute a series of context-informed vector-space representations of the symbols in its input (see this blog post to know more about the details of the Transformer).  This leads to two main properties for Transformers:

  • Straightforward to parallelize: There is no connections in time as with RNNs, allowing one to fully parallelize per-symbol computations.
  • Global receptive field: Each symbol’s representation is directly informed by all other symbols’ representations (in contrast to e.g. convolutional architectures which typically have a limited receptive field).

Although Transformers continue to achieve great improvements in many tasks, they have some shortcomings:

  • The Transformer is not Turing Complete: While the Transformer executes a total number of operations that scales with the input size, the number of sequential operations is constant and independent of the input size, determined solely by the number of layers. Assuming finite precision, this means that the Transformer cannot be computationally universal. An intuitive example are functions whose execution requires the sequential processing of each input element. In this case, for any given choice of depth T, one can construct an input sequence of length N > T that cannot be processed correctly by a Transformer:
  • Lack of Conditional Computation: The Transformer applies the same amount of computation to all inputs (as well as all parts of a single input). However, not all inputs need the same amount of computation and this can be conditioned on the complexity of the input.  

Universal Transformers (UTs) address these shortcomings.  In the next parts, we'll talk more about UT and its properties.