Vision Transformer: Farewell Convolutions?

Vision Transformer (ViT) is a pure self-attention-based architecture (Transformer) without CNNs. ViT stays as close as possible to the Transformer architecture that was originally designed for text-based tasks.  One of the most key characteristics of ViT is its extremely simple way of encoding inputs and also using vanilla transformer architecture with no fancy trick. In ViT, […]

Long Range Arena: A Benchmark for Efficient Transformers

Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem (check out this post). However, there was no well-established way on how to evaluate this class of models in a systematic way […]

Efficient Transformers

Transformers has garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning.  The self-attention mechanism is a key defining characteristic of Transformer models. The mechanism can be viewed as a graph-like inductive bias that connects all tokens in a sequence with a relevance-based pooling operation. A […]

MetNet: A Neural Weather Model for Precipitation Forecasting

Weather has an enormous impact on renewable energy and markets, which is expected to reach 80% of the world’s electricity production. There are many social and economic benefits of accurate weather forecasting, from improvements in our daily lives to substantial impacts on agriculture, energy and transportation and to the prevention of human and economic losses […]

Universal Transformers: The Infinite Use of Finite Means!

Thanks to Stephan Gouws for his help on writing and improving this blog post. Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. They address a significant shortcoming of RNNs, i.e. their inherently sequential computation which prevents parallelization across elements of the input sequence, whilst still addressing the […]