Vision Transformer: Farewell Convolutions?

Vision Transformer (ViT) is a pure self-attention-based architecture (Transformer) without CNNs. ViT stays as close as possible to the Transformer architecture that was originally designed for text-based tasks.  One of the most key characteristics of ViT is its extremely simple way of encoding inputs and also using vanilla transformer architecture with no fancy trick. In ViT, […]

Long Range Arena: A Benchmark for Efficient Transformers

Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem (check out this post). However, there was no well-established way on how to evaluate this class of models in a systematic way […]