Long Range Arena: A Benchmark for Efficient Transformers

Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem (check out this post). However, there was no well-established way on how to evaluate this class of models in a systematic way and inconsistent benchmarking on a wide spectrum of tasks and datasets made it difficult to assess relative model quality and fair comparison.

Long-Range Arena (LRA) is a unified benchmark that is specifically focused on evaluating model quality under long-context scenarios. LRA consists of a diverse set of tasks and modalities with long sequences (ranging from 1K to 16K tokens). We systematically evaluate several well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, Longformers, Big Bird, Local Transformer) against LRA tasks and show how do they compare in terms of performance versus computational costs and memory usage.

Here you can find tasks and datasets included in LRA, standard input pipeline we developed for each task,  implementation of the baseline models, and finally the leaderboard:
https://github.com/google-research/long-range-arena

LRA Desiderata

When designing and developing LRA, we considered a set of requirements for the tasks and dataset we included in the benchmark: 

  • Generality: All efficient Transformers models should be applicable to our tasks. For instance, given that not all xformer models are able to perform autoregressive decoding, we include tasks that only require encoding.
  • Simplicity: The tasks should have a simple setup. All factors that make comparisons difficult should be removed. This encourages simple models instead of cumbersome pipelined approaches. For instance, we avoid including any particular data augmentation and consider pretraining to be out of scope of this benchmark.
  • Challenging: The tasks should be difficult enough for current models to ensure there is room for improvement to encourage future research in this direction.
  • Long inputs: The input sequence lengths should be reasonably long since assessing how different models capture long-range dependencies is a core focus of LRA.
  • Probing diverse aspects: The set of tasks should assess different capabilities of models like their ability to model relations and hierarchical/spatial structures, generalization capability, etc.
  • Non-resource intensive and accessible: The benchmarks should be deliberately designed to be lightweight so as to be accessible to researchers without industry-grade computing resources.

LRA Tasks and Datasets

We design five tasks from 2K to 16K sequence lengths:

  • ListOps (Reasoning with List Operators): The dataset is comprised of sequences with a hierarchical structure and operators MAX, MEAN, MEDIAN, and SUM MOD that are enclosed by delimiters (brackets). 
    • Input size: 2K
    • Dataset: ListOps
    • Example: (with a much shorter length than those in the dataset)
INPUT: 
[{MAX} 4 3 [{MIN} 2  3  ]  1  0  [{MEDIAN} 1  5  8 9, 2]] 
OUTPUT: 
5
  • Byte-level Text Classification (4K): We consider the byte/character-level setup of this task in order to simulate a longer input sequence, which also makes the task considerably more challenging. For byte-level text classification, the model needs to reason with compositional, unsegmented data in order to solve a meaningful real-world task. 
    • Input size: 4K
    • Dataset: IMDB reviews dataset
  • Byte-level Document Retrieval: This task is to evaluate a model’s ability to encode and store compressed representations that are useful for matching and retrieval. Similar to the text classification setup, we use a byte/character level setup, which challenges the model to compose and aggregate information over longer contexts.
    • Input size: 8K (2 * 4K)
    • Dataset: ACL Anthology Network dataset, which identifies if two papers have a citation link. 
  • Pixel-wise Image Classification: This task is an image classification task, where the inputs are sequences of pixels. In other words, an N × N image is flattened to a sequence of pixels. This task requires the model to learn the 2D spatial relations between input pixels, while presented as a 1D sequence of symbols.
    • Input size: ~1K
    • Dataset: CIFAR-10
  • PathFinder: Long Range Spatial Dependency: The task requires a model to make a binary decision that given a pallet, whether two points represented as circles are connected by a path consisting of dashes. Each pallet also contains distractor paths, which makes this setup challenging.
    • Input size: Two variants, normal: ~1K and hard (Path-X): ~16K
    • Dataset: PathFinder
    • Example: 
Positive example
Negative example

No one size fits all model

Based on our analysis, the best qualitative performance in terms of LRA score, i.e., integrated across all five tasks, is the BigBird model. While BigBird does not do extremely well on any individual task compared to other models, it has consistently good performance across all tasks. Performers and Linear Transformers also have strong performance on some tasks. We also studied the the trade-off between qualitative performance (y-axis), model speed (x-axis), and memory footprint (size of the circles):

While BigBird performs well, its speed is almost similar to the vanilla Transformer. On the other hand, a model like Local Attention is fast at the cost of lower quantitative performance. Among these models, the kernel-based variants, i.e., Performer, Linformer, and Linear Transformer seem to be able to make a better trade-off in terms of speed and performance, while having reasonable memory usage. Overall, the models that lie on the pareto-optimal curve are BigBird and Performer.

To know more about Long Range Arena and more detailed results and analysis, checkout our paper: