Hadi's Thesis Cover
Hadi Hashemi, my office mate during my PhD and my lunch body for many many days has just defended his PhD thesis on "Modeling Users Interacting with Smart Devices".
I've designed Hadi's thesis cover:

Hadi Hashemi, my office mate during my PhD and my lunch body for many many days has just defended his PhD thesis on "Modeling Users Interacting with Smart Devices".
I've designed Hadi's thesis cover:
Vision Transformer (ViT) is a pure self-attention-based architecture (Transformer) without CNNs. ViT stays as close as possible to the Transformer architecture that was originally designed for text-based tasks.
One of the most key characteristics of ViT is its extremely simple way of encoding inputs and also using vanilla transformer architecture with no fancy trick. In ViT, we first extract patches from the input image. Then we flatten each patch into a single vector by concatenating the channels and use a linear projection layer to embed patches [1]You can see this "extracting patches plus the linear projection" as applying a convolutional layer with window size and strides of the patch size!. Then we add learnable one-dimensional position embeddings to each patch, then feed this input as a sequence of image patch embedding to a Transformer encoder, similar to the sequence of word embeddings used when applying Transformers to text.
References
⇧1 | You can see this "extracting patches plus the linear projection" as applying a convolutional layer with window size and strides of the patch size! |
---|
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem (check out this post). However, there was no well-established way on how to evaluate this class of models in a systematic way and inconsistent benchmarking on a wide spectrum of tasks and datasets made it difficult to assess relative model quality and fair comparison.
Long-Range Arena (LRA) is a unified benchmark that is specifically focused on evaluating model quality under long-context scenarios. LRA consists of a diverse set of tasks and modalities with long sequences (ranging from 1K to 16K tokens). We systematically evaluate several well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, Longformers, Big Bird, Local Transformer) against LRA tasks and show how do they compare in terms of performance versus computational costs and memory usage.
Here you can find tasks and datasets included in LRA, standard input pipeline we developed for each task, implementation of the baseline models, and finally the leaderboard:
https://github.com/google-research/long-range-arena
When designing and developing LRA, we considered a set of requirements for the tasks and dataset we included in the benchmark:
Marzieh Fadaee, who is an old friend of mine, has just defended her PhD thesis, on "Understanding and Enhancing the Use of Context for Machine Translation". Marzieh's thesis is one of best thesis I've ever seen. Super cool stuff!
I've designed Marzieh's thesis cover, (of course with her help and great suggestions):
Transformers has garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. The self-attention mechanism is a key defining characteristic of Transformer models. The mechanism can be viewed as a graph-like inductive bias that connects all tokens in a sequence with a relevance-based pooling operation. A well-known concern with self-attention is the quadratic time and memory complexity, which can hinder model scalability in many settings.
Recently, a dizzying number of “X-former” models have been proposed, many of which make improvements around computational and memory efficiency. We hereinafter name this class of models “efficient Transformers”. We wrote a survey that sets out to provide a comprehensive overview of the recent advances made in this class of models. In our survey, we propose a taxonomy of efficient Transformer models, characterizing them by the technical innovation and primary use case. We also provide a detailed walk-through of many of these models, including: Memory Compressed, Image Transformer, Set Transformer, Transformer-XL, Sparse Transformer, Reformer, Routing Transformer, Axial Transformer, Compressive Transformer, Sinkhorn Transformer, Longformer, ETC, Synthesizer, Performer, Linformer, Linear Transformers, and Big Bird.
Weather has an enormous impact on renewable energy and markets, which is expected to reach 80% of the world’s electricity production. There are many social and economic benefits of accurate weather forecasting, from improvements in our daily lives to substantial impacts on agriculture, energy and transportation and to the prevention of human and economic losses through better prediction of hazardous conditions such as storms and floods. However, weather forecasting (i.e. prediction of future weather conditions such as precipitation, temperature, pressure, and wind) is a long-standing scientific challenge.
Most of the weather forecasting methods that are used by meteorological agencies are based on physical models of the atmosphere. Although these methods have seen substantial advances over the preceding decades, they are inherently constrained by their computational requirements and are sensitive to approximations of the physical laws used in them. An alternative approach for modeling weather in order to predict the future condition is using deep neural networks, where instead of explicitly encoding physical laws in our model, we can design neural networks that discover patterns in the data and learn complex transformations from inputs to the outputs. Besides, given the infrastructure that is built for serving neural models, like accelerators, neural weather prediction models can be substantially faster than physics-based models.
Along this direction, we introduce MetNet, a neural weather model for precipitation forecasting. MetNet outperforms HRRR, which is the current state-of-the-art physics-based model in use by HRRR for predicting future weather condition up to 8 hours ahead, and in terms of speed, the latency of the model is a matter of seconds as opposed to an hour.
I have defended my PhD dissertation, with "cum laude" ( highest distinction in the Netherlands), on Friday, February 28, 2020, at 10:00AM in de Agnietenkapel .
My PhD thesis is about "Learning with Imperfect Supervision for Language Understanding". You can download my PhD thesis here, or get it from the UvA repository.
Humans learn to solve complex problems and uncover underlying concepts and relations given limited, noisy or inconsistent observations and draw successful generalizations based on them. This rests largely on the poverty of the stimulus argument, or what is sometimes called Plato’s problem: “How do we know so much when the evidence available to us is so meagre?”
In contrast, the success of today’s data-driven machine learning models is often strongly correlated with the amount of available high quality labeled data and teaching machines using imperfect supervision remains a key challenge. In practice, however, for many applications, large-scaled high-quality training data is not available, which highlights the increasing need for building models with the ability to learn complex tasks with imperfect supervision, i.e., where the learning process is based on imperfect training samples.
When designing learning algorithms, pure data-driven learning, which relies only on previous experience, does not seem to be able to learn generalizable solutions. Similar to human’s innately primed learning, having part of the knowledge encoded in the learning algorithms in the form of strong or weak biases, can help learning solutions that better generalize to unseen samples.
In this thesis, we focus on the problem of the poverty of stimulus for learning algorithms. We argue that even noisy and limited signals can contain a great deal of valid information that can be incorporated along with prior knowledge and biases that are encoded into learning algorithms in order to solve complex problems. We improve the process of learning with imperfect supervision by (i) employing prior knowledge in learning algorithms, (ii) augmenting data and learning to learn how to better use the data, and (iii) introducing inductive biases to learning
algorithms . These general ideas are, in fact, the key ingredients for building any learning algorithms that can generalize beyond (imperfections in) the observed data.
Keyvan Azadbakht, one of my friends has defended his PhD thesis, on "Asynchronous Programming in the Abstract Behavioural Specification Language".
I've designed Keyvan's thesis cover:
Thanks to Stephan Gouws for his help on writing and improving this blog post.
Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. They address a significant shortcoming of RNNs, i.e. their inherently sequential computation which prevents parallelization across elements of the input sequence, whilst still addressing the vanishing gradients problem through its self-attention mechanism.
In fact, Transformers rely entirely on a self-attention mechanism to compute a series of context-informed vector-space representations of the symbols in its input (see this blog post to know more about the details of the Transformer). This leads to two main properties for Transformers:
Although Transformers continue to achieve great improvements in many tasks, they have some shortcomings:
Universal Transformers (UTs) address these shortcomings. In the next parts, we'll talk more about UT and its properties.
(more…)Hosein Azarbonyad, my best friend at ILPS, has just defended his PhD dissertation, on "Exploratory Search over Semi-Structured Documents". He is now a data scientist at KLM.
I've designed Hosein's thesis cover: