# Universal Transformers

Thanks to Stephan Gouws for his help on writing and improving this blog post.

Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. They address a significant shortcoming of RNNs, i.e. their inherently sequential computation which prevents parallelization across elements of the input sequence, whilst still addressing the vanishing gradients problem through its self-attention mechanism.

In fact, Transformers rely entirely on a self-attention mechanism to compute a series of context-informed vector-space representations of the symbols in its input (see this blog post to know more about the details of the Transformer).  This leads to two main properties for Transformers:

• Straightforward to parallelize: There is no connections in time as with RNNs, allowing one to fully parallelize per-symbol computations.
• Global receptive field: Each symbol’s representation is directly informed by all other symbols’ representations (in contrast to e.g. convolutional architectures which typically have a limited receptive field).

Although Transformers continue to achieve great improvements in many tasks, they have some shortcomings:

• The Transformer is not Turing Complete: While the Transformer executes a total number of operations that scales with the input size, the number of sequential operations is constant and independent of the input size, determined solely by the number of layers. Assuming finite precision, this means that the Transformer cannot be computationally universal. An intuitive example are functions whose execution requires the sequential processing of each input element. In this case, for any given choice of depth T, one can construct an input sequence of length N > T that cannot be processed correctly by a Transformer:
• Lack of Conditional Computation: The Transformer applies the same amount of computation to all inputs (as well as all parts of a single input). However, not all inputs need the same amount of computation and this can be conditioned on the complexity of the input.

Universal Transformers (UTs) address these shortcomings.  In the next parts, we’ll talk more about UT and its properties.

(more…)

# Learning to Transform, Combine, and Reason in Open-Domain Question Answering

Our paper "Learning to Transform, Combine, and Reason in Open-Domain Question Answering", with Hosein Azarbonyad, Jaap Kamps, and  Maarten de Rijke, has been accepted as a long paper at 12th ACM International Conference on Web Search and Data Mining (WSDM 2019).\o/

We have all come to expect getting direct answers to complex questions from search systems on large open-domain knowledge sources like the Web. Open-domain question answering is a critical task that needs to be solved for building systems that help address our complex information needs.

To be precise open-domain question answering is the task of answering a user’s question in the form of short texts rather than a list of relevant documents, using open and available external sources.

Most open-domain question answering systems described in the literature first retrieve relevant documents or passages, select one or a few of them as the context, and then feed the question and the context to a machine reading comprehension system to extract the answer.

However, the information needed to answer complex questions is not always contained in a single, directly relevant document that is ranked high. In many cases, there is a need to take a broader context into account, e.g., by considering low-ranked documents that are not immediately relevant, combining information from multiple documents, and reasoning over multiple facts from these documents to infer the answer.

### Why should we take a broader context into account?

In order to better understand why taking a broader context into account can be necessary or useful, let’s consider an example. Assume that a user asks this question: “Who is the Spanish artist, sculptor and draughtsman famous for co-founding the Cubist movement?

We can use a search engine to retrieve the top-k relevant documents. The figure below shows the question along with a couple of retrieved documents.

# Internship at Apple

I’ve started an internship at Apple in San Francisco. I am working with Siri Machine Learning team on learning disentangled representations 🙂

# SIGIR2018 Workshop on Learning From Noisy/Limited Data for IR

We are organizing the “Learning From Noisy/Limited Data for Information Retrieval” workshop which is co-located with SIGIR 2018. This is the first edition of this workshop and The goal of the workshop is to bring together researchers from industry, where data is plentiful but noisy, with researchers from academia, where data is sparse but clean, to discuss solutions to these related problems.

We invited contributions relevant to this topics:

• Learning from noisy data for IR
• Learning from automatically constructed data
• Learning from implicit feedback data, e.g., click data
• Distant or weak supervision and learning from IR heuristics
• Unsupervised and semi-supervised learning for IR
• Transfer learning for IR
• Incorporating expert/domain knowledge to improve learning-based IR models
• Learning from labeled features
• Incorporating IR axioms to improve machine learning models

Marc Najork is going to give a fantastic keynote on “Using biased data for learning-to-rank” and we have a set of fantastic papers  (including mine :P) that are going to be presented at the workshop and a great discussion panel with wonderful panelist from both industry and academia.

Save the date on your calendar!

# Fidelity-Weighted Learning

Our paper "Fidelity-Weighted Learning", with Arash Mehrjou, Stephan Gouws, Jaap Kamps, Bernhard Schölkopf, has been accepted at Sixth International Conference on Learning Representations (ICLR2018). \o/

[perfectpullquote align=”full” bordertop=”false” cite=”” link=”” color=”” class=”#16989D” size=”16″]

### tl;dr

Fidelity-weighted learning (FWL) is a semi-supervised student-teacher approach for training deep neural networks using weakly-labeled data. It modulates the parameter updates to a student network which trained on the task we care about, on a per-sample basis according to the posterior confidence of its label-quality estimated by a Bayesian teacher, who has access to a rather small amount of high-quality labels.[/perfectpullquote]

The success of deep neural networks to date depends strongly on the availability of labeled data which is costly and not always easy to obtain. Usually, it is much easier to obtain small quantities of high-quality labeled data and large quantities of unlabeled data. The problem of how to best integrate these two different sources of information during training is an active pursuit in the field of semi-supervised learning and here, with FWL, we propose an idea to address this question.

## Learning from samples of variable quality

For a large class of tasks, it is also easy to define one or more so-called “weak annotators”, additional (albeit noisy) sources of weak supervision based on heuristics or “weaker”, biased classifiers trained on e.g. non-expert crowd-sourced data or data from different domains that are related. While easy and cheap to generate, it is not immediately clear if and how these additional weakly-labeled data can be used to train a stronger classifier for the task we care about. More generally, in almost all practical applications machine learning systems have to deal with data samples of variable quality. For example, in a large dataset of images only a small fraction of samples may be labeled by experts and the rest may be crowd-sourced using e.g. Amazon Mechanical Turk. In addition, in some applications, labels are intentionally perturbed due to privacy issues.

Assuming we can obtain a large set of weakly-labeled data in addition to a much smaller training set of “strong” labels, the simplest approach is to expand the training set by including the weakly-supervised samples (all samples are equal). Alternatively, one may pretrain on the weak data and then fine-tune on observations from the true function or distribution (which we call strong data). Indeed,  a small amount of expert-labeled data can be augmented in such a way by a large set of raw data, with labels coming from a heuristic function, to train a more accurate neural ranking model. The downside is that such approaches are oblivious to the amount or source of noise in the labels.

[perfectpullquote align=”right” bordertop=”false” cite=”” link=”” color=”” class=”” size=””]

All labels are equal, but some labels are more equal than others, just like animals.

Inspired by George, Animal Farm, 1945.

[/perfectpullquote]

We argue that treating weakly-labeled samples uniformly (i.e. each weak sample contributes equally to the final classifier) ignores potentially valuable information of the label quality. Instead, we propose Fidelity-Weighted Learning (FWL), a Bayesian semi-supervised approach that leverages a small amount of data with true labels to generate a larger training set with confidence-weighted weakly-labeled samples, which can then be used to modulate the fine-tuning process based on the fidelity (or quality) of each weak sample. By directly modeling the inaccuracies introduced by the weak annotator in this way, we can control the extent to which we make use of this additional source of weak supervision: more for confidently-labeled weak samples close to the true observed data, and less for uncertain samples further away from the observed data.

## How fidelity-weighted learning works?

We propose a setting consisting of two main modules:

1. One is called the student and is in charge of learning a suitable data representation and performing the main prediction task,
2. The other is the teacher which modulates the learning process by modeling the inaccuracies in the labels.

# Learning to Learn from Weak Supervision by Full Supervision

Our paper "Learning to Learn from Weak Supervision by Full Supervision", with Sascha Rothe, and  Jaap Kamps, has been accepted at NIPS2017 Workshop on Meta-Learning (MetaLearn 2017). \o/

Using weak or noisy supervision is a straightforward approach to increase the size of the training data and it has been shown that the output of heuristic methods can be used as weak or noisy signals along with a small amount of labeled data to train neural networks. This is usually done by pre-training the network on weak data and fine-tuning it with true labels. However, these two independent stages do not leverage the full capacity of information from true labels and using noisy labels of lower quality often brings little to no improvement. This issue is tackled by noise-aware models where denoising the weak signal is part of the learning process.

[latexpage]

We propose a meta-learning approach in which we train two networks: a target network, which plays the role of the learner and it uses a large set of weakly annotated instances to learn the main task, and a confidence network which plays the role of the meta-learner and it is trained on a small human-labeled set to estimate confidence scores. These scores define the magnitude of the weight updates to the target network during the back-propagation phase. The goal of the confidence network trained jointly with the target network is to calibrate the learning rate of the target network for each instance in the batch. I.e., the weights $\pmb{w}$ of the target network $f_w$ at step $t+1$ are updated as follows:

$\pmb{w}_{t+1} = \pmb{w}_t – \frac{\eta_t}{b}\sum_{i=1}^b c_{\theta}(x_i, \tilde{y}_i) \nabla \mathcal{L}(f_{\pmb{w_t}}(x_i), \tilde{y_i}) %+ \nabla \mathcal{R}(\pmb{w_t})$

where $\eta_t$ is the global learning rate, $\mathcal{L}(\cdot)$ is the loss of predicting $\hat{y}=f_w(x_i)$ for an input $x_i$ when the label is $\tilde{y}$; $c_\theta(\cdot)$ is a scoring function learned by the confidence network taking input instance $x_i$ and its noisy label $\tilde{y}_i$. Thus, we can effectively control the contribution to the parameter updates for the target network from weakly labeled instances based on how reliable their labels are according to the confidence network, learned on a small supervised data.

Our setup requires running a weak annotator to label a large amount of unlabeled data, which is done at pre-processing time. For many tasks, it is possible to use a simple heuristic to generate weak labels. This set is then used to train the target network. In contrast, a small human-labeled set is used to train the confidence network.  The general architecture of the model is illustrated in the figure below: