Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. They address a significant shortcoming of RNNs, i.e. their **inherently sequential computation** which prevents parallelization across elements of the input sequence, whilst still addressing the vanishing gradients problem through its self-attention mechanism.

In fact, Transformers rely entirely on a **self-attention mechanism** to compute a series of context-informed vector-space representations of the symbols in its input (see this blog post to know more about the details of the Transformer). This leads to two main properties for Transformers:

**Straightforward to parallelize**: There is no connections in time as with RNNs, allowing one to fully parallelize per-symbol computations.**Global receptive field**: Each symbol’s representation is directly informed by all other symbols’ representations (in contrast to e.g. convolutional architectures which typically have a limited receptive field).

Although Transformers continue to achieve great improvements in many tasks, they have some shortcomings:

**No Recurrent Inductive Bias:**The Transformer trades the**recurrent inductive bias**of RNN's for parallelizability. However, the recurrent inductive bias appears to be crucial for generalizing on different sequence modeling tasks of varying complexity. For instance, when it is necessary to model the hierarchical structure of the input, or when the distribution of input length is different during training and inference, i.e. when good length generalization is needed.

**The Transformer is not Turing Complete**: While the Transformer executes a total number of operations that scales with the input size, the number of sequential operations is constant and independent of the input size, determined solely by the number of layers. Assuming finite precision, this means that the Transformer cannot be computationally universal. An intuitive example are functions whose execution requires the sequential processing of each input element. In this case, for any given choice of depth*T*, one can construct an input sequence of length*N > T*that cannot be processed correctly by a Transformer:

**Lack of Conditional Computation:**The Transformer applies the same amount of computation to all inputs (as well as all parts of a single input). However, not all inputs need the same amount of computation and this can be conditioned on the complexity of the input.

Universal Transformers (UTs) address these shortcomings. In the next parts, we'll talk more about UT and its properties.

The ** Universal Transformer **is an extension to the Transformer models which

In the standard Transformer, we have a "fixed" stack of Transformer blocks, where each block is applied to all the input symbols in parallel. In the Universal Transformer, however, instead of having a fixed number of layers, we **iteratively** apply a Universal Transformer block (a self-attention mechanism followed by a recurrent transformation) to refine the representations of all positions in the sequence in parallel, during an arbitrary number of steps (which is possible due to the recurrence).

In fact, Universal Transformer is a recurrent function (not in time, but in depth) that evolves per-symbol hidden states in parallel, based at each step on the sequence of previous hidden states. In that sense, UT is similar to architectures such as the Neural GPU and the Neural Turing Machine. This gives UTs the* attractive computational efficiency* of the original feed-forward Transformer model, but with the added *recurrent inductive bias* of RNNs.

Note that when running for a fixed number of steps, the Universal Transformer is equivalent to a multi-layer Transformer with tied parameters across its layers.

In sequence processing systems, certain symbols (e.g. some words or phonemes) are usually more ambiguous than others. It is, therefore, reasonable to allocate more processing resources to these more ambiguous symbols.

As stated before, the standard Transformer applies the same amount of computations (fixed number of layers) to all symbols in all inputs. To address this, **Universal Transformer with dynamic halting **modulates the number of computational steps needed to process each input symbol dynamically based on a scalar *pondering value* that is predicted by the model at each step. The pondering values are in a sense the model’s estimation of how much further computation is required for the input symbols at each processing step.

**Universal Transformer with dynamic halting** uses an Adaptive Computation Time (ACT) mechanism, which was originally proposed for RNNS, to enable conditional computation.

More precisely, Universal Transformer with dynamic halting adds a dynamic ACT halting mechanism to each position in the input sequence. Once the per-symbol recurrent block halts (indicating a sufficient number of revisions for that symbol), its state is simply copied to the next step until all blocks halt or we reach a maximum number of steps. The final output of the encoder is then the final layer of representations produced in this way:

Unlike the standard Transformer --which cannot be computationally universal as the number of sequential operations is constant-- we can choose the number of steps as a function of the input length in the Universal Transformer. This holds independently of whether or not adaptive computation time is employed but does assume a non-constant, even if possibly deterministic, number of steps. Note that **varying the number of steps dynamically after training **is possible in Universal Transformers since the model shares weights across its sequential computation steps.

Given sufficient memory the **Universal Transformer is computationally universal** – i.e. it belongs to the class of models that can be used to simulate any Turing machine (You can check this blog post on "What exactly is Turing Completeness?". )

To show this, we can reduce a Neural GPU (which is Turing Complete) to a Universal Transformer: Let's ignore the decoder and parameterize the self-attention module (i.e., self-attention with the residual connection) to be the identity function. Now let’s assume the transition function is a convolution. Then, if we set the total number of recurrent steps *T* to be equal to the input length, we obtain exactly a Neural-GPU.

Note that the last step is where the Universal Transformer crucially differs from the vanilla Transformer whose depth cannot scale dynamically with the size of the input. A similar relationship exists between the Universal Transformer and the Neural Turing Machine, whose single read/write operations per step can be expressed by the global, parallel representation revisions of the Universal Transformer.

The cool thing about the Universal Transformer is that not only is it theoretically appealing (Turing complete), but in contrast to other computationally universal models like Neural-GPU which only perform well on algorithmic tasks, the Universal Transformer also achieves competitive results on realistic natural language tasks such as LAMBADA and machine translation. This closes the gap between practical sequence models competitive on large-scale tasks such as machine translation, and computationally universal models like Neural GPUs.

We applied Universal Transformer to a variety of algorithmic tasks and a diverse set of large-scale language understanding tasks. These tasks were chosen because they are challenging in different aspects. For instance, bAbI question answering and reasoning tasks with 1k training samples require **data efficient models** that are capable of doing multi-hop reasoning. Likewise, a set of algorithmic tasks like copy, reverse, addition, etc. are designed to assess the **length generalization** capabilities of a model (by training on short examples and evaluating on much longer examples). Subject-verb agreement task needs modeling **hierarchical structure** which requires a recurrent inductive bias. LAMBADA is a challenging language modeling task that requires **capturing a broad context**. And finally, MT is a very important large-scale task that is one of the standard benchmarks for evaluating language processing models. Results on all these tasks are reported in the paper.

Here, we just bring some analysis on the bAbI Question-Answering task as an example. In bAbI tasks, the goal is to answer a question given a series of facts forming a story. The goal is to measure various forms of language understanding by requiring a certain type of reasoning over the linguistic facts presented in each story.

A standard Transformer does not achieve good generalization on this task, no matter how much one tunes the hyper-parameters and the model. However, we can design a model based on the Universal Transformer that achieves state-of-the-art (SOTA) results on bAbI. To encode the input, we first encode each fact in the story by applying a learned multiplicative positional mask to each word’s embedding, and then summing all embeddings. We then embed the questions in the same way, and feed the UT with these embeddings of the facts and questions. Both the UT with/without dynamic halting achieve SOTA results in terms of average error and number of failed tasks, in both the 10K and 1K training regime.

Here is a visualization of the attention distribution over multiple processing steps of UT in one of the examples from the test set in task 2:

**An example from tasks 2**: **(requiring two supportive facts to solve)**

**Story:**

John went to the hallway.

John went back to the bathroom.

John grabbed the milk there.

Sandra went back to the office.

Sandra journeyed to the kitchen.

Sandra got the apple there. Sandra dropped the apple there. John dropped the milk.

**Question:**

Where is the milk?

**Model's Output:**

bathroom

Visualization of the attention distributions, when encoding the question: *“Where is the milk?”*.

- Step#1

- Step#2

- Step#3

- Step#4

In this example, and in fact in most of the cases, the attention distributions start out very uniform, but get progressively sharper (peakier) in later steps around the correct supporting facts that are required to answer each question, which is indeed very similar to how humans would solve the task (i.e. from coarse to fine).

Here is a visualization of the per-symbol pondering times for a sample input processed by UT with adaptive halting:

As can be seen, the network learns to ponder over relevant facts more, compared to the facts in the story that provides no support for the answer to the question.

Following intuitions behind weight sharing found in CNNs and RNNs, UTs extend the Transformer with a simple form of weight sharing of the model that strikes an effective balance between **inductive bias** and **model expressivity**.

Sharing weights in depth of the network introduces a recurrence into the model. This recurrent inductive bias appears to be crucial for learning generalizable solutions in some tasks, like those that need modeling hierarchical structure of the input, or capturing dependencies in a broader context. Besides this fact, weight sharing in depth leads to better performance of UTs (compared to the standard Transformer) on **very small datasets** and allows the UT to be a very data efficient model, making it attractive for domains and tasks with limited available data.

There has been a long track of research on RNNs and many works followed the idea of recurrence in time to improve sequence processing. UT is a recurrent model where the recurrence is in depth, not in time. So there is **a notion of state in depth** of the model and one of the interesting directions is **to take ideas that are worked for RNNs, “flip them vertically”** and see if they can help improve the flow of information in depth of the model. For instance, we can introduce memory/state with forget gates in depth by simply using an LSTM as the recurrent transition function:

Many of these ideas are already implemented in and ready to be explored (for instance check the UT with LSTM as the transition function here).

The **code** used to train and evaluate Universal Transformers can be found here:

https://github.com/tensorflow/tensor2tensor

The code for training as well as attention and ponder time visualization of bAbI tasks can be found here:

https://github.com/MostafaDehghani/bAbI-T2T

For more details about the model as well as results and analysis on all tasks, please take a look at the paper:

- M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. "
*Universal Transformers*". International Conference on Learning Representations (ICLR'19).

I've designed Hosein's thesis cover:

and here is its bookmark:

]]>We have all come to expect getting direct answers to complex questions from search systems on large open-domain knowledge sources like the Web. *Open-domain question answering* is a critical task that needs to be solved for building systems that help address our complex information needs.

To be precise open-domain question answering is the task of answering a user's question in the form of short texts rather than a list of relevant documents, using open and available external sources.

Most open-domain question answering systems described in the literature first retrieve relevant documents or passages, select one or a few of them as the context, and then feed the question and the context to a machine reading comprehension system to extract the answer.

However, the information needed to answer complex questions is not always contained in a single, directly relevant document that is ranked high. In many cases, there is a need to take **a broader context** into account, e.g., by considering low-ranked documents that are not immediately relevant, combining information from multiple documents, and reasoning over multiple facts from these documents to infer the answer.

In order to better understand why taking a broader context into account can be necessary or useful, let's consider an example. Assume that a user asks this question: "*Who is the Spanish artist, sculptor and draughtsman famous for co-founding the Cubist movement?*"

We can use a search engine to retrieve the top-k relevant documents. The figure below shows the question along with a couple of retrieved documents.

In an attempt to infer the correct answer to the users' question, given the top-ranked document, a reading comprehension system most likely will extract "*Georges Braque*" as the answer, which is not the correct answer.

In this example, in order to infer the correct answer, one has to go down the ranked list, gather and encode facts, even those that are not immediately relevant to the question, like that fact that "*Malaga is a city in Spain,*'' which can be inferred from a document at rank 66, and then in a multi-step reasoning process, infer some new facts, including "*Picasso was a Spanish artist"* given documents positioned at ranks 12 and 66, and finally "*Picasso, who was a Spanish artist, co-founded the Cubist*" given the previously inferred fact and the document ranked third.

In this example, and in general in many cases in open-domain question answering, a piece of information in a low-ranked document that is not immediately relevant to the question may be useful to fill in the blanks and complete information extracted from the top relevant documents in order to eventually support inferring the correct answer. However, most open-domain question answering methods focus on only one or a few candidate documents by filtering out the less relevant documents to* avoid dealing with noisy information* and operate over the selected set of documents to extract the answer.

In one of our recent papers, we propose a new model, *TraCRNet* (pronounced "*Tracker Net*"), to improve open-domain question answering by explicitly operating on a larger set of candidate documents during the whole process and learning how to aggregate and reason over information from these documents in an effective way while trying not to be distracted by noisy documents.

Given the candidate documents and the question, to generate the answer, TraCRNet first **transforms** them into vectors by applying a stack of Transformer blocks with self-attention over words in each document. Then, it updates the learned representations from the first stage by **combining** and enriching them through a multihop **reasoning** process by applying multiple steps of the Universal Transformer (UT).

The figure below shows the general schema of the TraCRNet architecture:

Let’s go through the main ingredients:

*Input encoding*: This layer is in charge of encoding each of the documents and the question to single vectors given their words' embeddings. For this layer, we used a stack of*N*transformer encoder blocks that is followed by a depth-wise separable convolution followed by a pooling function to get a single vector representation for the whole document or the question (see the transformer encoder in the above figure).

*Multihop reasoning*: In this layer, the universal transformer (UT) is employed to combine evidence from all documents with respect to the question within a multi-step process with the capacity of multihop reasoning. In TraCRNet, the input of the UT encoder is a set of vectors, each representing a candidate document or the question, that are computed by the*input encoding*layer (see the universal transformer encoder in the above figure). In each step of the UT, we add two embeddings to the vectors representing queries or documents: (i) a*rank embedding*that encodes the rank of documents given by the retrieval system, also used to distinguish the question from documents and (ii) the*step embedding*determining the depth of the UT. In the multihop reasoning layer, the representations of all the documents and the question learned from the previous layer get updated in*T*steps. Self-attention in this layer allows the model to understand each of the documents based on the information in all the other documents as well as the question.

*Output decoder*: Given the output of the multihop reasoning layer, we use a stack of*N*transformer decoder blocks (see the transformer decoder in the above figure) to decode the answer.

Returning to our earlier example about the Spanish artist, sculptor and draughtsman who is famous for co-founding the Cubist movement, after learning representations for each top-ranked document and the question, TraCRNet updates them by applying multiple steps of the universal transformer.

Given the self-attention mechanism and the recurrence in depth in the universal transformer in the first step, TraCRNet can update the representation of document #12 by attending to document #66 and augment the information in document #12. Then, in the next step of reasoning, TraCRNet can update the representation of document #3 by attending over the vector representing document #12 estimated in the previous step, and enrich the information in document #3. After that, during answer generation, the decoder can attend to the final vector representing document #3 and give the correct answer.

We looked into the attention distribution for this particular example and were able to find a relation between attention distributions and the reasoning steps that are needed to give the correct answer to this question. The figure below presents the attention distribution of different heads of UT over all documents and the question while encoding document #12 at step 3 and step 7:

Step 3:

Step 7:

At step 3, TraCRNet has a high level of attention for the document #66 using heads #1 and 4 (blue and red) as well as for the question using head #3 (green) while transforming the document #12. This is in accordance with the fact that the model first needs to update the information encoded in the document #12 with the fact that ``*Malaga is a city in Spain"* from the document #66.

Later, at step #7, while encoding the question, TraCRNet attends over document #12, which has information about "Picasso who is a Spanish artist'' (updated in step 3) using heads #1 and #4 as well as document #3, which contains information about "*Picasso as a co-founder of Cubism"* using head #2 (green).

TraCRNet has a number of desirable features:

- All the building blocks of TraCRNet are based on
*self-attentive feed-forward neural networks*, hence per-symbol hidden state transformations are fully**parallelizable**, which leads to an enormous speedup during training and a super fast input encoding during inference time compared to RNN-based models.

- While there is no recurrence in time in our model, the recurrence in depth in the universal transformer used in the
**multihop reasoning**layer adds the inductive bias to the model that is needed to go beyond understanding each document separately and combine their information in multiple steps.

- TraCRNet has the global receptive field of the transformer-based models, which helps it to better encode a long document during
*input encoding*as well as perform better inference over a rather large set of documents during*multihop reasoning*.

- The hierarchical usage of a self-attention mechanism, first over words and then over documents, helps TraCRNet control its attention both at word and document levels, making it less fragile to noisy input, which is of key importance while encoding many documents.

All these properties of TraCRNet come together and lead to an effective and efficient architecture for open-domain question answering.

We employed TraCRNet on two public open-domain question answering datasets, SearchQA and Quasar-T, and achieve results that meet or exceed the state-of-the-art. We also conducted a set of analyses to check the sensitivity of TraCRNet to the number of candidate documents as well as some ablation studies on the architecture of TraCRNet that you can find in the paper listed below.

For more details about the TraCRNet and see how it performs in benchmark datasets, please take a look at our paper:

**Mostafa Dehghani**, H. Azarbonyad, J. Kamps, M. de Rijke. "*Learning to Transform, Combine, and Reason in Open-Domain Question Answering*'', In Proceedings of the Twelfth International Conference on Web Search and Data Mining (WSDM2019).

I've designed his thesis cover:

and here is its bookmark:

]]>We invited contributions relevant to this topics:

- Learning from noisy data for IR
- Learning from automatically constructed data
- Learning from implicit feedback data, e.g., click data

- Distant or weak supervision and learning from IR heuristics
- Unsupervised and semi-supervised learning for IR
- Transfer learning for IR
- Incorporating expert/domain knowledge to improve learning-based IR models
- Learning from labeled features
- Incorporating IR axioms to improve machine learning models

Marc Najork is going to give a fantastic keynote on "Using biased data for learning-to-rank" and we have a set of fantastic papers (including mine :P) that are going to be presented at the workshop and a great discussion panel with wonderful panelist from both industry and academia.

Save the date on your calendar!

]]>
**"Fidelity-Weighted Learning"**, with Arash Mehrjou, Stephan Gouws, Jaap Kamps, Bernhard Schölkopf, **has been accepted **at Sixth International Conference on Learning Representations (ICLR2018). **\o/**

## tl;dr

Fidelity-weighted learning (FWL)is a semi-supervised student-teacher approach for training deep neural networks using weakly-labeled data. It modulates the parameter updates to a student network which trained on the task we care about, on a per-sample basis according to the posterior confidence of its label-quality estimated by a Bayesian teacher, who has access to a rather small amount of high-quality labels.

The success of deep neural networks to date depends strongly on the availability of labeled data which is costly and not always easy to obtain. Usually, it is much easier to obtain small quantities of high-quality labeled data and large quantities of unlabeled data. The problem of **how to best integrate these two different sources of information during training** is an active pursuit in the field of semi-supervised learning and here, with FWL, we propose an idea to address this question.

For a large class of tasks, it is also easy to define one or more so-called “**weak annotators**”, additional (albeit noisy) sources of weak supervision based on heuristics or “weaker”, biased classifiers trained on e.g. non-expert crowd-sourced data or data from different domains that are related. While easy and cheap to generate, it is not immediately clear if and **how these additional weakly-labeled data can be used to train a stronger classifier** for the task we care about. More generally, in almost all practical applications machine learning systems have to deal with **data samples of variable quality**. For example, in a large dataset of images only a small fraction of samples may be labeled by experts and the rest may be crowd-sourced using e.g. Amazon Mechanical Turk. In addition, in some applications, labels are intentionally perturbed due to privacy issues.

Assuming we can obtain a large set of weakly-labeled data in addition to a much smaller training set of “strong” labels, the simplest approach is to expand the training set by including the weakly-supervised samples (all samples are equal). Alternatively, one may pretrain on the weak data and then fine-tune on observations from the true function or distribution (which we call strong data). Indeed, a small amount of expert-labeled data can be augmented in such a way by a large set of raw data, with labels coming from a heuristic function, to train a more accurate neural ranking model. The downside is that such approaches are oblivious to the amount or source of noise in the labels.

All labels are equal, but some labels are more equal than others, just like animals.

^{Inspired by George, Animal Farm, 1945.}

We argue that

We propose a setting consisting of two main modules:

- One is called the
**student**and is in charge of*learning a suitable data representation and performing the main prediction task*, - The other is the
**teacher**which*modulates the learning process by modeling the inaccuracies in the labels*.

We assume we are given a large set of unlabeled data samples, a heuristic labeling function called the *weak annotator*, and a small set of high-quality samples labeled by experts, called the **strong dataset**, consisting of tuples of training samples and their true labels , i.e. . We consider the latter to be observations from the true target function that we are trying to learn.

We use the weak annotator to generate labels for the unlabeled samples. Generated labels are noisy due to the limited accuracy of the weak annotator. This gives us the **weak dataset** consisting of tuples of training samples and their weak labels , i.e. . Note that we can generate a large amount of weak training data at almost no cost using the weak annotator. In contrast, we have only a limited amount of observations from the true function, so: .

Here, we assume the student to be a neural network and teacher to be a Bayesian function approximator. The training process consists of three phases (Illustrated in the above figure):

**Step 1**:*Pre-train the student on using weak labels generated by the weak annotator.*

The main goal of this step is to learn a**task-dependent representation**of the data as well as**pretraining the student**. The student function is a neural network consisting of two parts. The first part learns the data representation and the second part performs the prediction task (e.g. classification). Therefore the overall function is . The student is trained on all samples of the weak dataset . For brevity, in the following, we will refer to both data sample and its representation by when it is obvious from the context.

From the self-supervised feature learning point of view, we can say that representation learning in this step is solving a surrogate task of approximating the expert knowledge, for which a noisy supervision signal is provided by the weak annotator.**Step 2:***Train the teacher on the strong data represented in terms of the student representation and then use the teacher to generate a soft dataset consisting of for*We use a Gaussian process as the teacher to capture the label uncertainty in terms of the student representation, estimated w.r.t the strong data. A prior mean and co-variance function is chosen for . The learned embedding function in Step 1 is then used to map the data samples to dense vectors as input to the . We use the learned representation by the student in the previous step to compensate lack of data in and the teacher can enjoy the learned knowledge from the large quantity of the weakly annotated data.**all**data samples.

**This way, we also let the teacher to see the data through the lens of the student**.

Let's call the generated labels by the teacher as*soft labels*. Therefore, we refer to as the soft dataset. Note that we train only on the strong dataset but then use it to generate soft labels and uncertainty for samples belonging to^{1}.**Step 3:***Fine-tune the weights of the student network on the soft dataset, while modulating the magnitude of each parameter update by the corresponding teacher-confidence in its label.*The student network of Step 1 is fine-tuned using samples from the soft dataset where . The corresponding uncertainty of each sample is mapped to a confidence value (which is going to be explained how in a minute!), and this is then used to determine the step size for each iteration of the stochastic gradient descent (SGD). So, intuitively,

*for data points where we have true labels, the uncertainty of the teacher is almost zero, which means we have high confidence and a large step-size for updating the parameters. However, for data points where the teacher is not confident, we down-weight the training steps of the student. This means that at these points, we keep the student function as it was trained on the weak data in Step 1.*More specifically, we update the parameters of the student by training on using SGD:where is the per-example loss, is the total learning rate, is the size of the soft dataset , is the parameters of the student network, and is the regularization term. %Regularization term is the usual regularization used by optimization packages (e.g. weight decay). Therefore, we do not go into its details here. We define the total learning rate as , where is the usual learning rate of our chosen optimization algorithm that anneals over training iterations, and is a function of the label uncertainty that is computed by the teacher for each data point. Multiplying these two terms gives us the total learning rate. In other words, represents the

**fidelity**(quality) of the current sample, and is used to multiplicatively modulate . Note that the first term does not necessarily depend on each data point, whereas the second term does. We propose

(1)

to exponentially decrease the learning rate for data point if its corresponding soft label is unreliable (far from a true sample). In Equation1, is a positive scalar hyper-parameter. Intuitively, small results in a student which listens more carefully to the teacher and copies its knowledge, while a large makes the student pay less attention to the teacher, staying with its initial weak knowledge. More concretely speaking, as student places more trust in the labels estimated by the teacher and the student copies the knowledge of the teacher. On the other hand, as , the student puts less weight on the extrapolation ability of and the parameters of the student are not affected by the correcting information from the teacher.

Let's apply the FWL a one-dimensional toy problem to illustrate the various steps of it.

Let be the true function (red dotted line in the plot in the figure below) from which a small set of observations is provided (red points in the plot in the figure below). These observations might be noisy, in the same way that labels obtained from a human labeler could be noisy.

A weak annotator function (magenta line in the plot in the figure below) is provided, as an approximation to . The task is to obtain a good estimate of given the set of strong observations and the weak annotator function . We can easily obtain a large set of observations from with almost no cost (magenta points in the plot in the figure below).

We consider two experiments:

- A neural network trained on weak data and then fine-tuned on strong data from the true function, which is the most common semi-supervised approach (plot in the figure above).
- A teacher-student framework working by the proposed FWL approach.

As can be seen in plot in the figure above, FWL by taking into account label confidence, gives a better approximation of the true hidden function. We repeated the above experiment 10 times. The average RMSE with respect to the true function on a set of test points over those 10 experiments for the student, were as follows:

- Student is trained on weak data (blue line in plot in the figure above): ,
- Student is trained on weak data then fine-tuned on true observations (blue line in plot in the figure above): .
- Student is trained on weak data, then fine-tuned by soft labels and the confidence information provided by the teacher (blue line in plot in the figure above): (best).

More details of the neural network and along with the specification of the data used in the above experiment are in the paper.

That was the general idea of FWL, to see how it works for real-world tasks, like sentiment classification or document ranking, you can take a look at our paper:

**Mostafa Dehghani**, A Mehrjou, S Gouws, J Kamps, B Schölkopf. "*Fidelity-Weighted Learning*", In Proceedings of Sixth International Conference on Learning Representations, (ICLR'18).

Using weak or noisy supervision is a straightforward approach to increase the size of the training data and it has been shown that the output of heuristic methods can be used as weak or noisy signals along with a small amount of labeled data to train neural networks. This is usually done by pre-training the network on weak data and fine-tuning it with true labels. However, these two independent stages do not leverage the full capacity of information from true labels and using noisy labels of lower quality often brings little to no improvement. This issue is tackled by noise-aware models where denoising the weak signal is part of the learning process.

We propose a **meta-learning** approach in which we train two networks: a *target network*, which plays the role of** the learner** and it uses a large set of weakly annotated instances to learn the main task, and a *confidence network *which plays the role of** the meta-learner** and it is trained on a small human-labeled set to estimate confidence scores. These scores define the *magnitude of the weight updates* to the target network during the back-propagation phase. The goal of the confidence network trained jointly with the target network is to calibrate the learning rate of the target network for each instance in the batch. I.e., the weights of the target network at step are updated as follows:

where is the global learning rate, is the loss of predicting for an input when the label is ; is a scoring function learned by the confidence network taking input instance and its noisy label . Thus, we can effectively control the contribution to the parameter updates for the target network from weakly labeled instances based on how reliable their labels are according to the confidence network, learned on a small supervised data.

**weak annotator** to label a large amount of unlabeled data, which is done at pre-processing time. For many tasks, it is possible to use a simple heuristic to generate weak labels. This set is then used to train the target network. In contrast, a small human-labeled set is used to train the confidence network. The general architecture of the model is illustrated in the figure below:

The subfigure in the left presents the **full-supervision mode** in which given a batch of data with true labels (as well as weak labels), we train the confidence network to learn that given an example in the beach and it's weak label, how likely does that help in training the target network to learn the main task. The confidence network is trained based on the difference between the weak and true labels in the human labeled data. In this mode, we update the parameters of the confidence network as well as representation learning layer of the target network.

In the **weak-supervision mode **(subfigure in the right), given a batch of data with weak labels, we train the target network to learn the main task. However, each example and it's weak label is passed through the confidence score to generate a score (a probability) indicating how good is this example. Then, the generated score by confidence score is used to weight the gradients of the loss of the target network in the backward pass of the backpropagation. In this mode, we update the parameters of the target network as well as representation learning layer, but the parameters of the confidence network are frozen.

During training, we alternate between these two modes. It is noteworthy that having a **shared representation layer** between the target and confidence networks has a couple of advantages:

First, it lays the ground for a better communication between the learner and the meta-learner. Besides considering the shared representation layer as a communication channel, we can say that this enables each of these two networks to see the data from each other's point of view.

Second, this way, we let the confidence network to enjoy the updates from the **large quantity** of the weakly annotated data, and at the same time, the target network benefits from the **high quality** of the clean data with true labels.

And last but not least, we have in fact a multi-task learning setup with parameter sharing, so learning the confidence can be considered as an auxiliary task that helps the target network to better learn the main task and in a way, it acts as a regularizer helping the target network to better generalize at inference time.

That was the general idea of our model, but if you are interested in more details and results from the experiments, you can take a look at our paper:

**Mostafa Dehghani**, A. Severyn, S. Rothe, and J. Kamps. "*Learning to Learn from Weak Supervision by Full Supervision*", NIPS2017 workshop on Meta-Learning (MetaLearn 2017) [arXiv].

and here is its the bookmark:

]]>