# Fidelity-Weighted Learning

Our paper "Fidelity-Weighted Learning", with Arash Mehrjou, Stephan Gouws, Jaap Kamps, Bernhard Schölkopf, has been accepted at Sixth International Conference on Learning Representations (ICLR2018). \o/

[perfectpullquote align=”full” bordertop=”false” cite=”” link=”” color=”” class=”#16989D” size=”16″]

### tl;dr

Fidelity-weighted learning (FWL) is a semi-supervised student-teacher approach for training deep neural networks using weakly-labeled data. It modulates the parameter updates to a student network which trained on the task we care about, on a per-sample basis according to the posterior confidence of its label-quality estimated by a Bayesian teacher, who has access to a rather small amount of high-quality labels.[/perfectpullquote]

The success of deep neural networks to date depends strongly on the availability of labeled data which is costly and not always easy to obtain. Usually, it is much easier to obtain small quantities of high-quality labeled data and large quantities of unlabeled data. The problem of how to best integrate these two different sources of information during training is an active pursuit in the field of semi-supervised learning and here, with FWL, we propose an idea to address this question.

## Learning from samples of variable quality

For a large class of tasks, it is also easy to define one or more so-called “weak annotators”, additional (albeit noisy) sources of weak supervision based on heuristics or “weaker”, biased classifiers trained on e.g. non-expert crowd-sourced data or data from different domains that are related. While easy and cheap to generate, it is not immediately clear if and how these additional weakly-labeled data can be used to train a stronger classifier for the task we care about. More generally, in almost all practical applications machine learning systems have to deal with data samples of variable quality. For example, in a large dataset of images only a small fraction of samples may be labeled by experts and the rest may be crowd-sourced using e.g. Amazon Mechanical Turk. In addition, in some applications, labels are intentionally perturbed due to privacy issues.

Assuming we can obtain a large set of weakly-labeled data in addition to a much smaller training set of “strong” labels, the simplest approach is to expand the training set by including the weakly-supervised samples (all samples are equal). Alternatively, one may pretrain on the weak data and then fine-tune on observations from the true function or distribution (which we call strong data). Indeed,  a small amount of expert-labeled data can be augmented in such a way by a large set of raw data, with labels coming from a heuristic function, to train a more accurate neural ranking model. The downside is that such approaches are oblivious to the amount or source of noise in the labels.

[perfectpullquote align=”right” bordertop=”false” cite=”” link=”” color=”” class=”” size=””]

All labels are equal, but some labels are more equal than others, just like animals.

Inspired by George, Animal Farm, 1945.

[/perfectpullquote]

We argue that treating weakly-labeled samples uniformly (i.e. each weak sample contributes equally to the final classifier) ignores potentially valuable information of the label quality. Instead, we propose Fidelity-Weighted Learning (FWL), a Bayesian semi-supervised approach that leverages a small amount of data with true labels to generate a larger training set with confidence-weighted weakly-labeled samples, which can then be used to modulate the fine-tuning process based on the fidelity (or quality) of each weak sample. By directly modeling the inaccuracies introduced by the weak annotator in this way, we can control the extent to which we make use of this additional source of weak supervision: more for confidently-labeled weak samples close to the true observed data, and less for uncertain samples further away from the observed data.

## How fidelity-weighted learning works?

We propose a setting consisting of two main modules:

1. One is called the student and is in charge of learning a suitable data representation and performing the main prediction task,
2. The other is the teacher which modulates the learning process by modeling the inaccuracies in the labels.

[latexpage]

We assume we are given a large set of unlabeled data samples, a heuristic labeling function called the weak annotator, and a small set of high-quality samples labeled by experts, called the strong dataset, consisting of tuples of training samples $x_i$ and their true labels $y_i$, i.e. $\mathcal{D}_s=\{(x_i,y_i)\}$. We consider the latter to be observations from the true target function that we are trying to learn.
We use the weak annotator to generate labels for the unlabeled samples. Generated labels are noisy due to the limited accuracy of the weak annotator. This gives us the weak dataset consisting of tuples of training samples $x_i$ and their weak labels $\tilde{y}_i$, i.e. $\mathcal{D}_w=\{(x_i, \tilde{y}_i)\}$. Note that we can generate a large amount of weak training data $\mathcal{D}_w$ at almost no cost using the weak annotator. In contrast, we have only a limited amount of observations from the true function, so: $|\mathcal{D}_s| \ll |\mathcal{D}_w|$.

Here, we assume the student to be a neural network and teacher to be a Bayesian function approximator. The training process consists of three phases (Illustrated in the above figure):

• Step 1Pre-train the student on $\mathcal{D}_w$ using weak labels generated by the weak annotator.
The main goal of this step is to learn a task-dependent representation of the data as well as pretraining the student. The student function is a neural network consisting of two parts. The first part $\psi(.)$ learns the data representation and the second part $\phi(.)$ performs the prediction task (e.g. classification). Therefore the overall function is $\hat{y}=\phi(\psi(x_i))$. The student is trained on all samples of the weak dataset $\mathcal{D}_w=\{(x_i, \tilde{y}_i)\}$. For brevity, in the following, we will refer to both data sample $x_i$ and its representation $\psi(x_i)$ by $x_i$ when it is obvious from the context.
From the self-supervised feature learning point of view, we can say that representation learning in this step is solving a surrogate task of approximating the expert knowledge, for which a noisy supervision signal is provided by the weak annotator.
• Step 2: Train the teacher on the strong data $(\psi(x_j),y_j) \in \mathcal{D}_s$ represented in terms of the student representation $\psi(.)$ and then use the teacher to generate a soft dataset $\mathcal{D}_{sw}$ consisting of $\langle \textrm{sample}, \textrm{predicted label}, \textrm{ confidence} \rangle$ for all data samples.
We use a Gaussian process as the teacher to capture the label uncertainty in terms of the student representation, estimated w.r.t the strong data.  A prior mean and co-variance function is chosen for $\mathcal{GP}$. The learned embedding function $\psi(\cdot)$ in Step 1 is then used to map the data samples to dense vectors as input to the $\mathcal{GP}$. We use the learned representation by the student in the previous step to compensate lack of data in $\mathcal{D}_s$ and the teacher can enjoy the learned knowledge from the large quantity of the weakly annotated data. This way, we also let the teacher to see the data through the lens of the student.
Let’s call the generated labels by the teacher as soft labels. Therefore, we refer to $\mathcal{D}_{sw}$ as the soft dataset. Note that we train $\mathcal{GP}$ only on the strong dataset $\mathcal{D}_s$ but then use it to generate soft labels $\bar{y}_t = \tfunc(x_t)$ and uncertainty $\Sigma(x_t)$ for samples belonging to $\mathcal{D}_{sw}=\mathcal{D}_w\cup \mathcal{D}_s$1.
• Step 3: Fine-tune the weights of the student network on the soft dataset, while modulating the magnitude of each parameter update by the corresponding teacher-confidence in its label.
The student network of Step 1 is fine-tuned using samples from the soft dataset $\mathcal{D}_{sw}=\{(x_t, \bar{y}_t)\}$ where $\bar{y}_t=\tfunc(x_t)$. The corresponding uncertainty $\Sigma(x_t)$ of each sample is mapped to a confidence value (which is going to be explained how in a minute!), and this is then used to determine the step size for each iteration of the stochastic gradient descent (SGD). So, intuitively, for data points where we have true labels, the uncertainty of the teacher is almost zero, which means we have high confidence and a large step-size for updating the parameters. However, for data points where the teacher is not confident, we down-weight the training steps of the student. This means that at these points, we keep the student function as it was trained on the weak data in Step 1.More specifically, we update the parameters of the student by training on $\mathcal{D}_{sw}$ using SGD:
\begin{eqnarray*}
\pmb{w}^* &=& \argmin_{\pmb{w} \in \mathcal{W}} \> \frac{1}{N}\sum_{(x_t,\bar{y}_t) \in \mathcal{D}_{sw}}l(\pmb{w}, x_t, \bar{y}_t) + \mathcal{R}(\pmb{w}), \\
\pmb{w}_{t+1} &=& \pmb{w}_t – \eta_t(\nabla l(\pmb{w},x_t,\bar{y}_t) + \nabla \mathcal{R}(\pmb{w}))
\end{eqnarray*}
where $l(\cdot)$ is the per-example loss, $\eta_t$ is the total learning rate, $N$ is the size of the soft dataset $\mathcal{D}_{sw}$, $\pmb{w}$ is the parameters of the student network, and $\mathcal{R(.)}$ is the regularization term. %Regularization term is the usual regularization used by optimization packages (e.g. weight decay). Therefore, we do not go into its details here. We define the total learning rate as $\eta_t = \eta_1(t)\eta_2(x_t)$, where $\eta_1(t)$ is the usual learning rate of our chosen optimization algorithm that anneals over training iterations, and $\eta_2(x_t)$ is a function of the label uncertainty $\Sigma(x_t)$ that is computed by the teacher for each data point. Multiplying these two terms gives us the total learning rate. In other words, $\eta_2$ represents the fidelity (quality) of the current sample, and is used to multiplicatively modulate $\eta_1$. Note that the first term does not necessarily depend on each data point, whereas the second term does. We propose

\label{eqn:eta2}
\eta_2(x_t) = \exp[-\beta \Sigma(x_t)],

to exponentially decrease the learning rate for data point $x_t$ if its corresponding soft label $\bar{y}_t$ is unreliable (far from a true sample). In Equation\ref{eqn:eta2}, $\beta$ is a positive scalar hyper-parameter. Intuitively, small $\beta$ results in a student which listens more carefully to the teacher and copies its knowledge, while a large $\beta$ makes the student pay less attention to the teacher, staying with its initial weak knowledge. More concretely speaking, as $\beta \to 0$ student places more trust in the labels $\bar{y}_t$ estimated by the teacher and the student copies the knowledge of the teacher. On the other hand, as $\beta \to \infty$, the student puts less weight on the extrapolation ability of $\mathcal{GP}$ and the parameters of the student are not affected by the correcting information from the teacher.

## A toy problem

Let’s apply the FWL  a one-dimensional toy problem to illustrate the various steps of it.

Let $f_t(x)=\sin(x)$ be the true function (red dotted line in the plot $a$ in the figure below) from which a small set of observations $\mathcal{D}_s=\{x_j,y_j\}$ is provided (red points in the plot $b$ in the figure below). These observations might be noisy, in the same way that labels obtained from a human labeler could be noisy.

A weak annotator function $f_{w}(x)=2sinc(x)$ (magenta line in the plot $a$ in the figure below) is provided, as an approximation to $f_t(.)$. The task is to obtain a good estimate of $f_t(.)$ given the set $\mathcal{D}_s$ of strong observations and the weak annotator function $f_{w}(.)$. We can easily obtain a large set of observations $\mathcal{D}_w=\{x_i,\tilde{y}_i\}$ from $f_{w}(.)$ with almost no cost (magenta points in the plot $a$ in the figure below).

We consider two experiments:

1. A neural network trained on weak data and then fine-tuned on strong data from the true function, which is the most common semi-supervised approach (plot $c$ in the figure above).
2. A teacher-student framework working by the proposed FWL approach.

As can be seen in plot $d$ in the figure above, FWL by taking into account label confidence, gives a better approximation of the true hidden function. We repeated the above experiment 10 times. The average RMSE with respect to the true function on a set of test points over those 10 experiments for the student, were as follows:

1. Student is trained on weak data (blue line in plot $a$ in the figure above): $0.8406$,
2. Student is trained on weak data then fine-tuned on true observations (blue line in plot $c$ in the figure above): $0.5451$.
3. Student is trained on weak data, then fine-tuned by soft labels and the confidence information provided by the teacher (blue line in plot $d$ in the figure above): $0.4143$ (best).

More details of the neural network and $\mathcal{GP}$ along with the specification of the data used in the above experiment are in the paper.

That was the general idea of FWL,  to see how it works for real-world tasks, like sentiment classification or document ranking,  you can take a look at our paper:

• Mostafa Dehghani, A Mehrjou, S Gouws, J Kamps, B Schölkopf. “Fidelity-Weighted Learning“,  In Proceedings of  Sixth International Conference on Learning Representations, (ICLR’18).
1. In practice, we furthermore divide the space of data into several regions and assign each region a separate $\mathcal{GP}$ trained on samples from that region. This leads to a better exploration of the data space and makes use of the inherent structure of data. The algorithm called clustered $\mathcal{GP}$ gave better results compared to a single GP. Check the paper for more details.