Learning to Learn from Weak Supervision by Full Supervision

Our paper "Learning to Learn from Weak Supervision by Full Supervision", with Sascha Rothe, and  Jaap Kamps, has been accepted at NIPS2017 Workshop on Meta-Learning (MetaLearn 2017). \o/

Using weak or noisy supervision is a straightforward approach to increase the size of the training data and it has been shown that the output of heuristic methods can be used as weak or noisy signals along with a small amount of labeled data to train neural networks. This is usually done by pre-training the network on weak data and fine-tuning it with true labels. However, these two independent stages do not leverage the full capacity of information from true labels and using noisy labels of lower quality often brings little to no improvement. This issue is tackled by noise-aware models where denoising the weak signal is part of the learning process.

We propose a meta-learning approach in which we train two networks: a target network, which plays the role of the learner and it uses a large set of weakly annotated instances to learn the main task, and a confidence network which plays the role of the meta-learner and it is trained on a small human-labeled set to estimate confidence scores. These scores define the magnitude of the weight updates to the target network during the back-propagation phase. The goal of the confidence network trained jointly with the target network is to calibrate the learning rate of the target network for each instance in the batch. I.e., the weights \pmb{w} of the target network f_w at step t+1 are updated as follows:

\pmb{w}_{t+1} = \pmb{w}_t - \frac{\eta_t}{b}\sum_{i=1}^b c_{\theta}(x_i, \tilde{y}_i) \nabla \mathcal{L}(f_{\pmb{w_t}}(x_i), \tilde{y_i}) %+ \nabla \mathcal{R}(\pmb{w_t})

where \eta_t is the global learning rate, \mathcal{L}(\cdot) is the loss of predicting \hat{y}=f_w(x_i) for an input x_i when the label is \tilde{y}; c_\theta(\cdot) is a scoring function learned by the confidence network taking input instance x_i and its noisy label \tilde{y}_i. Thus, we can effectively control the contribution to the parameter updates for the target network from weakly labeled instances based on how reliable their labels are according to the confidence network, learned on a small supervised data.

Our setup requires running a weak annotator to label a large amount of unlabeled data, which is done at pre-processing time. For many tasks, it is possible to use a simple heuristic to generate weak labels. This set is then used to train the target network. In contrast, a small human-labeled set is used to train the confidence network.  The general architecture of the model is illustrated in the figure below:

Our proposed multi-task network for learning a target task using a large amount of weakly labeled data and a small amount of data with true labels. Faded parts of the network are disabled during the training in the corresponding mode. Red-dotted arrows show gradient propagation. Parameters of the parts of the network in red frames get updated in the backward pass, while parameters of the network in blue frames are fixed during the training.

The subfigure in the left presents the full-supervision mode in which given a batch of data with true labels (as well as weak labels), we train the confidence network to learn that given an example in the beach and it's weak label, how likely does that help in training the target network to learn the main task. The confidence network is trained based on the difference between the weak and true labels in the human labeled data. In this mode, we update the parameters of the confidence network as well as representation learning layer of the target network.

In the weak-supervision mode (subfigure in the right), given a batch of data with weak labels, we train the target network to learn the main task. However, each example and it's weak label is passed through the confidence score to generate a score (a probability) indicating how good is this example. Then, the generated score by confidence score is used to weight the gradients of the loss of the target network in the backward pass of the backpropagation. In this mode, we update the parameters of the target network as well as representation learning layer, but the parameters of the confidence network are frozen.

During training, we alternate between these two modes. It is noteworthy that having a shared representation layer between the target and confidence networks has a couple of advantages:

First, it lays the ground for a better communication between the learner and the meta-learner. Besides considering the shared representation layer as a communication channel, we can say that this enables each of these two networks to see the data from each other's point of view.

Second, this way, we let the confidence network to enjoy the updates from the large quantity of the weakly annotated data, and at the same time, the target network benefits from the high quality of the clean data with true labels.

And last but not least, we have in fact a multi-task learning setup with parameter sharing, so learning the confidence can be considered as an auxiliary task that helps the target network to better learn the main task and in a way, it acts as a regularizer helping the target network to better generalize at inference time.

That was the general idea of our model, but if you are interested in more details and results from the experiments, you can take a look at our paper: