Learning to Learn from Weak Supervision by Full Supervision
Our paper "Learning to Learn from Weak Supervision by Full Supervision", with Sascha Rothe, and Jaap Kamps, has been accepted at NIPS2017 Workshop on Meta-Learning (MetaLearn 2017). \o/
Using weak or noisy supervision is a straightforward approach to increase the size of the training data and it has been shown that the output of heuristic methods can be used as weak or noisy signals along with a small amount of labeled data to train neural networks. This is usually done by pre-training the network on weak data and fine-tuning it with true labels. However, these two independent stages do not leverage the full capacity of information from true labels and using noisy labels of lower quality often brings little to no improvement. This issue is tackled by noise-aware models where denoising the weak signal is part of the learning process.
We propose a meta-learning approach in which we train two networks: a target network, which plays the role of the learner and it uses a large set of weakly annotated instances to learn the main task, and a confidence network which plays the role of the meta-learner and it is trained on a small human-labeled set to estimate confidence scores. These scores define the magnitude of the weight updates to the target network during the back-propagation phase. The goal of the confidence network trained jointly with the target network is to calibrate the learning rate of the target network for each instance in the batch. I.e., the weights of the target network
at step
are updated as follows:
where is the global learning rate,
is the loss of predicting
for an input
when the label is
;
is a scoring function learned by the confidence network taking input instance
and its noisy label
. Thus, we can effectively control the contribution to the parameter updates for the target network from weakly labeled instances based on how reliable their labels are according to the confidence network, learned on a small supervised data.
Our setup requires running a weak annotator to label a large amount of unlabeled data, which is done at pre-processing time. For many tasks, it is possible to use a simple heuristic to generate weak labels. This set is then used to train the target network. In contrast, a small human-labeled set is used to train the confidence network. The general architecture of the model is illustrated in the figure below:
