This post is about the project I've done in collaboration with Aliaksei Severyn, Sascha Rothe, and Jaap Kamps, during my internship at Google Research.
Deep neural networks have shown impressive results in a lot of tasks in computer vision, natural language processing, and information retrieval. However, their success is conditioned on the availability of exhaustive amounts of labeled data, while for many tasks such a data is not available. Hence, unsupervised and semi-supervised methods are becoming increasingly attractive.
Using weak or noisy supervision is a straightforward approach to increase the size of the training data. In one of my previous post, I've talked about how to beat your teacher, which provides an insight on how to train a neural network model using only the output of a heuristic model as supervision signal which eventually works better than that heuristic model. Assuming that most of the time, besides a lot of unlabeled (or weakly labeled) data there is a small amount of training data with strong (true) labels, i.e. a semi-supervised setup, here I'll talk about how to learn from a weak teacher and avoid his mistakes.
This is usually done by pre-training the network on weak data and fine-tuning it with true labels. However, these two independent stages do not leverage the full capacity of information from true labels. For instance, in the pre-training stage, there is no handle to control the extent to which the data with weak labels contribute in the learning process, while they can be of different quality.
In this post, I'm going to talk about our proposed idea which is a semi-supervised method that leverages a small amount of data with true labels along with a large amount of data with weak labels. Our proposed method has three main components:
- A weak annotator, which can be a heuristic model, a weak classifier, or even human via crowdsourcing and it is employed to annotate massive amount of unlabeled data.
- A target network which uses a large set of weakly annotated instances by weak annotator to learn the main task
- A confidence network which is trained on a small human-labeled set to estimate confidence scores for instances annotated by the weak annotator. We train the target network and confidence in a multi-task fashion.
In a joint learning process, target network and confidence network try to learn a suitable representation of the data and this layer is shared between them as a two-way communication channel. The target network tries to learn to predict the label of the given input under the supervision of the weak annotator. In the same time, the output of the confidence network, which are the confidence scores, define the magnitude of the weight updates to the target network with respect to the loss computed based on labels from weak annotator, during the back-propagation phase of the target network. This way, the confidence network helps the target network to avoid mistakes of her teacher, i.e.weak annotator, by down-weighting the weight updates from weak labels that do not look reliable to the confidence network.
From a meta-learning perspective, the goal of the confidence network trained jointly with the target network is to calibrate the learning rate for each instance in the batch. I.e., the weights of the target network at step are updated as follows:
where is the global learning rate, is the batch size, the loss of predicting for an input when the target label is ; is a scoring function learned by the confidence network taking input instance and its noisy label . Thus, we can effectively control the contribution to the parameter updates for the target network from weakly labeled instances based on how reliable their labels are according to the confidence network (learned on a small supervised data).