# Some Highlights of MILA Deep Learning and Reinforcement Learning Summer Schools 2017

[latexpage]
A couple weeks a go, I attended MILLA Deep Learning summer school (DLSS) from June 26th to July 1st and Reinforcement Learning summer school (RLSS) from  July 3rd to 5th, 2017, organized by Yoshua Bengio and Aaron Courville. You can find information about the lectures here. In the following, I will share some of “my” highlights from the summer schools.

## Different types of learning problems

The first day of summer school was on general topics of machine learning and neural networks. Doina Precup gave a talk which was a gentle refreshing of the general concepts of Machine Learning and then Hugo Larochelle covered the basics of neural networks. In the second part of his talk, Hugo started by dividing learning problems into different types, based on the data and settings of the problem during training and inference. Based on his grouping, each learning problem can be classified to one of these categories:

• Supervised Learning (like classification and regression) in which at training time, we have a set of pairs of <input, target> that are coming from an unknown distribution and at test time, we are given another set of data that we assume they are sampled from the same distribution. Adapting a neural network to a supervised learning problem is often about adapting the output layer.
• Unsupervised Learning (like distribution estimation and dimensionality reduction) in which we have a similar setup to the supervised learning unless we don’t have targets. Generally, in unsupervised learning, we aim at learning something about the distribution of the data, or we just want to learn what is the potential implicit manifold around which most of the data lie.
• Semi-Supervised Learning (Like tasks for which a small amount of labeled data is available) in which at training time often we assume we don’t have a lot of <input, targets> pairs, but we have a lot of unlabeled examples and we hope to leverage the unlabeled data to do a better job at the test time.
• Multi-Task Learning (like object recognition in images with multiple objects) which we can think of it as a problem setting that we have input and a series of labels, each with regard to a different task and we would like to simultaneously solve these task ideally with a single neural network. The reason that we are interested in multitask learning is that we suspect that there are some types of tasks that if we solve them jointly, we will do better than solving them separately and this generally could be the effect of the knowledge that we can share between tasks by learning them jointly.
• Transfer Learning (like when we just care about detecting an object in an image and we have caption of images in our training data) which is similar to multitask learning that we assume at training time, we are seeing multiple labels for different tasks, but there is one task that we are particularly interested in and we hope that seeing all the other labels, we can do a better job for that particular task (not all of them) at test time.
• Structured Output Prediction (like image caption generation and machine translation) where the target is a more structured object like a sequence of words that are not independent.
• Domain Adaptation (like classifying sentiment in reviews of different products) where the distribution of our data during training is different from the distribution of our data during test time. We might have some unlabeled data during training that are sampled from the same distribution as our test data. The key is that we need to learn an unbiased representation which is independent of the domain of the input data. One way to do that is to have an adversarial setup in which we have a classifier unit in our model that is in charge of detecting the domain of the input and we try to learn a representation using which the classifier unit is doing as bad as possible.
• One-Shut Learning (like recognizing a person based on a single picture of him/her) in which we assume that at training time, we are seeing examples that belong to a set of $C$ classes, but at test time, we will be asked to classify examples that belong to classes we have never seen at training time. As a side information, we can consider that we have only one example from that class during training.
• Zero-Shut Learning (like recognizing an object based on a worded description of it) which is similar to one-shut learning but at test time, instead of having an example from a new class, we are going to have some feature representation of what the task is.

I recommend taking a look at his slides as you can easily clear up which type of learning problems you are dealing with and make better decisions to solve it.

## Unintuitive properties of neural networks

In a part of Hugo’s talk, he spent some time talking about properties of neural networks which are perhaps not super intuitive. For instance, some times neural networks are able to achieve the level of accuracy for a task which is close to the level of accuracy that human can achieve but it turns out that they make really dumb mistakes that make no sense to us, for example, cases in object detection in an image where the source of error is not something like optical illusions which happens for human, but just very small differences that are not even visible for human.

#### Neural networks are strangely non-convex

Another way in which neural networks are intriguing is that they are strangely non-convex. So when we are thinking about non-convexity, we think of one or two dimensions and we can imagine a nonconvex objective function with a lot of local optima. Then considering the standard gradient descent procedure which starts from a randomly chosen point, we might think it is kind of hopeless and we are never going to find the global optimum of this nonconvex function.

But it turns out that this intuition is not great at describing the complexity of neural networks. This is because training a neural network is not about an optimization problem in one dimension with a single parameter, but it is an optimization in very high dimensional space with a lot of parameters. In this situation, a lot of critical points, i.e. points corresponds to a derivative of zero are actually not the local minima, but look likes a saddle point. A saddle point is a point that there is a direction that the function is locally convex and essentially increasing in both directions, but another direction exists with opposite situation, where the function is decreasing in both sides:

This is more or less a good news because essentially unless we fall exactly at the critical point, we fall a little bit off that point, the gradient points to a right direction allowing us to improve the loss. In other words, we can think that in a very high dimensional space, if we have a critical point, and we think a bout a random direction, the chances of the critical point is a local minimum is very small because it should curve up in all directions and in general there is going to be a direction that the function goes down (saddle point). Essentially, assuming a rather smooth function,  the closer we are to the global minimum, the harder it is to have a direction that is descending in this high dimensional space.

#### Neural networks work best when badly trained

Imagine we have a simple objective function with a single parameter and we have two global minima:

The training function is the average loss of the training set and testing function is the average loss from the same neural network but on the test set. Abstracting it, we can assume the testing function looks like the training function but maybe a shifted version of it. So, if the local minimum is very narrow, the optimal point might match up very badly with the value we are observing on the test set. So, there is this general observation that flat minima tend to be associated with better generalization. It has been shown that under some definition of flatness, when we are using large mni batch sizes, we are more likely to find sharp minima and with small mini batches, we have a lot of stochasticity which doesn’t allow us to find sharp minima and we preferentially end up in a flat minimum.

#### Neural networks can easily memorize

Although neural networks are state of the art in terms of their ability of generalization, if we train them more on a problem where they should not fit a lot,  assuming that they generalize well since they have less capacity, they will overfit. It has been shown that if we take a dataset and randomly assign labels to data points, the same neural network that is trained longer will eventually fit the data which shows memorization. What this fact suggests is that we cannot think of the learning algorithm and the architecture separately.

#### Neural networks can be compressed

This is related to the idea of knowledge distillation where we have a large neural network that we already trained and we then initialize a smaller neural network (e.g. with fewer hidden units) trained on the same dataset while it also predicts some properties of the large network (essentially the logits), we can more or less achieve the similar performance with the smaller neural network. More interestingly it has been shown that with a shallow neural network, we can approximate surprisingly well a deep neural network. However, if we try to train the shallow network from scratch we cannot reach the same level of performance and in a way, we need the guidance of the larger one.

Max Welling gave a talk about “Marrying Graphical Models & Deep Learning” and he started with a really nice discussion about Bias-Variance Trade-off, and he said, if we want to take one thing away from this summer school, it should be the bias-variance trade-off, as it is actually the way we think about over-fitting and under-fitting, which is the core of machine learning!

Often time, deep learning is seen as an optimisation problem in which we have an objective, we have a dataset, and we should try as hard as we can to optimise the objective. Most of the time, we train for a week and the learning curve is getting better and better. It is clear that the optimization is really important for deep learning but a learning problem is more than just an optimization problem, it’s a statistical inference problem.

We can imagine we build a machine that throws darts at a target,  and the machine might be a bit wiggly due to for example an unstable underground, which means every time we shoot a dart, it might not hit the target in the same spot. Also, the might be a systematic error in this machine, which means the mean of all the darts is off from the center. We call the systematic error the “bias” and the error caused by the wiggliness of the machine the “variance” and the ideal case is having a machine with low bias and low variance (top-left in the image).

In terms of equations, we have $Y= f(x) +\varepsilon$, where $Y$ is label and $f(x)$ is the underlying function that we try to estimate and $\varepsilon$ is the label noise drawn from the normal distribution. We are interested for example in the squared error for point $x$: $Err(x) = E\big[(Y-\hat{f}(x))^2\big]$, which is the difference between the true label $Y$ and our estimate $\hat{f}(x)$. We can decompose this error to three terms:

$Err(x) =\big(E[\hat{f}(x)]-f(x)\big)^2 +E[\big(\hat{f}(x)-E\big[\hat{f}(x)]\big)^2\big] +\sigma_e$

The first term is bias as it looks at the difference between the expected value of our estimate (expectation taken over many datasets) and the true value, the next term is the variance, which is the variance of the estimator that only depends on the estimator, and the third term which is basically the noise. Noise is something that we cannot do anything about as it is part of the characteristics of the problem we deal with, so we simply ignore it.

The game here is to try to get either the bias down or variance down in order to get the total error down. The following plot shows the relation between error and its components with the model complexity:

If we have a very simple model with a few parameters, the variance is going to be small, but the bias is very large. At the other side of the spectrum, with a complex model, we can fit basically everything, but on a particular dataset, we might get still a large error which means the variance is high. We want to sort of trade these two off and find the optimal point where the bias squared and variance added up is minimal.

## What is the learning algorithm in the human brain that allows us to learn about sequences?

In one of the sessions, we had Yoshua Bengio talking about recurrent neural networks and he covered a wide range of related problems to RNNs. Among them, there was a discussion that “How would the human brain do anything like back propagation through time?“. Actually, it does not seem very plausible that the brain would use a strategy anything like back propagation through time. In back propagation through time, we just unfold the graph, we wait until the end and we can compute the gradients as usual with the chain rule. But, it actually does not seem very plausible that the brain would use a strategy anything it as it requires us to store the whole sequence in the memory, hold it there, and look at the outcome and then compute the gradients in the reverse time direction. We might say that although we don’t learn long term dependency with the length of our life, we do it on a daily basis, however, it still does not feel completely right.

There is another strategy to compute the gradient, which was introduced in the late eighties, called real time recurrent learning, which is a forward evaluation of the gradient. In another word, as we are moving forward in time, we can compute the derivative of each of the state variables with respect to each of the parameters, and each time we get a cost, we can use that immediately to make an update and we don’t need to wait for the end of the sequence, which is a form of online learning. It seems very similar to what human brain does but unfortunately when we take the equations to the paper, we end up with that the computations are not practical and wouldn’t scale with the size of the brain, even with the size of the networks we now train for speech and language. There are some works in this direction proposing approximate gradient estimators instead of doing exact gradient computation, but this is still an open question that how can we do online update that is computationally efficient.

## Some of the theoretical puzzles of deep learning

During the summer school, we had a really good session about “Theoretical Neuroscience and Deep Learning Theory”, by  Surya Ganguli. In a part of this session, Surya talked about some of the theoretical puzzles of deep learning and he divided these puzzles into three main topics: Trainability, Expressivity, and Generalizability.

• Trainability: if a good network solution exists with small training error, how do we find it? And what makes a learning problem difficult?
There are a bunch of research works attacking this question by focusing on human semantic cognition. They try to understand the infant level learning from the psychological point of view and build neural networks to mimic this by modeling hierarchical differentiation of concepts. Then they investigating the ability of these neural networks on learning a range of simple and complex tasks.
• Expressivity: what kinds of functions can a deep network express that shallow networks cannot?
We can investigate this part by connecting the deep neural networks to the theory of chaos. It turns out that chaos is a special case of deep learning! So it’s important to know that deep learning can have a chaotic face and we may wish to avoid it. On the other hand, the chaos can be the origin of high expressivity as well.
To understand this, we can pick a random network with $l$ layer and $N$ neurons in each layer, where the weights and biases are i.i.d Gaussian and we can have an arbitrary nonlinearity. We can ask this question that “having a single point in the space (a single feature vector), and we consider it as input and propagate it through the network, does its length grow or shrink?”, or more importantly, “if we have a pair of points in the space that are close to each other, as they propagate through the network, do they get more similar to each other over time, or do they decorrelated and become more different over time?”.  So if the network makes two points more similar, we call it “ordered regime” and if the network fundamentally amplifying small initial differences and making them very different, which is the very definition of chaos: the sensitivity to the small changes in the initial conditions, and that is called the “chaotic regime”.
In a neural network, biases are independent of the input and they push the state of the network to the same direction, regardless of the input. So in case of having two points, biases tend to align two different input and make them similar to each other. However, the large weight can overwhelm the common bias and eventually it makes the inputs dissimilar to each other. So we can have a boundary between chaotic and ordered regime based on biases and weights. The conclusion is that deep neural network can have expressivity if they go to the chaotic regime in a controlled way.
• Generalizability: what principles do deep networks use to place probability or make decisions in regions of input space with little data?
Generalizability is one of the hardest issues and it is still an open problem. Some research works investigated and discussed the fact that generalizability depends on both expressivity and optimizability in a weird way. For instance, it has been shown that deep networks, when they are operated with a few data points and many more parameters, can memorize random labels. What is happening is that based on the general procedure of training, we try to find the best model on the training set, but in deep learning, we are not perfect at optimizing, so even though we can achieve a lot of expressivity, the space of accessible functions we can get to is limited and the ability of generalization is tied up not only with the space of function but optimization.
Also from the stability perspective, it has been shown that if we train our network with small mini batches, that adds noise to our gradient and it turns out we generalize better which is also related to finding flat optima during optimization, while when we train with large mini batches we might think that’s better as we get a better gradient, but we generalize worse.

## GANs for language generation?

Ian Goodfellow gave a talk on Generative Adversarial Networks. At the end of the session, I asked about Ian’s idea on GANs for language. He mentioned that based on his observation, he feels that GANs are not quite there yet. He also said “One thing I don’t understand is that there is quite a lot of enthusiasms for GANs for text in general and to me, that’s a little bit confusing because text seems one of the places where GANs are least likely to help us. For GANs, they work if the output is differentiable because we need to back propagate through the discriminator, and then through the output or the generator and the generator itself.”

He pointed out that making the output of GANs very stable and reliable is really hard even for continuous space and for cases where the output is discrete, it seems like tackling two difficulties at ones and we need to resolve both of them before we see really a success. So it doesn’t seem to easily have cut up to where RNNs text generation is.

## Learning to learn

Nando de Freitas gave a talk on learning to learn. He started by a discussion on the next challenges in AI and pointed out some simple human abilities that a one-year old kid has and no machine that we have ever developed and no machine that we have insight that we think we will be able to build in the next couple of years is capable of them. For instance, the ability of reasoning about objects and interacting with them and manipulating them with an intrinsic motivation or the advanced ability of the human to learn by imitation. Nando argued that this is mostly because the human has the ability to learn from the beginning and human learns the ability to learn.  So learning to learn would be one the most obvious directions to drive the AI research toward.

One of the problems that is getting popular in machine learning is few-shot learning, and the challenge is how a neural network can learn from a few data. Nando argued that learning to learn can tackle this problem as if we have a model that is capable of learning how to use data at test time, it is basically able to quickly learn from a small dataset.

Generally, the learning to learn paradigm is about learning one algorithm using another algorithm and basically, we can say “learning to learn X by Y”, and one can plug in anything as X and Y (like NIPS2016 deepmind paper, where X=SGD and Y=SGD). As one of the instances of the learning to learn paradigm, Nando also discussed the cases where a neural network controls the behavior of another neural network, for example as the optimizer and optimize, or a network generating parameters of the other network.

## Learning  to guess from a guess

Richard S. Sutton also gave a talk on reinforcement learning, mostly focused on Temporal-Difference Learning. Richard started talking about scalability and said “methods that scale with computation are the future of AI” and scale with computation means that when we get more computer power, these methods become more powerful. A lot of current methods are weakly scalable including supervised learning that we need to gather a lot of data and even model-free RL is not a strong scalable method.

The only strong scalable method would be “prediction learning“, where we learn to predict what will happen.  In prediction learning, we don’t need human labeling and we have the target just by waiting. So, prediction learning is the scalable model-free learning.

Richard talked continued by talking about Temporal-Difference Learning (TD-learning) as a method for learning to predict which is widely used in reinforcement learning to predict future reward or value functions. TD-learning is the core of many methods like Q-learning, Sarsa, TD-lambda etc. TD-learning is learning a prediction from another later prediction, which is, in fact,  learning a guess from a guess. As an example, consider we have a model (a neural network) that takes a chess position and generates the probability of winning. And to train that, we define the error simply as the probability of winning in one position minus the probability of winning in the next position and it’s the temporal change in the prediction, then we can use a standard method (like back propagation) to update the parameters of the model with respect to the error and you can learn the model playing against itself and end up with a really strong model competitive with word best chess players.

We are dealing with a multi-stage learning problem when we want to use DT-learning and one would say: can’t we just think of the multi-step as one big step, and then use one-step methods? the answer is no, we really can’t (and shouldn’t want to). Because it’s not scalable as you need to remember all the things you did and in this case, the computation is poorly distributed over time which makes it very costy.  The other question would be that: can’t we just learn one-step predictions, and then iterate them (compose them) to produce multi-step predictions when needed?. The answer is again no, as first of all, we cannot do each step perfectly and iterating over this imperfect steps, we get a propagation of errors. Besides, it’s exponentially complex because as we look a head, at each step there will be many possibilities for action and different choices and this quickly gets computationally intractable. So, in case of having a multi-stage learning problem, TD-learning would an obvious choice.