Despite the buzz surrounding deep neural networks (DNN) models for information retrieval, the literature is still lacking a systematic basic investigation on how generally we can model the retrieval problem using neural networks.
Modeling the retrieval problem in the context of neural networks means the general way that we frame the problem with regards to the essential components of a neural network, including what we consider as the objective function, and which kind of architecture we employ, how we feed the data to the network, etc.
Here, in this post, I try to present different general architectures that can be considered for modeling the retrieval problem. First, I provide a categorization of different models based on their objective function, and then I will discuss different approaches with regards to their inference time. Note that in the figures, I use the fully connected feed-forward neural network, while it can be replaced by more complex or more expressive neural models like LSTMs, or CNN.
Categorizing Models by the Type of Objective Function
There are different models that the retrieval problem can be generally formulated in the neural network framework in terms of the objective function which is defined to be optimized: Retrieval as Regression, Retrieval as Ranking, and Retrieval as Classification. I am going to explain these models and discuss their pros and cons.
Retrieval as Regression
The first architecture would be framing the retrieval problem as the scoring problem which can be phrased as the regression problem. In the regression model (left most model in above figure), given the query and the document , we aim at generating a score, which could be for example interpreted as the probability that the document is relevant given the query . In this model, network learns to produce calibrated scores, which at the end, these scores are used to rank documents. This model is also referred as the point-wise model in the learning to rank literature.
Retrieval as Ranking
The retrieval problem also can instead be phrased more directly as a ranking or sorting problem. In case that we do not need the calibrated scores and we just needs to know which document should be ranked above another document, we are able to formulate the retrieval problem as a ranking model (the middle model in the above figure). In the ranking model, there is no constraint to produce calibrated scores, so different runs will produce different values, but the same sort order. Relative comparisons of scores are used to rank documents. This model is referred as the pair-wise model in the learning to rank literature. The advantage of this model over the first model is that the objective function is more relaxed, i.e. the model does not aim at learning the exact scores, and instead it aims at fitting the ranking, which makes the training process simpler and faster as well as more robust against small noises.
Retrieval as Classification
The other option is to see the retrieval problem as a multinomial problem and frame it as a classification task. In this model, documents are considered as categories and for a given query , we select a document as the category to which the given query is most related. This model is trained by softmax using candidate sampling.
This model has some advantages over two previous models, we can employ, for example, softmax as the final layer so that we do not need to provide explicit negative examples. The negative back propagation signal is implicitly generated for all non-label classes as part of the training. Moreover, in the inference time, we do not require to perform a separate stage of generating candidates to score (or rank). We simply feed the model with the query, and the output would be the documents that match the best and we can get the top-k documents based on their scores.
However, there are limitations to the classification approach. The first one is scaling beyond $O(n)$ documents is difficult with the softmax layers. More importantly, in this model, we train the network on a fixed set of documents, which makes it unable to score new documents that were not present at training time. With regards to this limitations, we leave this model out from our investigation in this paper.
The models presented above are different models in terms of learning different things. In particular, the regression model learns a binomial distribution of what is the chance of retrieving a particular document given the query. The classification model learns a multinomial distribution of which document to retrieve given that one relevant document is spouses to be retrieved, and the ranking model simply tries to learn uncalibrated scores to rank documents given a query.
Categorizing Models by the Inference Time
Beside the categorization presented above, we can group retrieval neural models based on their big-O inference run-time. This is measured as a function of the number of documents being scored.
Constant Time Models
These are models aiming at learning proper representation for query and documents. They process the query features only once and at the inference time, they can quickly find approximate best matches to a set of query features in a large document corpora with regards to the learned representations.
In fact, these models are not constant-time as finding the nearest neighbor is not a constant time process, but there are algorithms that they can approximate finding nearest neighbors in sub-linear time based on the number of documents. These models are called late-combining models or representation-focused models which are generally designed based on the on siamese neural networks. Figure bellow shows how we can design models with constant time during inference, but with different architecture in terms of objective function (as it is explained above).
The leftmost model marked is trained via regression (linear or logistic) using specific instances of query and document, having a score as the label. The middle model is trained with a ranking loss but note that it also trains the “m-dimension” embeddings. The rightmost model is the same as the softmax classification model in the previous section, however, the softmax weights have been shown as pulled out to a separate matrix indexed by documents for clarity. This is still the model trained by softmax using candidate sampling dot product layer.
Linear Time Models
This category of models is called early-combining models or interaction-focused models. In these models, there is no need to produce embedding and they support the rich relationship between query and documents.
The category includes regression models in which the query and document are scored together (left model in the bellow figure), and also ranking model, which given a query, a positive document, and a negative document, it process the query and each of documents simultaneously in a single model, via a retrogression like procedure, and generates two float numbers, one for the positive example and for the negative example, to be fed to a pairwise loss that gets back-propagated through the network (right model in the bellow figure). This model is trained as pair-wise model but it is used in a point-wise setting in the inference time.
Quadratic Time Models
The last category is fully pairwise early combination models, which not only use pairwise input during the training but also they need the pairwise input during inference, which makes them cost intensive models. In other words, during inference, the output of the model is pairwise comparisons and in order to map these comparisons to scalar scores, for each document, we can calculate the average of predictions against all other candidate documents.
There are some approximations would be applicable to decrease the time complexity during inference in the real world application, but addressing them is out of the scope of this post.