Separability is a highly desirable property for constructing and operating autonomous information systems, and especially classifiers. In this post, I present a step by step argument which shows that based on the classification principles, having better separability in the feature space leads to better accuracy in the classification results.
[perfectpullquote align=”full” cite=”” link=”” color=”” class=”” size=”16″]For a given set of items presented to a binary classification system, there exists a classification of the items such that the probability of class membership for all items assigned to the class is greater than or equal to the probability of class membership for all items not assigned to the class, and the classification has optimal expected effectiveness.[/perfectpullquote]
Since in many applications, autonomous systems need to decide how to classify an individual item in the absence of entire items set, Lewis has extended the PRP to the Probability Threshold Principle (PTP):
[perfectpullquote align=”full” cite=”” link=”” color=”” class=”” size=”16″]For a given effectiveness measure, there exists a threshold , if all and only those items with the probability of class membership greater than p are assigned to the class, the expected effectiveness of the classification will be the best possible for that set of items.[/perfectpullquote]
PTP in fact discusses optimizing the effectiveness of classifiers by making items separable regarding their probability of class membership, which is a discussion on “separability in the score space”. Based on PTP, optimizing a threshold for separating items is a theoretically trivial task, however, there are practical difficulties. The main difficulty refers to the fact that retrieval models are not necessarily capable of measuring actual probabilities of relevance for documents, so they do not guarantee to generate a set of scores from which the optimum cutoff can be inferred. In this regard, a great deal of work has been done on analyzing the score distribution over relevant and non-relevant documents to utilize this information for finding the appropriate threshold between relevant and non-relevant documents.
It is a clear fact that the more the score distributions of relevant and non-relevant documents are separable, the easier it is to determine the optimum threshold. So, obtaining the separation property in the scores distributions of relevant and non-relevant documents is one of the key focus areas for retrieval models.
There are two ways to obtain separability in the scores distributions. We could address the complex underlying process of score generation and investigate ranking functions that yield a separable score distribution, as in the score distributional approaches. Alternatively, we can investigate ways to provide existing scoring functions with a highly separable representation of the data. That is, the “term distribution” directly provides information about the “probability of relevance” and if there are separable distributions over terms of relevant and non-relevant documents, a scoring function satisfying PRP will generate scores that separate the classes of relevant and non-relevant documents.
Thus, a separation property on feature distribution for representing the data is a favorable property, which follows better accuracy of classifiers’ decisions. As a formal and general definition, we can refer to the model separability as follows:
[perfectpullquote align=”full” cite=”” link=”” color=”” class=”” size=”16″]DEFINITION.The model of an entity is epistemologically “separable” if, and only if, it has unique, non-overlapping features that distinguish it from other models.[/perfectpullquote]
We argued that how separability in feature space leads to the separability in score space. Based on this and the given definition of the separability, we present Strong Separation Principle (SSP), which is a counterpart of the PTP in the feature space:
[perfectpullquote align=”full” cite=”” link=”” color=”” class=”” size=”16″]For a given set of items presented to a classification system, for each class there exists at least one feature in the representation of items, and a threshold , such that for any set of items, if all and only those items with are assigned to the class, the classification will have the optimal possible performance for that set of items in terms of a given effectiveness measure.[/perfectpullquote]
SSP in general is a stronger version of PTP. In strict binary classification, if you have PTP, which holds on the whole feature space, SSP will be satisfied, however in the multi-class case, SSP is stronger and it implies PTP, but not the other way around.
Based on PTP, there is always an underlying probabilistic scoring function on the set of whole features, which generates membership probabilities as the scores of items and these scores make items separable with regards to a threshold. So, the scoring function can be deemed as a mapping function which maps items to a new feature space in which the score of each item is a single feature representation of that item (membership probabilities (or scores) in PTP would be equivalent to in SSP). Thus, when the SSP holds, the PTP and PRP will also hold. One could envision a stronger version of the SSP in which “all” the features in the representations need to be non-overlapping, but the SSP is sufficient for optimizing the effectiveness of the classifier.
For more details and discussions, read our paper:
- Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx, “On Horizontal and Vertical Separation in Hierarchical Text Classification“, In proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR’16), 2016.