
The Efficient Frontier in document ranking
Jun Wang, Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval, ECIR2009
This paper concerns document ranking in information retrieval – particularly collaborative filtering and recommender systems . In information retrieval systems, the widely accepted probability ranking principle (PRP) suggests that, for optimal retrieval, documents should be ranked in order of decreasing probability of relevance. In this paper, we present a new document ranking paradigm, arguing that a better, more general solution is to optimize top-n ranked documents as a whole, rather than ranking them independently. Inspired by the Modern Portfolio Theory in finance, we quantify a ranked list of documents on the basis of its expected overall relevance (mean) and its variance; the latter serves as a measure of risk, which was rarely studied for document ranking in the past. Through the analysis of the mean and variance, we show that an optimal rank order is the one that maximizes the overall relevance (mean) of the ranked list at a given risk level (variance). Based on this principle, we then derive an efficient document ranking algorithm. It extends the PRP by considering both the uncertainty of relevance predictions and correlations between retrieved documents. Furthermore, we quantify the benefits of diversification, and theoretically show that diversifying documents is an effective way to reduce the risk of document ranking. Experimental results on the collaborative filtering problem confirms the theoretical insights with improved recommendation performance, e.g., achieved over 300% performance gain over the PRP-based ranking on the user-based recommendation.
@INPROCEEDINGS{Wang:ecir2009:1,
AUTHOR = {Jun Wang},
TITLE = {“{M}ean-Variance Analysis: A New Document Ranking Theory in Information Retrieval},
BOOKTITLE = {Proc. of European Conference on Information Retrieval (ECIR 2009)},
YEAR = {2009}
}
The ECIR Paper (PDF) 

Most retrieval models estimate the relevance of each document to a query and rank the documents accordingly. However, such an approach ignores the uncertainty associated with the estimates of relevancy. If a high estimate of relevancy also has a high uncertainty, then the document may be very relevant or not relevant at all. Another document may have a slightly lower estimate of relevancy but the corresponding uncertainty may be much less. In such a circumstance, should the retrieval engine risk ranking the first document highest, or should it choose a more conservative (safer) strategy that gives preference to the second document? There is no definitive answer to this question, as it depends on the risk preferences of the user and the information retrieval system. In this paper we present a general framework for modeling uncertainty and introduce an asymmetric loss function with a single parameter that can model the level of risk the system is willing to accept. By adjusting the risk preference parameter, our approach can effectively adapt to users’ different retrieval strategies. We apply this asymmetric loss function to a language modeling framework and a practical risk-aware document scoring function is obtained. Our experiments on several TREC collections show that our “risk-averse” approach significantly improves the Jelinek-Mercer smoothing language model, and a combination of our “risk-averse” approach and the Jelinek-Mercer smoothing method generally outperforms the Dirichlet smoothing method. Experimental results also show that the “risk-averse” approach, even without smoothing from the collection statistics, performs as well as three commonly-adopted retrieval models, namely, the Jelinek- Mercer and Dirichlet smoothing methods, and BM25 model.
Our work on this topic has been accepted in SIGIR2009 and ECIR2009.
@INPROCEEDINGS{Wang:sigir2009:2,
AUTHOR = {Jianhan Zhu and Jun Wang and Michael Taylor and Ingemar Cox},
TITLE = {Risky Business: Modeling and Exploiting Uncertainty in Information Retrieval},
BOOKTITLE = {SIGIR09 Full Paper},
YEAR = {2009}
}
@INPROCEEDINGS{Wang:ecir2009:2,
AUTHOR = {Jianhan Zhu and Jun Wang and Michael J Taylor and Ingemar Cox},
TITLE = {Risk-aware Information Retrieval},
BOOKTITLE = {Proc. of European Conference on Information Retrieval (ECIR 2009)},
YEAR = {2009}
}
SIGIR Paper (PDF) ECIR Paper (PDF)
