Portfolio Theory of Information Retrieval
[bibtex file=mypublications.bib key=Wang:sigir2009:1]
- Portfolio effect of Document Ranking
As a general rule of thumb, when we present search results for a query, documents should be ranked according to their relevance scores – the document with the highest estimated relevance score is retrieved ﬁrst, the document with the next highest is retrieved second, and so on. The ranking principle (commonly called the Probability Ranking Principle) has been dominated in information retrieval about fifty years (ever since the Maron and Kuhns’ classic work on probabilistic indexing).
It took me a while to trace back the origin of the Probability Ranking Principle (PRP). In recent correspondence with William Cooper, he confirmed my understanding that, while ” the idea of probabilistic ranking was applied implicitly in earlier papers, starting with the pioneering paper of Maron and Kuhns in the early 1960′s,”, the PRP was ” first named, stated explicitly, and analyzed critically with the help of a counterexample” in the William Cooper’s 1971 paper “The inadequacy of probability of usefulness as a ranking criterion for retrieval system output.“ “After 1971 the PRP was examined explicitly by (Stephen) Robertson and others.”
While the Probability Ranking Principle has widely been adopted, either explicitly or implicitly, it is, however, not desirable in many common scenarios where the predicted relevance of documents are dependent between each other. I generally believe that presenting search results is not just about picking individual relevant documents, but about choosing the right combination of relevant documents – the portfolio effect. After all, the user’s satisfaction of the search results relies on the overall effect of the ranked list.
To understand the problem, let us take a look at the ﬁeld of ﬁnance. Our ranking task resembles the investment problem in ﬁnancial markets; for example, suppose that an investor needs to select a set (portfolio) of n stocks that will provide the best distribution of future return, given his or her investment budget – an analogy with IR is that we invest ranking positions in documents.
The PRP of IR might suggest that, for optimal selection, one should ﬁrst rank stocks in order of decreasing future returns and then choose the top-n most “proﬁtable” stocks to construct the portfolio. Such a principle that essentially maximizes the expected future return was, however, rejected by an economist Harry Markowitz in his Nobel Prize winning work, the Modern Portfolio Theory (MPT) of ﬁnance, in 1952. As one of the most inﬂuential economic theories dealing with ﬁnance and investment, the MPT was motivated on the basis of the following two observations .
- The future return of a stock is unknown and cannot be calculated with absolute certainty. Investors have different preferences of the risk associated with uncertainty. Therefore, it is highly desirable to have a method of quantifying this uncertainty or risk, and reﬂect them and incorporate users’ risk preferences when selecting stocks.
- In practice the future returns of stocks are correlated. Assuming independence between the returns and selecting them independently to construct a portfolio is not preferable.
Markowitz’ approach is based on the analysis of the expected return (mean) of a portfolio and its variance (or standard deviation) of return, where the latter serves as a measure of risk. A portfolio solution is preferred to have high expected returned and low variance; conversely, any solution that has low return and large variance should be avoided. Yet, in investment, it is always a trade-off between the expected return and variance. The following example illustrates that you either take more risk (larger variance) in order to obtain more return in the portfolio (Google type of stocks), or conversely trade off return for having more conﬁdence on the portfolio (Coca Cola type of stocks).
Following the same way of thinking, we can summarize the overall effectiveness of a ranked list by its mean (expected overall relevance) and variance (the uncertainty or risk of the mean). Through the analysis of the mean and variance of a ranked list, we show that an optimal rank order is the one that balancing the overall relevance (mean) of the ranked list against its risk level (variance). Based on this principle, we then derive an efficient document ranking algorithm. It generalizes the well-known probability ranking principle (PRP) by considering both the uncertainty of relevance predictions and correlations between retrieved documents.
2. diversification as a way of reducing ranking risk
Moreover, the benefit of diversification is mathematically quantified; diversifying documents is an effective way to reduce the risk of document ranking. To understand this, consider two extreme cases: in the ﬁrst case, suppose we have a ranked list consisting of two documents, where the correlation coefficient ρ between them is −1. This means that their estimated relevance scores change in the exact opposite direction in response to different information needs. The volatility (the change) of the documents’ relevance cancels one another completely and leads to a situation where the ranked list has no volatility at all. As a result, a certain amount of relevance for any kind of user information needs is maintained. Conversely, when we have two documents that are perfectly correlated (ρ = 1) in the list, the relevance returns of the two documents move in the perfectly same direction in response to different information needs. In this case, the returned relevance of the list mimics that of each of the two documents. As a result, the list contains the same amount of uncertainty (risk) as each of the two documents alone. In this case, risk is not reduced.