Archive

Archive for the ‘SIGIR2009’ Category

UCL at SIGIR2009

July 16th, 2009 Comments off

  • J. Wang and J. Zhu, "Portfolio Theory of Information Retrieval," in Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2009. bibtex
    Go to document Go to document
    @INPROCEEDINGS{Wang:sigir2009:1,
      author = {Jun Wang and Jianhan Zhu},
      TITLE = {Portfolio Theory of Information Retrieval},
      BOOKTITLE = {Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR)},
      url={http://web4.cs.ucl.ac.uk/staff/jun.wang/blog/2009/06/17/portfolio-theory-of-information-retrieval/},
      pdf = {http://web4.cs.ucl.ac.uk/staff/jun.wang/papers/2009-sigir09-portfoliotheory.pdf},
      YEAR = {2009}
    }
  • J. Zhu, J. Wang, M. Taylor, and I. Cox, "Risky Business: Modeling and Exploiting Uncertainty in Information Retrieval," in Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2009. bibtex
    Go to document Go to document
    @INPROCEEDINGS{Wang:sigir2009:2,
      author = {Jianhan Zhu and Jun Wang and Michael Taylor and Ingemar Cox},
      TITLE = {Risky Business: Modeling and Exploiting Uncertainty in Information Retrieval},
      pdf = {http://web4.cs.ucl.ac.uk/staff/jun.wang/papers/2009-sigir09-riskmag.pdf},
      url = {http://web4.cs.ucl.ac.uk/staff/jun.wang/blog/2009/06/17/risk-aware-information-retrieval-models/},
      BOOKTITLE = {Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR)},
      YEAR = {2009}
    }
  • J. Zhu, J. Wang, and V. Vinay, "Topic (Query) Selection for IR Evaluation," in Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2009. bibtex
    Go to document
    @INPROCEEDINGS{Wang:sigir2009:3,
      author = {Jianhan Zhu and Jun Wang and Vishwa Vinay},
      TITLE = {Topic (Query) Selection for IR Evaluation},
      BOOKTITLE = {Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR)},
      pdf={http://web4.cs.ucl.ac.uk/staff/jun.wang/papers/2009-sigir09-querysel.pdf},
      YEAR = {2009}
    }

There were six papers in the session of Probabilistic IR Models and two of them were from us. The idea about uncertainty of document ranking and portfolio theory have well been received in the conference. The feedback is positive. One of the conference notes can be found at http://miksa.ils.unc.edu/~wke/wordpress/?p=152 written by Weimao Ke.

Some of the snap shots of our talks:

Jun is talking about Portfolio Theory of Information Retrieval

Jun is talking about Portfolio Theory of Information Retrieval

The title

the slides of "Portfolio Theory of IR"

Jianhan is explaining risk models of IR

Jianhan is explaining risk models of IR

The slides of "Risk Models of IR"

The slides of "Risk Models of IR"

Categories: SIGIR2009 Tags:

Portfolio Theory of Information Retrieval

July 11th, 2009 Comments off
  • J. Wang and J. Zhu, "Portfolio Theory of Information Retrieval," in Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2009. bibtex
    Go to document Go to document
    @INPROCEEDINGS{Wang:sigir2009:1,
      author = {Jun Wang and Jianhan Zhu},
      TITLE = {Portfolio Theory of Information Retrieval},
      BOOKTITLE = {Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR)},
      url={http://web4.cs.ucl.ac.uk/staff/jun.wang/blog/2009/06/17/portfolio-theory-of-information-retrieval/},
      pdf = {http://web4.cs.ucl.ac.uk/staff/jun.wang/papers/2009-sigir09-portfoliotheory.pdf},
      YEAR = {2009}
    }

Presentation slides

  1. Portfolio effect of Document Ranking

As a general rule of thumb, when we present search results for a query, documents should be ranked according to their relevance scores – the document with the highest estimated relevance score is retrieved first, the document with the next highest is retrieved second, and so on. The ranking principle (commonly called the Probability Ranking Principle) has been dominated in information retrieval about fifty years (ever since the Maron and Kuhns’ classic work on probabilistic indexing).

It took me a while to trace back the origin of the Probability Ranking Principle (PRP). In recent correspondence with William Cooper, he confirmed my understanding that,  while ” the idea of probabilistic ranking was applied implicitly in earlier papers, starting with the pioneering paper of Maron and Kuhns in the early 1960′s,”, the PRP was ” first named, stated explicitly, and analyzed critically with the help of a counterexample” in the William Cooper’s 1971 paper “The inadequacy of probability of usefulness as a ranking criterion for retrieval system output.“ After 1971 the PRP was examined explicitly by (Stephen) Robertson and others.”

While the Probability Ranking Principle has widely been adopted, either explicitly  or implicitly,  it is, however, not desirable in many common scenarios where the predicted relevance of documents are dependent between each other.  I generally believe that presenting search results is not just about picking individual relevant documents, but about choosing the right combination of relevant documents – the portfolio effect. After all, the user’s satisfaction of the search results relies on the overall effect of the ranked list.

The analogy between search result ranking in information retrieval and selecting stocks in financial markets

The analogy between search result ranking in information retrieval and selecting stocks in financial markets

To understand the problem, let us take a look at the field of finance.  Our ranking task resembles the investment problem in financial markets; for example, suppose that an investor needs to select a set (portfolio) of n stocks that will provide the best distribution of future return, given his or her investment budget – an analogy with IR is that we invest ranking positions in documents.
The PRP of IR might suggest that, for optimal selection, one should first rank stocks in order of decreasing future returns and then choose the top-n most “profitable” stocks to construct the portfolio. Such a principle that essentially maximizes the expected future return was, however, rejected by an economist Harry Markowitz in his Nobel Prize winning work, the Modern Portfolio Theory (MPT) of finance, in 1952. As one of the most influential economic theories dealing with finance and investment, the MPT was motivated on the basis of the following two observations .

  • The future return of a stock is unknown and cannot be calculated with absolute certainty. Investors have different preferences of the risk associated with uncertainty. Therefore, it is highly desirable to have a method of quantifying this uncertainty or risk, and reflect them and incorporate users’ risk preferences when selecting stocks.
  • In practice the future returns of stocks are correlated. Assuming independence between the returns and selecting them independently to construct a portfolio is not preferable.

Markowitz’ approach is based on the analysis of the expected return (mean) of  a portfolio and its variance (or standard deviation) of return, where the latter serves as a measure of risk. A portfolio solution is preferred to have high expected returned and low variance; conversely, any solution that has low return and large variance should be avoided. Yet, in investment, it is always a trade-off between the expected return and variance. The following example illustrates that you either take more risk (larger variance) in order to obtain more return in the portfolio (Google type of stocks), or conversely trade off return for having more confidence on the portfolio (Coca Cola type of stocks).

portfolio-explained

Following the same way of thinking, we can summarize the overall effectiveness of a ranked list by its mean (expected overall relevance) and variance (the uncertainty or risk of the mean).  Through the analysis of the mean and variance of a ranked list, we show that an optimal rank order is the one that balancing the overall relevance (mean) of the ranked list against its risk level (variance). Based on this principle, we then derive an efficient document ranking algorithm. It generalizes the well-known probability ranking principle (PRP) by considering both the uncertainty of relevance predictions and correlations between retrieved documents.

2. diversification as a way of reducing ranking risk

Diversification reduces the ranking risk

Diversification reduces the ranking risk

Moreover, the benefit of diversification is mathematically quantified; diversifying documents is an effective way to reduce the risk of document ranking. To understand this, consider two extreme cases: in the first case, suppose we have a ranked list consisting of two documents, where the correlation coefficient ρ between them is −1. This means that their estimated relevance scores change in the exact opposite direction in response to different information needs. The volatility (the change) of the documents’ relevance cancels one another completely and leads to a situation where the ranked list has no volatility at all. As a result, a certain amount of relevance for any kind of user information needs is maintained. Conversely, when we have two documents that are perfectly correlated (ρ = 1) in the list, the relevance returns of the two documents move  in the perfectly same direction in response to different information needs. In this case, the returned relevance of the list mimics that of each of the two documents. As a result, the list contains the same amount of uncertainty (risk) as each of the two documents alone. In this case, risk is not reduced.

Topic (Query) Selection for IR Evaluation

June 18th, 2009 Comments off

The need for evaluating large amounts of topics (queries) makes
IR evaluation an uneasy task. In this paper, we study a topic se-
lection problem for IR evaluation. The selection criterion is based
on the overall difficulty of the chosen set, as well as the uncertainty
of the final IR metric applied to the systems. Our preliminary ex-
periments demonstrate that our approach helps to identify a set of
topics that provides confident estimates of systems’ performance
while keeping the requirement of the query difficulty.

@INPROCEEDINGS{Wang:sigir2009:3,
AUTHOR =       {Jianhan  Zhu and Jun Wang and Vishwa  Vinay},
TITLE =        {Topic (Query) Selection for IR Evaluation},
BOOKTITLE =    {SIGIR09 Poster},
YEAR =         {2009}
}

Query Selection Paper (PDF)

Risk-aware Information Retrieval Models

June 17th, 2009 Comments off

beyondsearch-click-through-models-v5

Most retrieval models estimate the relevance of each document to a query and rank the documents accordingly. However, such an approach ignores the uncertainty associated with the estimates of relevancy. If a high estimate of relevancy also has a high uncertainty, then the document may be very relevant or not relevant at all. Another document may have a slightly lower estimate of relevancy but the corresponding uncertainty may be much less. In such a circumstance, should the retrieval engine risk ranking the first document highest, or should it choose a more conservative (safer) strategy that gives preference to the second document? There is no definitive answer to this question, as it depends on the risk preferences of the user and the information retrieval system. In this paper we present a general framework for modeling uncertainty and introduce an asymmetric loss function with a single parameter that can model the level of risk the system is willing to accept. By adjusting the risk preference parameter, our approach can effectively adapt to users’ different retrieval strategies. We apply this asymmetric loss function to a language modeling framework and a practical risk-aware document scoring function is obtained. Our experiments on several TREC collections show that our “risk-averse” approach significantly improves the Jelinek-Mercer smoothing language model, and a combination of our “risk-averse” approach and the Jelinek-Mercer smoothing method generally outperforms the Dirichlet smoothing method. Experimental results also show that the “risk-averse” approach, even without smoothing from the collection statistics, performs as well as three commonly-adopted retrieval models, namely, the Jelinek- Mercer and Dirichlet smoothing methods, and BM25 model.

Our work on this topic has been accepted in SIGIR2009 and  ECIR2009.

@INPROCEEDINGS{Wang:sigir2009:2,
AUTHOR =       {Jianhan Zhu and Jun Wang and Michael Taylor and Ingemar Cox},
TITLE =        {Risky Business: Modeling and Exploiting Uncertainty in Information Retrieval},
BOOKTITLE =    {SIGIR09 Full Paper},
YEAR =         {2009}
}
@INPROCEEDINGS{Wang:ecir2009:2,
AUTHOR =       {Jianhan Zhu and Jun Wang and Michael J Taylor and Ingemar Cox},
TITLE =        {Risk-aware Information Retrieval},
BOOKTITLE =    {Proc. of European Conference on Information Retrieval (ECIR 2009)},
YEAR =         {2009}
}

SIGIR Paper (PDF) ECIR Paper (PDF)