Learning Information Retrieval by Papers: TF-IDF and Term Weighting
It is difficult for an information retrieval system to predict what query terms (words) users will issue to find relevant documents. Every term (word) in the documents could be a potential query term. A modern solution, like in most Web search engines, is to index all the terms occurring in the documents.
If we do that, the next question is, given a query (contain multiple terms), how to calculate a relevance score for each of the documents in a collection. The simplest calculation would be to count the actual number of terms the query has in common with a document. This number has become known as the co-ordination level. [bibtex file=ir.bib key=Cleverdon]
However, it does not consider the frequency of a query term in a document. It is likely that a document that contains more the given query terms is more relevant. One could use the frequency of occurrence as a weight, which is called Term Frequency. The relevance score is a summation of all query terms’ frequency in a document. [bibtex file=ir.bib key=salton_1968_tf]
Purely relying on term frequency is still limited because query terms are different in their ability to discriminate documents. A query term is not a good discriminator if it occurs in many documents. We should give it less weight than one occurring in few documents. For examples, when querying “information retrieval”, it is unlikely that documents containing “information” might be more relevant than documents containing “retrieval”. In 1972, Spärck Jones introduced a measure of term speciﬁcity (discriminative power) called Inverse Document Frequency (IDF).
[bibtex file=ir.bib key=tfidf]
Basically, IDF is based on counting the number of documents in the collection – If a term occurs in many documents, it has little discriminative power. In her original paper, Spärck Jones has demonstrated that IDF outperforms Co-ordination level matching. Coupled with TF (the frequency of occurrence of a term t in document d ), IDF almost exists in every term weighting scheme. The weighting scheme is generally known as TF-IDF.
The original IDF idea was based on heuristic study. People look for theoretical explanations: Information Theory, Probabilistic IR models (the RSJ model, the language models).
[bibtex file=ir.bib key=tfidf_steve]