Learning Information Retrieval by Papers: TF-IDF and Term Weighting

TF-IDF
It is difficult for an information retrieval system to predict what query terms (words) users will issue to find relevant documents. Every term (word) in the documents could be a potential query term. A modern solution, like in most Web search engines, is to index all the terms occurring in the documents.
If we do that, the next question is, given a query (contain multiple terms), how to calculate a relevance score for each of the documents in a collection. The simplest calculation would be to count the actual number of terms the query has in common with a document. This number has become known as the co-ordination level.
-
C. W. Cleverdon, "Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems," Cranfield Library, 1962. bibtex
@article{Cleverdon,
author = {Cyril W. Cleverdon},
title = {Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems},
journal = {Cranfield Library},
year = {1962}
}
However, it does not consider the frequency of a query term in a document. It is likely that a document that contains more the given query terms is more relevant. One could use the frequency of occurrence as a weight, which is called Term Frequency. The relevance score is a summation of all query terms’ frequency in a document.
-
G. Salton, Automatic Information Organization and Retrieval., McGraw Hill Text, 1968. bibtex
@book{salton_1968_tf,
author = {Gerard. Salton},
title = {Automatic Information Organization and Retrieval.},
year = {1968},
isbn = {0070544859},
publisher = {McGraw Hill Text},
}
TF-IDF
Purely relying on term frequency is still limited because query terms are different in their ability to discriminate documents. A query term is not a good discriminator if it occurs in many documents. We should give it less weight than one occurring in few documents. For examples, when querying “information retrieval”, it is unlikely that documents containing “information” might be more relevant than documents containing “retrieval”. In 1972, Spärck Jones introduced a measure of term specificity (discriminative power) called Inverse Document Frequency (IDF).
-
K. Spärck Jones, "A statistical interpretation of term specificity and its application in retrieval," , 1972. bibtex
@article{tfidf,
author = {Karen {Sp\"{a}rck Jones}},
title = {A statistical interpretation of term specificity and its application in retrieval},
book = {Journal of Documentation},
doi={http://dx.doi.org/10.1108},
pdf={http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf},
year = {1972},
url={http://www.soi.city.ac.uk/~ser/idf.html},
}
Basically, IDF is based on counting the number of documents in the collection – If a term occurs in many documents, it has little discriminative power. In her original paper, Spärck Jones has demonstrated that IDF outperforms Co-ordination level matching. Coupled with TF (the frequency of occurrence of a term t in document d ), IDF almost exists in every term weighting scheme. The weighting scheme is generally known as TF-IDF.
The original IDF idea was based on heuristic study. People look for theoretical explanations: Information Theory, Probabilistic IR models (the RSJ model, the language models).
-
S. Robertson, "Understanding inverse document frequency: On theoretical arguments for IDF," Journal of Documentation, 2004. bibtex
@ARTICLE{tfidf_steve,
author = {Stephen Robertson},
title = {Understanding inverse document frequency: On theoretical arguments for IDF},
journal = {Journal of Documentation},
year = {2004},
pdf={http://www.soi.city.ac.uk/~ser/idfpapers/Robertson_idf_JDoc.pdf},
url={http://www.soi.city.ac.uk/~ser/idf.html},
}
