Talk:Okapi BM25

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
???	This article has not yet received a rating on the project's importance scale.

Computer science Low‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Low

This article has been rated as Low-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

"Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative..." No doubt I'm missing something, but I don't see how that happens. Example? —JLundell talk 01:47, 10 June 2014 (UTC)[reply]

@Jlundell:

\ln a

is negative if

a\leq 1

.

{\frac {N-n(q_{i})+0.5}{n(q_{i})+0.5}}=1

is equivalent to

N=2n(q_{i})

. So the IDF is negative if

n(q_{i})\geq N

. Ireas ^ask! 00:32, 31 October 2015 (UTC)[reply]

Is BM25, an algorithm developed in the 70's, really "state-of-the-art." Doesn't "state-of-the-art" mean that it should be representative of the most advanced, highest-performing algorithm in current use? Word2Vec and PCA (LSI) seem like more advanced algorithms that achieve better performance by many measures and are in widespread use. And if the IR performance metric is user utility, the most popular search engines do not use BM25 or merely an TFIDF-enhanced BM25, but rather ensemble approaches. Hobsonlane (talk) 22:20, 2 March 2016 (UTC)[reply]

BM25 was not developed in the 70s, and the article doesn't say it was: "It is based on the probabilistic retrieval framework developed in the 1970s and 1980s" (my emphasis). The first reference on the article indicates TREC2 (1993) for BM11 and BM15, and TREC3 (1994) for BM25. Also from the references, BM25F looks to be 2004 and BM25+ 2011. And "state-of-the-art" is also qualified: "BM25, and its newer variants, e.g. BM25F [...] represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval" (again my emphasis). 2406:E006:29EF:1:8E89:A5FF:FECA:57FE (talk) 02:51, 3 March 2016 (UTC)[reply]

Yea, I think the reader would benefit if those emphasised points in your comment were to "pop" in the wording of the intro paragraph, rather than the "state-of-the-art" claim/promotion that caught my eye. Perhaps "state of the art" should be downplayed, deleted, or moved to a later section. Even if if we worded it clearly and accurately with something like "Among TFIDF-based algorithms, some variants of BM25 are considered state of the art" would not be all that helpful for the reader just getting to know BM25. More useful for them are the bits about its history (like is pedigree going back to the 70s and 80s) and current use in modern products (like Bing and Google, or whatever other examples of "TFIDF-based state of the art" IR systems you know of that use BM25). Hobsonlane (talk) 16:53, 23 March 2016 (UTC)[reply]

Link http://kak.tx0.org/IR/TFxIDF in the References appears to be dead. Archive.org did pick up a snapshot of it: https://web.archive.org/web/20160916025726/http://kak.tx0.org/IR/TFxIDF Would make the change directly, but I don't know Wikipedia editing protocol. 98.218.179.112 (talk) 00:47, 26 June 2017 (UTC)Anonymous[reply]

Note that the $(k_{1}+1)$ in the numerator of the ranking function is constant and thus should have no influence on the ranking. It seems that the function is often cited in this form, without any explanation. As is pointed out in one of the papers cited (staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf) "The reason for including it was to make the final formula more compatible with the RSJ weight used on its own. If it is included, then a single occurrence of a term would have the same weight in both schemes.". It might be worth pointing this detail out in the article. Maybe it's just me, but I always found the existence of this term slightly confusing. There is an interesting discussion on this topic in the bug tracker of Apache Lucene (https://issues.apache.org/jira/browse/LUCENE-8563). Icannotthinkofanythinguseful (talk) 12:10, 19 November 2020 (UTC)[reply]