Talk:Statistical hypothesis test/Archive 2

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1 Archive 2

Sentence Removed

I removed the following sentence from the article because I couldn't fix it:

"Nearly all of the supposed criticisms of hypothesis testing (publication bias, problems of model specification, difficulty of interpretation, etc.) apply equally to Bayesian inference, and are likely to [be] more deeply concealed."

  • If you know that these things "apply equally", why do you have to speculate on how deeply they are concealed? Sounds like "I didn't find it yet but I'm 100% sure it's there."
  • "nearly all" and "etc." are incompatible
  • This sentence reads like part of a pamphlet against Bayesian inference, not like part of an encyclopedia.
  • The whole paragraph is dotted with "citation needed" remarks and yet this counterargument that was at its end would be in need of more backing up than any of the other statements criticized.
  • Does "supposed criticisms" imply that not only the validity of the criticism is questions but even its existence?

If this sentence really must go back into the article, I suggest something along these lines: "Many of the criticisms of hypothesis testing (e.g. publication bias, problems of model specification, and difficulty of interpretation) apply equally to Bayesian inference." --Mudd1 (talk) 13:00, 6 February 2013 (UTC)

I have put this shortened version back in, possibly at a different place, in "Future of the controversy". 81.98.35.149 (talk) 13:26, 6 February 2013 (UTC)

The "Selected criticisms" section.

The section was once (as described) an abbreviated summary of an article by Nickerson. Edits have invalidated that description. Flags for citations have also been added. Nickerson's list of references occupies 11 pages. If an abbreviated summary of his article is inadequate, then a substantial expansion of the section is in order. The Controversy section is (in my opinion) already too long for this article. About half of the citations in the current article refer to a controversy that is scarcely (rarely?) mentioned in introductory statistics books. I hope to slightly expand the Cautions section & then to advocate moving the Controversy to a separate article. Comments?159.83.196.1 (talk) 21:08, 17 April 2012 (UTC)

Well the standard for Wikipedia is certainly not to be equivalent to a text-book: see WP:NOTTEXTBOOK. Where did you get the idea that an "introductory text book" level is in any way a target? But it does need to present a current overview of the current state of knowledge, and to be balanced and verifiable, which means providing citations (more than one where possible) close enough to what is said as to be recognisable as providing the source. The present "the following section" is ambiguous (is that the next subsection only?) and likely to be off-screen when viewing. You obviously can't stop people adding in new subsections. I'll add in citations to Nickerson in what seem like better places for now, based on what you've said, but it is an obscure journal so I can't check it. Melcombe (talk) 22:44, 17 April 2012 (UTC)
Sorry, I am feeling the heat but I am not seeing the light. I looked for hypothesis testing in 3 encyclopedias: Americana, Britannica and the Great Soviet. The coverage here is much more complete; The encyclopedias do not get much beyond definitions. The Great Soviet has a separate page on statistical hypothesis testing with 2 rigorous references (Cramer, Lehmann) translated from English. The Britannica has half a page within Statistics with four introductory statistics textbooks for references. The Americana mentions hypothesis testing within Statistics. I am satisfied with my position that the introductory statistics text is a reasonable model for this article. There are two Wikipedia policies of concern:
  • "Wikipedia is an encyclopedic reference, not a textbook."
  • "Texts should be written for everyday readers, not for academics."
Hypothesis testing is a academic subject that is explained (taught) to everyday readers by defining terms and providing examples. Is there a better way to explain an academic subject? The sections of the article most subject to criticism based on policy are those which are most mathematical and algorithmic (those that are most statistical).159.83.196.1 (talk) 22:43, 1 May 2012 (UTC)
I suggest you look at WP:MOSMATH (Wikipedia standards for maths-type articles). This promotes the idea that articles should start at a "generally understood level" but may/should go on, later in the article, at higher levels of sophistication so as cover a topic thoroughly. But "generally understood level" need not/should not be equated to reproducing what you might find in a textbook ... rather it relates to the understanding of what the topic of an article is about and why it is important. Wikipedia is not about teaching a subject ... there are Wikibooks and Wikiversity for that. Why do you think that Wikipedia should be different from any other encyclopedias? THey have articles on "statistics" that presumably prfovide a reasonable overview odf statistics and of "hypothesis testing" as a sub-topic. Since here there is a separate article on statistical hypothesis testing what should be expected is that it provides a reasonable outline of the topic and of important sub-topics and of relations to other topics ... not the equivalent of what a chapter in a book on statistics might contain. I also suggest that you look at some others of the mathematics articles ... try Measure (mathematics) as it is somewhat related to statistics. Melcombe (talk) 23:50, 1 May 2012 (UTC)

I have expanded this section greatly. I now want to spice it up with some images and figures. The first thing is I would like to add the following pic next to the ronald fisher quotes but don't know how to check for copyright, etc: Fisher and his calculator. I think this picture has more character than the one shown on the ronald fisher page. So if anyone can do this or let me know how to do it that would be great.

Also if there is someone who disagrees with the criticisms they should go through and write up responses. The current ones appear pretty weak to me. I should also note that I have little formal statistics training, this section is just a summary of what I have found on my own trying to wrap my head around the situation... so the input of someone coming from the formally trained perspective would be valuable.207.229.179.97 (talk) 16:02, 3 December 2012 (UTC)


I'm not sure that a bunch of quotes criticising NHST is really helpful for the article. The whole section about controversy is now a blur of quotes and various kinds of lists. I think it could be substantially improved by trying to consolidate some of those ideas into a few paragraphs of encyclopedic content. Particularly given that the quotes from Fisher, et al. are written in fairly out-dated English, I don't think they're particularly helpful to the average reader. --Thosjleep (talk) 19:54, 4 January 2013 (UTC)
I agree that it can be substantially improved. The quotes should probably have their own page. The purpose is to convey to the reader that there has been a very strong vein of ongoing dissent amongst statisticians regarding the way hypothesis testing is used in research, going all the way back to the originators of the procedure. The dissenters are not just some fringe group of people. I found this a very striking contrast to what was implied to me via general opinion in the research community as well as statistics textbooks and courses. What the section needs is some figures of simulation results that prove the problems discussed in the quotes are indeed true. 207.229.179.97 (talk) 17:37, 8 January 2013 (UTC)
The most prestigious consideration of this controversy is Wilkinson (1999) which summarizes the recommendations of a committee of influential psychologists and statisticians (including critics):
"Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval." p 599
"Some had hoped that this task force would vote to recommend an outright ban on the the use of significance tests in psychology journals. Although this might eliminate some abuses, the committee thought that there were enough counterexamples (e.g., Abelson, 1997) to justify forbearance." pp 601-602
Wilkinson as PDF: http://www.apa.org/pubs/journals/releases/amp-54-8-594.pdf159.83.196.1 (talk) 00:33, 13 January 2013 (UTC)
Yes I have read it. It is linked to in the introduction of the section. Hopefully others will as well. — Preceding unsigned comment added by 207.229.179.97 (talk) 21:51, 25 January 2013 (UTC)
Oops sorry I misread as nickerson. I will check this one out. The link gives a 404 error though. Can you provide the full reference?207.229.179.97 (talk) 21:54, 25 January 2013 (UTC)
Alternatively(maybe): http://www.mobot.org/plantscience/ResBot/EvSy/PDF/Wilkinson_StatMeth1999.pdf159.83.196.1 (talk) 22:48, 29 January 2013 (UTC)


Perhaps the quotes could be organized thematically rather than chronologically? That way the whole section could be organized into subsections surrounding specific points of debate, then the article can have paragraphs rather than lists (which look like a talk page rather than an article) describing these points of debate and relevant quotes can be added into each of those subsections? --Thosjleep (talk) 08:44, 27 January 2013 (UTC)
I agree 100%, other people were supposed to take up the slack once some material was provided... If noone else wants to do it and it doesn't fit with the purpose of the page then I suppose it should be gotten rid of and the next generation of researchers will have to discover the controversy as I did.207.229.179.97 (talk) 14:08, 27 January 2013 (UTC)


I haven't been following all that closely, but it looks like all the quotes have been removed. Can you link me to the last version that included them and I will try to synthesize that whole section into something including at least some of the quotes). --Thosjleep (talk) 13:46, 6 February 2013 (UTC)
If the page is split off I think they will fit better. The quotes are necessary to convey the tone of the literature on this subject. https://en.wikipedia.org/w/index.php?title=Statistical_hypothesis_testing&oldid=535614651 — Preceding unsigned comment added by 207.229.179.97 (talk) 15:06, 7 February 2013 (UTC)


I've followed up this and completely revised the latter half of the article. Not all the quotes made it in, mostly for length. I think working for this wikified text is a better starting place than trying to modify the previous lists of things. --Thosjleep (talk) 11:03, 8 February 2013 (UTC)

Information about the null ritual hybrid should be included

I believe the person who deleted this information 1) Does not have access to many journal articles and may not be aware of what procedures are being performed in many fields. 2) Has not read the article that introduces the term null ritual and so has no qualification to remove that material 3) Has something against psychologists talking about statistics and this is the true reason he does not like the use of the term.

The term comes from peer reviewed literature and the phenomenon it describes is widely recognized to describe how statistics are being taught and applied. Omitting a clear description of the null ritual hybrid method makes the origins section more confusing then it needs to be. What is being taught as hypothesis testing is neither a fisher nor neyman-pearson method, and it is important to convey this information to the reader.

Other opinions?207.229.179.97 (talk) 05:59, 12 February 2013 (UTC)

1 - Yes, I do have access to many journals and yes, I am aware incompetent/cheaters surround us in many fields of science.
2 - I had read the article. It is a coined term from a non-qualify person in statistics (which checking some definitions in his papers makes it obvious). Wikipedia guidelines says personal opinions must be stated as such.
3 - What I have against psychologists talking about statistics is as much as I have against NBA players talking about statistics; it is not their field of expertise.
The term is peer reviewed by psychologists who, interestingly, one of the papers you reference show they do not properly understand what a p-value is. So if psychologists agree they don't know what they are talking about, shouldn't just we listen to them... or rather ignore them when it comes to mathematics?--Viraltux (talk) 11:03, 14 February 2013 (UTC)
I didn't make this change, but my suggestion would be to include it, but not in the table. I think the comparison between Neyman/Pearson and Fisher is important from both a historical and conceptual angle, but the "null ritual" criticism isn't so much about the origins of method as it is about criticisms of how it is used in practice. I think it's good that there appear to be at least three editors actively interested in clarifying this article, but I think the best approach is to give a neutral discussion of the method (or methods) as they were originally intended, some discussion of how they are used and taught for lay readers who simply need to know what hypothesis tests are, and then a rich discussion (which I think we have no problem putting together) about misinterpretations and possible problems. So, to more directly answer your question, I think there is value in mentioning it, but I think it should go in the "Ongoing Controversy" section rather than in the "Origins" section. --Thosjleep (talk) 07:59, 12 February 2013 (UTC)
I added a part back into the table and supported it with a quote from fisher in the paragraph above. The passage is from page 12-13 and can be found here: http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf. My efforts to shorten the quote to include only the important information is somewhat tortured (Fisher had a gregarious writing style), so perhaps this can be improved. I think this also contributes to the impression that the original source of the information in the table (Gigerenzer) did know what he was talking about and it would be worthwhile to include his comparison to the modern hybrid ritual. 207.229.179.97 (talk) 16:39, 12 February 2013 (UTC)
Had Psychologist Gigerenzer know what he is talking about (or anyone else editing the article for that matter) he would not have placed the "Use this procedure only if little is known about the problem at hand" as step 3 since, obviously, this is not a step in the procedure but rather a requirement to deal with information in your experiment so that only random information is left for the test to make sense. If anything it should be a kinda of Step 0, but I guess the whole purpose of forcing that statement there is, oh surprise, to let people not forget that when we have prior information Bayesian statistics are an option.--Viraltux (talk) 11:03, 14 February 2013 (UTC)
... You are the statistics teacher here. Please improve the description of the "testing process" so that table will be made unnecessary. 207.229.179.97 (talk) 15:07, 14 February 2013 (UTC)
... We might finally might agree on something; the whole article is messy.--Viraltux (talk) 16:39, 14 February 2013 (UTC)

Subject (and title) of the article?

If the Fisher and Neyman-Pearson formulations were merged (as alleged in a Controversy section), shouldn't this article discuss both in detail given its title? The Neyman-Pearson lemma is termed "the fundamental theorem of hypothesis testing" by an introductory text in probability (Ash, 1970, p 246). The lemma is concerned with statistic power and likelihood ratios. A classic old book on the testing of statistic hypotheses by Lehmann scarcely mentions null hypotheses or significance levels. (A later version of the book supplied a lot of the procedures and terminology in the article.) If, alternatively, the formulations were not merged, isn't the title of this article wrong? See the first issue of 2004 on this talk page.159.83.196.1 (talk) 00:24, 14 February 2013 (UTC)

I agree and not only that, we should even add a Bayesian hypothesis section since Hypothesis testing is a quite general term. The article should be more math oriented and less history/controversy oriented. The size of the controversy section is ridiculous for a mathematical article. We already have a History of Statistics entry that could be expanded with all the controversy information in a more appropriate context. I would love to discuss math here instead of what we are doing.--Viraltux (talk) 11:17, 14 February 2013 (UTC)


Why should this page only be math? I guess I'd like more of a justification of that. Hypothesis testing is ubiquitous in the sciences, so I think some discussion of how it is done in practice and what the problems with that standard practice are seem appropriate for this article. I definitely think the front end of this article could use more formalized treatments of Fisher and Neyman/Pearson approaches, but I don't see why the controversy should be moved elsewhere.--Thosjleep (talk) 11:34, 14 February 2013 (UTC)
I actually agree with you but I have never said it should be only math, what I said is The article should be more math oriented and less history/controversy oriented. I am not complaining about a controversy section but about its size. At this pace we might end up having more historical content in this article than in the History of Statistics article. Does this makes sense to you?--Viraltux (talk) 12:12, 14 February 2013 (UTC)


Yes, that makes sense. I think I just misinterpreted your previous remarks. Would you be willing to put in a little bit of labor to clarify the Fisher and Neyman/Pearson approaches. I think, especially, finding a better way to describe the approaches (rather than the ordered list of steps) and introduce the terminology (which doesn't necessarily make sense unless you already know about all of the different issues) would really strengthen the first half of the article, which might make the second part seem less excessive. I'd still like to tighten the controversy section, but haven't had time to do another systematic revision since I transitioned all of the bullet lists and quotes into the current state (which hopefully at least resembles a wiki entry even if it is a bit long).--Thosjleep (talk) 12:32, 14 February 2013 (UTC)
The Statistical Inference books that I have and that I have seen go straight to the mathematical formulation of Neyman and Pearson and avoid any philosophical/historical discussion. These books present a tool and let to the user then decide how to use it. The reason why this is so is because we can see the Neyman/Pearson approach as an extension (not improvement) of what Fisher had already presented, so if the users do not need to go all the way with Neyman/Pearson approach they can stay half way with Fisher or any other way as long as they understand what they are doing.
For instance, In in talk with a university professor (Biologist and Mathematician) he explained to me that for some biologist tests (don't remember which sorry) they only pay attention to a collection of p-values, they have no use for alternative hypothesis in those cases, whereas they do for others. In other words, this is a tool and you need to know how to use it.
Mathematics are crystal clear. All the discussions among Fisher and Neyman et al. about hypothesis testing were all on the philosophy of science grounds. Hypothesis testing is not a close formula of right or wrong, you can use all sort of hybrids as long as you understand the mathematics behind. Those complaining about hybrids only focus in their misuse to then whine about the tool. I guess the people misusing the tool is not ever to be blamed.
Anyhow, since most (all?) books in statistical inference offer the Neyman & Pearson approach I guess it makes sense the article do likewise. How about if we expand the section "Use and Importance"? We could give there details about the different hybrids used and when they abuse the rational behind the test. Also we should avoid philosophical discussions since one Wikipedia guideline is neutrality. Those discussions should go to history/controversy sections.--Viraltux (talk) 16:04, 14 February 2013 (UTC)
The discussions are already in history/controversy sections. What are you suggesting be changed about the page?207.229.179.97 (talk) 16:51, 14 February 2013 (UTC)
Well, the article goes straight to the Neyman/Pearson description with the alternative hypothesis paradigm. If the math behind is not understood one might think that other approaches are wrong when they are not. Maybe stating this from the beginning and giving a more general description of hypothesis testing might lessen the confusion. Including some section of Bayesian hypothesis testing too, and all this with no philosophical bias comments about their validity here and there, we can leave that to the controversy section.--Viraltux (talk) 17:16, 14 February 2013 (UTC)
I think some mathematical proofs should be added to the terms and common tests section and this should be moved to the front. 207.229.179.97 (talk) 17:23, 14 February 2013 (UTC)
It is impossible to understand the method described by this page (which is also what you will find in textbooks) without being aware of the history and (lack of) logic behind it. The attempts to disentangle the mathematics from these other factors just leads to confusion in the mind of the reader. Look at the insane claims made in the "testing process" section. It describes a fantasy scenario in which some kind of mutated decision theory was used in the past but now an alternative process that is "less deficient in reporting results" which more resembles Fisher's method (except detached from inference) is more available to us since we have computers. Of course the citation is to an introductory textbook. 207.229.179.97 (talk) 14:39, 14 February 2013 (UTC)


Sorry, but to say that you need to know about the history of a mathematical concept to understand that concept is... well, let's say inaccurate (e.g. nobody needs to know all the long history behind the development of the Normal distribution to understand its properties and purpose)
As a matter of fact, mathematics books offer little to zero historical background on hypothesis testing (or any other subject for that matter) since this is what history books are for.--Viraltux (talk) 16:04, 14 February 2013 (UTC)
That is nothing but an opinion based on anecdotes. This method of learning discourages critical thinking. Further, there is plenty of evidence that this strategy has failed to produce people who understand hypothesis testing. Perhaps because it is more than only a "mathematical concept". 207.229.179.97 (talk) 17:20, 14 February 2013 (UTC)
Let me guess, those evidences were put forward by psychologists, right? --Viraltux (talk) 17:51, 14 February 2013 (UTC)
If you can't recognize there is mass confusion amongst people including otherwise intelligent scientists about hypothesis testing and will refuse to acknowledge evidence of this that comes from people other than mathematicians (who would not do such a study) then I don't think it will be possible to convince you. 207.229.179.97 (talk) 18:30, 14 February 2013 (UTC)
Excuse me, we were talking about the impossibility to understand Hypothesis testing without a historical background, and now you just trough some precanned statements on my mouth to then finger pointing me as a hopeless case? What you've just done must be one of these debating skills you learn in some schools, right?--Viraltux (talk) 18:54, 14 February 2013 (UTC)
The impossibility to understand Hypothesis testing as presented by this page. Which is the result of the accidental fusion of two approaches, an historical event, rather than any mathematical or logical thought process. 207.229.179.97 (talk) 19:22, 14 February 2013 (UTC)
On further thought perhaps this history is not fantasy and it actually reflects the process that has gone on in the minds of the people who come up with this stuff. The entire thing is so detached from the insightful concepts of fisher, neyman, etc perhaps a better fix would just be to remove all references to their work from the page and make new pages about their approaches that can be linked to at the top in the same way bayesian inference is now. We can say the origin is anonymous but the first published reference to it is Lindquist (1940). It is just too much of a mess. 207.229.179.97 (talk) 15:35, 14 February 2013 (UTC)

Split section (discussion beginning in February 2013)?

I favor division and offer the existing Controversy section and recent heated rhetoric on this talk page as supporting evidence. The Common criticisms subsection is deficient on many grounds including format, lack of citations and POV issues. The Evidence for failure to control publication bias and model specifcation [typo!] errors subsection is very well referenced, but a balanced discussion of the references would alter the POV. Repairing these errors will require more prose and more citations, adding to an already overgrown section. This is a heavily visited statistics article - not a good place to battle over controversial issues of uncertain practical significance.159.83.196.203 (talk) 20:35, 7 February 2013 (UTC)


I'm probably a no on this. I think the problem here is failure to synthesize the controversy and wikify it. In fact, pushing it to its own page will probably just keep the disagreement happening on this talk page burning forever. I think it would be more productive to work on actually coming up with a concise statement of what the points of contention are, with relevant citations, and then link to relevant existing articles that can more fully address alternative approaches (e.g., Bayesian inference, Bayesian statistics, etc.). --Thosjleep (talk) 21:22, 7 February 2013 (UTC)
There are two controversies. One is the widespread misuse and misinterpretation of the results of hypothesis testing, the second is the disagreement over whether or not frequentist methods can successfully be applied to cutting edge research during which it is common for unexpected results to occur (thus invalidating the assumptions of the test made before performing a study). They are related in that researchers commonly apply bayesian interpretations to the results of statistical tests that are based on non-bayesian methods. This is a failure of education on the subject. They are taught "easy" examples like those in this page, but never the logic behind the method or anything relevant to the actual type of situation they encounter and then try to fit a square peg into a round hole, instead of modifying the study to be a square hole.207.229.179.97 (talk) 04:03, 8 February 2013 (UTC)
See the first issue of 2006 on this talk page. The tone of the criticism has been improved, but the volume has increased.159.83.196.1 (talk) 00:34, 14 February 2013 (UTC)

I am concerned that any encyclopedic coverage of the controversy will necessarily involve personal essays or original research. The literature does not contain succinct summaries. (Prove me wrong by citation, Please.) The creation of one violates Wikipedia guidelines.

Consider Nickerson who provided a condensation of the literature of controversy in psychology: (my counts)

  • 12 Misconceptions regarding NHST
  • 11 Other criticisms of NHST
  • 4 Recommendations for improving the use of NHST
  • 8 Alternatives or supplements to NHST
  • 7 Other recommendations for improving statistical reporting

60 pages. Approximately 300 references.

Citing Nickerson has one big advantage; He is a secondary source who summarizes the 100s of references. Nickerson's coverage of the historical dispute between Fisher and Neyman is limited. He is not complete. Tersely listing complaints does not project a NPOV. Discussion of each is not terse.159.83.196.1 (talk) 20:57, 16 February 2013 (UTC)


An alternative secondary source to Nickerson is Kline (2004):

  • 13 Fallacies
  • 12 Other criticisms of NHST
  • 6 Justifications for NHST
  • 4 Variations on NHST
  • 9 Recommendations

Kline references Nickerson. The cited chapter of Kline - 30 pages.

The two sets of fallacies and other criticism produced by K & N are nearly disjoint. There are about 20 fallacies and another 20 issues to be addressed from psychology and the behavioral sciences. Given these two references, there is no short consensus list of the major issues.159.83.196.1 (talk) 21:40, 19 February 2013 (UTC)


The History of Statistics, Ronald Fisher, Jerzy Neyman and Egon Pearson articles make little or no mention of the controversy. The history article says little of frequentist inferential statistics during the 20th century. All is too silent on the Wikipedia front.159.83.196.1 (talk) 21:40, 16 February 2013 (UTC)

The Testing Process Section

Mr/Ms 207.229.179.97 has edited this section with a cleanup template saying "This section confuses two competing approaches to statistics and includes historical claims that are complete fantasy" .

The two "competing" approaches described are not such and, in fact, are the same. They simply say that when no computers were available instead calculating the p-value to reject or not Ho they would simply check whether the statistic would fall above or below the value determined by the alpha to reject Ho. This makes perfect sense and they also offer references to prove it. So I am removing the cleanup template until Mr/Ms 207.229.179.97 clarifies his/her motives to add such template. --Viraltux (talk) 17:44, 14 February 2013 (UTC)

They are indeed competing approaches. This is explained in the controversy section with multiple sources and quotes from the originator of one of the approaches clearly disagreeing with the idea that they were the same approach. The concepts of a null hypothesis and pvalue are of Fisher, not Neyman-Pearson. The idea of using a significance level of 5% or 1% has nothing to do with decision theory. It is based on an example used by fisher.
In decision theory you determine the cutoff using cost-benefit analysis. "The principle upon which the choice of the critical region is determined so that the two sources of error may be controlled is of first importance."Jerzy Neyman, Egon Pearson (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses". Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character. 231: 289–337. doi:10.1098/rsta.1933.0009. JSTOR 91247.
The use of an alternative hypothesis did not come before the use of only a null hypothesis, and the reason the method including an alternative hypothesis was developed was not because people lacked access to computers. — Preceding unsigned comment added by 207.229.179.97 (talk) 18:20, 14 February 2013 (UTC)
First, I don't find where it says "...the method including an alternative hypothesis was developed was not because people lacked access to computers." Could you please indicate where the link between H1 and computers is established or implied? I don't see it anywhere.
Second, They split the process at step 7. to simply indicate that when no computers where available calculating the p-value was not necessary to reject/not-reject Ho. It could be better explained/redacted by they are not wrong.
Third, even if they were truly explaining describing the Fisher vs Neyman/Pearson approach (They are not) these two approaches are not "competing" but complementary.--Viraltux (talk) 18:39, 14 February 2013 (UTC)
Before computers step 8: "Compute from the observations the observed value tobs of the test statistic T." <-N-P method and Fisher method
Before computers step 9: "Decide to either fail to reject the null hypothesis or reject it in favor of the alternative." <-N-P method
After computers step 8: "From the statistic calculate a probability of the observation under the null hypothesis (the p-value)." <-Fisher method
After computers step 9: "Reject the null hypothesis or not." <-Mutant Fisher method
Claim that p value calculation arose after the method of accepting the alternative hypothesis: "The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results."
They don't claim that. They just say it is not necessary to calculate the p-avlue to take a decission, they are not claiming nor suggesting that the p-value did not exist or was developed until computers came. It is obvious they are not implying that.--Viraltux (talk) 19:47, 14 February 2013 (UTC)
If the method presented here is not an inconsistent hybrid, why does it recommend calculating a p value at all? For decision theory you do not care what the exact value is, only if it falls in the rejection region you have previously spent considerable effort determining and justifying in your report. The only reason this would be superior for "reporting results" is if one wished to draw inferences about the probability a null hypothesis is true.207.229.179.97 (talk) 19:10, 14 February 2013 (UTC)
They mention it is for reporting reasons. Even though you might take the same decision anyway you still might want to know how strong are the evidences for doing so. A p-value of 0.04 and and a p-value of 0.0000001 might lead to the same decision but they are definitely worth reporting in many situations (including those you use the Neyman-Pearson paradigm).
I'll do a quick rewriting to make it more consistent but as it is now there is nothing worth calling "historical fantasy".--Viraltux (talk) 19:47, 14 February 2013 (UTC)
"Even though you might take the same decision anyway you still might want to know how strong are the evidences for doing so. A p value of 0.04 and 0.0000001 might lead to the same decision but they are definitely worth reporting in many situations (including those you use the Neyman-Pearson paradigm)."
Why? What purpose would this serve? If you are going to infer the strength of the evidence from the pvalue anyway why did you determine a cutoff beforehand at all? Why not just use what you have learned during the course of the experiment to determine what your cutoff should be after the fact?
Will you conclude that the type I error rate used to make your decision was too lenient? Did the cost of getting a false positive you carefully determined before running the study change based on the result you got? If the pvalue is slightly above your cutoff does that mean it would be a good idea to extend the experiment or repeat the study using a more lenient cutoff if your prior belief is that there is significance there? The only uses for that information are mis-uses or determining that choosing a significance level before running the study is pointless because you did it before learning what you have from doing the study.207.229.179.97 (talk) 21:28, 14 February 2013 (UTC)
Imagine I am your doctor and I tell you that you must undergo surgery or you will die, then you ask me "what a the chances of survival?" and then I reply to you "What purpose would answer that serve? you have to undergo surgery anyway". Since you before questioned how much I know about researchers issues now allow me to question how much you know about the private sector and the kind of questions you will be asked.--Viraltux (talk) 21:51, 14 February 2013 (UTC)
You seem to be advocating using neyman-pearson hypothesis testing but also calculating fisher p values so that you can then misinterpret them as bayesian posterior probabilities to tell to people who nothing about statistics. This is pretty much what is going on in the medical literature so it is understandable that people like yourself are teaching and advising them.207.229.179.97 (talk) 22:27, 14 February 2013 (UTC)
Seriously, I would think I was being trolled if I didn't witness this exact same thing all the time. Please contribute something of value to the page (not one-liners or deletions) so that we can know you are not a troll. You have so far failed to do that despite repeated requests. 207.229.179.97 (talk) 22:43, 14 February 2013 (UTC)
Personal attacks again? And from an anonymous account. If I have not contributed more to the article is due that most of my time has been wasted talking to you. Should I engage in personal attacks as well? Should I keep wasting my time with you instead editing the article? You keep throwing again and again precanned statements on my mouth to then accuse me of whatever you want.
Wikipedia guidelines indicate articles must be neutral; you violate that. Sources must be reliable; you violate that. Personal attacks are not allowed; you repeatedly violate that. So, one more personal attack and I will request you IP to be banned.
If you keep editing this article just make sure you thoroughly document your statements from reliable sources in an neutral manner or "deletions and one-liners" will follow.--Viraltux (talk) 23:05, 14 February 2013 (UTC)


Some statements of history support the view of Viraltux. "Neyman and Pearson followed Fisher's adoption of a fixed level. In fact, Pearson (1962, p. 395) acknowledged that they were influenced by [Fisher's tables of 5 and 1% significance levels]" (Lehmann, 1993)159.83.196.1 (talk) 23:31, 14 February 2013 (UTC)
The claim made in that quote (as you have posted it) appears to be an example of something taken out of context. Here is what pearson actually said:
"His tables of 5 and 1 % significance levels, which lent themselves to the idea of choice, in advance of experiment, of the risk of the "first kind of error" which the experimenter was prepared to take."
Later on, consistent with has already been said here, we see that choosing an appropriate significance level was seen as a crucial point at which personal judgement was to be used (ie do not use the conventional levels):
"Of necessity, as it seemed to us, we left in our mathematical model a gap for the exercise of a more intuitive process of personal judgment in such matters-to use our terminology-as the choice of the most likely class of admissible hypotheses, the appropriate significance level, the magnitude of worthwhile effects and the balance of utilities."
He also said this in that same work:
"I must confess that the older I get, the more difficult I find it to be positive in this matter of statistical inference, but I have felt that as you have invited me to address you here on what is nearly the 30th anniversary of an earlier visit, I should try to formulate some of my thoughts on the relation between the Neyman-Pearson theory and fresh views on inference that are current today. I do this the more readily because I believe rather strongly in the value of emphasising continuity as well as differences in statistical philosophy. I am convinced that if we can only get to the bottom of the way in which similar situations are tackled by different approaches, all I believe lying within the broad path of development of our subject, our understanding will gain in richness-gain in a way which can never happen if we waste energy in trying to establish that we are right and the other fellow is wrong!"
Pearson, E. S. (1962). Some thoughts on statistical inference. Annals of Mathematical Statistics, 33, 394-403.
And to top it off in, as it says in that Lehman paper I believe you are quoting, even though Fisher himself made the conventional level charts and said stuff like "We shall not often be astray if we draw a conventional line at .05". He also said "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in light of his evidence and his ideas." So Fisher himself is the cause of the 5% level, but it was meant to be used provisionally. Neyman-Pearson specifically recommended against a conventional level, but the inference drawn from their method was not meant to be provisional. What we get is everyone using a conventional yet non-provisional significance level as recommended by noone involved, and then also drawing provisional conclusions from the p value. Even worse, 9 times out of 10 this p value is interpreted as a posterior probability using a uniform prior by the people doing the test or reading the report when this is the exact issue both methods were attempting to get away from. Then on top of that they regularly make the error of thinking the research hypothesis they stated is the only explanation for a difference between two things.
The levels of confusion and misinformation involved in this issue are overwhelming. Pearson above was only referring to the bayesian vs frequentist vs fisher debate, the entire hybrid thing has occurred on top of that, and then misinterpreting p values on top of the hybrid. I mean the mess that has been created is really just unbelievable. If people published all their results and not just dynamite charts with stars on it this issue would not be as important. That may be an easier battle to fight than sorting this out. I am going to stay away for awhile. 207.229.179.97 (talk) 01:31, 15 February 2013 (UTC)
I want to leave this information here, this can also be gathered from the simulation script that is linked to in pastebin I posted above somewhere (or better yet I would encourage anyone planning on using this stuff to learn how to write your own simulations so you do not have to rely on others to give you the information necessary to interpret scientific results). Summary is that if there is a difference, p values follow an exponential distribution, not the laymans intuition of a normal distribution which has huge consequences on how they should be interpreted, noone is taught any of this important stuff:
"This article shows that, if an initial experiment results in two-tailed p = .05, there is an 80% chance the one-tailed p value from a replication will fall in the interval (.00008, .44), a 10% chance that p < .00008, and fully a 10% chance that p > .44. Remarkably, the interval—termed a p interval—is this wide however large the sample size. p is so unreliable and gives such dramatically vague information that it is a poor basis for inference."
Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286–300.
Violation of assumptions is yet another level of error and confusion I left out of my above rant. P values also go exponential if the assumption of normality is violated (at least for t-tests, not sure about others). Even if both samples come from the same population. So if you find a paper that claims p values under a true null follow a uniform distribution without justifying the assumption of normality (which is often not true in the biomedical field), it may not be applicable to your problem.

— Preceding unsigned comment added by 207.229.179.97 (talk) 19:29, 15 February 2013 (UTC)


You say "Summary is that if there is a difference, p values follow an exponential distribution, not the laymans intuition of a normal distribution which has huge consequences on how they should be interpreted, noone is taught any of this important stuff:"
FALSE; p-values cannot possibly follow in any circumstances an exponential distribution since the support for such distribution is [0, ∞) and p-values ∈ [0, 1]. An appropriate way to describe their behavior in these cases would be a Beta distribution. And why on Earth anyones' intuition should expect a Normal distribution in this circumstances? Do you have the slightest clue what you are talking about?
The paper you subscribe says "In one simulation of 25 repetitions of a typical experiment, p varied from <.001 to .76, thus illustrating that p is a very unreliable measure."
This statement shows such an appalling ignorance of what a p-value is that I need someone's help for a conjoint three hands face palm; p-values in repeated experiment are supposed to behave like that. Why I am not surprised to see that, once again these fantastic mathematical papers come from yet another psychologist... from now on a YAP (I guess I need the acronym since the psychologists literature in this article is becoming a disease). --Viraltux (talk) 05:19, 17 February 2013 (UTC)
The comment about beta vs exponential distributions is correct but IMO a nitpick in this context, maybe there are nuances I am missing. As for the rest of the well known facts about p values I will let the readers decide for themselves whether or not these points are obvious.207.229.179.81 (talk) 14:30, 17 February 2013 (UTC)

Stigler mis-quotation

The problem with the quotation from Stigler is not formatting; The quotation is erroneous. His cited work is entitled "... before 1900", thus prior to the works of Fisher, Neyman and the younger Pearson. Stigler was originally cited only for his reference to the work of Laplace more than two centuries ago. The erroneous quotation was created in a blizzard of edits a few months past.159.83.196.1 (talk) 21:32, 26 March 2013 (UTC)

Hypothesis Testing is NOT objective

I believe someone edited this line: "Thus, researchers were encouraged to infer the strength of their data against some null hypothesis using p-values, while also retaining the objectivity provided by hypothesis testing."

It originally read "retaining an illusion of objectivity", which is probably worded too strongly. The usefulness of the hypothesis testing framework is that (if used correctly) it allows one to place all the subjectivity into the design of the experiment and method of analysis which are supposed to be determined before the experiment is run and data collected. Whether or not this can be done satisfactorily by scientists dealing with "unknown unknowns" that often pop up during the experiments was at the root of fisher's rejection of the method, and at remains the root of the "bayesian vs frequentist" debate. The process itself cannot be correctly be called objective per se, the important aspect is at which point in the experiment the subjectivity is introduced. The original text attempted to convey that there is a common misconception that the process is completely objective, and this point should be reintroduced somehow in a succinct way. 207.229.179.97 (talk) 22:52, 2 February 2013 (UTC)

Addressed by reference to Objectivity (science) which is itself controversial.159.83.196.35 (talk) 20:19, 13 April 2013 (UTC)

More dubious statements

This is an opinionated claim and should be revised:

"The competing approaches of Fisher and Neyman-Pearson are complementary, but distinct. Significance testing is enhanced and illuminated by hypothesis testing: Hypothesis testing provides the means of selecting the test statistics used in significance testing. The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in determining the number of subjects in an experiment."

Fisher himself did not agree and was very vocal about it. The usefulness of power analysis is highly suspect when one does not have knowledge of the alternative distribution (most scientific research circumstances). Fisher called it the result of "mental confusion" and often "incalculable". If prior information was available to calculate it, Fisher recommended bayesian methods, since his problem with them was use of the principle of indifference (where did this info go?). The statement also appears to be claiming the legitimacy of power analysis is supported by it's "heavy use", similar to the use of p<0.05. If this is the best argument for its place in scientific research then this is a serious problem, it is argument from consensus fallacy.

I also note that numerous references to Fisher himself, the reasons for developing significance testing, and the reasons that Neyman and Pearson attempted to expand upon it, have been removed, which is very unfortunate. http://www.phil.vt.edu/dmayo/personal_website/Fisher-1955.pdf207.229.179.81 (talk) 23:30, 29 March 2013 (UTC)

Read the last two citations that I added, then comment further. I will support the text to which you object.159.83.196.1 (talk) 21:51, 2 April 2013 (UTC)
In Lehmann (1993) section 2, read the paragraph starting, "A question that Fisher did not raise..."159.83.196.1 (talk) 22:15, 2 April 2013 (UTC)
Fisher-1955 was published 15 years after Lindquist! It is difficult to claim that it was particularly important in the formulation of the "hybrid".159.83.196.1 (talk) 22:11, 16 April 2013 (UTC)
Wilkinson (1999): "Provide information on sample size and the process that led to sample size decisions. Document the effect sizes, sampling and measurement assumptions, as well as analytic procedures used in power calculations." Power calculations are required to plan an experiment in psychology. Fisher was right in his era. Later, statistics distinguished between exploratory and confirmatory analysis. NHST is now part of confirmatory analysis which uses crude estimates made earlier to make sample size decisions. Fisher is wrong in the changed environment.159.83.196.1 (talk) 23:51, 17 April 2013 (UTC)

"Consensus"?

"There is some consensus that the hybrid testing procedure that is commonly used is fundamentally flawed." Dubious. The criticism and defence of hypothesis testing contains a lot of nuance. There is a much stronger consensus regarding statistical practices than theory. Practices are particularly bad in fields with ethical constraints on experimental practices. While psychology, education and medicine are properly reluctant to treat people like lab rats, they are fields which would benefit from large samples. Experimental controls are difficult and data are usually multidimensional and often subjective.159.83.196.1 (talk) 21:49, 4 April 2013 (UTC)

Deleted after weeks for discussion.159.83.196.1 (talk) 00:23, 24 April 2013 (UTC)

Granger causality

It seems Granger causality should be introduced in this article. Thanks -- ResearcherQ (talk) 18:05, 23 April 2013 (UTC)

Added to See Also159.83.196.1 (talk) 00:29, 24 April 2013 (UTC)

Table: "A comparison between Fisherian, frequentist (Neyman-Pearson)."

"Report the exact level of significance (e.g., p = 0.051 or p = 0.049). Do not use a conventional 5% level, and do not talk about accepting or rejecting hypotheses."

Fisher is quoted on both sides of this issue.

"Use this procedure only if little is known about the problem at hand, and only to draw provisional conclusions in the context of an attempt to understand the experimental situation."

This statement (from Gigenenzer) has been called "unsupported".

"The usefulness of the procedure is limited among others to situations where you have a disjunction of hypotheses (e.g., either μ1 =8 or μ2 = 10 is true) and where you can make meaningful cost-benefit trade-offs for choosing alpha and beta."

While historically true, N-P theory was generalized over decades. The statement is not true of the hybrid.

The table needs work.159.83.196.1 (talk) 22:00, 4 April 2013 (UTC)


One problem is that Fisher's later opinions did not match his life-long practices. He is both credited (blamed?) for the 5% convention (Stigler and Lehmann) and for opposing it. The complexities of history do not often fit easily in tables. While Fisher was not a frequentist his theory is often labelled so because his alternative statistical philosophy effectively died with him.159.83.196.1 (talk) 00:37, 19 June 2013 (UTC)

Long introduction

It was recently noted that the introduction is lengthy. Constructive suggestions are solicited. The subject suffers from a chicken/egg problem. Each of the sections benefits from having the other sections first. Textbooks develop the subject gradually rather than having strong sectional organisation. Can we have a second introductory section that combines a bit of terminology with an example?159.83.196.1 (talk) 01:02, 19 June 2013 (UTC)

Move common test statistics?

It has been suggested that much of the section be moved elsewhere. I oppose removing the core technical content of this article.159.83.196.1 (talk) 01:10, 19 June 2013 (UTC)

Strange Meehl 1967 reference comments

Here is the current comment on this reference:

Meehl, Paul E. (1967). "Theory-Testing in Psychology and Physics: A Methodological Paradox". Philosophy of Science 34 (2): 103–115. Thirty years later, Meehl acknowledged statistical significance theory to be mathematically sound, blaming instead the "social scientists’ poor understanding of the logical relation between theory and fact" in "The Problem Is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions" (Chapter 14 in Harlow (1997)).

The relationship between this comment and the referenced paper is unclear. The paper makes no claim about mathematical unsoundness, instead it discusses problems with logic and assumptions (ie garbage in garbage out). Could whoever is responsible please explain why this was added? It currently appears to "cast doubt" on the legitimacy of the claims made in that paper while not actually addressing any of the claims. — Preceding unsigned comment added by 207.229.179.81 (talk) 22:05, 5 June 2013 (UTC)

Meehl's later paper deals directly with criticism of hypothesis testing. It references the earlier paper and others by the author on the subject of hypothesis testing. Meehl's earlier paper is addressed to philosophers. It contrasts the ways in which statistics is used by physics and psychology. The later paper seems to be the more appropriate reference for this article. It does indeed cast doubt on a simple interpretation of the earlier paper.159.83.196.1 (talk) 23:11, 12 June 2013 (UTC)
I don't understand what aspect of the first paper has had "doubt casted upon it". Here is an excerpt of the abstract from the 1997 paper cited by the note in which Meehl repeats the same claims as he was making thirty years earlier:
"Although a theory’s success in deriving a fact tends to corroborate it, this corroboration is weak unless the fact has a very low prior probability and there are few possible alternative theories. The fact of a nonzero difference or correlation, such as we infer by refuting the null hypothesis, does not have such a low probability because in social science everything correlates with almost everything else, theory aside." http://www.tc.umn.edu/~pemeehl/169ProblemIsEpistemology.pdf207.229.179.81 (talk) 03:43, 14 June 2013 (UTC)
The first sentence of the quotation is a general philosophical problem of science, largely independent of statistics. The second says that the general problem is severe in the social sciences, again independent of statistics. These are not criticisms of statistics (or hypothesis testing); They are observations on the complexity of the social sciences.159.83.196.1 (talk) 00:09, 19 June 2013 (UTC)
Rather than debate the complex prose and opinions of Meehl which may have mellowed with time, I inserted a bullet which contains some of the same criticism more simply expressed. I plan to delete the earlier bullet after allowing time for commentary.159.83.196.1 (talk) 21:11, 25 June 2013 (UTC)
This new text does not address the same concerns as the original text. If the null hypothesis chosen has low a priori probability (ie, there are numerous possible explanations for why it may be false), significance testing is rather worthless as an indicator of the variable under study. This is the case for much of the psychological, sociological, and biomedical literature. By choosing the null hypothesis of "means of two groups are exactly equal" when studying systems influenced by innumerable unknown interactions between factors, the researcher has chosen the easiest possible thing to disprove. It is to the point that we often already know a priori that it is false.
Further, if we choose the opposite of what our research hypothesis predicts as the statistical null hypothesis, we are doomed to perpetually affirming the consequent. It is actually impossible for research assessed in such a way to ever be falsified or verified. If the researcher instead chose the null hypothesis to be some outcome predicted by their research hypothesis (eg difference in means will be exactly 3), then it becomes possible for the research hypothesis to be falsified (or at least the conjunction of this hypothesis along with all the auxiliary info) and for cumulative growth of knowledge to occur.
Perhaps this should be placed under "cautions" rather than criticisms. The problem with that is many visiting the page will be told to do the opposite when they are taught the hybrid in statistics 101. The math, sociology, logic, and philosophy are all intertwined so trying to separate them is a mistake in my opinion. — Preceding unsigned comment added by 207.229.179.81 (talk) 02:58, 29 June 2013 (UTC)
The conventional comparison of numeric prediction to numeric result was supplemented but not replaced by NHST. The original approach is stronger.159.83.196.1 (talk) 22:29, 5 July 2013 (UTC)

Dubious statements in Origins

"Neither strategy is meant to provide any way of drawing conclusions from a single experiment.[25][27] Both strategies were meant to assess the results of experiments that were replicated multiple times.[28]" Maybe. Significance testing has no explicit mechanism for considering multiple experiments, although frequentist statistics relies on that interpretation. Fisher was not exactly a frequentist.

Gigerenzer, et al (1990) claim that a consensus hybrid emerged among statisticians, (social science statistics) textbook authors and (social science) journal editors over several decades. The hybrid emerged without the known participation of Fisher, Neyman or Pearson during their long-running dispute. Lindquist may have contributed to the hybrid, but is unlikely to have originated it. His "nil" interpretation was noted as unusual decades later.159.83.196.1 (talk) 21:01, 29 March 2013 (UTC)

The claim is based on the paper cited in that figure. They examine the literature and textbooks of the time and come to the conclusion that Lindquist was the most likely originator. I don't know what Gigerenzer paper you are quoting but he most likely did not present evidence as strong as that in the said paper.
Also, from later in the wiki: 'Fisher himself said,[4] "In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result."'207.229.179.81 (talk) 22:51, 29 March 2013 (UTC)
I have been unable to find a distinction between the hybrid method supposedly developed in the behavioural sciences and the method taught in introductory statistics classes outside of those sciences. Gigerenzer's book mentions statisticians, statistics textbook authors who were mostly not statisticians and journal editors. He is a vague on the interactions among those groups.159.83.196.1 (talk) 20:04, 30 March 2013 (UTC)
My impression is that the method developed by psychologists became the method taught in textbooks and that psychology was the first field to adopt widespread use of statistics. I haven't read anything by historians on this (perhaps Stigler has written something?), but I have a tentative theory linking it back to Edward Bernays and the formation of NIMH after WWII. It is something I have been meaning to look deeper into. Another interesting point is Churchill destroying all of Alan Turing's Bayesian work to keep it out of the hands of the Russians.207.229.179.81 (talk) 22:35, 30 March 2013 (UTC)


As far as I can tell, statisticians with knowledge of history accept that current hypothesis testing method is a hybrid, but "the hybrid theory" of hypothesis testing, as a term, is a product of Gigerenzer et al (1989). Gigerenzer is a psychologist and his writing partners were philosophers and historians (no mathematicians or statisticians). A search on the term will not yield much and statisticians may be unfamiliar with the term. Lindquist taught a hybrid of the two formulations, but it was not yet the theory that Gigerenzer described. Perhaps it was a first step on the evolutionary path.
Old reference to Lindquist: "There is some confusion in the literature concerning the meaning of the term null hypothesis. Fisher used the term to designate any exact hypothesis that we might be interested in disproving, and "null" was used in the sense of that which is to be nullified (cf., e.g., Berkson, 1942). It has, however, also been used to indicate a parameter of zero (cf., e.g., Lindquist, 1940, p. 15), that the difference between the population means is zero, or the correlation coefficient in the population is zero, the difference in proportions in the population is zero, etc. Since both meanings are usually intended in psychological research, it causes little difficulty." A footnote in: The test of significance in psychological research, David Bakan, Psychological Bulletin, Vol. 66 (No. 6), December, 1966, pp. 423-437. I leave it to you whether "some confusion" "causes little difficulty". I don't know when Lindquist's deficient definition was first known.
Gigerenzer discusses the invasion of psychology by statistics in section 6.3: Statistics of the mind; The new tools pp 205-211.
Halpin's paper is titled "...in Psychological Research (1940-1960)". Lindquist's book was doubtlessly influential in that arena. Lindquist learned from Snedecor, whose book on statistical methods has been in (re-)print (8 editions) since 1938 (and which mentions the Neyman-Pearson formulation without confusion). Snedecor was taught by Fisher. The statistician's post-graduate text of choice on hypothesis testing for the past half-century was authored by Lehmann who was a student of Neyman. Neither farmers nor statisticians were much influenced by Lindquist. — Preceding unsigned comment added by 159.83.196.1 (talk) 22:32, 2 April 2013 (UTC)
On further reflection & study, I removed some of the adjectives (leaving the factual claim intact) and removed the dubious flag. Another reference noted how advanced Lindquist's text was (in a complimentary way).159.83.196.1 (talk) 22:06, 16 April 2013 (UTC)

The lines that were tagged as dubious remain so. In a series of 10 hypothesis tests, the calculations of the 10th are completely independent of tests 1-9. Fisher's definition does not change that. It does raise the issue of a replacement definition if hypothesis testing is abandoned.159.83.196.1 (talk) 00:16, 19 June 2013 (UTC)

The two dubious statements were deleted after months of discussion.159.83.196.1 (talk) 19:19, 12 July 2013 (UTC)

Redirect from "Hypothesis testing"

"Hypothesis testing" is redirected to this page at present, but in fact, "statistical hypothesis testing" is a particular type of the more general "hypothesis testing". There is a page whose title is Empirical research that makes a general description of the "empirical method", of which the "hypothesis testing" is a part. I think that it would be better to redirect "hypothesis testing" to the "empirical method" page, which I will do if there are no objections.--Auró (talk) 14:25, 29 December 2013 (UTC)

I suggest that you make it easy to reach either article. The preferred article depends on context.172.250.105.20 (talk) 20:45, 9 January 2014 (UTC)

Missing word?

I think this phrase must be missing a word: "Specifically, the null hypothesis allows to attach an attribute" ... allows whom? Should there be an "us" or "the researchers"? The entirety of number 2 under "The testing process" seems to be saying we need to choose the null wisely, but it's just very unclear, and this seems to be a central part. And attach an attibute to what? The null hypothesis? Any help/edits to make it more clear are much appreciated.TedPSS (talk) 02:50, 21 February 2014 (UTC)TedPSS

I do recognize that both the Null hypothesis and Alternative hypothesis pages are somewhat long, and statistical hypothesis testing is a rather large topic to cover. However, it seems cumbersome to split up two such indivisible types of hypotheses and explain each separately. It results in a lot of redundant repetition of content and examples. Also consider those wanting to learn and must wade through three pages on basically the same points (the tests, history, etc sections are much more naturally divisible). Can't we rather split up the main topic by other ways? Or alternatively, merge Null hypothesis and Alternative hypothesis into one Null and alternative hypothesis with sections for each, and broaden this latter page up for more than the narrow statistical perspective? Sda030 (talk) 03:56, 26 February 2014 (UTC)

I'm in favor of Null and alternative hypothesis; Statistical hypothesis testing has already attracted too much controversy. Fgnievinski (talk) 04:30, 26 February 2014 (UTC)
Null and alternative hypothesis because alternative hypothesis is short & this article is already too long.172.250.105.20 (talk) 19:04, 8 March 2014 (UTC)
There seems to be some duplication or undesirable overlap with estimation statistics; your thoughts? Fgnievinski (talk) 03:09, 29 June 2014 (UTC)
Does not appear to be a serious problem. Problems with significance testing are a natural lead to the advantages of confidence intervals.172.249.8.109 (talk) 04:14, 13 July 2014 (UTC)

Early choices of null hypothesis

The section should be moved to null hypothesis.172.249.8.109 (talk) 04:04, 8 July 2014 (UTC)

Using a strawman null hypothesis is the most fundamental problem with NHST as commonly practiced, so it probably belongs on both pages. One can read the 1904 Pearson paper (apparently the first use of a strawman) to see that he did not consider the consequences at all.
I have not been able to find any pro-NHST argument addressing Meehl's critique of this or, more generally, justifying the "disproving the opposite of your theory" use of NHST. This appears to be simply absent from the literature, the discussion is completely one-sided against doing this yet the practice continues. I doubt it can be justified, it renders the entire procedure pointless as far as I can tell. All the other confusion seems to stem from people trying to make sense of this non-productive behaviour they have been trained to perform.207.229.179.81 (talk) 17:28, 16 July 2014 (UTC)

The Cohen Criticism

I would like to expand on Cohen criticisms of NHST

*"[I]t does not tell us what we want to know"

by adding:

What we want to know is: given our data, what is the probability of the null hypothesis being true. But what the p-vlaue tells us is: given that the null hypothesis is true, what is the probability of obtaining our data? --1980na (talk) 01:23, 1 August 2014 (UTC)

It is perfectly possible to selectively quote Cohen as a Bayesian advocate. His recommendations in the subject article do not support that. He did not recommend replacing NHST, but supplementing it with exploratory data analysis and reporting effect sizes.172.249.8.109 (talk) 01:51, 7 August 2014 (UTC)

Grammatical error

As copied directly from the cited source here (potentially a copyright violation), the text under section /* The testing process */ reads: There should be a well-defined statistical statement (the null hypothesis (H0)) which allows to attach an attribute (rejected): it should be chosen in such a way that it allows us to conclude whether the alternative hypothesis can either be accepted or stays undecided as it was before the test. The grammatical error is the omission of the word "us" between "allows" and "to". But the larger problem is that, if I'm not mistaken, this is talking about setting the p-value, which is the probability of observing an effect given that the null hypothesis is true, but attempts to address the issue in a shorthand, that is less than clear outside of the context of the original source. --Bejnar (talk) 16:43, 12 August 2014 (UTC)

Well spotted. I removed that passage, and another from the same source, that were copied (or very closely paraphrased) from the book you linked to. They were added in this edit by an editor who stopped editing in 2010. Qwfp (talk) 17:32, 12 August 2014 (UTC)

Statistical threshold

On 17 August 2014, after discussion at Afd, Statistical threshold was redirected here, to "statistical hypothesis testing". --Bejnar (talk) 13:40, 17 August 2014 (UTC)

Origins and early controversy

Most papers on this subject say that textbooks are the source of the "hybrid" approach which is exemplified by the Lindquist figure in this wiki page. Yet the source that is cited in support of this (Halpin & Starn, 2006) say that textbooks are not the likely origin of the hybrid approach. To cite from page 644:

Thus, there seems to be little trace of a hybrid approach to statistical testing in the psychological research literature reviewed here, as indexed by the general omission of Neyman-Pearson concepts and procedures. Nonetheless, it has been found repeatedly that small-sample methods became prevalent in psychological research of this period. This suggests that a hybridized model of statistical testing was not directly transmitted from textbooks of psychological statistics to the research literature.

Clearly this Halpin & Starn is not valid to support this point. However, I am not knowledgeable enough about this the origins of the current Null Hypothesis Significance Testing procedure to say whether Halpin & Starn is right or not in the big picture. There is a big literature on this. Huberty (1993) "Historical origins of statistical testing practices: The treatment of Fisher versus Neyman-Pearson views in textbooks" is very relevant. — Preceding unsigned comment added by 193.163.223.34 (talk) 10:24, 12 February 2015 (UTC)

I am not sure you read the entire paper. Halpin and Stam (2006) clearly do come to the conclusion the "hybrid" originated textbooks (the Lindquist figure claims to present evidence for this). The sentence following your quote begins the following:
This in turn raises the question of how to conceptualize the implications of the textbook hybrid for psychological research...
The historical research presented in this article therefore corroborates the interpretation of statistical textbooks given by Gigerenzer and Murray...
However, there has also been a conspicuous absence of Fisher's logic of inductive inference throughout the critical literature, an omission that has clear precedent in the hybrid model found in textbooks of psychological statistics.[1]
Of course whether this is the correct path the "idea" traveled along is another issue. There could be multiple points of origination. Huberty (1993) looks like it would make a great additional reference for this page. 207.229.179.81 (talk) 21:50, 14 February 2015 (UTC)

Hypothesis testing to discriminate two point processes

Suppose that we have two different point processes. By hypothesis testing, how can we find out whether the sample is from the first one or the second one? What is the minimal sum of error probabilities? — Preceding unsigned comment added by 174.63.121.210 (talk) 21:53, 2 May 2016 (UTC)

External links modified

Hello fellow Wikipedians,

I have just modified one external link on Statistical hypothesis testing. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 19:13, 26 May 2016 (UTC)

Clairvoyant example...

I could be completely wrong about this but looking at the clairvoyant example...

The probability of getting every guess correct (clairvoyantly) is said to be

(1/4)^25 ~= 10^-15

This is basically the 1/4 probability that a card will be of a chosen suit rasied to the power of the number of correctly chosen cards right?

So then the probability of getting between 10 and 25 of the choices correct is the sum of getting exactly 10,11,12,13, etc.. up to 25 choices correct so if I put that into Wolfram's summation widget I get something like 1.26*10^-6, NOT ~= .07 as stated in the article?

Am I missing something here?

http://www.wolframalpha.com/input/?i=sum+[%2F%2Fmath:%281%2F4%29^k%2F%2F],+[%2F%2Fmath:k%2F%2F],+[%2F%2Fmath:10%2F%2F],+[%2F%2Fmath:25%2F%2F] — Preceding unsigned comment added by 132.45.121.6‎ (talk) 28 October 2016

What you said in words is correct, but your translation of that into maths isn't. The probability of getting k cards right (and hence 25–k cards wrong) is . Wolfram Alpha gives 0.071328... [1]. See Binomial distribution#Probability mass function. —Qwfp (talk) 10:08, 29 October 2016 (UTC)

The p-value doesn't have to be strictly lower than the significance level to reject the null hypothesis.

The significance level “alpha” is defined as the risk of rejecting a true null hypothesis (risk of type 1 error, or false positive). The p-value is defined as the probability of getting a test statistic at least as extreme as observed, under the null hypothesis. The page says one should reject the null hypothesis when the p-value is less than alpha. This rule appears to contract the two definitions. If we reject H0 only when a sample yields a p-value that is strictly lower than alpha, the rejection rate of a true H0 might be lower than alpha, while it should equal alpha, by definition.

To illustrate: H0 is “this coin is fair” and H1 is “there is a probability >1/2 of getting a head” (one-sided test). We toss the coin 10 times. Our test statistic X is the number of heads observed in 10 trials. X follows Bi(10, 1/2) under H0. We get 5 heads. The p-value is P(X ≥ 5) = 0.6230469. You can check with R using binom.test(5, 10, 1/2, “greater”).

If we choose alpha = P(X ≥ 5) = 0.6230469, and decide to reject H0 when the p-value is strictly lower than alpha, we would reject H0 only if there are 6 heads of more, because if we get 5 heads, the p-value equals alpha. Getting 6 heads or more under H0 has a probably P(X ≥ 6) = 0.3769531. This is the rate at which we would reject the true H0. As you can see, it does not equal alpha.

If I’m right, the wiki page is wrong. Jpeccoud (talk) 05:41, 29 August 2019 (UTC)jpeccoud

I agree, it should be "less than or equal to". This is meaningless for continuous test statistics for which there will be no difference. The issue comes up for discrete distributions. I'm changing the phrasing. Tal Galili (talk) 14:21, 29 August 2019 (UTC)
Thanks. This implies many corrections to the Statistical significance article, which I'd rather not do myself (not being a statistician nor a native English speaker) — Preceding unsigned comment added by Jpeccoud (talkcontribs) 08:02, 30 August 2019 (UTC)
Thanks for the heads up. I've now made modifications to Statistical significance based on this. Tal Galili (talk) 07:05, 1 September 2019 (UTC)

Criticism

  • When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g. increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%. However, this absurd assumption that the mean difference between two groups cannot be zero implies that the data cannot be independent and identically distributed (i.i.d.) because the expected difference between any two subgroups of i.i.d. random variates is zero; therefore, the i.i.d. assumption is also absurd.

This train of thought is unclear and does not make much sense to me. The contributor confuses two different assumptions, in addition to labeling them as "absurd" without any justification. 2607:FEA8:11E0:6C57:E9A6:1B95:A479:CA89 (talk) 16:49, 2 April 2020 (UTC)

The testing process

This section is not correct - or at best, is misleading - when the null is not simple or the test statistic is discrete. If (at step 6 of the first procedure) you select a significance level (like 5%) without regard to the available significance levels or (with non-simple nulls) if the true parameter is not at the significance level boundary then the probability of being in the rejection region will generally be lower than that selected level. On the other hand, with a discrete statistic if you *do* select it with regard to the available significance levels (which is pretty rare among what seems to be common practice) in that procedure then it's no longer consistent with the p-value approach unless *that too* is selected in like fashion (which in common practice is considerably more rare again) -- the first would be exactly at a selected level and the second would not. So either steps 6 and 8 of the first procedure have an issue or the claim of equivalence of the second procedure has an issue.

Glenbarnett (talk) 06:27, 17 June 2020 (UTC)

History

I looked at the inline cited article by Zarbell and in no way does it say that Fisher started as a Bayesian. Bayes is not mentioned at all. Bayesian was not really an school within statistics until the 1950s. That part of the sentence should be removed. Mcsmom (talk) 20:38, 8 December 2020 (UTC)

Re-ordering the sections

The article currently has the example section sandwiched in between The testing process section and Definition of terms section, which I think is a little bit off. I think it would be helpful for the reader if the example section is ordered after the Common test statistic section.Happyboi2489 (talk) 15:37, 7 December 2021 (UTC)