Tests of acceptance and confidence intervals

Cumming’s (20145) article exhorts substituting confidence intervals (CIs) for Null Hypothesis Significance Testing (NHST) as a way of increasing the scientific value of psychological research. Yet, Cumming’s article is somehow biased and potentially misleading, hence this commentary.

Statistical tests

NHST is a philosophical, and pseudoscientific, mismatch of three incompatible theories: Fisher’s, Neyman-Pearson’s, and Bayes’s (Gigerenzer, 20048). Technologically, however, it reduces to either of the former two, most often to Fisher’s (Cortina & Dunlap, 19974); therefore, it is more interesting to discuss Fisher’s (19547) or Neyman-Pearson’s theories (193312) than NHST.

In a nutshell, Fisher’s relevant constructs are null hypotheses, levels of significance and ad hoc p-values. Neyman-Pearson’s relevant constructs are main and alternative hypotheses, long-run errors under repeated sampling (α, β), critical test values, and power (1 – β)—p-values are not relevant but can be used as proxies for deciding between hypotheses (Perezgonzalez, 201413).

Cumming’s simulation about the dance of p’s (eg, Cumming, 2009((bibcite Cumming, 2009))) is not suitable for representing Fisher’s ad hoc approach, only Neyman-Pearson’s repeated sampling approach. Under this latter approach, what is relevant is a simple count of accepted tests (those whose results fall in the critical rejection area) irrespective of their p-value. As it turns out, Cumming’s results are a textbook example of what to expect given power. For example, 48% of tests are to be accepted at α = 0.05 out of 50% expected (see Table 1).

Table 1. Accepted tests given alpha and power
alpha expected tests observed
0.05 50% 12 48%
0.01 26% 7 28%
0.10 63% 1 68%
0.20 76% 19 76%
(Based on Cumming’s, 20145, data; calculated with G*Power)

Fair comparison

Cumming writes “CIs and p values are based on the same statistical theory” (p.13), which is partly correct. CIs were developed by Neyman (1935) and, thus, CIs and Neyman-Pearson’s tests are grounded on the same statistical philosophy: repeated sampling from the same population, errors in the long run (1 – CI = β), and assumption of true population parameters (which are unknown in the case of CIs)—p-values, however, are part of Fisher’s theory.

CIs and Neyman-Pearson’s tests are, thus, two sides of the same coin. Tests work on the main hypotheses, with known population parameters but unknown sample probabilities, and calculate point estimates to make decisions about those samples. CIs work on the alternative hypotheses, with known sample interval probabilities but unknown population parameters, and calculate CIs to describe those populations.

To be fair, a comparison of both requires equal ground. At interval level, CIs compare with power, and Cumming’s simulation reveals that 92% of sample statistics fall within the population CIs (95% are expected) while, on the other hand, as shown in Table 1, the percentage of observed tests supporting the alternative hypothesis matched the percentage expected to do so. At point estimate level, means (or a CI bound) compare with p-values, and Cumming’s figure reveals a well-choreographed dance between those. Namely, CIs are not superior to Neyman-Pearson’s tests when properly compared although, as Cumming discussed, CIs are certainly more informative.

Fending against fallacies

The common philosophy underlying CIs and tests implies that they share similar fallacies. Cumming touches on some but does not pre-emptily resolve them for CIs.

Cumming writes, “if p reveals truth…” (p.12). It is not clear what truth Cumming refers to, most probably about two known fallacies: that p is the probability of the main hypothesis being true and, consequently, that 1 – p is the probability of the alternative hypothesis being true (e.g., Kline, 2004). Similar fallacies equally extend to the power of the alternative hypothesis. Yet accepting the alternative hypothesis does not mean a 1 – β probability of it being true, but a probability of capturing 1 – β samples pertaining to its population in the long run. The same can be said about CIs (insofar CI = 1 – β): they tell something about the data—about their probability of capturing a population parameter in the long run—not about the population—i.e., the observed CI may not actually capture the true parameter.

Another fallacy touched upon is that p informs about replication. P-values only inform about the ad hoc probability of data under the tested hypothesis, thus “(they) have little to do with replication in the usual scientific sense” (Kline, 2004, p.6610). Similarly, CIs do not inform about replicability either. They are a statement of expected frequency under repeated sampling from the same population. The 83% next-experiment replicability reported by Cumming (20145), although interesting, is not part of the frequentist understanding of CIs and seems explainable by the size of the confidence interval alone.

Fending against mindless use of CIs

Finally, there is the ever present risk that CIs will be used as mindlessly as tests. For one, CIs share the same philosophical background than Neyman-Pearson’s tests, yet many researchers take them to mean an ad hoc probability of personal confidence (Hoekstra et al, 20149). On the other hand, the inferential value of a CI rests on the assumption that the unknown population parameter has equal chance of being anywhere within that interval. Using a point estimate leads to the wrong conclusion: that such point estimate is more probable than any other in the interval.

Final note

Both CIs and tests are useful tools (Gigerenzer, 20048). Neyman’s CIs are more descriptive, Fisher’s tests find significant results and foster meta-analyses, Neyman-Pearson’s tests help with decisions, NHST means trouble. Yet, we also need to consider other tools in the statistical toolbox, such as exploratory data analysis (Tukey, 197716), effect sizes (Cohen, 19883), meta-analysis (Rosenthal, 198415), cumulative meta-analysis (Braver et al, 20142), and Bayesian applications (Dyjas et al, 20126; Barendse et al, 20141).

1. BARENDSE MT, CJ ALBERS, FJ OORT & ME TIMMERMAN (2014). Measurement bias detection through Bayesian factor analysis. Frontiers in Psychology, doi:10.3389/fpsyg.2014.01087.
2. BRAVER Sanford L, Felix J THOEMMES & Robert ROSENTHAL (2014). Continuously cumulating meta-analysis and replicability. Perspectives in Psychology, doi 10.1177/1745691614529796.
3. COHEN Jacob (1988). Statistical power analysis for the behavioral sciences, 2nd ed). Psychology Press, ISBN 9780805802832.
4. CORTINA Jose M & William P DUNLAP (1997). On the logic and purpose of significance testing. Psychological Methods, doi 10.1037/1082-989X.2.2.161.
5. CUMMING Geoff (2014). The new statistics: why and how. Psychological Science, doi 10.1177/0956797613504966.
6. DYJAS Oliver, Raoul PPP GRASMAN, Ruud WETZELS, Han LJ VAN DER MAAS & Eric-Jan WAGENMAKERS (2012). What’s in a name: a Bayesian hierarchical analysis of the name-letter effect. Frontiers in Psychology, doi 10.3389/fpsyg.2012.00334.
7. FISHER Ronald Aylmer (1954). Statistical methods for research workers, 12th ed. Oliver and Boyd, ASIN B001KT8DTK.
8. GIGERENZER Gerd (2004). Mindless statistics. The Journal of Socio-Economics, doi 10.1016/j.socec.2004.09.033.
9. HOEKSTRA Rink, Richard D MOREY, Jeffrey N ROUDER & Eric-Jan WAGENMAKERS (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, doi 10.3758/s13423-013-0572-3.
10. KLINE Rex B (2004). Beyond significance testing: reforming data analysis methods in behavioral research. APA Books, ISBN 9781591471189.
11. NEYMAN Jerzy (1935). On the problem of confidence intervals. The Annals of Mathematical Statistics, doi 10.1214/aoms/1177732585.
12. NEYMAN Jerzy & Egon Sharpe PEARSON (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London, A: Mathematical, Physical and Engineering Sciences, doi 10.1098/rsta.1933.0009.
13. PEREZGONZALEZ Jose D (2014). A reconceptualization of significance testing. Theory & Psychology, doi 10.1177/0959354314546157.
14. PEREZGONZALEZ Jose D (2015). Confidence intervals and tests are two sides of the same research question. Frontiers in Psychology, doi 10.3389/fpsyg.2015.00034.
15. ROSENTHAL Robert (1984). Meta-analytic procedures for social research. Sage, ISBN 0803920342.
16. TUKEY John W (1977). Exploratory data analysis. Pearson, ISBN 9780201076165.
+++ Notes +++
17. This is an edited version of a commentary by Perezgonzalez (2015) on an article by Cumming (2014) entitled "The new statistics: why and how by Cumming".

Want to know more?



Jose D PEREZGONZALEZ (2015). Massey University, New Zealand. (JDPerezgonzalezJDPerezgonzalez).

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License