Tests of significance (intermediate)

Tests of significance (intermediate)

Research validity and sensitiveness

An important aspect of significance testing is to be able to have a valid research design. Otherwise, a significance test is no more than a mathematical exercise. Fisher (19605) considers good design and, especially, randomisation as important cornerstones to ensure validity.

The sensitiveness of research, on the other hand, refers to the chances it has of finding a 'true' effect. "…We may say that the value of the experiment is increased whenever it permits the null hypothesis to be more readily disproved" (Fisher, 19605, p.22).

Sensitiveness can be increased by way of:

  • increasing sample size (enlargement),
  • repeating the same experiment a number of times and aggregating the entire series of results (repetition),
  • randomising procedures and presentation (structural reorganisation), and
  • controlling confounding variables in order to allow the main effect to express itself more readily (refinements of technique).

Sensitiveness as power

We could use the 'power' tables elaborated by Cohen (1994) or computed by programs on statistical power, as a proxy to manage the sensitiveness of a test by way of enlargement (ie, increasing sample size). Although power is not very 'coherent' as Fisher's test of significance do not have 'type II errors', both concepts, power and sensitiveness, are basically the same thing.

This is a summary of Cohen's power table for different tests. The table is for a sig=.05, pwr=.80, 2-tailed.

Table 1: Sample size per test & effect size
test index expected effect size
small medium large
t-test d .20 .50 .80
(n) 393 64 26
correlation r .10 .30 .50
(n) 783 85 28
2 correlations q .10 .30 .50
(n) 1573 177 66
sign test g .05 .15 .25
(n) 783 85 30
proportions h .20 .50 .80
(n) 392 63 25
χ²-test w .10 .30 .50
(n) 964 107 39
F-test f .10 .25 .40
(n) 322 52 21
(*Required sample size for sig=.05, pwr=.80, 2-tailed)

The null hypothesis (Ho)

A null hypothesis does not need to be set up, as it can be assumed to be the default hypothesis when doing tests of significance. Setting it up, however, may add clarity to the research, especially regarding the type of test intended (means differences, analysis of variance…) and the directionality of the test (one-tailed or two-tailed).

Advanced null hypothesis (rarely found in the research literature) are rather literal to the working of the statistical test, as well as to the level of significance and its directionality. It may also include an estimate of the size of the effect and the inference to be made from results. Also, as they will be transformed into a nil hypothesis for mathematical testing, a literal null hypothesis could also reflect such nil characteristic when appropriate to the test.

Null hypotheses portrayed as instances of a theoretical distribution may be as follow:

  • Ho: This coin is an instance of a theoretical fair coin, having a 50/50 probability of landing heads on each tossing.
    • Following a binomial distribution and a level of significance of 5%, we will reject the null hypothesis as likely if the probability of the observed data is smaller than the level of significance. We shall thus infer that the coin is a biased coin.
    • However, we will accept the null hypothesis as likely if the probability of the observed data is greater than the level of significance. We shall thus infer the coin is a fair coin.
  • Ho: The mean difference between the control and the treatment group is an instance of a theoretical 't' distribution.
    • Following such 't' distribution and a level of significance of 1%, we will reject the null hypothesis as likely if the probability of the observed data is smaller than the level of significance. We shall thus infer that any mean difference is due to the treatment provided.
    • However, we will accept the null hypothesis as true if the probability of the observed data is greater than the level of significance, thus inferring any mean difference is due to random fluctuation.

The level of significance

Conventional levels are 5% (sig<.05, or a similar result occurring by chance less than 1 time in 20) or 1% (ie, sig<.01, or a similar result occurring by chance less than 1 time in 100). However, the level of significance can be any threshold considered appropriate (thus, it could be .02, .001, .0001, etc). If required, label such level of significance as 'significance', 'sig' or 's' (ie, sig<.05, s<.01), but avoid labeling it as 'p' (so not to confuse it with 'p-values') or as 'alpha' or 'α' (so not to confuse it with 'alpha' tolerance errors).

One-tailed or two-tailed tests

A one-tailed test assesses whether the observed results are either significantly higher or smaller than the null hypothesis, but not both. A two-tailed test assesses both possibility at once, but dividing the total level of significance, typically in half.

Probability of the data

It is recommended to use the exact probability of the data, that is the 'p-value' (eg, p=.011, or p=.51). This exact probability is normally provided together with the pertinent statistic test (z, t, F…) either as 'p' or as 'sig' (significance)9. However, when necessary (eg, for reducing the size of a table), 'p-values' can also be reported in a categorical (eg, * for p<.05, ** for p<.01) or nominal manner (p<.05)7.

Interpreting results

'P-values' can be interpreted as the probability of getting those results under the null hypothesis, independently of whether they are statistically significant or not. For example, p=.033 literally means that 3.3 times in 100, or 1 time in 33, we will obtain the same results as random fluctuation under the null.

'P-values' are considered as statistically significant if they are equal or smaller than the chosen significance level. This is the actual test of significance. This test interprets those 'p-values' falling beyond the threshold as 'rare' enough as to deserve attention: that is, either a rare random fluctuation (or rare event) which occurs 1 time in 33 happened, or the null hypothesis does not explain the observed results. Under the 'parameters' of a test of significance, thus, two other conclusions can be drawn, according to Fisher (19605):

  • 'p-values' can be interpreted as evidence against the null hypothesis, with smaller values representing stronger evidence.
  • the null hypothesis can be rejected as being true for the observed data.

Only under the 'parameters' of a test of significance, above two interpretations are somehow reasonable. This is so because, literally, 'p-values' represent the probability of the observed data under the null hypothesis, and no probability (even extreme ones) can be used as evidence of what is already true in a mathematical sense. The solution, of course, is to repeat the research or experiment to ascertain whether significant results are consistently obtained beyond what is expected (ie, greater than 1 time in 33 trials).

A second 'paradox', unsolvable by mathematical means alone, is the rejection of the null hypothesis. Indeed, 'p-values' represent the probability of the observed data under the null hypothesis (ie, 1 time in 33 you will get the same results) and there is no way that the calculus of probabilities resolves such 'paradox'. Cohen (19941) criticised Fisher's logic as following an invalid syllogism like this: "If the null hypothesis is correct, then (significant data) are highly unlikely. (Significant data) have occurred. Therefore, the null hypothesis is highly unlikely" (p.998). However, even Cohen missed the point that, literally, the null hypothesis is correct (no 'if' attached), that is "The null hypothesis is correct, and extreme data are highly unlikely. Extreme data have occurred. The null hypothesis is still correct." The difference between these is the qualifying words 'significant data', which for Fisher represented the test of significance and prompted a rational call for dividing the observed results between those which disproved the null hypothesis and those which did not (pending the experiment was also valid).

Although Fisher claimed that a test of significance is an inferential tool not a deductive one, an attempt to a syllogism, following Cohen's (19941), may shed some light onto what a test of significance can and cannot do mathematically. Such syllogism would go like this: All fair coins have a 50/50 head-tail probability with extreme values in their tails (Ho). My coin has a extreme probability (significantly different from 50/50). Therefore, my coin either is a extreme case of a fair coin or is not a fair coin. Thus, the test is carried out on the particular case represented by "my coin" behaving as a hypothetical fair coin rather than on the null hypothesis itself, it is incapable of resolving the conundrum whether the results are due to an extreme occurrence of a fair coin or to a normal behavior of a coin which is not fair, and it cannot be used to reject the null hypothesis because it hasn't tested it.

In brief, interpreting 'p-values' as evidence against the null and for disproving the null hypothesis does not make literal sense, but only some "rational" sense when following the 'parameters' set out by a test of significance: significance of extreme results and rejection of the null hypothesis in those cases.

To this, it is of interest to add a further insight by Cohen (1994, p.9981): "…what is always the real issue, is the probability that Ho is true, given the data, P(H0|D), the inverse probability. When one rejects Ho, one wants to conclude that Ho is unlikely, say, p < .01. The very reason the statistical test is done is to be able to reject Ho because of its unlikelihood! But that is the posterior probability, available only through Bayes's theorem…"

Writing up results

Test statistics and associated exact 'p-values' can (and probably should) be reported as descriptive statistics, independently of whether they are statistically significant or not. When discussing the results and reaching conclusions, however, those results which are not statistically significant should be ignored. Significant results can be reported in different ways. Two "literal"8 ways of reporting are the following:

  • As per Fisher (19594), significant results can be reported in the line of "Either an exceptionally rare chance has occurred, or the theory of random distribution is not true" (p.39). Eg, "In regards to this correlation, either a rare chance has occurred (p = .01), or the hypothesis of nil correlation between these variables is not true."
  • As per Conover (19802), significant results can be reported in the line of "Without the treatment I administered, experimental results as extreme as the ones I obtained would occur only about 3 times in 1000. Therefore, I conclude that my treatment has a definite effect" (p.2). Eg, "This correlation is so extreme that it would only occur about 1 time in 100 (p = .01). Thus, it could be concluded that there seems to be a significant correlation between these variables."

Closing comments

Tests of significance are a potentially self-defeating technology, as it is based on many assumptions and 'opposite' procedures. For example:

  • The main reason to do a test of significance (or research, for that matter) is to find out whether a treatment works or a correlation exists. This is the so called 'alternative hypothesis' in significance testing. However, this hypothesis is never tested, only the 'null hypothesis' that those effects do not actually exist. Thus, this 'alternative' hypothesis can only be inferred by negating the null hypothesis.
  • Yet the null hypothesis is not actually tested either. It is assumed to be true, so that the probability of the observed data under the null hypothesis can be calculated. Thus, if the null hypothesis is not being tested, it cannot be negated, properly speaking (therefore, nothing can be inferred about the 'alternative' hypothesis either).
  • However, we have discussed that there are two ways of interpreting the null hypothesis: as a theoretical distribution which is put to test, or as a hypothesis about the data behaving as an instance of such theoretical distribution. The former cannot be negated, but the later, which 'tests' for likelihood, can.
  • Developing the later further then, significance testing rests on a decision about the likelihood of the data to the distribution under the null hypothesis. A significant result is that with a probability smaller than the threshold the researcher has deemed significant. Thus, a test of significance only tests the likelihood of the observed data against a theoretical distribution, and suggests a decision regarding their statistical significance.
  • From such decision about statistical significance, a correlative decision to reject or not the null hypothesis is made, for example a decision to reject the null hypothesis of likelihood when data are significant. Literally, what is rejected or not is the likelihood of the data behaving as expected under a particular test (eg, that the correlation between two variables is 'zero' or that any difference between treatments is 'zero').
  • Once a decision about the null hypothesis is made, an inference about any effects on the research proper can be elaborated. For example, the typical inference after rejecting a null hypothesis of statistical likelihood is that there is a difference in treatment or a correlation between variables.
  • Yet, in order to find out what this difference is (effect size), sometimes even its direction (eg, which group is better), the corresponding statistics need to be calculated out of the test itself (eg, getting the means and medians independently). Thus, a result may be statistically significant but not really important, because the effect size may be minimal.
  • Furthermore, if the research design and the sensitivity of the test are not good, then the lack of validity and sensitivity of the test may render any conclusions silly.
  • All in all, a test of significant is quite an indirect procedure for "proving" research effects. And the researcher is better to be well-aware of the whole process in order to make adequate inferences.
1. COHEN Jacob (1994). The Earth is round (p < .05). American Psychologist (ISSN 0003-066X), 1994, volume 49, number 12, pages 997-1003.
2. CONOVER WJ (1980). Practical nonparametric statistics (2nd ed). John Willey & Sons (New York, USA), 1980. ISBN 0471084573.
3. FISHER Ronald A (1954). Statistical methods for research workers (12th ed). Oliver and Boyd (Edinburgh, UK), 1954.
4. FISHER Ronald A (1959). Statistical methods and scientific inference (2nd ed). Oliver and Boyd (Edinburgh, UK), 1959.
5. FISHER Ronald A (1960). The design of experiments (7th ed). Oliver and Boyd (Edinburgh, UK), 1960.
+++ Footnotes +++
6. It cannot say anything at all about the probability of an alternative hypothesis either [P(HA|D)]. This feature is interesting because even Fisher used to infer about the hypothesis (either the stated null or the unstated alternative hypothesis) given the data, contradicting how the test actually works in practice.
7. Make sure you understand the difference between a 'p-value' expressed categorically or nominally and a level of significance. A 'p-value' such as 'p<.05' or '*' indicates that the observed result has a 'p-value' lesser than .05, independently of whether such result is interpreted as statistically significant or not. It also allows for using 'p<.01' or '**' as a more extreme 'p-value' category, thus, as stronger evidence against the null hypothesis. In contrast, 'sig<.05' only indicates that you will consider all 'p-values' smaller than .05 as statistically significant. As it is easy to confuse these two concepts, try to express 'p-values' as exact probabilities, and avoid using categorical or nominal expressions for 'p-values' as much as possible. Also, beware that 'sig' is commonly confused for, thus written as, 'p' or 'α' in research articles and statistics books alike, while 'p' is written as 'sig' in SPSS outputs.
8. "Literal" means literal to the philosophical and mathematical underpinnings of Fisher's tests of significance. Such interpretation is rarely found in the literature, not even used extensively by Fisher or Conovan. Indeed, Fisher was more than happy to interpret significant results as evidence of the existence of an effect rather than merely as evidence against the null hypothesis. Yet an "honest" or literal interpretation, albeit somewhat cumbersome, makes clear the working of the test: the location of the observed data within the theoretical distribution of the null hypothesis, resulting in the probability of that data assuming the null hypothesis is correct.
9. This reflects the current level of confusion in statistics books, in general. SPSS, for example, will provide exact 'p-values' under the label 'sig'. Properly speaking, what they are providing is the 'p-value' of the observed data on the null distribution. 'Sig' or 'significance' is a conventional 'p-value' that the researcher will use as a cut-off point for interpreting the observed 'p-values' as either statistically significant or not (ex, sig<.05 => all p<.05 will be deemed significant), thus it neither is calculated statistically nor depends on mathematical formulas.

Want to know more?

Neyman-Pearson's hypotheses testing
This page deals with Neyman-Pearson's testing procedure, based on testing two competing hypotheses against each other.

Contributors to this page

Authors / Editors



Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License