In this article we flesh out more advance concepts regarding tests of significance. Before doing so, however, an important issue to address is what the null hypothesis (Ho) stands for.
Which null hypothesis?
The null hypothesis poses an interesting issue in the technology of tests of significance, for there seems to be a conceptual conflict about how to interpret such hypothesis: either it represents a theoretical distribution or it represents a particular case being tested against a theoretical distribution.
Ho as a theoretical distribution
According to Fisher, the null hypothesis represents an exact hypothesis about an imaginary distribution in an infinite population ("the idea of an infinite population distributed in a frequency distribution", Fisher, 19543, p.41). For example, if this imaginary distribution were normal (a 'z' distribution), we could define it as a Gaussian distribution with mean = 0, standard deviation = 1, and an infinite population. Thus, a perfectly shaped, text-book type, theoretical distribution.
Such null hypothesis thus also implies the theoretical distribution for the calculus of the probability of the data (as "it must supply the basis of the 'problem of distribution', of which the test of significance is the solution", Fisher, 19605, p.16). Given that such distribution is theoretical, it can be considered well established and accepted in probability theory (eg, that a [theoretical] fair coin has 50/50 chances of landing heads; that each side of a [theoretical] fair dice has 1/6 chances of appearing after each toss; or that [theoretical] normal populations will conform to a Gaussian curve). For cases such as 'z', 't' and 'F' distributions, the population variance, estimated from the sample variance, and the assumptions of a nil hypothesis and an infinite population, serve to establish this theoretical distribution (ie, the distribution is certainly theoretical but the population variance is unknown until estimated from the sample). Once the theoretical distribution is known, the calculus of probabilities is certain under such theoretical distribution.
This being so, a null hypothesis cannot possibly be tested or disproved: it is theoretically sound and mathematically proved. The probabilistic notation P(D|Ho) (ie, the probability of the data given that the null hypothesis is true) captures well the concept of a null hypothesis about a theoretical distribution which provides the mathematical space for the calculation of the probability of the observed data. P-values simply represent the probability of data under the null hypothesis (a descriptive statistic), hardly a value to claim significance (and, thus, reject the null hypothesis), never an error term. It cannot possibly inform anything at all about the probability of the null hypothesis given the data [P(Ho|D)]6 either.
Therefore, under this interpretation, it is illogical to talk about null hypothesis testing, as the null hypothesis is never tested; only the probability of the data under the null hypothesis is calculated. It is also inappropriate to talk about proving or establishing the null hypothesis ("it should be noted that the null hypothesis is never proved or established", Fisher, 19605, p.16), simply because it is already proved or established (and because the hypothesis itself is not being tested, anyway). And it is inappropriate to talk about disproving or rejecting the null hypothesis because, again, it refers to a true and sound theoretical distribution (and because this hypothesis is not being tested, nor an 'alternative' hypothesis exists).
Under this conceptualization, the typical interpretation of 'p-values' as statistically significant and as evidence against the null is a 'paradox', acceptable only by suspending logic and accepting an obscure rationalization such as levels of significance. That is, only under the 'parameters' of a test of significance, above two interpretations are somehow reasonable. This is so because, literally, 'p-values' represent the probability of the observed data under the null hypothesis, and no probability (even extreme ones) can be used as evidence of what is already true in a mathematical sense. A solution, however, may be to repeat the research or experiment to ascertain whether significant results are consistently obtained beyond what is expected (ie, greater than 1 time in 33 trials).
A second 'paradox', unsolvable by mathematical means alone, is the rejection of the null hypothesis. Indeed, 'p-values' represent the probability of the observed data under the null hypothesis (ie, 1 time in 33 you will get the same results) and there is no way that the calculus of probabilities resolves such 'paradox'. Cohen (19941) criticised Fisher's logic as following an invalid syllogism like this: "If the null hypothesis is correct, then [significant data] are highly unlikely. [Significant data] have occurred. Therefore, the null hypothesis is highly unlikely" (p.998). However, even Cohen missed the point that, literally, the null hypothesis is correct (no 'if' attached), that is "The null hypothesis is correct, and extreme data are highly unlikely. Extreme data have occurred. The null hypothesis is still correct." The difference between these is the qualifying words 'significant data', which for Fisher represents the test of significance and prompts a rational call for dividing the observed results between those which "disproved" the null hypothesis and those which did not (pending the experiment was also valid).
In brief, interpreting 'p-values' as evidence against the null and for disproving the null hypothesis does not make literal sense, but could make some "rational" sense when following the 'parameters' set out by a test of significance: significance of extreme results and rejection of the null hypothesis in those cases.
Ho as a case to test
Most often, however, the null hypothesis is treated as a hypothesis about the observed data: that the data will fit the appropriate theoretical distribution (such as those discussed above, or, put somehow differently, that the data "comes" from an infinite population with such distribution). That is, the null hypothesis represents the hypothesis that 'my coin' is an instance of a theoretical fair coin, or that any mean difference showed between 'my treatment group' and 'my control group"' is an instance of a theoretical normal distribution with mean = 0, standard deviation = 1, and an infinite population.
Under this conceptualization, the test distribution is identical to the corresponding theoretical distribution discussed earlier. The main difference is that what is put to test is the null hypothesis that the observed data behaves as (or pertains to) such theoretical distribution. Such null hypothesis can be disproved ("every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis", Fisher, 19605, p.16), but can also be accepted or proved (something that Fisher denied), for what is being disproved or proved is not the theoretical distribution itself but the (null) hypothesis that the instance of 'my coin' or 'my treatment group' behaves like (or pertains to) that theoretical distribution.
The probabilistic notation P(D|Ho) (ie, the probability of the data given that the null hypothesis is true) is not coherent under this second conceptualization, simply because the null hypothesis is now conditional. It needs to be substituted with the conditional notation P(D|if Ho) (ie, the probability of the data assuming that the null hypothesis is true). A test of significance is then a good tool for assessing the level of likelihood with the theoretical distribution, being a reasonable threshold to decide about such likelihood. No significance can be interpreted as high likelihood, the observed data is said to be an instance of the theoretical distribution and the null hypothesis is not rejected (ie, it is a fair coin, or the treatment makes no difference). Larger p-values can be counted as stronger evidence in favor of the null hypothesis. On the other hand, significance can be interpreted as low likelihood, an indication that 'my coin' or 'my treatment' are not likely instances of the corresponding theoretical distribution, and the null hypothesis can be rejected. Smaller p-values can be counted as stronger evidence against the null hypothesis. (In both cases, however, the validity and sensitivity of the research is paramount, as only valid and sensitive enough designs can substantiate a decision to accept or reject the null hypothesis.)
Notice that any decision to accept or reject the null is done by way of assessing association by likelihood. That is it is inferred the null hypothesis may be true because of its high likelihood to a theoretical distribution, while it is inferred it may be false because of its low likelihood to the same theoretical distribution. The null hypothesis proper is not tested. The probability of the data is still a probability under the theoretical distribution, and "the 'one chance in a million' will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us" (Fisher, 19605, p.13-14).
The next step is inferring any real effect, such as a coin being fair or biased, or a treatment being effective. This inference may be made based on the significance of the data but goes beyond the scope of statistics. That is a significant result may prompt a decision to reject the null hypothesis of likelihood with the theoretical distribution of a fair coin. The inference that 'my coin' is, indeed, biased may be inferred from such likelihood, but it is just an inference, not a mathematical proof. Or a non-significant result my prompt the decision to retain the null hypothesis of likelihood with a normal distribution with mean = 0. The inference that 'my treatment' is not effective, however, may rest on such results but is not a mathematical proof.
Under this conceptualization, it becomes somehow appropriate, even relevant, to talk about null hypothesis testing (understanding it as the testing of the likelihood that 'my coin' behaves like a theoretical fair coin). In fact, this is the only hypothesis put to test, even if the purpose is to 'prove' it wrong. Notice that this second conceptualization ends up in a decision, thus quite close to Neyman-Pearson's procedure, except that only one hypothesis (the null) is ever tested. (There is also a risk of making an error of rejecting a null hypothesis which is indeed true. But such error cannot be conceptualized simply as the probability of the data. For such, a Neyman-Pearson frequentist approach is also needed (ie, multiple testing), not a ad-hoc experiment as proposed by Fisher.)
Cohen's (19941) syllogism makes more sense under this conceptualization, as it makes the null hypothesis conditional: "If the null hypothesis is correct, then (significant data) are highly unlikely. (Significant data) have occurred. Therefore, the null hypothesis is highly unlikely" (p.998). And yet, Cohen still creates a bad syllogism even for this case. The test of significance is not about "if the null hypothesis is correct" but about "if the data is unlikely". The corresponding syllogism (and its consequent negation, the so called "alternative" hypothesis) would go somehow as follows:
- The null hypothesis of likelihood is false (or unlikely), if the observed probability is small (significant). The probability is significant. Then the null hypothesis is false (or unlikely).
- The null hypothesis of likelihood is correct (or likely), if the observed probability is large (not significant). The probability is not significant. Then the null hypothesis is correct (or likely).
Although Fisher claimed that a test of significance is an inferential tool not a deductive one, an attempt to a syllogism, following Cohen's (19941), may shed some light onto what a test of significance can and cannot do mathematically under this conceptualization of the null. Such syllogism would go like this:
- My coin can be considered an unfair coin if it turns a heads-tails probability as extreme as those obtained 5% of the time (or less so) after a number of tosses. My coin turned up a extreme probability (significantly different from 50/50). Therefore, my coin can be considered an unfair coin (in this trial, at least).
The alternative hypothesis (Ha)
Alternative hypotheses are another interesting issue when dealing with tests of significance, namely because they are not necessary and, in principle, perpetuates the confusion between Fisher's and Neyman-Pearson's approaches to hypothesis testing.
Fisher's tests of significance do not "require" an alternative hypothesis (Ha) because this is never tested. If the Ho is taken as a theoretical distribution, then alternative hypotheses are not necessary: p-values simply represent the probability of data under such theoretical distribution.
If the Ho is a hypothesis about particular cases, then an alternative hypothesis, even when not tested, is implicit. After all, what's the point of doing experimentation if all we are interested in is on knowing that there is no difference in treatments? Under this interpretation, then, a test of significance is a 'proxy' to reject the Ho in favor of an unstated alternative hypothesis that there is a difference in treatment.
And here comes another conflict, for the alternative hypothesis is never tested, only arrived at by denying the null with some probability. That is, we cannot even 'accept' the alternative hypothesis of a difference in treatment because the only thing we know is that we have rejected the null hypothesis of no difference in treatment. To put it in context, if we reject the Ho that a ball is white, the only thing we have learned is that it is (possibly) not white (but nothing to whether it is grey or pale blue or red).
Furthermore, even if implicit, the alternative hypothesis is unnecessary because it is only the negation of the null. But it is often confused with Neyman-Pearson's "alternative" hypothesis and its associated type-II error.
All-in-all, stating alternative hypotheses when doing tests of significance is but a forgivable 'sin'. It is unnecessary and may perpetuate conceptual confusions in regards to different statistical approaches, but it also makes explicit the purpose of research (thus the purpose of doing tests of significance in the first place) and may make easy to clarify the directionality of results and, possibly, effect sizes.
Tests of significance (advanced)
Research validity and sensitiveness
An important aspect of significance testing is to be able to have a valid research design. Otherwise, a significance test is no more than a mathematical exercise. Fisher (19605) considers good design and, especially, randomisation as important cornerstones to ensure validity.
The sensitiveness of research, on the other hand, refers to the chances it has of finding a 'true' effect. "…We may say that the value of the experiment is increased whenever it permits the null hypothesis to be more readily disproved" (Fisher, 19605, p.22).
Sensitiveness can be increased by way of:
- increasing sample size (enlargement),
- repeating the same experiment a number of times and aggregating the entire series of results (repetition),
- randomising procedures and presentation (structural reorganisation), and
- controlling confounding variables in order to allow the main effect to express itself more readily (refinements of technique).
Sensitiveness as power
We could use the 'power' tables elaborated by Cohen (1994) or computed by programs on statistical power, as a proxy to manage the sensitiveness of a test by way of enlargement (ie, increasing sample size). Although power is not very 'coherent' as Fisher's test of significance do not have 'type II errors', both concepts, power and sensitiveness, are basically the same thing.
This is a summary of Cohen's power table for different tests. The table is for a sig=.05, pwr=.80, 2-tailed.
|Table 1: Sample size per test & effect size|
|test||index||expected effect size|
|(*Required sample size for sig=.05, pwr=.80, 2-tailed)|
The null hypothesis (Ho)
A null hypothesis does not need to be set up, as it can be assumed to be the default hypothesis when doing tests of significance. Setting it up, however, may add clarity to the research, especially regarding the type of test intended (means differences, analysis of variance…) and the directionality of the test (one-tailed or two-tailed).
Advanced null hypothesis (rarely found in the research literature) are rather literal to the working of the statistical test, as well as to the level of significance and its directionality. It may also include an estimate of the size of the effect and the inference to be made from results. Also, as they will be transformed into a nil hypothesis for mathematical testing, a literal null hypothesis could also reflect such nil characteristic when appropriate to the test.
Null hypotheses portrayed as instances of a theoretical distribution may be as follow:
- Ho: This coin is an instance of a theoretical fair coin, having a 50/50 probability of landing heads on each tossing.
- Following a binomial distribution and a level of significance of 5%, we will reject the null hypothesis as likely if the probability of the observed data is smaller than the level of significance. We shall thus infer that the coin is a biased coin.
- However, we will accept the null hypothesis as likely if the probability of the observed data is greater than the level of significance. We shall thus infer the coin is a fair coin.
- Ho: The mean difference between the control and the treatment group is an instance of a theoretical 't' distribution.
- Following such 't' distribution and a level of significance of 1%, we will reject the null hypothesis as likely if the probability of the observed data is smaller than the level of significance. We shall thus infer that any mean difference is due to the treatment provided.
- However, we will accept the null hypothesis as true if the probability of the observed data is greater than the level of significance, thus inferring any mean difference is due to random fluctuation.
The level of significance
Conventional levels are 5% (sig<.05, or a similar result occurring by chance less than 1 time in 20) or 1% (ie, sig<.01, or a similar result occurring by chance less than 1 time in 100). However, the level of significance can be any threshold considered appropriate (thus, it could be .02, .001, .0001, etc). If required, label such level of significance as 'significance', 'sig' or 's' (ie, sig<.05, s<.01), but avoid labeling it as 'p' (so not to confuse it with 'p-values') or as 'alpha' or 'α' (so not to confuse it with 'alpha' tolerance errors).
One-tailed or two-tailed tests
A one-tailed test assesses whether the observed results are either significantly higher or smaller than the null hypothesis, but not both. A two-tailed test assesses both possibility at once, but dividing the total level of significance, typically in half.
Probability of the data
It is recommended to use the exact probability of the data, that is the 'p-value' (eg, p=.011, or p=.51). This exact probability is normally provided together with the pertinent statistic test (z, t, F…) either as 'p' or as 'sig' (significance)9. However, when necessary (eg, for reducing the size of a table), 'p-values' can also be reported in a categorical (eg, * for p<.05, ** for p<.01) or nominal manner (p<.05)7.
'P-values' can be interpreted as the probability of getting those results under the null hypothesis, independently of whether they are statistically significant or not. For example, p=.033 literally means that 3.3 times in 100, or 1 time in 33, we will obtain the same results as random fluctuation under the null.
'P-values' are considered as statistically significant if they are equal or smaller than the chosen significance level. This is the actual test of significance. This test interprets those 'p-values' falling beyond the threshold as 'rare' enough as to deserve attention: that is, either a rare random fluctuation (or rare event) which occurs 1 time in 33 happened, or the null hypothesis does not explain the observed results. Under the 'parameters' of a test of significance, thus, two other conclusions can be drawn, according to Fisher (19605):
- 'p-values' can be interpreted as evidence against the null hypothesis, with smaller values representing stronger evidence.
- the null hypothesis can be rejected as being true for the observed data.
Only under the 'parameters' of a test of significance, above two interpretations are somehow reasonable. This is so because, literally, 'p-values' represent the probability of the observed data under the null hypothesis, and no probability (even extreme ones) can be used as evidence of what is already true in a mathematical sense. The solution, of course, is to repeat the research or experiment to ascertain whether significant results are consistently obtained beyond what is expected (ie, greater than 1 time in 33 trials).
A second 'paradox', unsolvable by mathematical means alone, is the rejection of the null hypothesis. Indeed, 'p-values' represent the probability of the observed data under the null hypothesis (ie, 1 time in 33 you will get the same results) and there is no way that the calculus of probabilities resolves such 'paradox'. Cohen (19941) criticised Fisher's logic as following an invalid syllogism like this: "If the null hypothesis is correct, then (significant data) are highly unlikely. (Significant data) have occurred. Therefore, the null hypothesis is highly unlikely" (p.998). However, even Cohen missed the point that, literally, the null hypothesis is correct (no 'if' attached), that is "The null hypothesis is correct, and extreme data are highly unlikely. Extreme data have occurred. The null hypothesis is still correct." The difference between these is the qualifying words 'significant data', which for Fisher represented the test of significance and prompted a rational call for dividing the observed results between those which disproved the null hypothesis and those which did not (pending the experiment was also valid).
Although Fisher claimed that a test of significance is an inferential tool not a deductive one, an attempt to a syllogism, following Cohen's (19941), may shed some light onto what a test of significance can and cannot do mathematically. Such syllogism would go like this: All fair coins have a 50/50 head-tail probability with extreme values in their tails (Ho). My coin has a extreme probability (significantly different from 50/50). Therefore, my coin either is a extreme case of a fair coin or is not a fair coin. Thus, the test is carried out on the particular case represented by "my coin" behaving as a hypothetical fair coin rather than on the null hypothesis itself, it is incapable of resolving the conundrum whether the results are due to an extreme occurrence of a fair coin or to a normal behavior of a coin which is not fair, and it cannot be used to reject the null hypothesis because it hasn't tested it.
In brief, interpreting 'p-values' as evidence against the null and for disproving the null hypothesis does not make literal sense, but only some "rational" sense when following the 'parameters' set out by a test of significance: significance of extreme results and rejection of the null hypothesis in those cases.
To this, it is of interest to add a further insight by Cohen (1994, p.9981): "…what is always the real issue, is the probability that Ho is true, given the data, P(H0|D), the inverse probability. When one rejects Ho, one wants to conclude that Ho is unlikely, say, p < .01. The very reason the statistical test is done is to be able to reject Ho because of its unlikelihood! But that is the posterior probability, available only through Bayes's theorem…"
Writing up results
Test statistics and associated exact 'p-values' can (and probably should) be reported as descriptive statistics, independently of whether they are statistically significant or not. When discussing the results and reaching conclusions, however, those results which are not statistically significant should be ignored. Significant results can be reported in different ways. Two "literal"8 ways of reporting are the following:
- As per Fisher (19594), significant results can be reported in the line of "Either an exceptionally rare chance has occurred, or the theory of random distribution is not true" (p.39). Eg, "In regards to this correlation, either a rare chance has occurred (p = .01), or the hypothesis of nil correlation between these variables is not true."
- As per Conover (19802), significant results can be reported in the line of "Without the treatment I administered, experimental results as extreme as the ones I obtained would occur only about 3 times in 1000. Therefore, I conclude that my treatment has a definite effect" (p.2). Eg, "This correlation is so extreme that it would only occur about 1 time in 100 (p = .01). Thus, it could be concluded that there seems to be a significant correlation between these variables."
Tests of significance are a potentially self-defeating technology, as it is based on many assumptions and 'opposite' procedures. For example:
- The main reason to do a test of significance (or research, for that matter) is to find out whether a treatment works or a correlation exists. This is the so called 'alternative hypothesis' in significance testing. However, this hypothesis is never tested, only the 'null hypothesis' that those effects do not actually exist. Thus, this 'alternative' hypothesis can only be inferred by negating the null hypothesis.
- Yet the null hypothesis is not actually tested either. It is assumed to be true, so that the probability of the observed data under the null hypothesis can be calculated. Thus, if the null hypothesis is not being tested, it cannot be negated, properly speaking (therefore, nothing can be inferred about the 'alternative' hypothesis either).
- However, we have discussed that there are two ways of interpreting the null hypothesis: as a theoretical distribution which is put to test, or as a hypothesis about the data behaving as an instance of such theoretical distribution. The former cannot be negated, but the later, which 'tests' for likelihood, can.
- Developing the later further then, significance testing rests on a decision about the likelihood of the data to the distribution under the null hypothesis. A significant result is that with a probability smaller than the threshold the researcher has deemed significant. Thus, a test of significance only tests the likelihood of the observed data against a theoretical distribution, and suggests a decision regarding their statistical significance.
- From such decision about statistical significance, a correlative decision to reject or not the null hypothesis is made, for example a decision to reject the null hypothesis of likelihood when data are significant. Literally, what is rejected or not is the likelihood of the data behaving as expected under a particular test (eg, that the correlation between two variables is 'zero' or that any difference between treatments is 'zero').
- Once a decision about the null hypothesis is made, an inference about any effects on the research proper can be elaborated. For example, the typical inference after rejecting a null hypothesis of statistical likelihood is that there is a difference in treatment or a correlation between variables.
- Yet, in order to find out what this difference is (effect size), sometimes even its direction (eg, which group is better), the corresponding statistics need to be calculated out of the test itself (eg, getting the means and medians independently). Thus, a result may be statistically significant but not really important, because the effect size may be minimal.
- Furthermore, if the research design and the sensitivity of the test are not good, then the lack of validity and sensitivity of the test may render any conclusions silly.
- All in all, a test of significant is quite an indirect procedure for "proving" research effects. And the researcher is better to be well-aware of the whole process in order to make adequate inferences.
Want to know more?
- Neyman-Pearson's hypotheses testing
- This page deals with Neyman-Pearson's testing procedure, based on testing two competing hypotheses against each other.