Fisher's significance testing
About tests of significance
For Fisher, tests of significance are statistical tools for learning from research, pending that each research design is sound. In fact, each test of significance is self-contained in the sense that a particular lesson can be learned from each test, although it can also be supplemented with further observations.
Tests of significance thus help gain a better understanding of the research at hand. Notwithstanding this, such understanding may also be deemed provisional, and subject to confirmation and revision.
Nature of the null hypothesis (Ho)
For Fisher, the null hypothesis is based on a theoretical distribution (eg, t, F, z…) which normally assumes an infinite population (he thus calls it an "imaginary" or "hypothetical" population).
Although many theoretical distributions may be known and be mathematically sound (eg, a binomial distribution, or a z-distribution), most still depend on information obtained from the sample in order to be "fit" accurately (eg, a z-distribution depends on the sample's standard deviation in order to "fit" the sample to the theoretical z-distribution).
"In the test of significance due to 'Student' (W. S. Gosset), and generally known as the t-test […]" (Fisher, 19594, p.79).
The null hypothesis (Ho)
Fisher saw advantageous to set up the null hypothesis clearly, if only to avoid confusion.
The null hypothesis needs to be exact, without ambiguity, as it is the basis for creating the distribution against which to test the data (the variance of the distribution or the appropriate degrees of freedom will be estimated from the variance and the size of the sample, respectively). The hypothesis doesn't need to be a nil hypothesis, though.
"The idea of an infinite population distributed in a frequency distribution in respect of one or more characters is fundamental to all statistical work. From a limited experience…, we may obtain some idea of the infinite hypothetical population from which our sample is drawn, and so of the probable nature of future samples to which our conclusions are to be applied. If a second sample belies this expectation we infer that it is… drawn from a different population; that the treatment to which the second sample… had been exposed did in fact make a material difference, or… had materially altered. Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first" (Fisher, 19542, p.41).
"Statistics are of course variable from sample to sample, and the idea of a frequency distribution is applied with especial value to the variation of such statistics. If we know exactly how the original population was distributed it is theoretically possible… to calculate how any statistic derived from a sample of given size will be distributed. The utility of any particular statistic, and the nature of its distribution, both depend on the original distribution, and appropriate and exact methods have been worked out for only a few cases. The application of these cases is greatly extended by the fact that the distribution of many statistics tends to the normal form as the size of the sample is increased. For this reason it is customary to apply to many cases what is called 'the theory of large samples' which is to assume that such statistics are normally distributed, and to limit consideration of their variability to calculations of the standard error" (Fisher, 19542, p.42).
The null hypothesis also needs to specify the parameters under which results will be deemed significant or not. This is the actual test of significance.
"…in order to be used as a null hypothesis, a hypothesis must specify the frequencies with which the different results of our experiment shall occur, and that the interpretation of the experiment consisted in dividing these results into two classes, one of which is to be judged as opposed to, and the other as conformable with the null hypothesis. If these classes of results are chosen, such that the first will occur when the null hypothesis is true with a known degree of rarity in, for example, 5 per cent or 1 percent of trials, then we have a test by which to judge, at a known level of significance, whether or not the data contradict the hypothesis to be tested" (Fisher, 19605, p.187).
Furthermore, there is no need to set up a so called 'alternative' hypothesis, because it is never going to be tested (ie, it is "based" on the rejection of the null), and, in any case, it is just the negation of the null hypothesis.
"It might be argued that if an experiment can disprove [a] hypothesis…, it must therefore be able to prove the opposite hypothesis… But this last hypothesis, however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experiment, because it is inexact" (Fisher, 19605, p.16).
The level of significance
It is often confused in the literature that the probability of the data and the procedure to get it, represents Fisher's test of significance. It does not. The proper test of significance, that is the test that will establish whether a result is statistically significant or not, is the level of significance that the researcher has decided upon as the cut-off point to separate 'significant' results from 'non significant' results. The remaining procedure is just necessary "scaffolding" to obtain the probability of the observed data (upon which to apply the test of significance).
"Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes with opposed interpretations [:] those which show a significant discrepancy from a certain hypothesis… and [those] which show no significant discrepancy from this hypothesis" (Fisher, 19605, p.15-16).
The level of significance (thus, the test itself) is not "fixed on stone", but is decided by the researcher for each experiment.
"It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him" (Fisher, 19605, p.13).
Of course, the researcher could also work with conventional levels of significance, such as 5% (sig<.05, or a similar result occurring by chance less than 1 time in 20), 2% (sig<.02, or a similar result occurring by chance less than 1 time in 50) or 1% (ie, sig<.01, or a similar result occurring by chance less than 1 time in 100).
"It is usual and convenient for experimenters to take 5 per cent as a standard level of significance" (Fisher, 19605, p.13).
"The value for which p=.05, or 1 in 20, is 1.96 or nearly 2 [in a normal distribution]; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty" (Fisher, 19542, p.44).
Whatever significance level is chosen, it identifies an area where all probabilities equal to or more extreme than the level of significance will thus be considered statistically significant by the researcher.
"[In] judging [the] significance [of a result] we must take account not only of its own frequency, but also of the frequency of any better result. In the present instance '3 right and 1 wrong' occurs 16 times, and '4 right' occurs once in 70 trials, making 17 cases out of 70 as good as or better than the observed. The reason for including cases better than that observed becomes obvious on considering what our conclusions would have been had the case of 3 right and 1 wrong only 1 chance, and the case of 4 right 16 chances of occurrence out of 70. The rare case of 3 right and 1 wrong could not be judged significant merely because it was rare, seeing that a higher degree of success would frequently have been scored by mere chance" (Fisher, 19605, p.15).
The tails of the test
A typical test of significance can be one-tailed or two-tailed. One-tailed tests (or directional tests) assess the significance of data by only checking for significant departure from the null hypothesis in one direction only: either the observed data is significantly "higher" than the null or significantly "lower". Two-tailed tests (or non-directional tests), on the other hand, assess the significance of the observed data either "side" of the null, thus being appropriate when the observed data may appear on either side of the null. In this latter case, the total level of significance is divided into two, typically in equal halves, but could be in any proportion (pending that they add up to the selected level of significance; such as .025 + .025, or .030 + .020, or .040 + .010, etc).
"We may… observe that… data may contradict the hypothesis in any one of a number of different ways. For example,…it is not only possible for [a] subject to designate… cups correctly more often than would be expected by chance, but it is also possible that she may do so less often. Instead of using a test of significance which separates from the remainder a group of possible occurrences, known to have a certain small probability when the null hypothesis is true, and characterised by showing an excess of correct classification, we might have chosen a test separating an equally infrequent group of occurrences of the opposite kind… Such tests may be made mathematically valid by ensuring that they each separate, for purposes of interpretation, a group of possible results of the experiment having a known and small probability, when the null hypothesis is true. For this purpose any quantity might have been calculated from the data, provided that its sampling distribution is completely determined by the null hypothesis, and any portion of the range of distribution of this quantity could be chosen as significant, provided that the frequency with which it falls in this portion of its range is .05 or .01, or whatever may be the level of significance chosen for the test" (Fisher, 19605, p.187-188).
'P-values' can be interpreted as evidence against the null hypothesis, independently of whether they are statistically significant or not. Interpret them as the probability of getting that result under the null hypothesis (eg, p=.033 means that 3.3 times in 100, or 1 time in 33, we will obtain the same result as normal [or random] fluctuation under the null).
"Convenient as it is to note that a hypothesis is contradicted at some familiar level of significance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength which the evidence has in fact reached, or to ignore the fact that with further trial it might come to be stronger, or weaker" (Fisher, 19605, p.25).
'P-values' are considered as statistically significant if they are equal or smaller than the chosen significance level. This is the actual test of significance. This test interpret those 'p-values' falling beyond the threshold as 'rare' enough as to deserve attention (ie, they are rare, yet possible, under the null hypothesis).
"No [statistical significance] can eliminate the whole of the possible effects of chance coincidence, and if we accept this convenient convention, and agree that an event that would occur by chance only once in 70 trials is decidedly "significant", in the statistical sense, we thereby admit that…the "one chance in " will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us" (Fisher, 19605, p.13-14).
If results are accepted as statistically significant, then the null hypothesis can be rejected as being true for the observed data. However, as the whole test of significance is constructed on a null distribution (thus, on the assumption that the null hypothesis is true), the null can never be proved or accepted. That is, you cannot accept as true a hypothesis which is already assumed to be true, nor prove a hypothesis based on a theoretical model assumed to be sound.
"…It should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis" (Fisher, 19605, p.16).
"…it is a fallacy… to conclude from a test of significance that the null hypothesis is thereby established; at most it may be said to be confirmed or strengthened" (Fisher, 19553, p.73).
All test statistics and associated exact 'p-values' can (and probably should) be reported as descriptive statistics, independently of whether they are statistically significant or not. When discussing the results and reaching conclusions, however, those results which are not statistically significant should be ignored.
"[Experimenters] are prepared to ignore all results which fail to reach [statistical significance], and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results" (Fisher, 19605, p.13).
Significant results can be reported in different ways. Two literal8 ways of reporting significant results are, as per Fisher (19594), in the line of "Either an exceptionally rare chance has occurred, or the theory of random distribution is not true" (p.39)9, or, as per Conover (19801), in the line of "Without the treatment I administered, experimental results as extreme as the ones I obtained would occur only about 3 times in 1000. Therefore, I conclude that my treatment has a definite effect" (p.2)10.
Validity and sensitiveness
A lesser known aspect of significance testing is that statistical significance is just one aspect of the whole procedure for significance testing, and that statistical significance, if the experiment was not done well, does not necessarily imply actual significance. For Fisher, both a well-designed experiment and randomisation are the physical basis of the validity of the test.
"[We] thereby admit that no isolated experiment, however significant in itself, can suffix for the experimental demonstration of any natural phenomenon… In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result" (Fisher, 19605, p.13-14).
Furthermore, increasing the sensitiveness of the experiment (eg by way of enlargement, repetition, structural reorganisation or refinements of technique) is one of the ways of ensuring that the "one chance in a million" does not happen to us.
"By increasing the size of the experiment [either by enlargement or repetition], we can render it more sensitive, meaning by this that it will allow of the detection… of a quantitatively smaller departure from the null hypothesis. Since in every case the experiment is capable of disproving, but never of proving this hypothesis, we may say that the value of the experiment is increased whenever it permits the null hypothesis to be more readily disproved" (Fisher, 19605, p.21-22).
"The need for duplicate experiments is sufficiently widely realised; it is not so widely understood that in some cases, when it is desired to place a high degree of confidence (say p=.01) on the results, triplicate experiments will enable us to detect differences as small as one seventh of those which, with a duplicate experiment, would justify the same degree of confidence" (Fisher, 19542, p.126).
"The evident object of these precautions is to increase the sensitiveness of the experiment, by making such differences… as were to be observed as little as possible dependent from environmental circumstances, and as much as possible, therefore, from intrinsic differences due to their mode of origin" (Fisher, 19605, p.32).
Other support information
- Fisher proposed his procedure as a form of inductive reasoning, ie, probably within the outdated version of its meaning as "reasoning from the specific [data] to the general [hypothesis / theory]" (IEP, 2011).
- A test of significance is an interpretative step done on the observed probability of the data assuming the null hypothesis is true. Thus, a 'significant' result always provide a dual dichotomy, unsolvable by statistic methods alone:
- either the data is part of the null 'population' and, thus, it is a extreme case of the same (as extreme as the observed 'p-value' indicates).
- or the data is part of another (unspecified) population and, thus, the null hypothesis is not true. In this case, the probability of the data can be used as 'evidence' against the former (aka, against the null hypothesis), and this evidence grows larger the smaller the probability is. This means, p = 0.012 provides stronger evidence against the null hypothesis than p = 0.047 or p = 0.065.
- The construction of inductive reasoning based on the data could go somehow like this:
- This result is extreme enough statistically as to call our attention (aka, it is significant).
- The result could be merely a rare event (luck, chance…), but why should we be greeted by such luck?
- Thus, it rather seems to be evidence against the hypothesis that the data pertains to the null population.
- Therefore, if the experiment has been set up properly, and we know of no reasons for the results other than our experimental manipulation, we could "reject" the null hypothesis based on this probability.
- Fisher saw no need to establish alternative hypotheses. Thus, he only assesses the probability of the data against a "null" hypothesis. This doesn't mean that an alternative hypothesis was ruled out, only that it was not necessary for performing the test of significance. For example:
- It is obvious that, without an "alternative" hypothesis, the probability of any observed data is no more than a description of such probability (think of the probability of winning the lottery, rare yet often happening). An alternative hypothesis, even if not clearly stated, casts doubt on the null hypothesis as sole explanation, thus re-interpret a 'significant' result not as a descriptive probability of the data but as a 'suspiciously' low probability that may be explained by the alternative hypothesis.
- It is also obvious that Fisher 'created' the test of significance for assessing the effects of experiments, which, by default, are not the null hypotheses. Thus, he was interested in those 'alternative' hypotheses, but created a test which did not assess them directly.
- Fisher indeed talks about alternative hypothesis, if only to say that they cannot be tested because they are ambiguous.
- Fisher was also more than happy to interpret his results not as evidence against the null hypothesis, but as evidence in favor of the alternative hypothesis (of difference in the treatment, and which treatments were better than which others).
- Thus, Fisher's resistance to the concept of alternative hypotheses was not so much practical as logical: alternative hypotheses are not the ones being tested, but only null hypotheses.
- The test is simply an 'idiosyncratic' call by the researcher regarding which results he would consider important, a call he would make 'objectively' by setting up a significant threshold beyond which all p-values would be considered to be statistically significant.
- Because there are not alternative hypotheses, the earlier dichotomic interpretation of results is unavoidable.
- Also, because the null hypothesis is not tested but assumed to be true, nothing can be said about whether it is true or not properly. Thus, it is not logically possible to make an error such as Neyman's alpha and beta errors. Indeed, the test is done under the explicit, accepted, assumption that the null hypothesis is true. We cannot thus negate it altogether.
Eg, we observe a group of swans which seems to be less white than normal, and wonder whether they are swans or not. We establish a normal curve of shades of white and compare the color of the observed swans against this 'population' (Ho: the observed swans are white swans).
We obtain results that indicate that the observed swans have shade of white which can be considered extreme, for all purposes. That is, it is a color shade so rare that only 10 swans in a thousand will have it (p = 0.01).
Thus, this could be either a group of very rare white swans, or they are not white swans at all.
Given the possible yet unlikely occurrence of observing a group of very rare white swans (p), it seems more harmonious to reject the hypothesis that they are, indeed, white. Which colour are they, then? We don't know: we only know they don't seem to be white swans.
Have we disproved they are white swans? No. We induce they may not be, but there is always the possibility they, indeed, are very rare white swans.
Have we proved they are not white swans, then? No. Again, we are inducing they are not white, but we cannot prove it. Furthermore, we haven't set any altenative-color hypothesis, thus, we don't know which color they are other than, probably non-white.
Can we infer swans are white or, else, non-white? No. We have not worked with a real population of white swans. We actually only have 'tested' the observed swans against a hypothetical population of normally distributed white-colored swans.
Can we infer that 1% of swans will be non-white? No, we didn't tested against a real population of white swans.
Can we infer that 99% of swans will be white? No, we didn't use a real population of white swans.
Want to know more?
- Neyman-Pearson's hypotheses testing
- This page deals with Neyman-Pearson's testing procedure, based on testing two competing hypotheses against each other.