Tests of significance

Fisher's tests of statistical significance

Tests of significance are a statistical technology used for ascertaining the likelihood of empirical data, and, from there, for inferring a real effect, such as a correlation between variables or the effectiveness of a new treatment. Fisher, starting around 19252, standardized the interpretation of statistical significance and was the main driving force behind the popularity of tests of significance in empirical research, especially in the social and behavioral sciences.

The following sections address the interpretation of tests of significance as well as a basic procedure for performing tests of significance. They are based on a procedure that improves Fisher's original ideas in order to prevent current misuses, such as the null hypothesis testing procedure leading to pseudoscientific interpretations, which were partly initiated by Fisher himself.

Reading tests of significance

A typical test of significance comprises two related elements: the calculation of the probability of the data and an assessment of the statistical significance of that probability.

Probability of the data

The probability of the data is normally reported using two related statistics: a test statistic (z, t, F…) and an associated probability (p, *10). The information provided by the test statistic is of little immediate usability and can be ignored in most cases. The associated probability, on the other hand, tells how probable the test results are, and forms the basis for assessing statistical significance.

z=1.96, p=0.025 F=13.140, p<0.01 r=0.60*

Statistical significance

The statistical significance of the results depends on criteria set up by the researcher beforehand. A result is deemed statistically significant if the probability of the data is small enough, conventionally if it is smaller than 5% (sig≤0.05). However, conventional thresholds for significance may vary depending on disciplines and researchers. For example, health sciences commonly settle for 10% (sig≤0.10), while particular researchers may settle for more stringent conventional levels, such as 1% (sig≤0.01). In any case, p-values (p, *) larger than the selected threshold are considered non-significant and are typically ignored from further discussion. P-values smaller than, or equal to, the threshold are considered statistically significant and interpreted accordingly. A statistically significant result normally leads to an appropriate inference of real effects, unless there are suspicions that such results may be anomalous. Notice that the criteria used for assessing statistical significance may not be made explicit in a research article when the researcher is using conventional assessment criteria.

z=1.96, p=0.025 F=13.140, p<0.01 r=0.60*

In this example, the test statistics are 'z' (normality test), 'F' (equality of variance test), and 'r' (correlation)7. Each p-value indicates, with more or less precision, the probability of its test statistic under the corresponding null hypothesis10. Assuming a conventional 5% level of significance (sig≤0.05), all tests are, thus, statistically significant. We can thus infer that we have measured a real effect rather than a random fluctuation in the data (*10).

When interpreting the results, the correlation statistic provides information which is directly usable 7. We could thus infer a medium-to-high correlation between two variables. The test statistics 'z' and 'F', on the other hand, do not provide immediate useful information, and any further interpretation needs of descriptive statistics. For example, skewness and kurtosis are necessary for interpreting non-normality (z), and group means and variances are necessary for describing group differences (F).

Running tests of significance

The following procedure should be considered a compromise between a literal understanding of tests of significance and current pseudoscientific use. This compromise seems appropriate because of the high prevalence of the later case and the absence of the former, which could see research work and scientific articles rejected if straying too far off the current pseudoscientific understanding. Using this compromise, such research work may perhaps be interpreted as punctilious but not necessarily wrong. Meanwhile, researchers can progress further into an advanced use of literal tests of significance if so wished.

  • Set up or assume a statistical null hypothesis (Ho). Setting up a null hypothesis helps clarify the aim of the research. However, such hypothesis can also be assumed, given that null hypotheses, in general, are nil hypothesis and can be easily 'reconstructed'.

Ho: Given our sample results, we will be unable to infer a significant correlation between the dependent and independent research variables.
Ho: It will not be possible to infer any statistically significant mean differences between the treatment and the control groups.
Ho: We will not be able to infer that this variable's distribution significantly departs from normality.

  • Decide on an appropriate level of significance for assessing results. Conventional levels are 5% (sig<0.05, meaning that results have a probability under the null hypothesis of less than 1 time in 20) or 1% (sig<0.01, meaning that results have a probability under the null hypothesis of less than 1 time in 100). However, the level of significance can be any 'threshold' the researcher considers appropriate for the intended research (thus, it could be 0.02, 0.001, 0.0001, etc). If required, label such level of significance as 'significance' or 'sig' (ie, sig<0.05); thus, avoid labeling it as 'p' (so not to confuse it with 'p-values') or as 'alpha' or 'α' (so not to confuse it with 'alpha' tolerance errors).
  • Decide between a one-tailed or a two-tailed statistical test. A one-tailed test assesses whether the observed results are either significantly higher or smaller than the null hypothesis, but not both. Thus, one-tailed tests are appropriate when testing that results will only be higher or smaller than null results, or when the only interest is on interventions which will result in higher or smaller outputs. A two-tailed test, on the other hand, assesses both possibilities at once. It achieves so by dividing the total level of significance between both tails, which also implies that it is more difficult to get significant results than with a one-tailed test. Thus, two-tailed tests are appropriate when the direction of the results is not known, or when the researcher wants to check both possibilities in order to prevent making mistakes.
  • Interpret results:
    • Obtain and report the probability of the data (p). It is recommended to use the exact probability of the data, that is the 'p-value' (eg, p=0.011, or p=0.51). This exact probability is normally provided together with the pertinent statistic test (z, t, F…) either as 'p' or as 'sig' (significance)4. However, when necessary (eg, for reducing the size of a table), 'p-values' can also be reported in a categorical (eg, * for p<0.05, ** for p<0.01) or nominal manner (p<0.05)5.
    • 'P-values' can be interpreted as the probability of getting the observed or more extreme results under the null hypothesis (eg, p=0.033 means that 3.3 times in 100, or 1 time in 33, we will obtain the same or a more extreme results as normal [or random] fluctuation under the null).
    • 'P-values' are considered statistically significant if they are equal or smaller than the chosen significance level. This is the actual test of significance as it interprets those 'p-values' falling beyond the threshold as 'rare' enough as to deserve attention.
    • If results are accepted as statistically significant, it can be inferred that the null hypothesis is not explanatory enough for the observed data.
  • Write up the report:
    • All test statistics and associated exact 'p-values' can be reported as descriptive statistics, independently of whether they are statistically significant or not.
    • When discussing the results and reaching conclusions, however, those results which are not statistically significant should be ignored. Significant results can be reported in different ways. Two "honest"6 ways of reporting are the following:
      • As per Fisher (19593), significant results can be reported in the line of "Either an exceptionally rare chance has occurred, or the theory of random distribution is not true" (p.39). Eg, "In regards to this correlation, either a rare chance has occurred (p=0.01), or the hypothesis of nil correlation between these variables is not true."
      • As per Conover (19801), significant results can be reported in the line of`"Without the treatment I administered, experimental results as extreme as the ones I obtained would occur only about 3 times in 1000. Therefore, I conclude that my treatment has a definite effect" (p.2). Eg, "This correlation is so extreme that it would only occur about 1 time in 100 (p=0.01). Thus, it can be inferred that there is a significant correlation between these variables."

Discussion: As we obtained statistically significant results, we can infer a correlation between the dependent and independent research variables.
Discussion: Without the administered treatment, results as extreme as the ones I obtained would occur about 1 time in 100 (p=0.01). Thus, it can be inferred that the treatment does have a positive effect.
Discussion: Given our results, and unless an exceptionally rare chance has occurred (p=0.04), we can infer that this variable is not normal.

References
1. CONOVER WJ (1980). Practical nonparametric statistics (2nd ed). John Willey & Sons (New York, USA), 1980. ISBN 0471084573.
2. FISHER Ronald A (1925). Statistical methods for research workers. Oliver and Boyd (Edinburgh, UK), 1925.
3. FISHER Ronald A (1959). Statistical methods and scientific inference (2nd ed). Oliver and Boyd (Edinburgh, UK), 1959.
+++ Footnotes +++
4. This reflects the current level of confusion in statistics, in general. SPSS, for example, provides exact 'p-values' under the label 'sig'. Properly speaking, what SPSS is providing is the 'p-value' of the observed data within the null distribution. The 'significance' level, however, is a nominal 'p-value' that the researcher selects as a cut-off point for interpreting the observed 'p-values' as being either statistically significant or not (eg, sig<0.05 implies that all p<0.05 will be deemed significant); thus neither it is calculated statistically nor it depends on mathematical formulas.
5. Make sure you understand the difference between a 'p-value' expressed categorically or nominally and a level of significance. A 'p-value' such as 'p<0.05' or '*' indicates that the observed result has a 'p-value' lesser than 0.05, independently of whether such result is interpreted as statistically significant or not. It also allows for using 'p<0.01' or '**' as a more extreme 'p-value' category, thus, as stronger evidence against the null hypothesis. In contrast, 'sig<0.05' only indicates that you will consider all 'p-values' smaller than 0.05 as statistically significant. As it is easy to confuse these two concepts, try to express 'p-values' as exact probabilities, and avoid using categorical or nominal expressions for 'p-values' as much as possible. Also, beware that 'sig' is commonly confused for, thus written as, 'p' or 'α' in research articles and statistics books alike, while 'p' is written as 'sig' in SPSS outputs.
6. "Honest" means honest to the philosophical and mathematical underpinnings of Fisher's tests of significance. Such interpretation is rarely found in the literature, and is not even used extensively by Fisher or Conovan. Indeed, Fisher was more than happy to interpret significant results as evidence of the existence of an effect rather than merely as evidence against the null hypothesis. Yet an "honest" or literal interpretation, albeit somewhat cumbersome, makes clear the working of the test: the location of the observed data within the theoretical distribution of the null hypothesis, resulting in the probability of that data assuming the null hypothesis is correct.
7. Actually, the statistic 'r' is not a test statistic but the correlation coefficient proper. The significance of this correlation is tested against a t-distribution, from where the p-value is obtained. Customarily, however, 'r' is presented together with the associated p-value, ignoring the 't-test' statistic.
8. Preferably, authors should use exact p-values, rather than nominal or categorical ones.
9. In reality, what are significant are not the correlations themselves but the t-tests done on the correlations, leading to the rejection of both null hypotheses and, thus, to the provisional inference that such correlations are true, pending further confirmation.
10. Conventionally, '*' stands for p≤0.05

Want to know more?

Wiki of Science - Neyman-Pearson's hypotheses testing
This page deals with Neyman-Pearson's testing procedure, based on testing two competing hypotheses against each other.
Wiki of Science - Tests of significance (intermediate)
This page provides further information on tests of significance, adequate for a better understanding of this technology.

Contributors to this page

Authors / Editors

JDPerezgonzalezJDPerezgonzalez


BlinkListblogmarksdel.icio.usdiggFarkfeedmelinksFurlLinkaGoGoNewsVineNetvouzRedditYahooMyWebFacebook

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License