Correlation

Correlation is typically defined as the degree of "relationship" between two variables ("co-" stands for two or more). Two variables that are "co-related" are understood as having a relationship such that when one of the variables changes (eg, it increases in value) the other variable changes, as well.

Eg, income and expenditure are typically correlated. Every time people get more money than normal (eg, for their birthdays) they tend to buy something new with it (eg, a nice present) or more of the same stuff (eg, more food). Equally, when they have less money than usual (eg, because they had to pay a bill), they tend to buy less things.

When you get a correlation, you normally get a correlation coefficient with a + or - sign, and, perhaps, the significance level of that coefficient. The sign tells you if the correlation is positive or negative. The coefficient itself tells you the strength of the correlation. The significance level tells you how confident you can be about such correlation.

Positive and negative correlations

A positive correlation indicates that when one of the variables changes towards higher values (eg, more money), the other variable also changes towards higher values (eg, more buying), and when one changes towards lower values (eg, less money), the other changes towards lower values (eg, less buying).

A negative correlation, on the other hand, has an inverted relationship. When one variable changes towards higher values (eg, increase in prices), the other variable changes towards lower values (eg, less buying), and viceversa (ie, cheaper prices, more buying).

Strength of the relationship

A correlation may vary in its strength. That is, there are variables which are more correlated than others.

For example, it is possible that when you get more money, you tend to spend more, which indicates a positive correlation. If you always spend your extra money, then the correlation is high; but if you typically spend it, but not always, then the correlation is smaller; and if you sometimes spend it and sometimes not, then the correlation is even smaller.

The correlation coefficient moves between +1 (perfect positive correlation) and -1 (perfect negative correlation). The closer to 1, the stronger the correlation. If the correlation is close to "0", however, it means the correlation is very weak, with "0" indicating that there is no correlation at all.

How to interpret the strength of a correlation is not a settled issue. The following table shows you several of the conventions different authors use in order to interpret the strength of a correlation (conventions as summarised by Reinard, 20062).

Interpretation Losh, 2004 Cohen, 1988 Koenker, 1961
Perfect 1.000
Very large / Very strong ≥.760 ≥.800
Large / Strong ≥.510 ≥.371 ≥.600
Moderate / Medium ≥.260 ≥.243 ≥.400
Small / Weak ≥.110 ≥.100 ≥.200
Very small / Very weak ≥.010
No correlation <.010 <.100 <.200

Statistical significance

The correlation coefficient is a descriptive statistic and, as such, is sufficiently informative when describing the relationship between two variables for the sample under comparison. However, if the interest is to use it for infer a similar relationship in the larger population, then the statistical significance of such correlation is important.

In this case, the "null hypothesis" is that there is no correlation between those variables in the population at large (even when the correlation is evident in the sample). The "alternative hypothesis" is that such correlation can be found in the population as it is found in the sample. Using a conventional statistical significance of "0.05" then, every time a correlation coefficient has an associated probability lower than "0.05" (p<.05), we can accept the alternative hypothesis and assume a correlation in the population (with a 5% chance of being wrong in so doing, of course).

Correlation matrices

Although correlations are always bivariate (ie, between two variables), it is not unusual to test several pairs of variables at once and show each correlation in a matrix. In such a matrix, each variable is represented both at the top and at the side of the matrix and correlations (together with significance and, sometimes, sample sizes) are shown within the table, each cell representing the bivariate correlation between one top and one side variable. If you look carefully you will notice that each variable 'correlates' with itself in the diagonal of the matrix, running from top left to bottom right (correlations are obviously 'perfect', although not normally represented in the matrix). Also each correlation is duplicated at both sides of that matrix, as each pair of variables appears twice. If so preferred, you may 'clean' up a correlation matrix in order to eliminate irrelevant and redundant information.

A typical correlation matrix
BNI WHO US/CAN AUS/NZ UK
BNI 1.000 .950 .950 .962
(p) .000 .000 .000 .000
WHO 1.000 .950 .950 .962
(p) .000 .000 .000 .000
US/CAN .950 .950 1.000 .999
(p) .000 .000 .000 .000
AUS/NZ .950 .950 1.000 .999
(p) .000 .000 .000 .000
UK .962 .962 .999 .999
(p) .000 .000 .000 .000
Same correlation matrix 'cleaned up'
BNI WHO US/CAN AUS/NZ
WHO 1.000
(p) .000
US/CAN .950 .950
(p) .000 .000
AUS/NZ .950 .950 1.000
(p) .000 .000 .000
UK .962 .962 .999 .999
(p) .000 .000 .000 .000

How to interpret the correlation coefficient "r"

Earlier, I said that correlation is typically defined as the degree of "relationship" between two variables. The few examples used even suggest a "causal" relationship, where one variable (eg, more or less money) leads to the other (eg, more or less buying).

However, there is nothing in the calculation of a correlation coefficient that determines, establishes or, even, suggests causation. The correlation coefficient simply indicates that two variables "correlate" but not why or how. In fact, the correlation may be a chance event, the variables may be related not to each other but to another variables which are common to both (eg, a positive correlation is possible between money and weight, not because having more money makes people heavier but because people with more money can buy more food, and eating in excess increases weight), the "relationship" may go in the other direction (eg, maybe prices increase because people don't buy enough, rather than the other way around), etc.

Of course, certain relationships are more "reasonable" than others (eg, it is more "reasonable" that a surplus of money let to a greater expenditure than viceversa). Yet, the only way of establishing causation is by way of an experiment, not by interpreting correlations "reasonably". Even so, the relationship may not be between money and expenditure, but between money and attitudes towards money, on the one hand, and between attitudes towards money and expenditure, on the other hand. That is, some people may like expending any extra money they get, while other people may prefer to save it for a rainy day. Thus, the real relationship is between attitudes and the other two variables, not between money and expenditure per se.

Interpretations that are "literally" possible out of the correlation coefficient are the following (D'Andrade & Dart, 19901):

• Prediction. "r" represents the slope of the regression line for normalized variables. Therefore, it can be used for predicting the normalized values of one variable from the other by multiplying the normalized values of the latter by the correlation coefficient.
• Effectiveness in making predictions.
• Proportion of variation accounted for. This is the absolute variation between predicted and actual scores (or the square root of the proportion of variance accounted for). The proportion of variation accounted for is a more intuitive measure of variation than the proportion of variance accounted for (ie, the coefficient of determination), simply because it doesn't use squared values. It is probably better to refer to it as the "± proportion of variation accounted for", to prevent misunderstandings.
• Proportion of variation not accounted for. This proportion can be obtained from the square root of the coefficient of alienation (ie, sqr of 1-r2) (but not by simply subtracting the proportion of variation accounted for from "1"). This proportion indicates the size of the error likely to be made when predicting a single case3.
• Commonality. "r" can be interpreted as the percentage of common elements between both variables when both variables can be treated as having equal elements. Eg, the expected correlation between hereditary behaviours of father and son is .50, simply because father and son also share 50% of their genes.
• Binomial effect size display. "r" represents the difference in proportion between sucessful and unsuccessul outcomes when using 2x2 tables (with results for each variable split at the median). Eg, in a game situation with a chance of winning of 7/10 times, one could expect winning 7 times and losing 3 times. If each bet costs \$1, then one can expect to get a net return of \$4 for every \$10 waged (7-3=4, r=0.4, or 40%).
• Binomial proportion misclassification. Equally, subtracting the binomial "r" from 1 and dividing this by 2 [ie, (1-r)/2] gives the proportion of unsuccessful outcomes or, in a clinical environment, the proportion of cases misclassified. Following above example, then, one could expect to lose \$3 for every \$10 waged [(1-0.4)/2=0.3, or 30%].
• Angle between variables = cosine of r.
References
1. D'ANDRADE Roy & Jon DART (1990). The interpretation of r versus r2 or Why percent of variance accounted for is a poor measure of size of effect. Journal of Quantitative Anthropology, number 2, pages 47-59.
2. REINARD John C (2006). Communication research statistics. Sage Publications (California, USA), 2006. ISBN 9780761929871.
+++ Footnotes +++
3. The coefficient of variation accounted for and the coefficient of variation unaccounted for do not add to "1".