The mysterious world of spurious correlations in business and life.

AuthorCroucher, John S.
PositionReport
  1. INTRODUCTION

    Correlation is a statistical technique that is often misused and misinterpreted and as such should be subject to greater scrutiny and analysis. The most common measure used to determine whether two variables are statistically correlated is the (Pearson) correlation coefficient, denoted by the symbol r. The value of r must always lie between -1 and +1. The closer the value of r to +1, the stronger the positive correlation while the closer to -1, the stronger the negative correlation.

    However, the correlation coefficient merely gives an indication of the extent to which large values of one variable are associated with large or small values of the other, and cannot assess whether there is a causal relationship. And that is the problem. While the correlation coefficient may tell us whether two variables are significantly related, it does not reveal how they are related. A graphical way to determine if two variables X and Y are correlated is to plot the data on a scatter plot or scatter diagram. This involves constructing X and Y axes on a graph and plotting each paired data point on the graph. Each data pair (x,y) is represented by a point on the graph and vice-versa. It is often quite possible to get a feeling for the type of relationship, if any, that the two variables have.

    However, the value of r must still be tested for statistical significance using a statistical procedure. Common among these is Student's t-test that can be used to test the value of r at a given a-level. ). In particular, the correlation coefficient obtained from a sample will be tested to find whether it provides significant evidence that the correlation coefficient of the corresponding population is different from zero. As well as the size of the value of r itself, any test for significance must also take into account the size of the sample n from which it was calculated.

  2. CAUSALITY

    If two variables are significantly correlated statistically, this does not imply that one must be the cause of the other. Association is not sufficient to establish a causal relationship and it may well be that the presence of other factors (known as lurking variables' or 'confounding factors') may be creating the illusion of causality. This concept of 'spurious' or 'false' correlation was first noted by the famous German statistician Karl Pearson (Pearson, 1897).

    Suppose we have two variables of interest, say, X, Y where it has already been found that X and Y are correlated. Let Z be a possible lurking variable. , it could be that:

    X causes Y

    Y causes X

    Z causes both X and Y

    In the last case we would then have a spurious correlation between X and Y. This point is demonstrated in Example 1.

    Example 1

    An HR manager collected data on ten employees in their organization. These were:

    X = current annual salary ($)

    Y = blood pressure of employee (mm)

    The correlation coefficient between X and Y was r = 0.968 which yielded a p-value of 0.000, suggesting a very significant positive correlation between the variables. Could it be that the more money you earn the more likely your blood pressure is to rise?

    A third variable Z = age (in years) was also recorded for the same employees in the sample. The following statistics were calculated:

    Correlation between X and Z = 0.960

    Correlation between Y and Z = 0.984

    The importance of these correlations and what they mean for the perceived relationship between X and Y is discussed in the next section.

  3. PARTIAL CORRELATION

    Partial correlation is the correlation between two variables when the effects of one or more related variables are removed. This should be tested if there is a suspicion that the correlation between two variables may be spurious and only appears to be significant since they are related to other variables. In some cases, two variables may appear to be strongly correlated, but after taking into account these other variables there is really no significant correlation at all.

    In the case of Example 1, the partial correlation coefficient between X and Y, allowing for Z is only 0.469, less than half of the raw 0.968 and which has a p-value of greater than 0.05. We might therefore conclude that the original association between Annual salary and Blood pressure was indeed spurious.

    Whilst in this case the fact of the matter was easy to detect, in general it is not so easy since there may be dozens, if not hundreds, of possibilities for lurking variables that may...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT