Everything about Correlate totally explained
» This article is about the correlation coefficient between two variables. The term correlation
can also mean the cross-correlation of two functions or electron correlation in molecular systems.
In
probability theory and
statistics,
correlation, (often measured as a
correlation coefficient), indicates the strength and direction of a linear relationship between two
random variables. In general statistical usage,
correlation or co-relation refers to the departure of two variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of data.
A number of different coefficients are used for different situations. The best known is the
Pearson product-moment correlation coefficient, which is obtained by dividing the
covariance of the two variables by the product of their
standard deviations. Despite its name, it was first introduced by
Francis Galton.
Pearson's product-moment coefficient
Mathematical properties
The correlation coefficient ρ
X, Y between two
random variables X and
Y with
expected values μ
X and μ
Y and
standard deviations σ
X and σ
Y is defined as:
»
Common misconceptions about correlation
Correlation and causality
The conventional dictum that "
correlation doesn't imply causation" means that correlation can't be validly used to infer a causal relationship between the variables. This dictum shouldn't be taken to mean that correlations can't indicate causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown. Consequently, establishing a correlation between two variables isn't a sufficient condition to establish a causal relationship (in either direction).
Here is a simple example: hot weather may cause both a reduction in purchases of warm clothing and an increase in ice-cream purchases. Therefore warm clothing purchases are correlated with ice-cream purchases. But a reduction in warm clothing purchases doesn't cause ice-cream purchases and ice-cream purchases don't cause a reduction in warm clothing purchases.
A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health in people is less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? Or is it pure coincidence? In other words, a correlation can be taken as evidence for a possible causal relationship, but can't indicate what the causal relationship, if any, might be.
Correlation and linearity
While Pearson correlation indicates the strength of a linear relationship between two variables, its value alone may not be sufficient to evaluate this relationship, especially in the case where the assumption of normality is incorrect.
The image on the right shows
scatterplots of
Anscombe's quartet, a set of four different pairs of variables created by
Francis Anscombe. The four
variables have the same mean (7.5), standard deviation (4.12), correlation (0.81) and regression line (
). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the assumption of normality. The second one (top right) isn't distributed normally; while an obvious relationship between the two variables can be observed, it isn't linear, and the Pearson correlation coefficient isn't relevant. In the third case (bottom left), the linear relationship is perfect, except for one
outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.81. Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables isn't linear.
These examples indicate that the correlation coefficient, as a summary statistic, can't replace the individual examination of the data.
Computing correlation accurately in a single pass
The following algorithm (in
pseudocode) will calculate
Pearson correlation with good numerical stability.
sum_sq_x = 0
sum_sq_y = 0
sum_coproduct = 0
mean_x = x[1]
mean_y = y[1]
for i in 2 to N:
sweep = (i - 1.0) / i
delta_x = x[i] - mean_x
delta_y = y[i] - mean_y
sum_sq_x += delta_x * delta_x * sweep
sum_sq_y += delta_y * delta_y * sweep
sum_coproduct += delta_x * delta_y * sweep
mean_x += delta_x / i
mean_y += delta_y / i
pop_sd_x = sqrt(sum_sq_x / N )
pop_sd_y = sqrt(sum_sq_y / N )
cov_x_y = sum_coproduct / N
correlation = cov_x_y / (pop_sd_x * pop_sd_y)
Further Information
Get more info on 'Correlate'.
|
External Link Exchanges
Do you know how hard it is to get a link from a large encyclopaedia? Well we're different and will prove it. To get a link from us just add the following HTML to your site on a relevant page:
<a href="http://correlation.totallyexplained.com">Correlation Totally Explained</a>
Then simply click through this link from your web page. Our crawlers will verify your link, extract the title of your web page and instantly add a link back to it. If you like you can remove the words Totally Explained and embed the link in article text.
As long as your link remains in place, we'll keep our link to you right here. Please play fair - our crawlers are watching. Your site must be closely related to this one's topic. Any kind of spamming, dubious practises or removing the link will result in your link from us being dropped and, potentially, your whole site being banned. |