# Your strongly correlated data is probably nonsense

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R :

`` # let's correlate some random data g1 <- rnorm(50) g2 <- rnorm(50)  cor(g1, g2) # [1] -0.1486646 ``

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let’s add in a single, large value:

`` # let's correlate some random data with the addition of a single, large value g1 <- c(g1, 10) g2 <- c(g2, 11)   cor(g1, g2) # [1] 0.6040776 ``

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It’s also significant.

`` > cor.test(g1,g2, method="pearson")          Pearsons product-moment correlation  data:  g1 and g2 t = 5.3061, df = 49, p-value = 2.687e-06 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:  0.3941015 0.7541199 sample estimates:       cor  0.6040776  ``

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman , of course:

`` > cor(g1, g2, method="spearman") [1] -0.0961086 > cor.test(g1, g2, method="spearman")          Spearmans rank correlation rho  data:  g1 and g2 S = 24224, p-value = 0.5012 alternative hypothesis: true rho is not equal to 0 sample estimates:        rho  -0.0961086  ``