神刀安全网

Your strongly correlated data is probably nonsense

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R :

 # let's correlate some random data g1 <- rnorm(50) g2 <- rnorm(50)  cor(g1, g2) # [1] -0.1486646 

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let’s add in a single, large value:

 # let's correlate some random data with the addition of a single, large value g1 <- c(g1, 10) g2 <- c(g2, 11)   cor(g1, g2) # [1] 0.6040776 

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It’s also significant.

 > cor.test(g1,g2, method="pearson")          Pearsons product-moment correlation  data:  g1 and g2 t = 5.3061, df = 49, p-value = 2.687e-06 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:  0.3941015 0.7541199 sample estimates:       cor  0.6040776  

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman , of course:

 > cor(g1, g2, method="spearman") [1] -0.0961086 > cor.test(g1, g2, method="spearman")          Spearmans rank correlation rho  data:  g1 and g2 S = 24224, p-value = 0.5012 alternative hypothesis: true rho is not equal to 0 sample estimates:        rho  -0.0961086  

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Your strongly correlated data is probably nonsense

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址