Question

plotting correlation between samples

0

Entering edit mode

teresati ▴ 10

@teresati-12364

Last seen 8.0 years ago

Dear all,

I would like to plot the correlation between my samples. I have a matrix with genes in rows and samples in columns, and for each gene a logCPM normalized values. If I want to look at the correlation between 2 samples I get an Error:

x <- as.data.frame(t(log.cpm.norm[,1]))  
y <- as.data.frame(t(log.cpm.norm[,2]))
correlation <- as.data.frame(cor(x, y,use="pairwise.complete.obs"))
plot(correlation)

Error in plot.new() : figure margins too large
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf

It is correct to transpose the columns?

Furthermore, are log cpms usually used to compute correlations between samples?

To solve the margin problem I have tried to adjust the margins with:

par('mar')

Thank you for any suggestion.

Regards

correlation • 4.0k views

ADD COMMENT • link updated 8.1 years ago by Aaron Lun ★ 28k • written 8.1 years ago by teresati ▴ 10

2

Entering edit mode

What exactly are you trying to do? You are producing a matrix of the pairwise correlations between two single observations, which will be NA for all of those pairs (you can't compute a correlation between just two observations). You could hypothetically compute the correlation between the two samples, but that would be a single number. Neither of these things is interesting, or illuminating.

Are you perhaps trying to show which samples are more similar to each other? If so, the main tool is either MDS or PCA, both of which are useful for showing that sort of thing.

You might want to spend some time reading some tutorials, which could help you get on track. Here is one, if you are using DESeq2. You could also use the DESeq2 vignette, or if you are using edgeR, there is a User's Guide for that package.

ADD REPLY • link 8.1 years ago James W. MacDonald 68k

score 3 · Answer 1 · 2017-03-07

As James says, the correlation between expression profiles of two libraries is rarely interesting in and of itself. More useful is the relative size of the correlation between different pairs. For example, is the correlation between samples in the same group greater than the correlation between samples in different groups? This reasoning motivates the construction of a MDS plot, as demonstrated in the code below:

stuff <- matrix(rnorm(1000), ncol=10) # 100 genes, 10 samples
cor.mat <- cor(stuff)
dist.mat <- sqrt(0.5*(1-cor.mat)) # magic step
coords <- cmdscale(dist.mat, k=2)
plot(coords[,1], coords[,2])

The magic step converts the pairwise correlation into a valid distance metric (see https://arxiv.org/abs/1208.3145 for details). The use of correlations has a couple of advantages over just using the Euclidean distances between the log-expression profiles:

Correlations are insensitive to scaling, so you don't have to worry about normalising the data.
Spearman's rho is robust to outliers that might interfere with estimation of relative differences between samples - though on the flip side, it is less sensitive to differences between libraries that do not involve many genes.

This is why we sometimes use correlation-based (i.e., cosine) distances in single-cell RNA-seq, instead of running PCA on the log-expression profiles or computing Euclidean distances between pairs of profiles for use in "standard" MDS.

As for your other question, I would calculate correlations from log-expression values. The log-transformation avoids domination of the estimated correlation by a handful of genes with large counts. Normalisation doesn't matter so much as the computed correlation isn't affected by scaling (aside from small changes with respect to the prior count). In fact, you could even use the original counts for computing Spearman's rho, which will be unaffected by general monotonic transformations of the data.