Dear Gordon Smyth,
I am facing a biological problem where I am comparing two different treatment conditions (plus a mutant) versus the control condition. I have RNA-Seq expression in triplicates for each condition. What I want to find out is the true (biological) correlation of fold changes in our two conditions when comparing each to the control. To do so, we came across your function genas from the limma package. However, I have some questions to make sure I am doing things right and that would help me understand the results of the function.
I provide the code I am using for you to find out what I am doing exactly:
condition <- c("MUT", "MUT", "MUT", "A", "A", "A", "B", "B", "B", "Control", "Control", "Control")
y <- RawCounts[which(rowMedians(as.matrix(RawCounts))>=1),] ### RawCounts is the matrix of gene expression
group <- factor(condition, levels=c("Control","MUT", "A", "B"))
design <- model.matrix(~group)
colnames(design) <- c("Control", "MUT", "A", "B")
dge <- DGEList(counts=y)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=T, prior.count=1)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
print(genas(fit, coef=c(2,3), plot=TRUE, subset="all"))
. When subsetting==Fpval, are only 'significantly changing genes' used to estimate biological correlation?
. Are the genes used to estimate biological correlation the only ones that show biological change (not technical correlation) in any of the two comparisons?
. Why is technical correlation always 0.5? Is this based on some assumption besides that "technical correlations between coefficients are the same for all genes" ?
. In line with this last question, when using RNA-Seq data, is it right to use previously filtered and scaled data (using for instance calcnormfactors) ?
. Is it correct to run genas on subsets of low number of genes (from tens to hundreds) to estimate biological correlations for different sets of genes?
### session info:
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=C LC_NUMERIC=C
[3] LC_TIME=C LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ellipse_0.3-8 edgeR_3.16.2
[3] pheatmap_1.0.8 BiocParallel_1.8.1
[5] genefilter_1.56.0 stringr_1.1.0
[7] scales_0.4.1 ggplot2_2.2.0
[9] RColorBrewer_1.1-2 DESeq2_1.14.0
[11] SummarizedExperiment_1.4.0 Biobase_2.34.0
[13] GenomicRanges_1.26.1 GenomeInfoDb_1.10.1
[15] IRanges_2.8.1 S4Vectors_0.12.0
[17] BiocGenerics_0.20.0 limma_3.30.3
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 splines_3.3.1 lattice_0.20-34
[4] colorspace_1.3-0 htmltools_0.3.5 chron_2.3-47
[7] survival_2.40-1 XML_3.98-1.5 foreign_0.8-67
[10] DBI_0.5-1 plyr_1.8.4 zlibbioc_1.20.0
[13] munsell_0.4.3 gtable_0.2.0 latticeExtra_0.6-28
[16] knitr_1.15 geneplotter_1.52.0 AnnotationDbi_1.36.0
[19] htmlTable_1.7 Rcpp_0.12.7 acepack_1.4.1
[22] xtable_1.8-2 Hmisc_4.0-0 annotate_1.52.0
[25] XVector_0.14.0 gridExtra_2.2.1 digest_0.6.10
[28] stringi_1.1.2 grid_3.3.1 tools_3.3.1
[31] bitops_1.0-6 magrittr_1.5 lazyeval_0.2.0
[34] RCurl_1.95-4.8 tibble_1.2 RSQLite_1.0.0
[37] Formula_1.2-1 cluster_2.0.5 Matrix_1.2-7.1
[40] data.table_1.9.6 assertthat_0.1 rpart_4.1-10
[43] nnet_7.3-12
Dear Gordon Smyth,
Thank you for your fast and comprehensive answer. I have been playing around with different parametres and now I have a different doubt:
When changing the 'subset' parameter the results change completely. I am using a group of 4263 genes and when running genas with:
subset="all" I get a biological correlation of -1 (although with a p.value of 0.875) using all genes.
subset=="Fpval" I get a biological correlation of 0.25 with a p.value e-05. while using 593 of these genes.
subset=="p.union" I get a biological correlation of 0.86 with a p.value e-44. while using 678 of these genes.
subset=="logFC" I get a biological correlation of 0.51, p. val e-26, 757 genes.
subset=="predFC" I get a biological correlation of 0.53, p.val e-30, 760 genes.
Following the genas function recommendation from the manual ("Ideally, only genes that are truly differentially expressed for one or both of the contrasts should be used estimate the biological correlation") we can exclude the option subset=all. Also, the p.value is not significant for such analysis. Then, the good part is that 403 genes are common between the 593 of Fpval and the 678 of p.union. Using subset="p.int" gives me this error: "Error in squeezeVar(sigma^2, df.residual, covariate = covariate, robust = robust, :
Could not estimate prior df"
Both spearman and pearson correlation of the contrast fold changes of these 4263 genes show a high positive correlation, but what I want is to exclude the technical correlation. This makes me think that probably the 'p.union' subset option is the most accurate in such case, but how do I make sure I am using the right subsetting value in this case? Is there any quality control value that I can use (besides p.values) to assess which is the best method (and how reliable it is) ?
Thank you again for your time,
Eduard