I have an RNA-seq dataset where the largest source of variation appears to be due to library size (principal component 1, which captures ~55% of the variance correlates with library size,B), even after TMM normalisation. Library sizes ranged from 13,481,192 to 22,737,049 and was spread relatively evenly between experimental groups (Top graph).
We used RUVSeq to try remove this unwanted variation. We used RUVg setting k = 1 and using the 5000 least DE genes in an initial DE analysis using edgeR
as the negative control genes (as described in the RUVseq vignette). This generated a W1 term which tracked with PC1 and library size (C & D). PCA on the RUVseq normalised counts (F) revealed that this artefact was gone (G). We then continued with differential gene expression analysis including the W1 covariate in the design matrix.
Is there something we are missing here about why this isn't a valid way to do this? A reviewer of the manuscript has voiced concerns.
I guess this is the same as here? As already suggested over at biostars, it would be good to know what the reviewer said and to show your code. https://www.biostars.org/p/459918/
The reviewer was concerned that using
RUVg
to remove unwanted variation due to library size was non-standard and would require substantial evidence using known data sets with known effects to gauge the impact of the approach. Below is how I performed the analysis :)Not saying you did anything wrong, but this is quite a chunk of code, and I find it odd to see a read count effect after normalization in bulk RNA-seq dominating a genotype effect. Can you, just for confidence, run a standard PCA on the top1000 most variable genes using the logcounts as input as here in part 2 of this post: https://www.biostars.org/p/461026/ and then show the resulting biplot?
This has the same pattern as the PCA on the TMM normalised counts. (before RUVseq, Fig A in the post). For this particular experiment, it makes sense that the genotype effect is small. I am looking at the effect of mutations which model human Alzheimer's disease mutations in young zebrafish, so effects may be small.