Dear Bioconductors,
I am dealing with a poly-A selected RNA-seq dataset that contains samples collected from the central nervous system of mice treated either with a drug or a control vehicle. The primary goal of the analysis is the identification of differential gene expression associated with drug treatment.
During my initial QC, I noticed relatively high counts for different hemoglobin genes in the dataset, varying ca. 5-10 fold between samples. This probably indicates the varying degrees of blood contamination in the original samples (which are difficult to prepare). As far as I can tell, the degree of contamination does not appear to be associated with the variable of interest, e.g. drug treatment, as it fluctuates similarly within and across treatment groups.
Would anybody have recommendations on how to minimize the effects of such contamination on differential expression analysis? For example, I am considering an (unsupervised) surrogate variable analysis (SVA). Alternatively, given that the source of the contamination is (likely) known, I am wondering if it would be useful to include the expression of blood-specific marker genes (hemoglobin, etc) in a linear model.
Perhaps any of you have are willing to share some experience and / or advice?
Thanks a lot,
Thomas
Little follow-up on the topic. We have RNA-seq gene and exon-level data extracted from white blood cells. We have significant hits on various hemoglobin, haemoglobin metabolic processes and gas exchange related genes from the differential expression analysis. Foldchanges are quite small and vary between 1-2. Is it possible in general that from WBC RNA-seq data we can see expression changes in such genes or this more likely to be due contamination? And if so is RUVseq a valid option for approaching this issue?
From the count data alone, it isn't possible to determine if your log-fold changes are due to contamination. Without prior biological knowledge, who am I to say that white blood cells do not genuinely express haemoglobin after drug treatment? I guess it sounds silly, but I don't know what your drug does, either.
Similarly, RUVseq will not explicitly determine whether contamination has occurred. It will only identify hidden factors of variation, some of which may (or may not) be driven by variable contamination between samples. For practical purposes, this is enough - you don't care about the contamination itself, you just want to get rid of it for the DE analysis. As long as the contamination is not confounded with your drug treatment, RUVseq should be applicable; have a look at the haemoglobin expression across samples and ensure that the high-expressing samples are not all in one condition.
Yeah, that indeed sounds quite bizarre that WBC would express haemoglobin etc gas exchange related genes.
1) If after RUVseq we still have the same results and there isn't any single high-expression samples in these genes do you see it as plausible that then the results might be genuine?
2) Is it an valid approach to exclude these genes in question from the analysis and do it without them?
1) Putting aside biological plausibility for the moment, there are still a few reasons why RUVseq may fail to remove contamination-related factors of variation in your data. The first is if the contamination effect doesn't lie in the first k factors, i.e., there are larger effects that take priority. The second is that RUVseq will not remove factors that are completely confounded with your conditions of interest. So it is entirely possible for technical effects to remain after applying RUVseq (albeit reduced in magnitude), thus requiring some care during interpretation.
2) Yes, and this will avoid drawing biologically nonsensical conclusions. However, any contamination effect will still be present in other genes. These may actually be more misleading as they are not obviously wrong.
Thank you very much for the help. Going to examine both approaches and see if the results make sense.
We have quite complex model in the analysis and was wondering about the RUVseq argument "k". Should it be set to 1 or how many in this case below (the vignette isn't really helpful in modifying the example code) ? And is the below right way to implement the W_1 in the model of differential expression analysis?
#Ruvseq code
design <- model.matrix(~ time + condition + time:subject.nested + condition:time, data=pData(set))
y <- DGEList(counts=counts(set1), group=coldata$CASE_CONTROL)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=35:36)
top <- topTags(lrt, n=nrow(set1))$table
empirical <- rownames(set1)[which(!(rownames(set1) %in% rownames(top)[1:5000]))]
set2 <- RUVg(set, empirical, k=1)
#Code for differential expression
full <- model.matrix(~ W_1+ time + condition + time:subject.nested + condition:time, pData(set2))
reduced <- model.matrix(~ W_1 + time + condition + time:subject.nested, pData(set2))
ddsFullCountTable <- DESeqDataSetFromMatrix(countData = counts(set2),colData = pData(set2), design = ~ 1 )
ddsFullCountTable <- ddsFullCountTable[ rowSums(counts(ddsFullCountTable)) > 1, ]
keep <- rowSums(counts(ddsFullCountTable) >= 5) >= 5
ddsFullCountTable <- ddsFullCountTable[keep,]
dds <- DESeq(ddsFullCountTable, test="LRT", full=full, reduced=reduced)
This is turning into a question about DESeq and RUVseq. If you want more relevant advice, post a new question with the appropriate tags.