Question

Removing hidden batch effects in my DESeq2 analysis resulted in lower number of DEGs

0

Entering edit mode

Leila.Eshraghi • 0

@leilaeshraghi-10806

Last seen 8.2 years ago

Hi All,

I have RNA-seq data from 7 healthy (Control) and 8 breast cancer carriers and using DESeq2 to find DEGs. So, every sample of mine is collected from an individual person (no real biological replicates and no experimental replicates). I put these samples into two different groups of Control (healthy) and Carrier and followed the DESeq2 analysis as suggested in in bioconductor and published paper titled: Differential analysis of count data – the DESeq2 package, 2016)

I ended up with no significant (padj<0.05) DEG. I did clustering analysis and some of my control samples were in the same cluster as my carriers. All my analysis suggested that the differences between my samples within a group are bigger than the differences between two groups (Control vs Carrier). So, based on the cluster, I created a new csv file with selected samples that they were not interfering with each other (4 control and 4 carriers) and I ended up with 60 DEG (padj<0.05). So, depending which samples I include or exclude the outcome is different (which does make sense to me).

However, I am doing this DESeq2 analysis for our group and they are insisting on doing the analysis using all samples. I also tried to do remove hidden batch effect using sva package (got two surrogate variables SV1 and SV2) and re-done the DESeq2 analysis as suggested in bioconductor and still did not get more statistically significant DEGs when using all cell lines (the command lines are at the end of this message). However, by removing the hidden batch analysis from my groups of 4 control and 4 carriers, the number of 60 significant DEGs decreased to 46 DEGs. Should not I have been getting more significantly DEGs?

So, my questions are:

1. Did I correctly grouped the samples? Status had two groups (Control =7 samples and Carrier=8 samples) and I did one factor analysis.

se <- summarizeOverlaps(features=ebg, reads=bamfiles, 
                        mode="Union",
                        singleEnd=TRUE,
                        ignore.strand=FALSE,
                        fragments=FALSE )
se
dim(se)
assayNames(se)
head(assay(se), 3)
colSums(assay(se))
rowRanges(se)
str(metadata(rowRanges(se)))
colData(se)
(colData(se) <- DataFrame(sampleTable))
se$Status
se$Status <- relevel(se$Status, "Control")
round( colSums(assay(se)) / 1e6, 1 )
colData(se)
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ Status)
countdata <- assay(se)
head(countdata, 3)
coldata <- colData(se)
(ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
                                  colData = coldata,
                                  design = ~ Status))
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
rld <- rlog(dds, blind=FALSE)
par( mfrow = c( 1, 2 ) )
dds <- estimateSizeFactors(dds)
dds <- DESeq(dds)
(res <- results(dds))
mcols(res, use.names=TRUE)
summary(res)
res.05 <- results(dds, alpha=.05)
table(res.05$padj < .05)

2. Why DESeq2 analysis did not result in more DEG after removing hidden batch effect and whether is this approach is correct or appropriate for my type of analysis.

3. I am not sure if using below command lines, I removed batch effects or just estimated the sabotage groups? sorry for the silly question, but as I mentioned these are all new to me and I am a bit confused by option "comBat" in sva package.

4. Is there any other approaches I can take to keep all the samples and still get rid of variations between the samples?Sorry for my long message. I am kind of stuck here and I really appreciate your advice.

My command lines for removing hidden batch effects were:

library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ Status, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
par(mfrow=c(2,1),mar=c(3,5,3,1))
png(filename = "stripchart1.png")
stripchart(svseq$sv[,1] ~ dds$CellLine,vertical=TRUE,main="SV1")
abline(h=0)
dev.off()
png(filename = "stripchart2.png")
stripchart(svseq$sv[,2] ~ dds$CellLine,vertical=TRUE,main="SV2")
abline(h=0)
dev.off()
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + Status
ddssva <- DESeq(ddssva)
library("DESeq2")
ddssva <- DESeq(ddssva)
(res_sva <- results(ddssva))
mcols(res_sva, use.names=TRUE)

summary(res_sva)
res_sva.05 <- results(ddssva, alpha=.05)
table(res_sva.05$padj < .05)
library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)
res_sva$symbol <- mapIds(org.Hs.eg.db,
                  keys=row.names(res_sva),
                  column="SYMBOL",
                  keytype="ENSEMBL",
                  multiVals="first")
res_sva$entrez <- mapIds(org.Hs.eg.db,
                  keys=row.names(res_sva),
                  column="ENTREZID",
                  keytype="ENSEMBL",
                  multiVals="first")

res_svaOrdered <- res_sva[order(res_sva$padj),]
head(res_svaOrdered)
(res_svaGR <- results(ddssva, lfcThreshold=1, format="GRanges"))
res_svaGR$symbol <- mapIds(org.Hs.eg.db, names(res_svaGR), "SYMBOL", "ENSEMBL")
res_svaOrderedDF <- as.data.frame(res_svaOrdered)[1:1600,]
write.csv(res_svaOrderedDF, file="results_sva.csv")

Thanks in advance, Leila

deseq2 • 1.5k views

ADD COMMENT • link 8.2 years ago Leila.Eshraghi • 0

0

Entering edit mode

I'm not sure why you say you have no replicates. You have two groups of samples with every sample collected from a different person. That's exactly what biological replicates are.

ADD REPLY • link 8.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan,

Thank you for the recommendations. I'll try this approach and hopefully I'll get a better outcome. I also looked into PCA plots as you suggested and the groups were separated in PC5 and PC6. I know that my samples can be considered as biological replicates but because the replicates are from different human samples, the variations between replicates are much larger that what we can expect between cell lines. These samples were also collected at different stages of cancer development, from different age group, different stages of treatment and one replicate in carrier group has not developed cancer yet (still healthy) and only has the carrier gene. And because every single of them were different in the group, I could not do the two factor DESeq2 analysis also.

Thanks again for your helpful suggestions, Leila

ADD REPLY • link 8.2 years ago Leila.Eshraghi • 0

score 2 · Answer 1 · 2016-11-18

SVA is not guaranteed to always give you more differentially expressed genes. The purpose of SVA is to remove variations that appear systematically in many genes. It cannot magically separate samples that cluster together. However, you should look at some PCA plots, in particular going beyond the first 2 principal components. Also look at PCs 3 through 8 or so and see if the controls and carriers separate on any of those axes. If they do, then there is at least some hope for finding differential expression, although it might not be a strong enough signal to detect at a 5% FDR. If the groups don't separate on any PC axis, then there's probably nothing you can do.

One possible option is to keep the "bad" samples in your model for dispersion estimation but exclude them from the control-vs-carrier contrast. This model will estimate essentially the same fold changes as the model that excluded the samples, but will have more residual degrees of freedom. To do this, change your Status factor to have 3 levels: control, carrier, and ignore. (Or possibly 4 levels: control, carrier, ignoreControl, and ignoreCarrier.) Then, when you use the results function, specifically request the contrast between carrier and control. Also, note that if you use this model or the model that you described in which the "bad" samples are simply removed, you are "double dipping" in the sense that you are using the same data to both inform the model and fit the model. This opens you up to a potentially much higher rate of false positives than the adjusted p-values would imply.