Hello everyone,
I have some RNASeq data, that has been spiked-in with exogenous ERCC controls. I have the counts from htseq-count and now I want to normalize + test for differential expression. For this, I am using two methods: DESeq and limma+voom. Below is my code for both the tests. Can someone tell me if what I am doing is correct or not? I do think that I am slightly mistaken in the voom code.
##### Using DESeq library(DESeq) # get the size factors from the spiked-ins # dat.spike.in is count data of only spike ins cds <- newCountDataSet(countData = dat.spike.in, conditions = as.character(condition)) cds <- estimateSizeFactors(cds) nf <- sizeFactors(cds) # add the size factors to data without spike-ins # dat.without.spike.in is count data where spike-ins are removed cds <- newCountDataSet(countData = dat.without.spike.in, conditions = as.character(condition)) sizeFactors(cds) <- nf # differential expression test cds <- estimateDispersions( cds, method= 'pooled', sharingMode='maximum', fitType='local') deseq.res <- nbinomTest( cds, "A", "B") ##### Using Limma + Voom library(limma) library(edgeR) groups = as.factor(condition) # conditions design <- model.matrix(~0+groups) colnames(design) = levels(groups) nf <- calcNormFactors(dat.spike.in) # calculation norm factors using spike ins only voom.norm <- voom(dat.without.spike.in, design, lib.size = colSums(dat.spike.in) * nf) # add that to voom
Awaiting your input and suggestions!
Thanks!
P.S: Crossposted on Biostars to get inputs from both communities.
Thank you for your detailed response. Not to sound naive but if spike-ins are not reliable for normalization, what is their purpose? What do people generally do with them?
Well, spike-ins can be used to assess accuracy and technical performance, see
http://www.ncbi.nlm.nih.gov/pubmed/25254650
and
http://rd.springer.com/chapter/10.1007%2F978-3-319-07212-8_9
A number of companies have been advocating spike-ins for normalization, for example:
https://cofactorgenomics.com/6-changes-thatll-make-big-difference-rna-seq-part-3/
and no doubt a lot of biologists have followed this advice. We went through the same process with microarrays 15 years ago, whereby spike-ins were proposed for normalization but the genomic methods proposed by statistical bioinformaticians gradually took over. There may be special situations, e.g., where all or most genes are DE, in which normalization by spike-ins is a good idea.
Thank you, this really helps.