Hi,
This is my first post here and I'm new to RNA-Seq data analysis, especially different expression analysis (DEA).
I have read some surveys about "the best practise in RNA-Seq analysis". As they said, when working on DEA, first we should normalize the raw counts on GC content, batch effects, library size, other unwanted tecnical variances (using RUVSeq), etc, then we could use DEA algorithms to estimate dispersion, fit the model and calculate LogFC and P-adjust.
What make me confused are:
1、I want to do GC content normalization before DESeq2 or EdgeR pipeline, Here's my source using DESeq2:
counts_GC <- withinLaneNormalization(raw_count_matrix, "GC", which="full", offset=T)
counts_DESeq<-DESeqDataSetFromMatrix(countData = counts_GC@assayData$normalizedCounts, colData =
myfactor,design=~myfactor)
DESeq2_DEA<-DESeq(counts_DESeq,parallel = T)
res<-results(DESeq2_DEA)
As you see, I extracted the normalized counts created by withinLaneNormalization()
and put it in DESeq2 pipeline, the same I did in edgeR.
While, the user's guide suggest that both DESeq2 and edgeR expect un-normalized counts, if user need EDASeq or cqn normalization before, they should set "offset" like:
EDASeqNormFactors<-exp(-EDASeqoffset) #DESeq2
disp<-estimateDisp(counts(dataOffset),design, offset=-offst(dataOffset)) #edgeR
Does it means I can only use "offset" to pass EDASeq's results? Is my former source code incorrect?
2、 When I look for some examples using EDASeq+DESeq2 or edgeR, like https://support.bioconductor.org/p/108765/ and https://support.bioconductor.org/p/113405/ , both of them use "offset" generated from EDASeq as normalization factor, I notice that they didn't use the normalize method in classical pipeline such as Median in DESeq2 (as part of DESeq()
do ) or TMM in edgeR (as calcNormFactors()
do) anymore. If I use "offset", can Median or TMM works? I wrote a new source below, Is it the only way to normalize on GC content and library size before DESeq2 or EdgeR pipeline?
library(DESeq2)
# data is data from EDAseq example
dataOffset <- withinLaneNormalization(data,"gc",which="full",offset=TRUE)
dataOffset <- betweenLaneNormalization(dataOffset,which="full",offset=TRUE)
dds <- DESeqDataSetFromMatrix(countData = dataOffset@assayData$normalizedCounts,
colData = pData(dataOffset),
design = ~ conditions)
EDASeqNormFactors <- exp(-1 * offst(dataOffset))
normalizationFactors(dds) <- EDASeqNormFactors
# dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
res <- results(dds)
res <- res[order(res$pvalue),]
head(res)
3、According to the user's guide of RUVSeq, I first find "Empirical control genes" by edgeR and run RUVg()
, I noticed that in the guide, the "set" in RUVg()
has been normalized before:
set <- betweenLaneNormalization(set, which="upper")
,after finding "Empirical control genes", it goes:
set2 <- RUVg(set, empirical, k=1)
It treated the normalizaed "set" as input of RUVg()
. Can I run RUVg without set normalization? Will it have a bad impact on results?
I appreciate your help,
Yang