Hello,
I have tried to understand how DESeq2 calculates the Log2FoldChange. I extracted the normalised counts from dds like below, calculated the mean of treated and tried to find the log2FC according to the formula: log2(treated/control)
. But the log2FC I get using this method is different the one I get using DESeq2. Moreover, it seems like log2FC reported with DESeq2 is calculated based of difference of mean treated and control, like this: log2FC=mean(treated)-mean(control)
. Why is that? Am i missing something?
newdata <- as.data.frame(cbind(dds@rowRanges@elementMetadata@listData$conditionTREATED1,dds@rowRanges@elementMetadata@listData$conditionTREATED2,dds@rowRanges@elementMetadata@listData$conditionTREATED3,dds@rowRanges@elementMetadata@listData$conditionCTRL1))
newdata$mean <- apply(newdata[,1:3],1, mean)
newdata$log <- log2(newdata$mean/newdata$conditionCTRL1)
For example using the example below:
CTRL1 mean_treated
12.30 12.44
With my calculation I get log2(12.44/12.30)=0.016
DESeq2 gives this value 0.14, which is the difference between 12.44-12.30=0.14
This is my script:
setwd("path/to/work_dir/")
sampledata <- data.frame(run = c("RNA_1","RNA_2","RNA_3","RNA_4","RNA_5","RNA_6","RNA_7","RNA_8","RNA_9","RNA_10","RNA_11","RNA_12"), condition = c("TREATED1","TREATED1","TREATED1","TREATED2","TREATED2","TREATED2","TREATED3","TREATED3","TREATED3","CTRL1","CTRL1","CTRL1"))
sampledata$condition <- factor(sampledata$condition)
table(sampledata$condition)
#Read all RNAseq files
dir <- ("/path/to/mydir")
list.files(dir)
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, samples$run, "quant.sf")
names(files) <- c(1:12)
all(file.exists(files))
#Create a tx2gene file
txdb <- makeTxDbFromGFF(file="gencode.v39.annotation.gtf")
saveDb(x=txdb, file = "gencode.v39.annotation.TxDb")
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(txdb, k, "GENEID", "TXNAME")
head(tx2gene)
#Create differential expression
txi.g <- tximport(files, type="salmon", tx2gene=tx2gene)
tcs <- txi.g$counts
mm = model.matrix(~0+condition, sampledata)
dds <- DESeqDataSetFromTximport(txi.g, colData=sampledata, design=mm)
nm <- assays(dds)[["avgTxLength"]]
sf <- estimateSizeFactorsForMatrix(counts(dds)/nm)
#Filter for genes/transcripts that have >= 10 reads
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
sapply(split(colSums(counts(dds))/1e6, sampledata$condition), summary)
#Calculate dds
dds <- DESeq(dds)
res <- DESeq2::results(dds,
contrast = list(c("conditionTREATED1","conditionTREATED2","conditionTREATED3"), "conditionCTRL1"),
listValues = c(1/3,-1)
)
contrast = (list(c("conditionTREATED1","conditionTREATED2","conditionTREATED3"),( "conditionCTRL1",listValues = c(1/3,-1)))
resLFC <- lfcShrink(dds, contrast=contrast, res=res, type="ashr")
I added an example, may be my explanation was not completely clear.
I'm not sure what those numbers you posted are, but the point of logs is to avoid long division. You say you are doing doing log(a/b), but log(a) - log(b) is the same thing, so if those numbers are the average of the logged counts, then subtracting one from the other will give you the ratio.
Thank you very much for taking your time and answering. I did not write that the difference is between logs. For me It is obvious that log(a/b) and log(a)-log(b) is the same thing. If you could I suggest you to read better the question, if it is not clear please just ask me clarifications. I really need to understand the problem I posted above. Thanks again.