Question

Issues with DESeq2 LRT for time-course experiment

0

Entering edit mode

luke.briody ▴ 10

@lukebriody-24065

Last seen 4.1 years ago

Hi all,

I am a high school student interning in a bioinformatics laboratory this summer, and I'm new to R, RNA-seq, etc. I'm analyzing data from an experiment looking at gene expression changes between two conditions over the course of three timepoints. I attempted to use DESeq2's LRT option in order to identify genes with condition-specific temporal expression changes, but only had one significant result. When another member of my lab used LRT, he received several hundred significant results. I've pasted both of our scripts (and relevant outputs) below -- can anyone see what might have caused such a drastic difference in our results?

For reference, counts.tsv is the gene expression matrix, and coldata.tsv is a matrix with samples as rows, Condition & Time as columns, and individual matrix elements are a sample's condition or collection timepoint. Coldata.tsv is identical to the matrix created using cbind() in the second script.

My script & outputs:

# Data Selection & Prep
cts <- as.matrix(read.csv(file.choose(), sep = "\t", row.names = "gene_id")) # select counts.tsv
coldata <- read.csv(file.choose(), sep = "\t", row.names = "Samples") # select coldata.tsv
Time <- factor(coldata$Time, levels = c("3","7","14"))
Condition <- factor(coldata$Condition, levels = c("A", "B"))
all(rownames(coldata) == colnames(cts))
*[1] TRUE*

# LRT
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ Condition + Time + Condition:Time)
*Warning message:*
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors
dds_lrt <- DESeq(dds, test = "LRT", reduced = ~ Condition + Time)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

# Extracting Results
res_lrt <- results(dds_lrt)
res_lrt <- res_lrt[order(res_lrt$pvalue), ]
res_lrt
log2 fold change (MLE): ConditionB.Time7
LRT p-value: '~ Condition + Time + Condition:Time' vs '~ Condition + Time'
DataFrame with 55291 rows and 6 columns
         baseMean log2FoldChange     lfcSE      stat      pvalue      padj
        <numeric>      <numeric> <numeric> <numeric>   <numeric> <numeric>
Dhh       305.439      -0.318403  0.220665   28.5176 6.41922e-07 0.0234160
Id1       328.252       0.522730  0.351085   24.5935 4.56646e-06 0.0832876
Pdgfra    256.663       1.174865  0.628269   16.9216 2.11605e-04 1.0000000
Aldh1a7   102.727      -0.110197  0.517990   16.8907 2.14897e-04 1.0000000
Ahrr      655.521      -0.205311  0.287271   16.8163 2.23040e-04 1.0000000
...           ...            ...       ...       ...         ...       ...
Gm21748         0             NA        NA        NA          NA        NA
mt-Ty           0             NA        NA        NA          NA        NA
mt-Td           0             NA        NA        NA          NA        NA
mt-Tk           0             NA        NA        NA          NA        NA
mt-Th           0             NA        NA        NA          NA        NA
sig_lrt <- subset(res_l, baseMean>20 & padj<0.05)
sig_lrt
baseMean log2FoldChange     lfcSE    stat       pvalue       padj
Dhh 305.4386     -0.3184035 0.2206646 28.5176 6.419222e-07 0.02341604

Another lab member's script & outputs:

 library(DESeq2)
counts<-read.table("counts.tsv")
design<- cbind(condition=c('A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B'), samples=c("A30.R_S2","A43.L_S13","A45.R_S16","A32.R_S4","A33.L_S8","A43.RL_S14","A32.L_S5","A35.L_S11","A46.R_S17","A30.N_S1","A30.RL_S3","A33.RR_S10","A32.RL_S6","A33.RL_S9","A44.R_S15","A32.RR_S7","A35.RL_S12","A46.L_S18"), time=c('3','3','3','7','7','7','14','14','14','3','3','3','7','7','7','14','14','14')) 
 full_model <- ~condition + time + condition:time
reduced_model <- ~condition + time 
dds<-DESeqDataSetFromMatrix(countData=counts,colData=design, design= ~ condition + time + condition:time)
dds_lrt_time <- DESeq(dds, test="LRT", reduced = ~ condition + time)
res<-results(dds_lrt_time)
sig <- subset(res, baseMean>20 & padj < 0.05 )
log2 fold change (MLE): conditionB.time7 
LRT p-value: '~ condition + time + condition:time' vs '~ condition + time' 
DataFrame with 342 rows and 6 columns
         baseMean log2FoldChange     lfcSE      stat      pvalue        padj
        <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Xkr4      44.2752       0.951242  0.521092  27.89740 8.75299e-07 7.09958e-05
Gm19938   32.8723      -2.692142  0.982106  12.29223 2.14178e-03 1.82805e-02
Sntg1     83.0165       0.629027  0.339907   9.57564 8.33058e-03 4.65789e-02
Eya1      40.5863      -0.668471  0.537290   9.47682 8.75254e-03 4.79500e-02
Gm29216  223.6502      -0.650962  0.362693   9.93755 6.95166e-03 4.11904e-02
...           ...            ...       ...       ...         ...         ...
Eif2s3y   509.392      -1.962028  2.566384   18.8918 7.90119e-05 1.79711e-03
Uty       165.199      -2.873323  3.045134   14.5634 6.88029e-04 8.24535e-03
Ddx3y     500.055      -2.019335  2.229747   16.1066 3.18052e-04 4.75764e-03
Gm47283   198.531      -1.234294  0.374524   36.3764 1.26170e-08 3.34519e-06
mt-Rnr2 42367.112      -0.627189  0.283350   11.4852 3.20650e-03 2.40670e-02

Any help on this would be greatly appreciated! The other lab member and I are both new to NGS analyses, so we are uncertain of how to proceed. We hope that some more experienced bioinformaticians might be able to guide us on the right path, as we are also the first people in our lab to perform these types of analyses. Thank you!

Kind regards, Luke Briody

rna-seq R DESeq2 LRT • 2.2k views

ADD COMMENT • link updated 4.3 years ago by Kevin Blighe ★ 4.0k • written 4.3 years ago by luke.briody ▴ 10

0

Entering edit mode

Cross-posted: https://www.biostars.org/p/458040/

ADD REPLY • link 4.3 years ago Kevin Blighe ★ 4.0k

score 1 · Answer 1 · 2020-08-27

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 22 days ago

Republic of Ireland

The two code chunks are essentially the same - can you clarify whether or not you are both processing the same dataset, or are these different datasets and you are providing some sample data in your question?

Your code:

dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ Condition + Time + Condition:Time)
dds_lrt <- DESeq(dds, test = "LRT", reduced = ~ Condition + Time)
res_lrt <- results(dds_lrt)

Your colleague's code:

dds<-DESeqDataSetFromMatrix(countData=counts,colData=design, design= ~ condition + time + condition:time)
dds_lrt_time <- DESeq(dds, test="LRT", reduced = ~ condition + time)
res<-results(dds_lrt_time)

So, the input data must be different. I presume that you have performed some filter on the raw counts, whereas your colleague has not (?) - note that the top genes in your colleague's results have low base mean..

Note that, in the current way that it is setup (the design and reduced formulae), you will be getting the genes that show a differential expression over Time across Condition.

If, instead, you used design = ~ Condition + Time and reduced = ~ Condition, then you would get the genes that show differential expression across just time, with no relation to Condition.

Kevin

ADD COMMENT • link 4.3 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Hi Kevin,

Sorry I'm just seeing this! Thank you so much for taking the time to help. We are indeed processing the same dataset, which is why we were shocked by the drastically different results. I'm waiting for confirmation on my colleague's part, but as far as I am aware, we pre-processed the count matrix in exactly the same manner.

Regarding the design formula: we are looking for genes affected by RNAi-induced knockdown of gene A, compared against models transfected with siRNA designed to target luciferase (which is not a gene expressed in our model organisms) as controls. Specifically, we want to see identify which genes have siRNA-specific changes in expression over time, so I am fairly certain we used the correct full and reduced models. Please correct me if I am wrong with this, but based on Michael Love's differential expression workflow, this is what we thought made the most sense for our goals.

Thanks again! -- Luke

ADD REPLY • link 4.3 years ago luke.briody ▴ 10

0

Entering edit mode

Cool, if it is the exact same dataset, then there must be differences in the pre-filtering steps(s). The DESeq2 version may also be different, but this would not account for such differences, I think.

Both need to be aware of lfcShrinkage, too: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#log-fold-change-shrinkage-for-visualization-and-ranking

ADD REPLY • link 4.3 years ago Kevin Blighe ★ 4.0k