Question

DESeq2 Time course analysis design - only time points WITHOUT treatments

1

Entering edit mode

Maithê Barros ▴ 10

@maithe-barros-13840

Last seen 4.3 years ago

Hello,

I am doing RNA-seq analysis of differential gene expression from mare's endometrial biopsies. I do not have any treatments though, and this is the reason I am having a hard time trying to figure out the design I should use as all examples I found so far had treatments + time points.

I have the first time point of 0h which is my control (endometrial biopsy from a mare) then 24h, 48h and 72h. All I want to do is to understand what happens over the time points, without any treatment. I have 8 animals and for each animal I have the 4 timepoints. My metadata has "sample_id" column with my bam files,"timepoint" column and "horse_id" column.

My coding looks like this:

txdb <- makeTxDbFromGFF("Equus_caballus.EquCab2.89.gtf", format="gtf")
ebg <- exonsBy(txdb, by="gene")
se <- summarizeOverlaps(features=ebg, reads=bamfiles,
                        mode="Union",
                        singleEnd=FALSE,
                        ignore.strand=TRUE,
                        fragments=TRUE )
dds <- DESeqDataSet(se, design = ~ timepoint)
dds <- DESeq(dds)
res_0hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "control_0h", "timepoint_72h"))
res_0hand24h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "control_0h", "timepoint_24h"))
res_24hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_24h", "timepoint_72h"))
res_48hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_48h", "timepoint_72h"))
res_24hand48h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_24h", "timepoint_48h"))
table(res_0hand72h_LFC1$padj < 0.1)
table(res_0hand24h_LFC1$padj < 0.1)
table(res_24hand72h_LFC1$padj < 0.1)
table(res_48hand72h_LFC1$padj < 0.1)
table(res_24hand48h_LFC1$padj < 0.1)

Running the above I got results comparing two time points at a time (false rate discovery set for 0.1% with a >2 fold change). Results look ok to me. E.g. When comparing control (0h) to 72h time point a total of 22542 genes were expressed, 824 of which were shown to be statistically significantly differentially up expressed and 785 shown to be statistically significantly differentially down expressed, both at a false discovery rate (FDR) of 0.1% with a fold change greater or equal to 2.

BUT I wanted to have one result comparing all time points, but I am not sure if that's doable/reasonable?

first doubt about my design:

I am not sure whether I need to add my horse_id to the design, maybe
dds <- DESeqDataSet(se, design = ~ treatment + horse_id)
or
dds <- DESeqDataSet(se, design = ~ treatment + horse_id + treatment:horse_id)
I do not know whether or not to use the interaction between treatment and horse_id.

second doubt about the time series analysis:

Should I use nbimLRT() function to test the significance of multiple coefficients at once?

I tried:

ddsLRT <- DESeqDataSet(se, design = ~ treatment*horse_id)
design(ddsLRT) <- formula(~ treatment*horse_id)
ddsLRT <- estimateSizeFactors(ddsLRT)
ddsLRT <- estimateDispersions(ddsLRT)
ddsLRT <- nbinomLRT(ddsLRT, reduced = formula(~ treatment))
resLRT <- results(ddsLRT)
table(resLRT$padj < 0.1)

and the result was FALSE 22542 genes, meaning that none of the expressed genes were statistically significantly expressed, which is a results different from the pairwise analysis of time points separatedely.

I hope I made myself clear and I really look forward to receiving any help/thoughts on this.

Many thanks,

Maithe Barros

rna-seq deseq2 multiple time points • 5.9k views

ADD COMMENT • link updated 7.6 years ago by Michael Love 43k • written 7.6 years ago by Maithê Barros ▴ 10

score 2 · Answer 1 · 2017-08-28

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

I'd recommend a LRT with a full design of ~horse + time and a reduced design of ~horse.

A couple notes here: I like to use short variable names without "_", this will reduce errors potentially (e.g. R doesn't allow "_" in column names, in case you ever want to store the results as a data.frame or the like).

The design above controls for differences between horse at time=0, and then tests to see if there are any differences at any time point. So you get a single p-value (although there are multiple coefficients fitted, one for each time point other than t=0). See the LRT section of the vignette and ?results which has a specific section on the LRT.

ADD COMMENT • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Thanks a lot for your reply, Michael. I have renamed my variables.I ran:

ddsLRT <- DESeqDataSet(se, design = ~ horse + timepoint)
ddsLRT <- estimateSizeFactors(ddsLRT)
ddsLRT <- estimateDispersions(ddsLRT)
ddsLRT <- nbinomLRT(ddsLRT, reduced = ~ horse)
resLRT <- results(ddsLRT)
table(resLRT$padj < 0.1)

and my result was

FALSE TRUE
6812 11837

log2 fold change (MLE): timepoint time72h vs control0h
LRT p-value: '~ horse + timepoint' vs '~ horse'
DataFrame with 26991 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSECAG00000000001 222.153134 -0.2119773 0.1507054 5.857054 0.1187757742 0.174482033
ENSECAG00000000002 227.264520 0.2389474 0.1314716 4.081881 0.2527571757 0.336834970
ENSECAG00000000003 43.521031 0.5022090 0.2916049 8.196995 0.0421111066 0.069049381
ENSECAG00000000004 123.625756 -0.2415447 0.1482973 16.996629 0.0007078716 0.001628157
ENSECAG00000000005 8.405233 -0.7764470 0.4533185 4.009984 0.2603880344 0.345326159
... ... ... ... ... ... ...
ENSECAG00000027698 137.2158 -1.056968 3.548143 0.2138203 9.753280e-01 9.856341e-01
ENSECAG00000027699 5487.5252 3.261664 0.400470 61.2228495 3.220709e-13 2.356336e-12

but when I try resultsNames(ddsLRT) the output is character(0). Therefore I think there is something wrog.

I also tried:

ddsTC <- DESeqDataSet(se, design = ~ horse + timepoint)
ddsTC <- DESeq(ddsTC, test="LRT", full= ~ horse + timepoint, reduced = ~ horse)
resTC <- results (ddsTC)

and the result was the same:
FALSE TRUE
6812 11837

log2 fold change (MLE): timepoint time72h vs control0h
LRT p-value: '~ horse + timepoint' vs '~ horse'
DataFrame with 26991 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSECAG00000000001 222.153134 -0.2119773 0.1507054 5.857054 0.1187757742 0.174482033
ENSECAG00000000002 227.264520 0.2389474 0.1314716 4.081881 0.2527571757 0.336834970
ENSECAG00000000003 43.521031 0.5022090 0.2916049 8.196995 0.0421111066 0.069049381
ENSECAG00000000004 123.625756 -0.2415447 0.1482973 16.996629 0.0007078716 0.001628157
ENSECAG00000000005 8.405233 -0.7764470 0.4533185 4.009984 0.2603880344 0.345326159
... ... ... ... ... ... ...
ENSECAG00000027698 137.2158 -1.056968 3.548143 0.2138203 9.753280e-01 9.856341e-01
ENSECAG00000027699 5487.5252 3.261664 0.400470 61.2228495 3.220709e-13 2.356336e-12

but again with resultsNames(resTC) the output is character(0).

I did read the LRT section and ?results but I still do not understand why I do not have any info from resultsNames, to then generate results tables for individual effects, which must be individual elements of resultsNames. Is this indeed incorrect or am I missing something?

ADD REPLY • link 7.6 years ago Maithê Barros ▴ 10

0

Entering edit mode

You need to copy paste exactly the code you ran if you obtained an unexpected output. resultsNames() should be run on a DESeqDataSet (dds) not on a DESeqResults object (res). See ?resultsNames for the usage and example code.

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, following on from this post and the LRT section of your DeSeq2 vignette, I was wondering if you can help answer my query here: https://support.bioconductor.org/p/118722/

Appreciate your time Michael.

ADD REPLY • link 6.1 years ago kaiser.karim • 0