Hello,
I am doing RNA-seq analysis of differential gene expression from mare's endometrial biopsies. I do not have any treatments though, and this is the reason I am having a hard time trying to figure out the design I should use as all examples I found so far had treatments + time points.
I have the first time point of 0h which is my control (endometrial biopsy from a mare) then 24h, 48h and 72h. All I want to do is to understand what happens over the time points, without any treatment. I have 8 animals and for each animal I have the 4 timepoints. My metadata has "sample_id" column with my bam files,"timepoint" column and "horse_id" column.
My coding looks like this:
txdb <- makeTxDbFromGFF("Equus_caballus.EquCab2.89.gtf", format="gtf") ebg <- exonsBy(txdb, by="gene") se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) dds <- DESeqDataSet(se, design = ~ timepoint) dds <- DESeq(dds) res_0hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "control_0h", "timepoint_72h")) res_0hand24h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "control_0h", "timepoint_24h")) res_24hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_24h", "timepoint_72h")) res_48hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_48h", "timepoint_72h")) res_24hand48h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_24h", "timepoint_48h")) table(res_0hand72h_LFC1$padj < 0.1) table(res_0hand24h_LFC1$padj < 0.1) table(res_24hand72h_LFC1$padj < 0.1) table(res_48hand72h_LFC1$padj < 0.1) table(res_24hand48h_LFC1$padj < 0.1)
Running the above I got results comparing two time points at a time (false rate discovery set for 0.1% with a >2 fold change). Results look ok to me. E.g. When comparing control (0h) to 72h time point a total of 22542 genes were expressed, 824 of which were shown to be statistically significantly differentially up expressed and 785 shown to be statistically significantly differentially down expressed, both at a false discovery rate (FDR) of 0.1% with a fold change greater or equal to 2.
BUT I wanted to have one result comparing all time points, but I am not sure if that's doable/reasonable?
first doubt about my design:
I am not sure whether I need to add my horse_id to the design, maybe
dds <- DESeqDataSet(se, design = ~ treatment + horse_id)
or
dds <- DESeqDataSet(se, design = ~ treatment + horse_id + treatment:horse_id)
I do not know whether or not to use the interaction between treatment and horse_id.
second doubt about the time series analysis:
Should I use nbimLRT() function to test the significance of multiple coefficients at once?
I tried:
ddsLRT <- DESeqDataSet(se, design = ~ treatment*horse_id) design(ddsLRT) <- formula(~ treatment*horse_id) ddsLRT <- estimateSizeFactors(ddsLRT) ddsLRT <- estimateDispersions(ddsLRT) ddsLRT <- nbinomLRT(ddsLRT, reduced = formula(~ treatment)) resLRT <- results(ddsLRT) table(resLRT$padj < 0.1)
and the result was FALSE 22542 genes, meaning that none of the expressed genes were statistically significantly expressed, which is a results different from the pairwise analysis of time points separatedely.
I hope I made myself clear and I really look forward to receiving any help/thoughts on this.
Many thanks,
Maithe Barros
Thanks a lot for your reply, Michael. I have renamed my variables.I ran:
ddsLRT <- DESeqDataSet(se, design = ~ horse + timepoint)
ddsLRT <- estimateSizeFactors(ddsLRT)
ddsLRT <- estimateDispersions(ddsLRT)
ddsLRT <- nbinomLRT(ddsLRT, reduced = ~ horse)
resLRT <- results(ddsLRT)
table(resLRT$padj < 0.1)
and my result was
FALSE TRUE
6812 11837
log2 fold change (MLE): timepoint time72h vs control0h
LRT p-value: '~ horse + timepoint' vs '~ horse'
DataFrame with 26991 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSECAG00000000001 222.153134 -0.2119773 0.1507054 5.857054 0.1187757742 0.174482033
ENSECAG00000000002 227.264520 0.2389474 0.1314716 4.081881 0.2527571757 0.336834970
ENSECAG00000000003 43.521031 0.5022090 0.2916049 8.196995 0.0421111066 0.069049381
ENSECAG00000000004 123.625756 -0.2415447 0.1482973 16.996629 0.0007078716 0.001628157
ENSECAG00000000005 8.405233 -0.7764470 0.4533185 4.009984 0.2603880344 0.345326159
... ... ... ... ... ... ...
ENSECAG00000027698 137.2158 -1.056968 3.548143 0.2138203 9.753280e-01 9.856341e-01
ENSECAG00000027699 5487.5252 3.261664 0.400470 61.2228495 3.220709e-13 2.356336e-12
but when I try resultsNames(ddsLRT) the output is character(0). Therefore I think there is something wrog.
I also tried:
ddsTC <- DESeqDataSet(se, design = ~ horse + timepoint)
ddsTC <- DESeq(ddsTC, test="LRT", full= ~ horse + timepoint, reduced = ~ horse)
resTC <- results (ddsTC)
and the result was the same:
FALSE TRUE
6812 11837
log2 fold change (MLE): timepoint time72h vs control0h
LRT p-value: '~ horse + timepoint' vs '~ horse'
DataFrame with 26991 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSECAG00000000001 222.153134 -0.2119773 0.1507054 5.857054 0.1187757742 0.174482033
ENSECAG00000000002 227.264520 0.2389474 0.1314716 4.081881 0.2527571757 0.336834970
ENSECAG00000000003 43.521031 0.5022090 0.2916049 8.196995 0.0421111066 0.069049381
ENSECAG00000000004 123.625756 -0.2415447 0.1482973 16.996629 0.0007078716 0.001628157
ENSECAG00000000005 8.405233 -0.7764470 0.4533185 4.009984 0.2603880344 0.345326159
... ... ... ... ... ... ...
ENSECAG00000027698 137.2158 -1.056968 3.548143 0.2138203 9.753280e-01 9.856341e-01
ENSECAG00000027699 5487.5252 3.261664 0.400470 61.2228495 3.220709e-13 2.356336e-12
but again with resultsNames(resTC) the output is character(0).
I did read the LRT section and ?results but I still do not understand why I do not have any info from resultsNames, to then generate results tables for individual effects, which must be individual elements of resultsNames. Is this indeed incorrect or am I missing something?
You need to copy paste exactly the code you ran if you obtained an unexpected output. resultsNames() should be run on a DESeqDataSet (dds) not on a DESeqResults object (res). See ?resultsNames for the usage and example code.
Hi Michael, following on from this post and the LRT section of your DeSeq2 vignette, I was wondering if you can help answer my query here: https://support.bioconductor.org/p/118722/
Appreciate your time Michael.