Dear Michael,
Please accept my apologize in advance for such a lengthy question. I have a data set of 200 RNAseq samples from an experiment on two treatments(trtR/trtNR) and the sampling is done at 7 different time-points. I have different number of replicates: for trtR two replicates and for trtNR 5 replicates for each time-points (some time number of replicate is more).
I want to employ a generalized linear model that allows me to quantify the genes that are differentially expressed at each time point and also across all time points. Then a likelihood ratio test to identify differentially expressed genes. I learned from this question https://www.biostars.org/p/133304/ that if I want to have linear affect of time on gene expression, I need to introduce time as a numeric covariate rather than a factor. Therefore, here is my model:
>Time=as.numeric(sampleTable3$Timepoint) >Trt_type=sampleTable3$Treatment_type >coldata <- data.frame(Time, Treatment_type) >rownames(coldata) <- colnames(txi$counts) >ddsTxi_time <- DESeqDataSetFromTximport(txi, colData=coldata, design=~ Time + Trt_type:Time + Trt_type) >colData(ddsTxi_time)$Time<-factor(colData(ddsTxi_time)$Time, levels=c("0", "4", "24", "72", "336", "1008","2352")) >ddsTxi_time= ddsTxi_time[rowSums(counts(ddsTxi_time))>1,] >dds_time <- DESeq(ddsTxi_time, test="LRT", reduced = ~ 1) >res_time <- results(dds_time) >colData(dds_time) DataFrame with 63 rows and 3 columns Time2 Treatment_Response2 replaceable <numeric> <factor> <logical> F02242_NR_72h_23 72 NR FALSE F02245_NR_14w_23 2352 NR FALSE ... ... ... ... F02382_R_14w_29 2352 R TRUE F02384_R_72h_29 72 R FALSE >resultsNames(dds_time) [1] "Intercept" "Time" [3] "Trt_typeR_vs_NR" "Time.Trt_type.R" mcols(res_time5)$description [1] "mean of normalized counts for all samples" [2] "log2 fold change (MLE): Time.Trt_type.R" [3] "standard error: Time.Trt_type.R" [4] "LRT statistic: '~ Time + Trt_type:Time + Trt_type' vs '~ 1'" [5] "LRT p-value: '~ Time + Trt_type:Time + Trt_type' vs '~ 1'" [6] "BH adjusted p-values"
1: Is this model correct or should I introduce treatment type as a numeric covariate as well?
2: The L2FC is coming from Time.Treatment typeR wich it is the l2fc from the genes with lowest p-value over time effect in treatmentR. Correct?
For having the effect of time on Trt_type.NR (l2fc and p-value), I did:
>results(dds_time, contrast=list(c("Time","Trt_typeR_vs_NR"))
is this correct? which this one is just giving me the l2fc in this level and not the p-values. How can I get the genes that are uniquely differentially expressed in trtNR with their corresponding p-value (or should I run the analysis two times separately for each treatment and then check the shared/unique DEGs).
The second part of my question is about removing the outlier, as you see below, I have many outliers.
>summary(res_time) out of 42823 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 373, 0.87% LFC < 0 (down) : 376, 0.88% outliers [1] : 112, 0.26% low counts [2] : 11622, 27% (mean count < 1) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
Should I remove this outliers from my result before further analysis?
I could not access ?results because of some memory issue. Can I also define the "cooksCutoff" from this command?
maxCooks <- apply(assays(dds_time4)[["cooks"]], 1, max).
I am using the last update of DEseq2.
Many thanks in advance for your reply.
Kind regards,
Rahel
For looking at differentially expressed at each time point I'd suggest reverting to looking at incorporating the
factor
version oftime.
That way you'll be able to extract the genes that are differential at specific time-points, with the use of a contrast.For the across all time points question, a valid approach would be to use the numeric version of of time (I can't think of a valid scenario where you'd want treatment type to be numeric, but it will make no difference, I suspect). In that case, the genes significant for the "Trt_typeR_vs_NR" contrast will roughly correspond to genes that have the same time-profile, but R is consistently higher (or lower) than NR. "Time.Trt_type.R" will look find genes that, colloquially, have different gradients in the different conditions.
I think DESeq2 deals with outliers for you, so I don't think they're a cause for concern.
There are warning signs in your excerpt of code (converting a factor directly to a numeric, rather than via an intermediate character; using LRT comparing a saturated model against a constant model) that makes me think you should probably consult a local statistician to cover some of the subtleties of the analysis, as the design is sufficiently complicated and the power of DESeq2 is such that you may need to check the results you get correspond to the questions you're wanting to ask.
Dear Gavin,
Many thanks for your reply and explanation. I know if I use time as factor is more flexible but it is not any more linear ...
It's still linear, in the sense of linear model, but it's not looking for a straight-line trend - to get the "at each timepoint" differential genes, you need to use the factor approach, and to get the "across all timepoints", you'll (probably) need the numeric approach. So if you want to know which genes are differential at 24hrs (irrespective of other timepoints), then you'd be best off with a factor-based approach. If you want to know which: genes are responding in a linear manner across all timepoints; which genes have different time-gradients across the timecourse; or which genes are differential at specific timepoints interpolating inbetween your measured points (not recommended!), then you'll need a numeric variable for time.
Many thanks for your explanation. I decided to do factor based analysis and check the output for each time point separately ...