Hi,
I have a CASE/CONTROL RNA-seq dataset with three time point measurements (PRE/MID/POST) for both groups. We're trying to test whether there's a difference in expression between the two groups in any of the the time points. From physiological point of view we hypothesize that there's no big difference between PRE and POST measurements between the groups but we should see a difference in the MID timepoint. In total we have little over 100 RNA-lib samples so theres quite a few biological replicates but no technical replicates. We have eliminated possible outliers based on PCA and heatmaps and are using RUVseq to eliminate overestimated variance in the data.
There's a good example of the model in DESeq2 vignette but we also want to account for the within sample variability (I think). In our coldata we have CASE/CONTROL status, PRE_MID_POST status and SUBJECT_ID (how we can see that to whom three RNA- library samples belong to).
I think the right model to test time course case/control setup should be, full: ~ CASE_CONTROL + PRE_MID_POST + CASE/CONTROL and reduced: ~ CASE_CONTROL + PRE_MID_POST, using the LRT-test in DESeq2. However, the p-values aren't calculated correctly and the results aren't quite what you would expect because it seem that DESeq2 doesn't calculate for the outliers or low count genes at all etc. I'm not quite sure if this is because we should also account for the SUBJECT_ID in the model so the within subject variability would be accounted for. I tried this by modelling and also by creating a separate model matrix according to the vignette (nested within sample variability) but the full=design() argument doesn't work:Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘design’ for signature ‘"matrix"’.
I'm quite confused by the modelling that we should use because when we subset only cases in the analysis then the results seem plausible and DESeq2 seems to calculate thing correctly But when we include also the control group in analysis for some reason there's big problems.
So my questions are:
1) Should we account for the SUBJECT_ID in the model?
2) Is the LRT-test the right approach ?
3) What sort of model would be the right solutions for our case?
Help would be really appreciated, especially from Mike Love if possible.
Subject ID is numerical that is coded as factor. So for subject that has all three time point measurement samples there is three same numbers in the subject_id column. There is some incomplete subjects that has only one or two timepoint measurements. Case/control status (1/2) and timepoint status (1/2/3) are both also numerical and coded as factors in the coldata in R. That was also a factor that I was wondering that how does DESeq2 interpret want variables to be coded. Is it suitable to code all the variables presented in coldata as factors? And the SAMPLE_IDs are as row names that match the countdatas colnames. Head of the coldata presented below:
In total there's samples from 25 cases and 15 controls. And as mentioned earlier some of the subjects have only one or two samples.
Thank you for the quick response Michael, really appreciate the help!
Yes it's appropriate that here all the variables are factors. This means DESeq2 will add coefficients to explain differences between factor levels.
With DESeq2, and in the case you want to control for subject-level differences in this experimental design, you'll have to remove the samples that don't have complete time series. It sounds like this is just a few, and those samples don't inform you about the way expression changes over time and over condition within subject. Note that it is possible to keep these samples in the analysis, by using duplicateCorrelation() in the limma package, instead of using a term in the design for the subject effects, as we do here.
For modeling with DESeq2, you can follow this section in the vignette, in that you will need to create a new column subject.nested, which takes levels 1-25 for the cases, and 1-15 for the controls.
You would use a design of ~condition + time + condition:subject.nested + condition:time. If you want to find genes with condition-specific changes over time, you could use a reduced design of ~condition + time + condition:subject.nested. I'd suggest using plotCounts() to look at the top genes from this LRT, to see how gene expression is changing over condition and time.
Because the cases and controls have different numbers of subjects, you will have to use the model.matrix function to produce matrices for the 'full' and 'reduced' argument of DESeq(), and for both matrices you will have to delete those columns in the model matrix that have all zeros. There is description and example code in the vignette section I referenced.
Still having some issues with model.matrices. I managed to create the additional column to the coldata and also model.matrices for the full and reduced model (zero columns removed).
However, when I try to assign the model.matrices to the DESeq2 function it doesn't seem to work.The code I'm trying to implement is:
ddsFullCountTable <- DESeqDataSetFromMatrix(countData = countdata,colData = coldata, design = ~1)
ddsFullCountTable <- ddsFullCountTable[ rowSums(counts(ddsFullCountTable)) > 1, ]
dds <- DESeq(ddsFullCountTable, test="LRT", full=design(full1), reduced=reduced1)
Where the full1 is model.matrix created for the full model and as instructed in the vignette and reduced1 for the reduced model. I get the error: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘design’ for signature ‘"matrix"’. On the note, is it necessary that the model.matrices are "unnamed"
What do think about RUVseq in removing overestimated variance from the dataset? Is it good solution in reducing the background noise in a dataset or is there a risk that it will greatly decrease the chances of finding true DE genes?
That little correction made the trick. DESeq2 manages to run the pipeline. Still however it giver an error: 30 rows did not converge in beta, labelled in mcols(object)$fullBetaConv:Use larger maxit argument with nbinomLRT. Is this something that should be dealt with? Went through some documents and previous topics but didn't find how I could apply the nbinomLRT and larger maxit argument with the model matrices we have made.
The results seem somewhat we would expect. However there seems to be going on something strange. P-values are calculated incorrectly as the histogram of the values we can see a peak in the "1" end of the histogram. Also in the dispersion plot one can see a cluster of values that has clearly deviated in the bottom left corner of the plot.
When plotting the significant genes with plotCounts we can see that for some genes there is one/few/numerous samples that clearly separate from the rest of the group.
Thanks again for the help!
I'm thinking, you might benefit from doing some pre-filtering before running DESeq(), for example:
i.e. at least 'x' samples need to have a count of 10 or more. Do you think that this would help deal with the fact that you have significant genes in which your DE signal is coming from some samples but not all?
Coming back to this..
Well, that certainly solved the problem with the "maxit" argument and also now the p-values are calculated correctly. When continuing on some pathway analysis with the results the end outcomes seem somewhat more biologically logical.
However, I'm a bit worried that doing the pre-filtering I might lose some important data. Is there another way I could reliably detect if there really is a couple of samples that contribute relatively much compared to others on the DE genes detection?
There is little chance for a gene to have power to detect differential expression if the counts are single digit. So a safe choice for 'x' is the number of samples per group.
DESeq2 doesn't technically need prefiltering for most datasets, but these rows are extra work for the algorithm, and hard to reach convergence sometimes.
That clears it, thank you!
The contrast are covered in quite detail in previous posts and vignette but I still find them quite ambiguous to interpret especially when implementing on this model which is quite complex (at least in my mind with the model.matrix) . So the resultNames gives me:
"Intercept" "CASE_CONTROL2" "PRE_MID_POST2"
[4] "PRE_MID_POST3" "CASE_CONTROL1.subject.nested2" "CASE_CONTROL2.subject.nested2"
[7] "CASE_CONTROL1.subject.nested3" "CASE_CONTROL2.subject.nested3" "CASE_CONTROL1.subject.nested4"
[10] "CASE_CONTROL2.subject.nested4" "CASE_CONTROL1.subject.nested5" "CASE_CONTROL2.subject.nested5"
[13] "CASE_CONTROL1.subject.nested6" "CASE_CONTROL2.subject.nested6" "CASE_CONTROL1.subject.nested7"
[16] "CASE_CONTROL2.subject.nested7" "CASE_CONTROL1.subject.nested8" "CASE_CONTROL2.subject.nested8"
[19] "CASE_CONTROL1.subject.nested9" "CASE_CONTROL2.subject.nested9" "CASE_CONTROL1.subject.nested10"
[22] "CASE_CONTROL2.subject.nested10" "CASE_CONTROL1.subject.nested11" "CASE_CONTROL2.subject.nested11"
[25] "CASE_CONTROL1.subject.nested12" "CASE_CONTROL2.subject.nested12" "CASE_CONTROL1.subject.nested13"
[28] "CASE_CONTROL2.subject.nested13" "CASE_CONTROL2.subject.nested14" "CASE_CONTROL2.subject.nested15"
[31] "CASE_CONTROL2.subject.nested16" "CASE_CONTROL2.subject.nested17" "CASE_CONTROL2.subject.nested18"
[34] "CASE_CONTROL2.subject.nested19" "CASE_CONTROL2.PRE_MID_POST2" "CASE_CONTROL2.PRE_MID_POST3"
I'm a bit confused on the matter if I want to examine only one time point between conditions how to construct the contrast. Eg. Let's say I want to examine the timepoint 2 between condition groups. Is it right if I assign in the contrast the factor "CASE_CONTROL2.PRE_MID_POST2" only or do I need to assign something else also ?
Or how to do it if I want to examine how many DE genes there's within one group between 1-2, 2-3 or all 1-3 timepoints? Do I need to assign some of the "subject.nested" resultNames in the contrast to account for the within subject variability?
With the LRT I get sig. p-values for hundreds of genes but when applying the contrasts and Wald-test as a result there is thousands of significant genes which seems that the contrast really can't be right.
Thank you for your help. Really getting analysis forward thanks to this!
Great. The results seem to add up, when comparing with the full LRT results.
One question regarding pathway analysis. Is there one certain package you would recommend for downstream pathway analysis of the results?
Most users I think use goseq. If you search the support posts there should be some about this.
Yeah, it was quite straight forward.
We have also the exon data from the experiment. Is it possible to use DESeq2 for exon data? Been looking at DEXseq(which to my understanding uses DESeq2 as inferential engine) for the analysis but was just wondering as we have done the analysis already to gene-level data so would it be a lot easier to implement the current workflow to fit exon-level data on DESeq2?
Been checking that over. Just wondering the same issue as with gene-level data on the model. As in DEXSeq user must specify on the coldata the exon column that has to be taken into account. How do I have to implement it/add on the model that I've used with DESeq2?
Full: ~condition + time + condition:subject.nested + condition:time
Reduced: ~condition + time + condition:subject.nested
Otherwise DEXseq seem somewhat self-explanatory on how to conduct the workflow.
Follow-up on this post. If I'd like to perform just case-only analysis with LRT between time-points would the correct models be:
~ subject.nested + time:subject.nested
~ subject.nested
So that the between subject variability is taken into account.
This will find differences in the profiles across time while controlling for differences in subjects.
That's exactly what I was looking for. However, DESeq2 produces the following error :
Warning message:
In checkForExperimentalReplicates(object, modelMatrix) :
same number of samples and coefficients to fit,
estimating dispersion by treating samples as replicates.
read the ?DESeq section on 'Experiments without replicates'
Oh, you only have individual time points per sample then you can't find this model. Just use 'time' then in place of the interaction of time and sample. You don't have another choice for this as you would need replicates for model you proposed.
Actually I have three timepoints for every subject and almost twenty biological replicates. Shouldn't it work?
No, you'd need replicates per sample and per timepoint.
Can you explain why is that compared to the model where condition is also included? Why there isn't need for replicates per sample in that model?
~condition + time + condition:subject.nested + condition:time
I don't have time to dive deep into explanations of models on the support site. Remember I'm just providing software support not statistical design. I often recommend partnering with a local statistician if you want explanations of various designs (and then you should consider bringing that person onto the project).
Hugely appreciated the previous comments on this matter Michael.
Is it possible to run similar case-only time-series analysis while accounting for age (or other variables of interest)?
Full model: ~ time + subject Reduced model: ~ subject
If age is added to these model the DESeq2 produces an error with matrix not full rank as variables are linear combination of each other. Thus, either variable should be discarded. Is there anyway around this matter in DESeq2? I've read the vignette on the matter and as far I can tell the only option is to discard age as a variable. However, the reason for wanting to run these analyzes is for an article review where the effect of age on the results is demanded - so some way this would need to be done.
If you control for subject, you don't need to (and cannot) control for subject-level covariates. But again, such statistical design considerations are outside the scope of what I have time to provide, and it would be good for you to discuss this with a local statistician.