I am comparing microbial abundance data in paired cancer biopsy samples from multiple studies. The studies are heterogenous in terms of their sample handling, hence would like to get differential abundance between paired cases and controls adjusted for Study effecct. What would be a suitable design matrix for the same? Something similar to this but many pairs within each study
Phenotype | Patient | Study | Study*Patient |
cancer | P1 | StudyA | P1.StudyA |
control | P1 | StudyA | P1.StudyA |
cancer | P2 | StudyA | P2.StudyA |
control | P2 | StudyA | P2.StudyA |
cancer | P3 | StudyB | P3.StudyB |
control | P3 | StudyB | P3.StudyB |
cancer | P4 | StudyB | P4.StudyB |
control | P4 | StudyB | P4.StudyB |
cancer | P5 | StudyC | P3.StudyC |
control | P5 | StudyC | P3.StudyC |
cancer | P6 | StudyC | P4.StudyC |
control | P6 | StudyC | P4.StudyC |
cancer | P7 | StudyD | P7.StudyD |
control | P7 | StudyD | P7.StudyD |
cancer | P8 | StudyD | P8.StudyD |
control | P8 | StudyD | P8.StudyD |
For example:
Study=c("StudyA","StudyA","StudyA","StudyA","StudyB","StudyB","StudyB","StudyB","StudyC","StudyC", "StudyC", "StudyC", "StudyD", "StudyD", "StudyD", "StudyD") phenotype=c("cancer","control","cancer","control","cancer","control","cancer","control","cancer","control", "cancer","control", "cancer","control", "cancer","control") pair_factor = c("P1", "P1", "P2", "P2", "P3", "P3", "P4", "P4", "P5", "P5", "P6", "P6", "P7", "P7", "P8", "P8") Interaction <- paste(Study, pair_factor, sep=".") design <- model.matrix(~0 + Interaction + phenotype) colnames(design) [1] "InteractionStudyA.P1" "InteractionStudyA.P2" "InteractionStudyB.P3" "InteractionStudyB.P4" "InteractionStudyC.P5" "InteractionStudyC.P6" [7] "InteractionStudyD.P7" "InteractionStudyD.P8" "phenotypecontrol" OR design <- model.matrix(~0 + Study + pair_factor + phenotype) colnames(design) [1] "StudyStudyA" "StudyStudyB" "StudyStudyC" "StudyStudyD" "pair_factorP2" "pair_factorP3" "pair_factorP4" [8] "pair_factorP5" "pair_factorP6" "pair_factorP7" "pair_factorP8" "phenotypecontrol"
Are any of the above options would be suitable for the study design? Or am I not accounting for something altogether? What coefficients would I call out for the glmLRT step from the above design? I saw @Aaron Lun and Gordon Smyth's answers at a question Multi factor design edgeR/Deseq2 but they gave adjusted effects within the genders, I want to compare cancer case and control pairs adjusted FOR study. Also if I may noobishly ask, what does the ~0 do in the design? Does it imply an X intercept?
Finally, is there a way to parallelize: estimateGLMTagwiseDisp
It took a long time to run on my test data (n=200), my actual data is ~1000 samples, 500 matched pairs over 8 studies.
Thanks a lot for your time!
Manasi, PhD Epidemiology Candidate, UTSPH
For what it's worth, I use
estimateDisp
on single-cell RNA-seq data sets with ~3000 genes and ~3000 samples. This takes about 15 minutes on a desktop machine, which I consider to be satisfactory performance. I once tried to parallelize the function via BiocParallel, but the memory usage increased dramatically (something to do with the environment variables being duplicated at each fork) which made it impractical for any data set large enough to be worth parallelizing.Thanks Aaron and Ryan. I did try to use the parallel function in DESeq but consistently kept getting an issue somewhat like this one:
DESeq2 with many samples
Ryan, you are right paired samples come from the same study so it is nested. But I was just wondering if my second option for model.matrix gave a higher layer of adjustment and separated the study effect?
What do you think Aaron and Ryan? Is the model above any better or just ~pair_factor + phenotype will suffice? Also I did consider limma + voom. However, there is not much documentation for it using microbiome data. Hence I thought of sticking to something more familiar if it only helps speed. If it adds significantly to the overall results and interpretation I will definitely look into it.
Thanks,
Manasi
The design you show will not work. The study effect is already absorbed into the patient effect because patient is nested within study, so attempting to include Study in the design a second time is redundant and will result in a rank-deficient design matrix that will throw an error when you try to use it. Unless you care very much about separating study effects from patient effects, there's no need to go any further than ~phenotype + patient.
I gave it a lot of thought and subjecting study effects (in addition to patient ones) is quite important. Could you kindly a suitable modification to the design matrix for the same. Also I want to take up your advice and calculate using
estimateDisp
instead. Following is my workflow:If x and taxonomy are my inputs, how does this get modified using estimateDisp. Am I right that estimateDisp also calculates trended dispersions and tagwise dispersions? If so should I use the default values for those? the trend method defaults are different for
estimateGLMTrendedDisp and estimateDisp.
Thanks so much, appreciate your time.
Best,
Manasi
Yes,
estimateDisp
replaces all three of theestimateGLM*Disp
functions. As I said, you can't put both study and patient into the same design matrix, since they are nested. You would need to use a mixed model, which is unfortunately beyond the scope of my experience, since it's quite rare for a genomics experiment to have enough samples to justify the use of mixed models.Ryan, I think you meant study and patient.
Manasi, what is the aim of your analysis? Is it to look at DE in cancer versus control? If so, you shouldn't need to care about separating the study from the patient effect. Rest assured that any study effects will be accounted for when you block on patient. The only scenario where you would need to separate them is if you wanted to look for DE between studies. This doesn't sound like an interesting or relevant comparison to me.
Whoops, thanks for the correction. I've edited it.
Thanks Ryan, I will use estimateDisp with the default settings acoordingly.
Aaron and Ryan, my aim is to look at DE in cancer vs control, but the data comes from heterogenous sources. It is a meta-analysis as such. I have partially overcome this by aligning all the raw sequences against a common reference database to classify them. However, that still does not cover some of the other sample handling differences such as different DNA extraction methods, sequencing depth in the Illumina MiSeq vs 454 FLX platform and hence using Study as a proxy factor for these variables would have made me a little more peaceful. As you boys suggest, maybe adjusting for patient pairs might help in overcoming that.
Thanks,
Manasi
Combining data from different wet lab methods, sequencing technologies, etc. is not for the faint of heart. Each prep has different biases, read lengths, sequencing error rates, coverage depth, variance, and who knows what else. I've had significant problems in one of my data sets just because half the sequencing libraries were prepared with an older version of the same protocol, with everything else (sequencing tech, etc.) the same. You should either be an experienced statistician or be working with one if you're trying to fit all of these samples in a single model. You might get lucky and get good results from the model I described above, simply because of your large sample size, but any of the above issues could substantially confound your analysis. Describing how to account for all of them is far beyond what would fit in this comment thread.
Agreed. Thanks for acknowledging the, even if naively optimistic, the brave mission that I am attempting :P It has been a hard road! What is a poor grad student to do if they do not have access to expensive original cohorts? Re-position existing ones smartly, right? I did get some really good advice for a few months. I will also be doing a paired study by study deseq and using effect sizes from those in a random effects model. (that is one way to do it :)) Also will appreciate some upvotes if the question was decent to build some reputation karma.
Thanks,
Manasi