Question

Adjusting for confounders along with paired testing using edgeR

1

Entering edit mode

manasishah86 ▴ 30

@manasishah86-11034

Last seen 8.6 years ago

I am comparing microbial abundance data in paired cancer biopsy samples from multiple studies. The studies are heterogenous in terms of their sample handling, hence would like to get differential abundance between paired cases and controls adjusted for Study effecct. What would be a suitable design matrix for the same? Something similar to this but many pairs within each study

Phenotype	Patient	Study	Study*Patient
cancer	P1	StudyA	P1.StudyA
control	P1	StudyA	P1.StudyA
cancer	P2	StudyA	P2.StudyA
control	P2	StudyA	P2.StudyA
cancer	P3	StudyB	P3.StudyB
control	P3	StudyB	P3.StudyB
cancer	P4	StudyB	P4.StudyB
control	P4	StudyB	P4.StudyB
cancer	P5	StudyC	P3.StudyC
control	P5	StudyC	P3.StudyC
cancer	P6	StudyC	P4.StudyC
control	P6	StudyC	P4.StudyC
cancer	P7	StudyD	P7.StudyD
control	P7	StudyD	P7.StudyD
cancer	P8	StudyD	P8.StudyD
control	P8	StudyD	P8.StudyD

For example:

Study=c("StudyA","StudyA","StudyA","StudyA","StudyB","StudyB","StudyB","StudyB","StudyC","StudyC", "StudyC", "StudyC", "StudyD", "StudyD", "StudyD", "StudyD")
phenotype=c("cancer","control","cancer","control","cancer","control","cancer","control","cancer","control", "cancer","control", "cancer","control", "cancer","control")
pair_factor = c("P1",  "P1",    "P2",    "P2",    "P3",    "P3",    "P4",    "P4",    "P5",    "P5",    "P6",    "P6",    "P7",    "P7",    "P8",    "P8")
Interaction <- paste(Study, pair_factor, sep=".")
design <- model.matrix(~0 + Interaction + phenotype)
colnames(design)
[1] "InteractionStudyA.P1" "InteractionStudyA.P2" "InteractionStudyB.P3" "InteractionStudyB.P4" "InteractionStudyC.P5" "InteractionStudyC.P6"
[7] "InteractionStudyD.P7" "InteractionStudyD.P8" "phenotypecontrol" 
OR
design <- model.matrix(~0 + Study + pair_factor + phenotype) 
colnames(design)
 [1] "StudyStudyA"      "StudyStudyB"      "StudyStudyC"      "StudyStudyD"      "pair_factorP2"    "pair_factorP3"    "pair_factorP4"   
 [8] "pair_factorP5"    "pair_factorP6"    "pair_factorP7"    "pair_factorP8"    "phenotypecontrol"

Are any of the above options would be suitable for the study design? Or am I not accounting for something altogether? What coefficients would I call out for the glmLRT step from the above design? I saw @Aaron Lun and Gordon Smyth's answers at a question Multi factor design edgeR/Deseq2 but they gave adjusted effects within the genders, I want to compare cancer case and control pairs adjusted FOR study. Also if I may noobishly ask, what does the ~0 do in the design? Does it imply an X intercept?

Finally, is there a way to parallelize:
estimateGLMTagwiseDisp

It took a long time to run on my test data (n=200), my actual data is ~1000 samples, 500 matched pairs over 8 studies.

Thanks a lot for your time!

Manasi, PhD Epidemiology Candidate, UTSPH

edgeR paired samples confounders model.matrix • 2.1k views

ADD COMMENT • link updated 8.8 years ago by Ryan C. Thompson ★ 7.9k • written 8.8 years ago by manasishah86 ▴ 30

score 1 · Accepted Answer · 2016-07-01

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

It looks like both of the samples of each pair are contained in the same study, so that the Patient factor is nested within the Study factor. This means that if your design adjusts for inter-patient differences, it is already adjusting for inter-study differences as well. For patients in different studies, you won't easily be able to disambiguate which differences are due to the patient effect and which are due to the study effect, but it sounds like you aren't interested in either of those, so you have no problem. Just use a design of ~Phenotype + Patient.

As for the running speed of estimateGLMTagwiseDisp, a couple of things: first, you should be using estimateDisp instead, which is not only better but probably faster as well. Second, I believe that DESeq2 has a parallel dispersion estimation function, though I haven't benchmarked it. Third, if estimateDisp is not fast enough for you, with 1000 samples, you could use limma-voom to fit a linear model instead of the GLM of edgeR and DESeq2, which will be many times faster on a data set of this size.

ADD COMMENT • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

For what it's worth, I use estimateDisp on single-cell RNA-seq data sets with ~3000 genes and ~3000 samples. This takes about 15 minutes on a desktop machine, which I consider to be satisfactory performance. I once tried to parallelize the function via BiocParallel, but the memory usage increased dramatically (something to do with the environment variables being duplicated at each fork) which made it impractical for any data set large enough to be worth parallelizing.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron and Ryan. I did try to use the parallel function in DESeq but consistently kept getting an issue somewhat like this one:

DESeq2 with many samples

Ryan, you are right paired samples come from the same study so it is nested. But I was just wondering if my second option for model.matrix gave a higher layer of adjustment and separated the study effect?

design <- model.matrix(~0 + Study + pair_factor + phenotype)

What do you think Aaron and Ryan? Is the model above any better or just ~pair_factor + phenotype will suffice? Also I did consider limma + voom. However, there is not much documentation for it using microbiome data. Hence I thought of sticking to something more familiar if it only helps speed. If it adds significantly to the overall results and interpretation I will definitely look into it.

Thanks,

Manasi

ADD REPLY • link 8.8 years ago manasishah86 ▴ 30

0

Entering edit mode

The design you show will not work. The study effect is already absorbed into the patient effect because patient is nested within study, so attempting to include Study in the design a second time is redundant and will result in a rank-deficient design matrix that will throw an error when you try to use it. Unless you care very much about separating study effects from patient effects, there's no need to go any further than ~phenotype + patient.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I gave it a lot of thought and subjecting study effects (in addition to patient ones) is quite important. Could you kindly a suitable modification to the design matrix for the same. Also I want to take up your advice and calculate using estimateDisp instead. Following is my workflow:

x = DGEList(counts=x, group=Diagnosis, genes=taxonomy, remove.zeros=TRUE)
# Calculate the normalization factors and estimate dispersion
x = calcNormFactors(x, method="RLE")
x = estimateGLMCommonDisp(x, design)
x = estimateGLMTrendedDisp(x, design)
x = estimateGLMTagwiseDisp(x, design)

If x and taxonomy are my inputs, how does this get modified using estimateDisp. Am I right that estimateDisp also calculates trended dispersions and tagwise dispersions? If so should I use the default values for those? the trend method defaults are different for estimateGLMTrendedDisp and estimateDisp.

Thanks so much, appreciate your time.

Best,

Manasi

ADD REPLY • link 8.8 years ago manasishah86 ▴ 30

0

Entering edit mode

Yes, estimateDisp replaces all three of the estimateGLM*Disp functions. As I said, you can't put both study and patient into the same design matrix, since they are nested. You would need to use a mixed model, which is unfortunately beyond the scope of my experience, since it's quite rare for a genomics experiment to have enough samples to justify the use of mixed models.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ryan, I think you meant study and patient.

Manasi, what is the aim of your analysis? Is it to look at DE in cancer versus control? If so, you shouldn't need to care about separating the study from the patient effect. Rest assured that any study effects will be accounted for when you block on patient. The only scenario where you would need to separate them is if you wanted to look for DE between studies. This doesn't sound like an interesting or relevant comparison to me.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Whoops, thanks for the correction. I've edited it.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks Ryan, I will use estimateDisp with the default settings acoordingly.

Aaron and Ryan, my aim is to look at DE in cancer vs control, but the data comes from heterogenous sources. It is a meta-analysis as such. I have partially overcome this by aligning all the raw sequences against a common reference database to classify them. However, that still does not cover some of the other sample handling differences such as different DNA extraction methods, sequencing depth in the Illumina MiSeq vs 454 FLX platform and hence using Study as a proxy factor for these variables would have made me a little more peaceful. As you boys suggest, maybe adjusting for patient pairs might help in overcoming that.

Thanks,

Manasi

ADD REPLY • link 8.8 years ago manasishah86 ▴ 30

0

Entering edit mode

Combining data from different wet lab methods, sequencing technologies, etc. is not for the faint of heart. Each prep has different biases, read lengths, sequencing error rates, coverage depth, variance, and who knows what else. I've had significant problems in one of my data sets just because half the sequencing libraries were prepared with an older version of the same protocol, with everything else (sequencing tech, etc.) the same. You should either be an experienced statistician or be working with one if you're trying to fit all of these samples in a single model. You might get lucky and get good results from the model I described above, simply because of your large sample size, but any of the above issues could substantially confound your analysis. Describing how to account for all of them is far beyond what would fit in this comment thread.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Agreed. Thanks for acknowledging the, even if naively optimistic, the brave mission that I am attempting :P It has been a hard road! What is a poor grad student to do if they do not have access to expensive original cohorts? Re-position existing ones smartly, right? I did get some really good advice for a few months. I will also be doing a paired study by study deseq and using effect sizes from those in a random effects model. (that is one way to do it :)) Also will appreciate some upvotes if the question was decent to build some reputation karma.

Thanks,

Manasi

ADD REPLY • link 8.8 years ago manasishah86 ▴ 30