How to decide whether a variate should be in the model or not
1
0
Entering edit mode
Raymond ▴ 20
@raymond-14020
Last seen 5.6 years ago

Hi, all/Micheal Love,

  Is there a way in DESeq2 package to hep with this kind of problems? For example, I may be able to think of tens of possible variables that may affect gene expressions(Genotypes of a couple of genes, genders, age, PMI, RIN, RIN^2, mapping rate, batches....). Obviously, I should only include a limited number of those variables.

And how could I choose these variables? How the number of my samples would restrain my selections, in order to make a robust estimation?

  From the literature, I saw someone using the PC1 as the corresponding factor. Then ANOVA model could be applied to testing the contribution of each possible variables to PC1. This is reasonable, but with obvious limitations. Especially, when sometimes you see the PC1 is mainly dominated by a single factor(such as Batch), then (PC2,PC3, etc) may also be used to identify other factors. 

Any suggestions? (I know this is not a pure DESeq2 package problem, but I guess Micheal would have some clue about this:-) )

Thanks in advance,

Raymond

deseq2 rnaseq • 1.4k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 23 hours ago
United States

Hi Raymond,

Usually to decide what variables would go in a model, I'd have a long in-person discussion with the investigator about their experiment. I'll give my quick opinion on these:

I don't like to put in RIN, RIN^2, mapping rate, etc. but instead if I think there is a lot of technical variability in the data, and many samples to estimate technical effects, I typically would use factor analysis methods such as SVA, RUV or PEER. These are more data driven and should capture whatever f(RIN) is attempting to capture if it is systematic and not confounded with the biological condition (hopefully it's not!). I would never use PC1, because it should be strictly worse than these other methods that take into account the design (I think they all compare to simplistic PC1 in their publications to show their superiority).

You can also use quantification methods like Salmon which model numerous technical biases directly, and per-sample, and so removing partially confounded technical effects is not a problem.

If I have clearly annotated library preparation batches, I tend to use batch instead of the factor analysis methods.

Throwing in lots of variables does seem a bit like trying to deal on the back end with observational data, and I'm not sure that it really works that well.

Some of these make sense, like sex, and I would put that in most datasets where there is a block design. If there is a sex specific effect you can include an interaction term with sex and condition.

Remember when you put in something like log(counts) ~ age, you are assuming that the effect is linear on log counts throughout the range of your data.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thank you so much for your answer to the question. I have a similar question regarding the QC metrics output and whether to include them in our design matrix. We performed a PCA association analysis of the normalized counts against all the variables we have. We found that average insert size, intronic rate, exonic rate, intergenic rate and duplication rate of mapped reads are all significantly associated with the variations captured by PC1. We were thinking about including one of these metrics (exonic rate) in our design matrix for our differential gene expression analysis as a control for technical variability from sequencing. Would this be valid or are methods like SVA you mentioned above preferred over this?

Thanks again and I really appreciate it!

ADD REPLY
1
Entering edit mode

I've answered this elsewhere on the support site. I prefer SVA/RUV as it captures these and has orthogonality.

ADD REPLY

Login before adding your answer.

Traffic: 557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6