Hi all,
I'm hoping to get your advice on how to test for DE genes in samples of cells isolated from a tissue, accounting for the "background" expression in the whole tissue that each sample was isolated from. Essentially, I'd like help designing a model matrix.
The particular laboratory method we using is Ribo-tagging or TRAP-seq, in which ribosome-associated mRNA from specific cells is isolated from a whole tissue (which contains free mRNA and ribo-associated mRNA from other cells), but I think this discussion may also apply to any kind of cell isolation, including cell sorting or laser micro-dissection.
Ribo-tagging and TRAP-seq have been used to identify transcripts that come from particular cell types, to better characterize cell populations or discover new cell populations. Because the isolation of the cell-type-specific mRNA is not 100% efficient, the field often uses the ratio of cell-type-specific expression over whole tissue expression to more accurately identify genes that are truly specific to the cell type of interest. I've linked a couple papers if you're interested:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3902954/
However, we are taking things a bit further than identifying cell-specific genes, and would like to identify cell-specific genes that are DE between treatment and control.
In our experiments we have several animals in a treatment group and several in a control group, and each animal yields two samples/RNAseq libraries, the isolated cell-type-specific gene expression levels ("cell expr") and the whole tissue gene expression levels ("tissue expr"). On top of that, we have several time points for each group (different animals in each time point). We want to ask how treatment effects gene expression within each time point, which genes change over time within each group, and which genes are different over time between treatment and control, always accounting for background expression in the tissue to reduce noise.
1. Analysis option 1 is to take the ratio of cell expr/tissue expr and perform DE gene analysis on the ratio values as usual with limma. But, there's some squirrely things that happen to ratios with low-abundance genes, and I'm not certain that limma will enjoy working with ratios of counts when it's used to working with counts. (Maybe this is wrong- I'm of the RNAseq generation, but I know limma was developed for microarrays.)
2. Analysis option 2 is to keep the cell expr data and the tissue expr data separate, perform DE analysis on them individually, and compare the resulting lists of DE genes afterward, noting which genes appear on the cell-specific list and which appear on both the cell-specific and tissue-specific lists.
3. Analysis option 3 is to set up a model matrix that includes the paired cell and tissue samples from each animal and looks at the difference between treatment and control within time points and across time points. I can't wrap my head around what the model matrix would look like. In the case of a within-time point analysis, I could treat the tissue expression of each sample like a baseline expression, but how to test for genes that are differentially regulated between time 0 and time 1 and between treatment and control, accounting for background tissue expression eludes me.
Thanks so much for any thoughts/designs you have!
Thanks Aaron! I was hoping that orange and white cat would pop up underneath my post!
First, RE: your last warnings, I absolutely agree and am planning to be very cautious with interpretations of results from this technique, and will keep in mind that this strategy may not pick up differences between low-abundance genes (for multiple reasons). Thanks for sharing your outlook on techniques like these.
Second, a couple questions about the design:
I'm interested in cell/tissue, not tissue/cell, so I've made the appropriate changes to your suggested design. But just to double check, the resulting columns in the full rank matrix are cell/tissue or tissue/cell depending on the order of factor levels in the groupings variable, right? ie, if I want the cell/tissue fold change, I have to re-level the groupings to put each tissue column before it's associated cell column before I get rid of the tissue columns. And am I correct that including the animal in as a blocking factor with duplicateCorrelation would not work, because it's "directionless", ie, wouldn't give me cell/tissue, just "same animal"?
With your model design, is it correct that I will get genes that change between time points within treatment groups with
tr.48.cell-tr.24.cell
orcon.48.cell-con.24.cell,
and genes that change with treatment within a time point withtr.48.cell-con.48.cell
? Is there a contrast in this set up that would give me genes that change with time between treatment and control, or do I need a more complex model matrix to pull that comparison out? If I add an interaction term between time and treatment, does having animal in the model still account for the cell/tissue pairing, and/or do I need to make another grouping variable to find genes different over time between treatment and control?Thanks much for your time!
Regarding the ordering of levels; yes, you need to put specify the order manually if you want tissues first.
Regarding
duplicateCorrelation
; yes, you could use it instead of blocking onanimal
in the design matrix. However, this is only necessary if you want to compare expression levels directly between levels ofgrouping
(i.e., without looking at cell/tissue ratios). If you don't want to do that, blocking directly in the design matrix is simpler and avoids some anticonservativeness associated with the use ofduplicateCorrelation
.Regarding your contrasts: that's right, provided you remember that these coefficients are ratios. So you're really looking at changes in the ratios between time points/treatments, not changes in the expression values themselves. If you want to look at genes where the change in the ratio with time differs between treatment and control, you should do a contrast like this:
This mimics the test for a time:treatment interaction effect, and will identify genes where the change in the ratio between treatments differs between time points. As you can imagine, this gets pretty confusing, so it's a good idea to also report the values of the individual coefficients to help you interpret what is changing and in what direction.
Great- thank you for helping me keep it simple!