Testing for DE genes between samples of isolated cells, accounting for expression in whole tissue
1
2
Entering edit mode
smurray ▴ 10
@smurray-13731
Last seen 5.7 years ago
Seattle

Hi all,

I'm hoping to get your advice on how to test for DE genes in samples of cells isolated from a tissue, accounting for the "background" expression in the whole tissue that each sample was isolated from.  Essentially, I'd like help designing a model matrix.

The particular laboratory method we using is Ribo-tagging or TRAP-seq, in which ribosome-associated mRNA from specific cells is isolated from a whole tissue (which contains free mRNA and ribo-associated mRNA from other cells), but I think this discussion may also apply to any kind of cell isolation, including cell sorting or laser micro-dissection.

Ribo-tagging and TRAP-seq have been used to identify transcripts that come from particular cell types, to better characterize cell populations or discover new cell populations.  Because the isolation of the cell-type-specific mRNA is not 100% efficient, the field often uses the ratio of cell-type-specific expression over whole tissue expression to more accurately identify genes that are truly specific to the cell type of interest.  I've linked a couple papers if you're interested:

https://academic.oup.com/nar/article/38/13/4218/2409074/Analytical-approaches-to-RNA-profiling-data-for

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3902954/

However, we are taking things a bit further than identifying cell-specific genes, and would like to identify cell-specific genes that are DE between treatment and control.

In our experiments we have several animals in a treatment group and several in a control group, and each animal yields two samples/RNAseq libraries, the isolated cell-type-specific gene expression levels ("cell expr") and the whole tissue gene expression levels ("tissue expr").  On top of that, we have several time points for each group (different animals in each time point).  We want to ask how treatment effects gene expression within each time point, which genes change over time within each group, and which genes are different over time between treatment and control, always accounting for background expression in the tissue to reduce noise.

1. Analysis option 1 is to take the ratio of cell expr/tissue expr and perform DE gene analysis on the ratio values as usual with limma.  But, there's some squirrely things that happen to ratios with low-abundance genes, and I'm not certain that limma will enjoy working with ratios of counts when it's used to working with counts.  (Maybe this is wrong- I'm of the RNAseq generation, but I know limma was developed for microarrays.)

2. Analysis option 2 is to keep the cell expr data and the tissue expr data separate, perform DE analysis on them individually, and compare the resulting lists of DE genes afterward, noting which genes appear on the cell-specific list and which appear on both the cell-specific and tissue-specific lists.

3. Analysis option 3 is to set up a model matrix that includes the paired cell and tissue samples from each animal and looks at the difference between treatment and control within time points and across time points.   I can't wrap my head around what the model matrix would look like.  In the case of a within-time point analysis, I could treat the tissue expression of each sample like a baseline expression, but how to test for genes that are differentially regulated between time 0 and time 1 and between treatment and control, accounting for background tissue expression eludes me.

Thanks so much for any thoughts/designs you have!

limma rnaseq rna-seq model-matrix multiple factor design • 1.8k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

Option 3. Let's start with a simple example that captures most of your experimental design:

treatment <- rep(c("con", "tr"), each=8)
mode <- rep(c("cell", "tissue"), 4)
time <- rep(rep(c("24", "48"), each=4), 2)
animal <- factor(rep(1:8, each=2))

The easiest thing to do is this:

grouping <- paste(treatment, time, mode, sep=".")
design <- model.matrix(~0 + animal + grouping)
design <- design[,!grepl("cell", colnames(design))] # get to full rank

The first 8 coefficients are the animal-specific blocking factors and are not of scientific interest. The last four coefficients represent the log-fold change of the tissue samples over the cell-type specific samples for each treatment/time group. Comparisons between these coefficients will represent comparisons between the log-fold changes, i.e., is the tissue/cell log-fold change the same between treatment and control conditions? This is basically a smarter way of doing option 1 without explicitly computing the ratios beforehand.

Having said all that, there are several mathematical quirks to be aware of. The first is that this approach will happily reject the null hypothesis when one log-fold change is large and the other log-fold change is very large. From a statistical perspective, this is correct as the log-fold changes are indeed different. However, from a biological perspective, this may be less interesting as the qualitative behaviour of the gene does not change between conditions. The second thing to note is that this approach will fail to reject the null hypothesis if you have counts of zero (in the numerator or denominator of the log-fold change calculation) for both conditions. This is because the log-fold change in either condition could be any very large value, such that it becomes impossible to reject the null.

One more warning: the ratio/log-fold change of cell-type-specific expression to whole tissue expression will depend on the composition of the tissue, which may make things difficult to interpret. If the cell composition changes between control and treatment conditions, this will result in changes to the ratio even if the cell type itself has no DE. Presumably, you also have pulldown biases in the cell type-specific samples that are not present in the whole tissue samples. This may cause the background enrichment in the former to differ from the expression profile of the latter, further complicating any comparison of ratios. It's a similar problem to the use of input controls in ChIP-seq, where my attitude is to trust the direction but not the magnitude of ChIP enrichment versus the control.

ADD COMMENT
0
Entering edit mode

Thanks Aaron!  I was hoping that orange and white cat would pop up underneath my post!

First, RE: your last warnings, I absolutely agree and am planning to be very cautious with interpretations of results from this technique, and will keep in mind that this strategy may not pick up differences between low-abundance genes (for multiple reasons).  Thanks for sharing your outlook on techniques like these.

Second, a couple questions about the design: 

I'm interested in cell/tissue, not tissue/cell, so I've made the appropriate changes to your suggested design.  But just to double check, the resulting columns in the full rank matrix are cell/tissue or tissue/cell depending on the order of factor levels in the groupings variable, right?  ie, if I want the cell/tissue fold change, I have to re-level the groupings to put each tissue column before it's associated cell column before I get rid of the tissue columns.   And am I correct that including the animal in as a blocking factor with duplicateCorrelation would not work, because it's "directionless", ie, wouldn't give me cell/tissue, just "same animal"?

With your model design, is it correct that I will get genes that change between time points within treatment groups with  tr.48.cell-tr.24.cell  or  con.48.cell-con.24.cell, and genes that change with treatment within a time point with tr.48.cell-con.48.cell ?  Is there a contrast in this set up that would give me genes that change with time between treatment and control, or do I need a more complex model matrix to pull that comparison out?  If I add an interaction term between time and treatment, does having animal in the model still account for the cell/tissue pairing, and/or do I need to make another grouping variable to find genes different over time between treatment and control?

Thanks much for your time!

ADD REPLY
1
Entering edit mode

Regarding the ordering of levels; yes, you need to put specify the order manually if you want tissues first.

Regarding duplicateCorrelation; yes, you could use it instead of blocking on animal in the design matrix. However, this is only necessary if you want to compare expression levels directly between levels of grouping (i.e., without looking at cell/tissue ratios). If you don't want to do that, blocking directly in the design matrix is simpler and avoids some anticonservativeness associated with the use of duplicateCorrelation.

Regarding your contrasts: that's right, provided you remember that these coefficients are ratios. So you're really looking at changes in the ratios between time points/treatments, not changes in the expression values themselves. If you want to look at genes where the change in the ratio with time differs between treatment and control, you should do a contrast like this:

(tr.48.cell - con.48.cell) - (tr.24.cell - con.24.cell)

This mimics the test for a time:treatment interaction effect, and will identify genes where the change in the ratio between treatments differs between time points. As you can imagine, this gets pretty confusing, so it's a good idea to also report the values of the individual coefficients to help you interpret what is changing and in what direction.

ADD REPLY
0
Entering edit mode

Great- thank you for helping me keep it simple!

ADD REPLY

Login before adding your answer.

Traffic: 815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6