I saw a recent post by Michael Love (C: differential expression analysis of cell subtypes mixture) about the DeSeq2 function unmix() and I had a follow-up question (but did not want to hijack the original poster's question with one of my own.)
Could unmix() be used if you only know the expression profile of one of the "pure" populations but not the other? For instance my Tissue A sample, which has never been characterized before, unfortunately has a little bit of the well characterized Tissue B contaminating it...
Specifically, I have a bulk RNA seq data set where we enriched for a subset of epithelial cells in three different tissues, but we were not able to get 100% purity due to technical limitations/cell quantity and viability constraints which prevent us from doing flow sorting.
So now it is throwing off some of the differential expression analyses because genes expressed in the contaminating tissue type is showing up with high logFCs/low FDRs because they are expressed in one sample type but not at all in the others. It may be a hopeless cause and we will just have to be cautious in interpreting the data we have, but unmix() sounded like it could be a possibility.
If you're certain that most of your samples contain only Tissue A, then maybe you could use those samples to construct an expression profile for Tissue A and combine that with the pre-existing profile for Tissue B?
Hmm... Maybe?
Let me explain a bit more about what we have to see if that might make sense to consider...:
We took three different human tissues A, B, C and with these tissues we mechanically dissociated the epithelial cells (scraped them off with a scalpel blade). This gave us about a 90-95%ish epithelial cell purity when we went back later and looked at it by FACs analysis. The cell viability after dissociation/RNA quantity and integrity that we could even get from the mechanical dissociation was already pretty low and we were not able to flow sort the samples for epithelial markers because of this...
With the data, we ideally want to ask the question of how does the transcriptional profile of the epithelial cells from these three tissues compare, what genes are DE between the tissues, and does this hint at any underlying differences in biological function between the tissues.
The issue we are having though is that Tissue C is anatomically embedded in Tissue X, and when we mechanically dissociated Tissue C it seems some of that 5-10% contaminating cells are from Tissue X. While Tissues A and B are not embedded in another tissue type and the contaminating population in them is immune cell/stromal cells most likely which do not seem to be vastly different between Tissue A, B, and C I think since I don't see the DGE analyses dominated by these types of genes.
So, when we go to compare Tissue C to Tissue A or B, we see a lot of signal that is most likely coming from the contaminating Tissue X population and we cannot easily conclude what might be the true underlying differences. Tissue X is well characterized and has publicly available RNA-Seq data.
Is Tissue C perhaps a lost cause?
Thank you for the reply/insight :D!
this is a similar situation to our case with the mixture of samples (from the original post you mentioned).
I was wondering if it is possible to artificially add RNA from tissue X to the second batch of samples (tussues A or B) to create a base line for this changes. similar to Spike-Ins used in a RNA-Seq. I am not sure if this is feasible both from the biological side (get pure tissue X ) and from the statistical side (create an even bigger bias). I guess the problem would be to know how much of tissue X to add to the other samples. But if you already can assess that there is something like 5-10% contamination of tissue C with tissue X, this might be a starting point.
Would this be a statistical valid idea?