Dear all,
Removing genes manually before RNA-seq normalization is not a good practice, right? For example, we would like to investigate osteoblast expression from bone samples, but we know that there is some contamination from muscle. Is it correct to remove the 'muscle' genes before normalization? I understand this should not be done because TMM normalization corrects for library size and compositional biases. Imagine that some bone samples are more contaminated than others, and one has an extremely high expression of muscle genes. If we compare two different conditions and remove the contaminating transcripts before normalization, we would obtain untrustful results, right?
Another example would be removing all non-coding genes beforehand if we want to study protein-coding genes, only. The same applies?
Thanks,
Thanks Gordon, then I was wrong. I thought one can not modify the RNA composition of the sample bioinformatically before the normalization.
Dear Gordon,
To be sure: if I understand correctly your answers implies that removing genes that are not of interest should be done before normalization. In this post I posted somewhat earlier on Biostars: https://www.biostars.org/p/9490668/, the answer to this question was that normalization can be applied before removing genes that are not of interest. These two answers are somewhat contradictory.
For some context, I am running a RNA-seq experiment, in which we want to evaluate only a subset of all genes. The way we have done that now is --> select only protein coding genes --> exclude lowly transcribed genes --> apply TMM-normalization.
The only worry that I have is that one of the assumption of normalization is that the majority of the genes are not differentially expressed. As we are now looking at a specific disease and selecting only genes that are highly expressed in specific tissue, I am not sure whether this assumption still holds. We consider only a very small part of the genome (1000 genes).
Would you advise to subset the genes before normalization or would you advise to normalize on the whole expression matrix and subset the genes of interest afterwards?
Edited
Yes, if you have just a small group of genes of interest, then naturally you should normalize on the whole expression matrix and subset afterwards. I cannot see how you could possibly read into any of my comments the suggestion that you should subset so drastically before normalization. I told that the original poster that they could choose their universe, but you still have to have a universe.
Dear Gordon,
Clear, and many thanks for your quick reply.
So the steps are: filter out lowly expressed genes --> normalization --> keep only genes of interest. What about recomputing library sizes? Should we use
keep.lib.sizes=FALSE
before and after normalization?Sorry, I don't understand the motivation for your question. You are now proposing to do the one thing that I told the original poster must never be done.
The whole purpose of TMM normalization is to estimate the effective library sizes. Resetting the library sizes after that would make nonsense of the normalization.
Please, no more comments added to 3-year-old questions. If you have a question, post a new question of your own.