Question

removing genes before RNA-seq normalization

0

Entering edit mode

aec ▴ 90

@aec-9409

Last seen 4.8 years ago

Dear all,

Removing genes manually before RNA-seq normalization is not a good practice, right? For example, we would like to investigate osteoblast expression from bone samples, but we know that there is some contamination from muscle. Is it correct to remove the 'muscle' genes before normalization? I understand this should not be done because TMM normalization corrects for library size and compositional biases. Imagine that some bone samples are more contaminated than others, and one has an extremely high expression of muscle genes. If we compare two different conditions and remove the contaminating transcripts before normalization, we would obtain untrustful results, right?

Another example would be removing all non-coding genes beforehand if we want to study protein-coding genes, only. The same applies?

Thanks,

edger deseq2 normalization removing genes • 5.1k views

ADD COMMENT • link updated 3.0 years ago by Gordon Smyth 52k • written 5.4 years ago by aec ▴ 90

score 1 · Answer 1 · 2019-11-13

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

It is up to you to determine what "universe" of genes you want to consider and, unless you remove most of the genome, it doesn't cause any problems for TMM or edgeR.

You can consider protein coding genes only if you want, or only somatic chromosomes, or only messenger RNA, or only microRNAs, whatever is biologically appropriate. You just have to describe what you did when you publish.

If you can unambiguously identify "contaminating" genes, then you can remove them as well. Again, you have to explain what you did and why.

The only thing you can't do is to remove genes and recompute library sizes after applying TMM normalization.

Every one of my own published papers explains which genes were removed before normalization. Just to take the most recent (Vrahnas et al, Nature Communications 2019), we said

Immunoglobulin gene segments, ribosomal genes, predicted and pseudo genes, sex-linked genes (Y chromosome and Xist), and obsolete Entrez Gene IDs were filtered out.

ADD COMMENT • link 5.4 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon, then I was wrong. I thought one can not modify the RNA composition of the sample bioinformatically before the normalization.

ADD REPLY • link 5.4 years ago aec ▴ 90

0

Entering edit mode

Dear Gordon,

To be sure: if I understand correctly your answers implies that removing genes that are not of interest should be done before normalization. In this post I posted somewhat earlier on Biostars: https://www.biostars.org/p/9490668/, the answer to this question was that normalization can be applied before removing genes that are not of interest. These two answers are somewhat contradictory.

For some context, I am running a RNA-seq experiment, in which we want to evaluate only a subset of all genes. The way we have done that now is --> select only protein coding genes --> exclude lowly transcribed genes --> apply TMM-normalization.

The only worry that I have is that one of the assumption of normalization is that the majority of the genes are not differentially expressed. As we are now looking at a specific disease and selecting only genes that are highly expressed in specific tissue, I am not sure whether this assumption still holds. We consider only a very small part of the genome (1000 genes).

Would you advise to subset the genes before normalization or would you advise to normalize on the whole expression matrix and subset the genes of interest afterwards?

Edited

ADD REPLY • link 3.0 years ago Barista • 0

1

Entering edit mode

Yes, if you have just a small group of genes of interest, then naturally you should normalize on the whole expression matrix and subset afterwards. I cannot see how you could possibly read into any of my comments the suggestion that you should subset so drastically before normalization. I told that the original poster that they could choose their universe, but you still have to have a universe.

ADD REPLY • link 3.0 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon,

Clear, and many thanks for your quick reply.

So the steps are: filter out lowly expressed genes --> normalization --> keep only genes of interest. What about recomputing library sizes? Should we use keep.lib.sizes=FALSE before and after normalization?

ADD REPLY • link 3.0 years ago Barista • 0

0

Entering edit mode

Sorry, I don't understand the motivation for your question. You are now proposing to do the one thing that I told the original poster must never be done.

The whole purpose of TMM normalization is to estimate the effective library sizes. Resetting the library sizes after that would make nonsense of the normalization.

Please, no more comments added to 3-year-old questions. If you have a question, post a new question of your own.

ADD REPLY • link 3.0 years ago Gordon Smyth 52k