Question

DESeq2: removal of duplicated genes during statistical analysis

0

Entering edit mode

thkapell ▴ 10

@tkapell-14647

Last seen 2.3 years ago

Helmholtz Center Munich, Germany

Hi all,

in my latest analysis with the DESeq2 package, I noticed that I had a few paralog genes (~30) which shared the same statistics and ensembl ID. I can use only the unique genes, but I was wondering whether it would be more appropriate to remove the duplicated genes before the statistical analysis since they contain redundant information which worsens my statistics. Would you recommend doing this or would you still include them in the analysis and maybe discard them later in downstream visualization? And if so, would you remove them before running results() or DESeq()? Hope this makes sense.

deseq2 gene symbol ensembl • 2.6k views

ADD COMMENT • link updated 6.4 years ago by Michael Love 43k • written 6.4 years ago by thkapell ▴ 10

score 0 · Answer 1 · 2018-11-24

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

What is your quantification setup? Why do you end up with multiple rows with the same ID?

ADD COMMENT • link 6.4 years ago Michael Love 43k

0

Entering edit mode

I did a gene differential expression analysis using transcriptome levels therefore my count table has transcript version IDs (e.g. ENSG00000000003.14). When I convert those to ensembl IDs (e.g. ENSG00000000003), there are about 30 genes which share the same ensembl ID because they are paralogs in the Y chromosome (their transcript version ID ende in _PAR_Y, but have the same ensembl ID).

ADD REPLY • link 6.4 years ago thkapell ▴ 10

0

Entering edit mode

I haven't thought about what to do with these, but generally if the sequence is near identical, I would collapse the redundant transcripts by adding their counts together. Salmon does this by default for identical transcripts (where otherwise the counts would be split equally among the identical sequences).

ADD REPLY • link 6.4 years ago Michael Love 43k