Hi all,
in my latest analysis with the DESeq2 package, I noticed that I had a few paralog genes (~30) which shared the same statistics and ensembl ID. I can use only the unique genes, but I was wondering whether it would be more appropriate to remove the duplicated genes before the statistical analysis since they contain redundant information which worsens my statistics. Would you recommend doing this or would you still include them in the analysis and maybe discard them later in downstream visualization? And if so, would you remove them before running results() or DESeq()? Hope this makes sense.
I did a gene differential expression analysis using transcriptome levels therefore my count table has transcript version IDs (e.g. ENSG00000000003.14). When I convert those to ensembl IDs (e.g. ENSG00000000003), there are about 30 genes which share the same ensembl ID because they are paralogs in the Y chromosome (their transcript version ID ende in _PAR_Y, but have the same ensembl ID).
I haven't thought about what to do with these, but generally if the sequence is near identical, I would collapse the redundant transcripts by adding their counts together. Salmon does this by default for identical transcripts (where otherwise the counts would be split equally among the identical sequences).