I am working with DeSeq2 and trying to convert ENSG IDs to Gene symbols after calling the results function in DeSeq2. When doing this however, I end up with duplicated (or sometimes triplicated or more) gene symbols each corresponding to a different ENSG ID. These duplicated symbols often have the same exact P-adj, logFC values, etc but sometimes they are slightly different.
I am a novice at RNA-Seq analysis so apologies if this is a silly question! I have tried looking on bioconductor, DeSeq vignette, etc and have not been able to figure out a reason or work around for this issue.
An example of one of the genes this is happening with is: AADACL2 which shows up with two ENSG IDs, ENSG00000197953 and ENSG00000261846.
My code is (hopefully) posted here:
#Creating a DeSeq Matrix from Tximport
samples <- read.table(file.path(dir,"samples_v3.txt"), header=TRUE)
files <- file.path(dir,"salmon", paste("salmon_",samples$run, ".txt", sep='') )
load("/Users/rimlandca/Desktop/R files/tx2gene_GRCh38.80.RData")
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ type)
#Collapsing Technical Replicates ddsColl<- collapseReplicates(ddsTxi, samples$name)
#Pre-FilterddsColl <- ddsColl[ rowSums(counts(ddsColl)) > 1, ] #Subset for protein coding genes (we are only interested in protein coding genes)
ddsCollinfo <- data.frame(getBM(attributes=c("ensembl_gene_id","hgnc_symbol", "transcript_biotype"), filters="ensembl_gene_id",values=rownames(ddsColl), mart=ensembl)) ddsCollinfocoding <- subset(ddsCollinfo, ddsCollinfo$transcript_biotype == "protein_coding") ddsCollcoding <- subset(ddsColl, rownames(ddsColl) %in% ddsCollinfocoding$ensembl_gene_id)#Run DeSeq on just a subset of the biological groups
ddsCollcoding_CBDonly<-ddsCollcoding[, ddsCollcoding$origin == "CBD"] ddsCollcoding_CBDonly$type<-droplevels(ddsCollcoding_CBDonly$type) dds_DGE_CBD <- DESeq(ddsCollcoding_CBDonly) resultsCtCc <- results(dds_DGE_CBD, contrast=c("type", "CBD_t", "CBD_c"), alpha=0.05) summary(resultsCtCc)
#Add Gene Symbols to Results output and exportlibrary( org.Hs.eg.db ) library(AnnotationDbi) resultsCtCc$symbol <- mapIds(org.Hs.eg.db, keys=row.names(resultsCtCc), column="SYMBOL", keytype="ENSEMBL", multiVals="first") resultsCtCc.df <- as.data.frame(resultsCtCc) write.csv(resultsGtGc.df, file="/Users/rimlandca/Desktop/R Files/resultsGtGc.csv")
Hi James,
Thank you for the detailed answer. Do you know what is the best to handle these situations when looking at differential expression between two groups? For instance I would like to state that biological group A has X number of unique genes differentially expressed from biological group B. Then do follow up analyses on a significant gene list using GO. And also make heat maps. However, when I go to make heatmaps, etc the duplicated genes cause issues as they show up more than once if using the gene symbols.
Is it best to delete the duplicated gene symbols? Or is there a way to merge them together? Or is it best to keep them as is?
This was a code I had tried to delete any gene symbols which were duplicated. When I do that, it is taking 20,000 protein coding genes down to 18,000 or so however, which seems like an awful lot of duplicates.
Thanks again!
resultsCtCc.df <- as.data.frame(resultsCtCc)
!duplicated(resultsCtCc.df$symbol)
index<- which(!duplicated(resultsCtCc.df$symbol))
resultsCtCc_subset <- resultsCtCc.df[index, ]
Any reasonable GO analysis software will automatically remove any duplicates, so that shouldn't be an issue. As for having duplicate symbols in a heatmap, you could simply remove any of the dups before you generate the heatmap.
I am in general not concerned with duplicate gene symbols because I don't really expect them to be unique. So dups may mean you measured the same gene twice, or it may mean (in the case of AADACL2) that at least one group thinks there are two versions of the gene in the genome that are different enough to require different IDs. That to me isn't something to be ignored, or 'fixed' but instead something to be explored and explained.
In other words, there are multiple groups of people who spend much of their time deciding how many genes there are, where they are located in the genome, etc, etc. And for somewhere around 2000 genes they have decided (for maybe 2000 different reasons) that things with the same gene symbol should have more than one (in this case, Ensembl) ID. Since I haven't spent any time investigating any of those genes, I would be hesitant to simply decide that I could simply 'fix' the problem by removing one of the duplicates. So in my analyses I don't.
Fair enough, that makes sense. So in your own analyses if you were to report on the total number of genes that are differentially expressed between groups, would you consider these duplicated gene instances as unique genes? Or as one gene?
For instance in my data set Tissue A has 700 genes up-regulated compared to Tissue B, however if I took out duplicates it may be 620 lets say (made up these numbers, may be higher or lower in reality). Would you report the 700 number? Or the 640? Or would you just go and investigate what the significant, duplicated genes were if its a manageable number?
I would report the 700 genes. But I consider the Ensembl IDs the genes, and not the HUGO symbols. That's probably because I'm a statistician. Seems to me that biologists tend to identify with gene symbols, to their detriment.
If one of my colleagues were to then come to me and ask 'wassup with all these dups', I would explain pretty much what I have already explained to you, and for those genes that were of interest, I would then provide more information to help figure out how different the duplicated symbols really are, and if it is worth it (for my colleague) to consider them separately or not.
Okay, thank you. That seems reasonable and helps clarify the issue a lot. It also makes me feel better to know that the duplicates are not arising due to something incorrect in my code which was another worry I had!
Also I forgot to mention that your vivid description of how your biologist colleagues would ask about the duplicates made me chuckle.
hehe, I would totally come to you and say: 'wassup with all these dups'!
Thanks again for taking the time to answer :D