Dear Communities
As the author of limma suggested, the DEG analysis of RSEM expected count should be done by limma::voom from a computational point of view (proper transformation for differential expression in normalized log expression data). But there are nearly 2000 genes when the Ensembl ID was transformed to the official gene symbol. So I wonder whether limma::avereps could/should be conducted to remove duplicated genes before feeding to voom. Any suggestions would be great appreciation!
# limma::avreps
rt = as.matrix(Expected_count)
rownames(rt) = rt[,1]
exp = rt[,2:ncol(rt)]
dimnames = list(rownames(exp),colnames(exp))
dataMatrix = matrix(as.numeric(as.matrix(exp)),
nrow = nrow(exp),
dimnames = dimnames)
dataMatrix %<>%
limma::avereps() %>%
as.data.frame() %>%
rownames_to_column('symbol')
Thanks to your reply and your package sir! Regarding "avereps() is not applicable to RNA-seq data because read counts can't be averaged". Is that reasonable for RNA-Seq(FPKM/TPM) were averaged by limma::avereps? If the Ensembl ID was kept though the analysis, is that reasonable to remove the duplicated genes after DEG analysis? Thanks so much!
You are assuming that the official gene symbol uniquely identifies a gene. This is not a valid assumption! It is also not necessarily a valid assumption about Ensembl Gene IDs (or NCBI Gene IDs), but both of those annotation services make a concerted effort to collapse duplicate identifiers, and old versions get deprecated, so it is much easier to know if a given ID has been deprecated or not, based on the build. Gene symbols do get deprecated as well, but they have meaning to biologists and tend to persist as an alias. As an example, there is a 1:1 mapping between NCBI Gene IDs and HGNC Symbols:
Even though there is a 1:1 mapping between the Gene IDs and symbols, the symbols aren't unique
Those may arise from duplicate NCBI Gene IDs, or they may be actual duplicates.
But this isn't true of Aliases!
An alias is just a deprecated gene symbol and depending on when and how the IDs were mapped to symbols, you may have a mixture or symbols and aliases. I personally never rely on gene symbols unless I have no other choice.
Thanks for this so detailed answer! Regarding "never rely on gene symbols unless I have no other choice", it means that you prefer to use ensemble ID or entrezID? The reason why I always want to transform those kinds od ID to official gene symbol is that people are usually familiar with this other than ensemble ID or others. And if I keep those duplicated names of gene, those would not be applicable for R in most cases, like set as row.names. So limma::avereps is used to removing those duplicated names for RNA-Seq(TPM/FPKM) and microarray. I'm not sure RNA-Seq(TPM/FPKM) would be okay by this function. Could you please give me more suggestions? Thanks so much!
I prefer either Ensembl or NCBI Gene IDs. I am agnostic to the choice, but they are superior to gene symbols. I always also provide the gene symbols, because as you note, they have meaning to many biologists. But I don't rely on the gene symbols other than to provide them to end users. In other words, the
row.names
for my counts matrix will be either Ensembl or NCBI Gene IDs, and the 'genes' list item of myDGEList
will be adata.frame
that includes both the Gene ID and symbol.When I then output the data, there will be columns containing the Gene ID, the symbol, the gene description, and then the statistics. Biologists can then peruse the list and find genes of interest. If there are duplicated gene symbols, then so be it. Given the unique Gene ID, the biologist can go to the relevant annotation source and figure out why two different Gene IDs have the same symbol.
As an example, I just analyzed some RNA-Seq data from Daphnia magna. Two of the top genes had different NCBI Gene IDs and different directionality, but the same exact gene name (mucin-2). NCBI says those two genes are on completely different chromosomes. Apparently there are lots of genes called mucin-2 in Daphnia! All those mucin-2 genes may all be the same exact gene, and maybe that's common for Daphnia, but I don't know that to be the case and am unwilling to make that assumption. Gene symbols and gene names are meant to be descriptive, whereas Gene IDs are meant to provide a unique identifier, and it's safer IMO to stick with the IDs.
Got that, very thankful sir!