Question

getBM function, search for annotation characteristics with entrezgene filters --> result with duplicates

0

Entering edit mode

Jose Luis Soto Vázquez • 0

@jose-luis-soto-vazquez-12108

Last seen 7.3 years ago

Spain/Vigo/University of Vigo

Hello,

I have one problem with the getBM function. I'm doing a RNA-seq experiment. The file that I'm analyzing includes an expression Set object with RNA.seq count data for 700 samples as well as information about different phenotypes. I have in this RNAseq 20532 reads (entrezgene identifier)

I want to extract the annotation characteristics (start position, end position and GC content). For this purpose I use this code:

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

annot <- getBM(attributes=c("entrezgene","start_position","end_position",

"percentage_gc_content"),

filters="entrezgene",

values=rownames(counts),

mart=mart)

(counts is the file with the 20532 genes (in rownames) and the counts for every sample).

My problem starts when I get the file annot with these characteristics. This file has 21739 rows (more than the original one). I observed it and I realize that there are some genes duplicated, that is to say: there are, for instance, two different entrezgene identifiers that correspond with the same gene.

How can I resolve this problem? I would suppose that I would take the annot file with less genes, but it happened the opposite situation. Maybe the code is not correct or is easy to find a solution, but I'm very new in R and Bioconductor. Any help?

Thanks in advance,

Jose

getBM entrez gene identifiers duplicate rnaseq • 2.3k views

ADD COMMENT • link 8.3 years ago Jose Luis Soto Vázquez • 0

score 2 · Answer 1 · 2017-01-04

There are multiple reasons why you get duplicates - here are two:

Queries across annotation groups (NCBI Entrez Gene IDs used to query Ensembl database) will likely bring up differences in the annotation methods used by NCBI as compared to Ensembl.

Some genes are found on multiple chromosomes or unplaced haplotypes. They will by default have different start/end coordinates (as well as chromosomes).

There are probably other reasons as well. These are not simple issues, and cannot be easily 'fixed' without making very strong simplifying assumptions, like simply removing all but the first of the duplicates. Or you could randomly select one of the duplicates. An alternative would be to inspect each duplicated gene, figure out why it is duplicated, and then choose the 'best' one. None of these is likely to be entirely satisfying.

score 0 · Answer 2 · 2017-01-05

0

Entering edit mode

Jose Luis Soto Vázquez • 0

@jose-luis-soto-vazquez-12108

Last seen 7.3 years ago

Spain/Vigo/University of Vigo

Hello,

Thank you very much for your answer. Now I understand it.

Yes, maybe none of these is likely to be entirely satisfying. What would be the best (or the most common/used solution) to select the "best" option for every gene without any duplicate?

The "inspecting" alternative seems to be a little tedious if there's a large set of genes. I could try to filter the number of genes to get less of them, but...what if I would be interested in analyzing all the set? (maybe it's not possible).

Is there any function or code in order to eliminate or "clean" the result annot file? Maybe it depends the situation.

Thank you,

Jose

ADD COMMENT • link 8.3 years ago Jose Luis Soto Vázquez • 0

1

Entering edit mode

If you want to make a comment or add another question, please use the ADD COMMENT link, rather than the Add your answer box below.

As to which method is 'best', that depends on how you want to define best. I could argue that the best way would be to retain the duplicates; if there are multiple regions of the genome that are thought to contain a given gene, then isn't that information relevant?

Alternatively, I could argue that you seem to primarily want the GC content, and in that case you could compute the mean GC content over all the regions that contain the gene. That might be 'best' in some sense.

Or alternatively I could argue that the 'best' method is to simply remove the duplicates as fast as possible, because who has the time? In that case, just subsetting out the duplicated Entrez Gene IDs would be fastest and easiest, and hence best.

In the end, as with most analyses, there are choices that have to be made. Your goal as an analyst is to decide what choice you want to or are willing to make, based all the criteria of interest in your analysis (time available to do the analysis, what the goals are, what hypothesis you are trying to test, etc), and have a cogent argument as to why your choice was optimal in some sense.

ADD REPLY • link 8.3 years ago James W. MacDonald 68k