Entering edit mode
shirley zhang
★
1.0k
@shirley-zhang-2038
Last seen 10.2 years ago
Hi Marc,
Thanks a lot for your detailed reply. It is very helpful.
I tried library "GenomicFeatures" and your code. It works great. I did
get position for each Transcript. However I am only interested in
obtaining position for each gene ID/gene symbol instead of position
for each transcript. For example, if one gene has multiple
transcripts, I only want the biggest boundary of each gene which
includes all of its transcripts. Does GenomicFeatures or other package
can do this? Or I have to compute the gene position by combining all
of its corresponding transcripts?
Thanks again for your suggestions.
Shirley
On Thu, Aug 19, 2010 at 2:50 PM, Marc Carlson <mcarlson at="" fhcrc.org="">
wrote:
> Hi Shirly,
>
> That is a very good question! ?Looking at the latest builds of the
> annotations (which is what we normally would recommend), the manual
page
> will now tell you that the source for the latest CHRLOC and
CHRLOCEND
> information is from here:
>
> ? ? Mappings were based on data provided by: UCSC Genome
> ? ? Bioinformatics (Homo sapiens)
> ? ? ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19
>
> That means that this information corresponds to the more recent
"hg19"
> build which in turn will correspond to "build 37". ?So if you want
older
> builds you could use older versions of the org packages, but I since
I
> would never in good conscience recommend using stale annotations, I
> think that in that case you would be much better off to do the
initial
> steps I showed you (to get the entrez gene IDs), and THEN using the
> GenomicFeatures package, which should allow you to get positions for
the
> hg18 build mapped onto entrez gene IDs.
>
> You can see a vignette for GenomicFeatures here:
>
> http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatur
es.html
>
> You would basically just need to use makeTranscriptDbFromUCSC(),
> serialize the result for safekeeping, and then call transcripts() on
the
> object it gave you. ?Once you had done so, you could simply ask for
the
>
> library(GenomicFeatures)
> foo = makeTranscriptDbFromUCSC(genome="hg18",tablename="knownGene")
> bar = transcriptsBy(foo, by="gene")
> ## At this point you will have ?GRanges object which is really
useful
> for a lot of great stuff,
> ## but since you are only interested in matching the gene_ids to the
> positions,
> ## you can just pare it down to a much simpler object like so:
>
> frm = as.data.frame(bar)
>
> ## The values in "element" will be the names of the elements from
the
> GRangesList object bar.
> ## If you check them you will see that for this UCSC track, they are
the
> entrez gene IDs that you want to merge with.
>
> Hope this helps,
>
>
> ?Marc
>
>
>
> On 08/18/2010 06:26 PM, shirley zhang wrote:
>> Hi Marc,
>>
>> Thanks a lot for your information. It is very very helpful. May I
ask
>> one more question?
>>
>> In my case, ?the snp_position is based on Human Build 36. ?The
package
>> org.Hs.eg.db was updated in October 2009, so can I assume it is on
>> Build 36?
>>
>> Thanks again,
>> Shirely
>>
>> On Wed, Aug 18, 2010 at 12:56 PM, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote:
>>
>>> Hi Shirley,
>>>
>>> You can find this information in an organism package.
?org.Hs.eg.db for
>>> humans (for example).
>>>
>>> What will be harder will be getting all the information from the
mixed
>>> bag of IDs that you describe. ?But you can still retrieve it for
each ID
>>> type separately like this example for refseq IDs:
>>>
>>> ## So for refseq you can start by getting the entrez gene ID
>>> library(org.Hs.eg.db)
>>> a = c("NM_000014", "NM_000015", "foo")
>>> gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA))
>>> gene_id = gene_id[!is.na(gene_id)]
>>>
>>> ## And then once you have the IDs, you can map them to a position
>>> chrloc = toTable(org.Hs.egCHRLOC[gene_id])
>>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])
>>>
>>> ## And now because we were careful to make sure that we always
have an
>>> entrez gene ID,
>>> ## you can just merge all the results together to make this easier
to
>>> look at:
>>> merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)),
chrloc),
>>> chrlocend)
>>>
>>>
>>> ## This process can then be repeated for other kinds of IDs etc:
>>> b = c("Hs.100217", "Hs.100299")
>>> gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE),
ifnotfound=NA))
>>> gene_id = gene_id[!is.na(gene_id)]
>>>
>>> chrloc = toTable(org.Hs.egCHRLOC[gene_id])
>>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])
>>> ## etc.
>>>
>>> Does that help?
>>>
>>>
>>> ?Marc
>>>
>>>
>>>
>>>
>>>
>>> On 08/17/2010 07:26 PM, shirley zhang wrote:
>>>
>>>> Dear list,
>>>>
>>>> I have a list of human SNPs and genes/transcripts which are
obtained
>>>> from different studies. ?I would like to calculate the distance
>>>> between these SNPs and genes/transcripts on the same chromosome.
>>>> However, the ID for the gene/transcript across studies ?is not
>>>> consistent. For example, the ID could be Gene Symbol/Alias
(TRIM5l),
>>>> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol
is
>>>> LOC254266, but its record is an obsolete version),
Contig22780_RC,
>>>> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc.
>>>>
>>>> I would like to retrieve the start and end position for those
>>>> gene/transcript. Can anybody help me how to get the position
>>>> information based on these different gene/transcript IDs. If I
also
>>>> need to match all of other IDs to gene symbol/entrezID, what kind
of
>>>> database or bioConductor package will allow me to do the mapping
as
>>>> perfect as possible? How about biomaRt package?
>>>>
>>>> Thanks
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>>
>
>