I am working on a project trying to mapping Affymetrix probeset ID to
Entrez ID, Gene Symbol and its chromosomal location. I used R package
biomaRt and another one named mouse4302.db for Affymetrix Mouse430 2.0
array specifically. I noticed from the result, for genes have multiple
probesets attached, only a small proportion of these probesets have a
precise transcription start locations. While most of these probesets
share the same start location with the given gene. Is there anyway I
can get a better match in terms of the precise transcription start
location for each probeset?
-- output of sessionInfo():
R version 2.12.2 (2011-02-25)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mouse4302.db_2.4.5 org.Mm.eg.db_2.4.6 RSQLite_0.10.0
[4] DBI_0.2-5 AnnotationDbi_1.12.1 mouse4302cdf_2.7.0
[7] affy_1.28.1 Biobase_2.10.0 biomaRt_2.6.0
loaded via a namespace (and not attached):
[1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.5-0.1
[4] tools_2.12.2 XML_3.2-0.2
--
Sent via the guest posting facility at bioconductor.org.
Hi,
On Monday, July 2, 2012, Jiayi Hou [guest] wrote:
>
> I am working on a project trying to mapping Affymetrix probeset ID
to
> Entrez ID, Gene Symbol and its chromosomal location. I used R
package
> biomaRt and another one named mouse4302.db for Affymetrix Mouse430
2.0
> array specifically. I noticed from the result, for genes have
multiple
> probesets attached, only a small proportion of these probesets have
a
> precise transcription start locations.
Can you clarify what you mean by a "transcription start location" for
a
probeset? Is this A function of the probes themselves? Or are you
talking
about the TSS of the gene that the probeset's probes land in.
If it's the latter are these different TSS's just different annotated
TSS's
of different isoforms of the genes?
> While most of these probesets share the same start location with
the
> given gene. Is there anyway I can get a better match in terms of the
> precise transcription start location for each probeset?
I guess I don't understand what you mean by the "start location" of a
probeset -- perhaps you can clarify a bit more what you are trying to
do?
Perhaps more details about the problem you are trying to solve would
also
be helpful.
> -- output of sessionInfo():
>
> R version 2.12.2 (2011-02-25)
> Platform: i386-pc-mingw32/i386 (32-bit)
Thanks for also including your sessionInfo output -- while we're
trying to
sort this out, you might take this opportunity to upgrade your version
of R
to the latest (2.15.1) since we don't really try to support outdated
versions of bioc packages.
HTH,
-Steve
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] mouse4302.db_2.4.5 org.Mm.eg.db_2.4.6 RSQLite_0.10.0
> [4] DBI_0.2-5 AnnotationDbi_1.12.1 mouse4302cdf_2.7.0
> [7] affy_1.28.1 Biobase_2.10.0 biomaRt_2.6.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.5-0.1
> [4] tools_2.12.2 XML_3.2-0.2
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org <javascript:;>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
[[alternative HTML version deleted]]
Hi Jiayi,
Side note: please CC the bioconductor list when replying to emails so
they can stay online -- you'll get better help (more eyeballs on your
problem), and the list can be used as a resource to others.
I guess this might be a pain using the "guest posting" stuff -- but
subscribing to the mailing list is easy, and you'll learn a lot by
skimming the post that come through here.
OK -- now to solver your problem:
On Mon, Jul 2, 2012 at 11:03 AM, Jiayi Hou <houj2 at="" vcu.edu=""> wrote:
> Hey Steve,
>
> Sorry let me put it this way, so when a probeset hybridized to a
given gene,
> the gene has a chromosomal location in terms of base pair. For a
given gene,
> on average there may be 2-3 probesets attach to the same gene.
However,
> these 2-3 probesets carrying different sequence of base pairs, are
expected
> to attach to the different location oin the given gene. What I am
looking
> for is where precisly these probesets attach to the gene.
Thanks, that's a bit clearer now.
In the past I've done this with a little elbow grease: you can get the
probe sequence info for the chip you're using from this package:
http://bioconductor.org/packages/2.10/data/annotation/html/htmg430apro
be.html
There's a short vignette on matching probe sequences (against each
other, which isn't all that helpful for you, but can be a start) using
the Biostrings package here:
http://bioconductor.org/packages/2.10/bioc/vignettes/Biostrings/inst/d
oc/matchprobes.pdf
You can extend the examples there by matching your probes against the
mouse genome using the appropriate BSgenome package
(BSgenome.Mmusculus.UCSC.mm9).
Alternatively, you can follow section 4.1 of the biomaRt vignette
here:
http://bioconductor.org/packages/2.10/bioc/vignettes/biomaRt/inst/doc/
biomaRt.pdf
For example:
R> ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
R> affyids <- c("202763_at","209310_s_at","207500_at")
R> getBM(attributes=c('affy_hg_u133_plus_2', 'hgnc_symbol',
'chromosome_name','start_position','end_position',
'band'),
filters = 'affy_hg_u133_plus_2', values = affyids, mart =
ensembl)
affy_hg_u133_plus_2 hgnc_symbol chromosome_name start_position
end_position band
1 202763_at CASP3 4 185548850
185570663 q35.1
2 209310_s_at CASP4 11 104813593
104840163 q22.3
3 207500_at CASP5 11 104864962
104893895 q22.3
You'll have to change the "mart/dataset" you are using, as well as the
chip id's, but you should get the idea.
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
Although it is not a BioConductor solution, you should check out the
Splice
Center website at the NIH for a nice view of probe locations across
isoforms.
On Jul 2, 2012 10:54 AM, "Steve Lianoglou"
<mailinglist.honeypot@gmail.com>
wrote:
> Hi Jiayi,
>
> Side note: please CC the bioconductor list when replying to emails
so
> they can stay online -- you'll get better help (more eyeballs on
your
> problem), and the list can be used as a resource to others.
>
> I guess this might be a pain using the "guest posting" stuff -- but
> subscribing to the mailing list is easy, and you'll learn a lot by
> skimming the post that come through here.
>
> OK -- now to solver your problem:
>
> On Mon, Jul 2, 2012 at 11:03 AM, Jiayi Hou <houj2@vcu.edu> wrote:
> > Hey Steve,
> >
> > Sorry let me put it this way, so when a probeset hybridized to a
given
> gene,
> > the gene has a chromosomal location in terms of base pair. For a
given
> gene,
> > on average there may be 2-3 probesets attach to the same gene.
However,
> > these 2-3 probesets carrying different sequence of base pairs, are
> expected
> > to attach to the different location oin the given gene. What I am
looking
> > for is where precisly these probesets attach to the gene.
>
> Thanks, that's a bit clearer now.
>
> In the past I've done this with a little elbow grease: you can get
the
> probe sequence info for the chip you're using from this package:
>
>
> http://bioconductor.org/packages/2.10/data/annotation/html/htmg430ap
robe.html
>
> There's a short vignette on matching probe sequences (against each
> other, which isn't all that helpful for you, but can be a start)
using
> the Biostrings package here:
>
>
> http://bioconductor.org/packages/2.10/bioc/vignettes/Biostrings/inst
/doc/matchprobes.pdf
>
> You can extend the examples there by matching your probes against
the
> mouse genome using the appropriate BSgenome package
> (BSgenome.Mmusculus.UCSC.mm9).
>
> Alternatively, you can follow section 4.1 of the biomaRt vignette
here:
>
>
> http://bioconductor.org/packages/2.10/bioc/vignettes/biomaRt/inst/do
c/biomaRt.pdf
>
> For example:
>
> R> ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
> R> affyids <- c("202763_at","209310_s_at","207500_at")
> R> getBM(attributes=c('affy_hg_u133_plus_2', 'hgnc_symbol',
> 'chromosome_name','start_position','end_position',
'band'),
> filters = 'affy_hg_u133_plus_2', values = affyids, mart =
ensembl)
>
> affy_hg_u133_plus_2 hgnc_symbol chromosome_name start_position
> end_position band
> 1 202763_at CASP3 4 185548850
> 185570663 q35.1
> 2 209310_s_at CASP4 11 104813593
> 104840163 q22.3
> 3 207500_at CASP5 11 104864962
> 104893895 q22.3
>
> You'll have to change the "mart/dataset" you are using, as well as
the
> chip id's, but you should get the idea.
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hi Jiayi,
If you 1st upgrade to a modern version of R, then you should be able
to
do stuff like this:
library(mouse4302.db)
keys = c("1415670_at", "1415671_at", "1415672_at")
cols(mouse4302.db)
keytypes(mouse4302.db)
select(mouse4302.db, keys= keys, cols=c("SYMBOL","CHRLOC"),
keytype="PROBEID")
Please let us know if you need more help,
Marc
On 07/02/2012 04:41 AM, Jiayi Hou [guest] wrote:
> I am working on a project trying to mapping Affymetrix probeset ID
to Entrez ID, Gene Symbol and its chromosomal location. I used R
package biomaRt and another one named mouse4302.db for Affymetrix
Mouse430 2.0 array specifically. I noticed from the result, for genes
have multiple probesets attached, only a small proportion of these
probesets have a precise transcription start locations. While most of
these probesets share the same start location with the given gene. Is
there anyway I can get a better match in terms of the precise
transcription start location for each probeset?
>
> -- output of sessionInfo():
>
> R version 2.12.2 (2011-02-25)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] mouse4302.db_2.4.5 org.Mm.eg.db_2.4.6 RSQLite_0.10.0
> [4] DBI_0.2-5 AnnotationDbi_1.12.1 mouse4302cdf_2.7.0
> [7] affy_1.28.1 Biobase_2.10.0 biomaRt_2.6.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.5-0.1
> [4] tools_2.12.2 XML_3.2-0.2
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor