rtracklayer and UCSC
2
0
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 18 months ago
United States
As far as I know USCS uses zero-based indexing of their genomes, R uses 1-based. What kind of conversion is being used by rtracklayer - I suspect none at all? It might be worthwhile to add a discussion about this somewhere in the vignette? More specifically, I have downloaded a couple of tables from UCSC using rtracklayer and I wanted to know if I need to add 1 to the column named exonStart (after a suitable splitting - it is a comma separated character list). Kasper
rtracklayer genomes rtracklayer genomes • 1.8k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-2759
Last seen 10.3 years ago
On Thu, May 14, 2009 at 4:29 PM, Kasper Daniel Hansen < khansen@stat.berkeley.edu> wrote: > As far as I know USCS uses zero-based indexing of their genomes, R uses > 1-based. What kind of conversion is being used by rtracklayer - I suspect > none at all? The indexing is 1-based. rtracklayer takes care of all of this (0 vs 1 based, closed vs half-open) behind the scenes. I've found places where I've messed up before though, so please let me know if you find inconsistencies. > It might be worthwhile to add a discussion about this somewhere in the > vignette? > Yes, it should be mentioned. > More specifically, I have downloaded a couple of tables from UCSC using > rtracklayer and I wanted to know if I need to add 1 to the column named > exonStart (after a suitable splitting - it is a comma separated character > list). > If you download a table (not an actual RangedData track), then the columns have not been adjusted at all. I suggest you everything 1-based and closed if you want to use it with packages like IRanges and Biostrings. Btw, if you had obtained the data using the track() function, which returns a RangedData, you could call blocks() on it to get the block information as a RangesList. But I just found that I forgot to add 1 in that method; fixed in svn. Thanks, Michael > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
On Thu, May 14, 2009 at 7:29 PM, Kasper Daniel Hansen < khansen@stat.berkeley.edu> wrote: > As far as I know USCS uses zero-based indexing of their genomes, R uses > 1-based. What kind of conversion is being used by rtracklayer - I suspect > none at all? It might be worthwhile to add a discussion about this somewhere > in the vignette? It is even slightly more complicated than that. They use zero-based starts and 1-based ends, except for graphical display: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 Sean > > > More specifically, I have downloaded a couple of tables from UCSC using > rtracklayer and I wanted to know if I need to add 1 to the column named > exonStart (after a suitable splitting - it is a comma separated character > list). > > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
My understanding of UCSC co-ordinates is, as Sean says, zero based and one based. However I have stopped using the word "start" and "end" with UCSC co-ordinates. I believe it would be better to use "left" and "right". The UCSC data definitions of their annotation files, see: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.sql use txStart/txEnd, cdsStart/cdsEnd, exonStarts/exonEnds. However these co-ordinates are only start and end co-ordinates for positive strand genes. They are end and start co-ordinates for negative strand genes, assuming that start means the 5 prime end of a gene. I think it is more accurate to say that LEFT end UCSC co-ordinates are zero based and RIGHT end UCSC co-ordinates are one based. However note that whenever UCSC are displaying co-ordinates to GUI users, they adjust left end co-ordinates back to being one based. If I remember correctly, if you use the DNA option in the UCSC browser to get DNA bases, the co-ordinates are all still one based, but as stated, if you download the annotation files, such as refGene.txt, from the above link, the left co-ordinates are zero based. I don't know how rtracklayer handles this issue. cheers, Keith Sean Davis wrote: > On Thu, May 14, 2009 at 7:29 PM, Kasper Daniel Hansen < > khansen at stat.berkeley.edu> wrote: > >> As far as I know USCS uses zero-based indexing of their genomes, R uses >> 1-based. What kind of conversion is being used by rtracklayer - I suspect >> none at all? It might be worthwhile to add a discussion about this somewhere >> in the vignette? > > > It is even slightly more complicated than that. They use zero-based starts > and 1-based ends, except for graphical display: > > http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 > > Sean > > >> >> More specifically, I have downloaded a couple of tables from UCSC using >> rtracklayer and I wanted to know if I need to add 1 to the column named >> exonStart (after a suitable splitting - it is a comma separated character >> list). >> >> Kasper >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On Thu, May 14, 2009 at 5:23 PM, Keith Satterley <keith@wehi.edu.au> wrote: > My understanding of UCSC co-ordinates is, as Sean says, zero based and one > based. However I have stopped using the word "start" and "end" with UCSC > co-ordinates. I believe it would be better to use "left" and "right". > > The UCSC data definitions of their annotation files, see: > > http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.sql > > use txStart/txEnd, cdsStart/cdsEnd, exonStarts/exonEnds. However these > co-ordinates are only start and end co-ordinates for positive strand genes. > They are end and start co-ordinates for negative strand genes, assuming that > start means the 5 prime end of a gene. > > I think it is more accurate to say that LEFT end UCSC co-ordinates are zero > based and RIGHT end UCSC co-ordinates are one based. > > However note that whenever UCSC are displaying co-ordinates to GUI users, > they adjust left end co-ordinates back to being one based. If I remember > correctly, if you use the DNA option in the UCSC browser to get DNA bases, > the co-ordinates are all still one based, but as stated, if you download the > annotation files, such as refGene.txt, from the above link, the left > co-ordinates are zero based. > > I don't know how rtracklayer handles this issue. > UCSC coordinates are 0-based half-open intervals relative to the 5' end of the positive strand. rtracklayer makes them 1-based closed intervals, also relative to the 5' end of the positive strand. Placing everything into the same frame of reference makes it easier to perform e.g. overlap queries. If you want to flip things around, see the reflect() function in IRanges. The flank() function is a convenient way to get out e.g. promoter regions taking into account the strand. > cheers, > > Keith > > > Sean Davis wrote: > >> On Thu, May 14, 2009 at 7:29 PM, Kasper Daniel Hansen < >> khansen@stat.berkeley.edu> wrote: >> >> As far as I know USCS uses zero-based indexing of their genomes, R uses >>> 1-based. What kind of conversion is being used by rtracklayer - I suspect >>> none at all? It might be worthwhile to add a discussion about this >>> somewhere >>> in the vignette? >>> >> >> >> It is even slightly more complicated than that. They use zero- based >> starts >> and 1-based ends, except for graphical display: >> >> http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 >> >> Sean >> >> >> >>> More specifically, I have downloaded a couple of tables from UCSC using >>> rtracklayer and I wanted to know if I need to add 1 to the column named >>> exonStart (after a suitable splitting - it is a comma separated character >>> list). >>> >>> Kasper >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6