TranscriptDb and genomicranges questions

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 11 hours ago

Seattle, WA, United States

Hi Vincent, Because of the semantic of the "gaps" method for IRanges/IRangesList object, you won't get the gaps that are at the 5' and 3' ends of the chromosomes when using the IRanges/IRangesList trick. An easier way to extract the gaps between transcripts without considering their strand is to artificially "project" all the transcripts on the + strand: txdb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") tr <- transcripts(txdb) strand(tr) <- "+" Then: > gaps(tr)[1:5] GRanges with 5 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | [1] chr1 [ 1, 1872] + | [2] chr1 [ 3534, 4273] + | [3] chr1 [19670, 20228] + | [4] chr1 [20367, 24416] + | [5] chr1 [25945, 42911] + | seqlengths chr1 chr1_random chr10 ... chrX_random chrY 247249719 1663265 135374737 ... 1719168 57772954 Note the gap at the 5' end of chr1. Of course now your 'tr' object is sort of "broken" so you need to remake it if you want to use it for downstream analyses where the strand actually matters. Cheers, H. On 07/16/2010 03:47 PM, Patrick Aboyoun wrote: > Vincent, > I am cc'ing the Bioconductor mailing list with the response to your > e-mail so others can access it. > > The gaps,GRanges-method is strand specific, so when you pass a GRanges > object containing the transcript bounds into gaps, you will get the gaps > on the positive strand, the negative strand, and the both strand > category. This is why you found a gap between transcripts on the > positive strand, while there was a transcript at some of those same > positions on the negative strand. If you want a strandless > representation of the transcripts, I recommend using the IRanges and > IRangesList classes. Here is code that achieves what I think you are > looking for: > > > > library(GenomicFeatures) > > txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene") > > tx<- transcripts(txdb) > > unstrndTx<- split(ranges(tx), seqnames(tx)) > > unstrndGaps<- unlist(gaps(unstrndTx)) > > intAlt<- GRanges(seqnames = factor(names(unstrndGaps), > names(seqlengths(tx))), > + ranges = unname(unstrndGaps), > + seqlengths = seqlengths(tx)) > > > intAlt[1:3] > GRanges with 3 ranges and 0 elementMetadata values > seqnames ranges strand | > <rle> <iranges> <rle> | > [1] chr1 [ 3534, 4273] * | > [2] chr1 [19670, 20228] * | > [3] chr1 [20367, 24416] * | > > seqlengths > chr1 chr1_random chr10 ... chrX_random chrY > 247249719 1663265 135374737 ... 1719168 57772954 > > > sessionInfo() > R version 2.12.0 Under development (unstable) (2010-07-01 r52425) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.16 IRanges_1.7.11 > > loaded via a namespace (and not attached): > [1] Biobase_2.9.0 biomaRt_2.5.1 Biostrings_2.17.24 > BSgenome_1.17.5 > [5] DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1 > rtracklayer_1.9.3 > [9] tools_2.12.0 XML_3.1-0 > > > > > > > On 7/16/10 6:23 AM, Vincent Detours wrote: >> Dear Patrick, >> >> I am learning your packages genomicranges and transcriptdb, which >> I find impressively efficients. Here are two questions and perhaps a >> request. >> >> In transcriptdb, I get intergenic regions with: >> ----- >>> txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene") >>> tr<- transcripts(txdb) >>> int<- gaps(tr) >>> int[1:3] >> GRanges with 3 ranges and 0 elementMetadata values >> seqnames ranges strand | >> <rle> <iranges> <rle> | >> [1] chr1 [ 1, 1872] + | >> [2] chr1 [ 3534, 20228] + | >> [3] chr1 [20367, 42911] + | >> ... >> ----- >> Now I type chr1:3,534-20,228 in the UCSC browser, hg18, and I see >> that there is a transcript within this interval: >> http://genome.ucsc.edu/cgi- bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632 >> <http: genome.ucsc.edu="" cgi-="" bin="" hgc?hgsid="165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632"> >> This transcript is indeed in the tr object >> ------- >>> which(elementMetadata(tr)[,"tx_name"]=="ENST00000326632") >> [1] 3880 >>> tr[3880,] >> GRanges with 1 range and 2 elementMetadata values >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> |<integer> <character> >> [1] chr1 [4274, 19669] - | 1731 ENST00000326632 >>> >> ------- >> Is there a bug in 'gaps', or am I missing something about how it works? >> >> I am using the development version of the packages. >> >> Thank you for your help and great software. >> >> Vincent Detours, Ph. D. >> http://homepages.ulb.ac.be/~vdetours/ >> <http: homepages.ulb.ac.be="" %7evdetours=""/> >> >> IRIBHM - Universit? Libre de Bruxelles >> Bldg C, room C.4.116 >> ULB, Campus Erasme, CP602 >> 808 route de Lennik >> B-1070 Brussels >> Belgium >> >> Phone: +32-2-555 4220 >> Fax: +32-2-555 4655 >> Skype: vdetours >> E-mail: vdetours at ulb.ac.be<http: ulb.ac.be=""> >> >> >> > > > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Cancer TranscriptDb Category IRanges GenomicRanges Cancer TranscriptDb Category IRanges • 1.1k views

ADD COMMENT • link 14.7 years ago Hervé Pagès 16k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 11 hours ago

Seattle, WA, United States

Hi Vincent, On 08/12/2010 05:17 AM, Vincent Detours wrote: > Indeed I was missing these ends. Thanks for the update! > Now I tryed this to avoids introducing arbitrary strand information: > ---- > > strand(tr) <- "*" > > gaps(tr)[1:5] > GRanges with 5 ranges and 0 elementMetadata values > seqnames ranges strand | > <rle> <iranges> <rle> | > [1] chr1 [ 1, 247249719] + | > [2] chr1 [ 1, 247249719] - | > [3] chr1 [ 1, 4224] * | > [4] chr1 [19234, 24474] * | > [5] chr1 [25945, 58953] * | > > seqlengths > chr1 chr1_random chr10 ... chrX_random chrY > 247249719 1663265 135374737 ... 1719168 57772954 > > > ---- > I don't understand why 'gaps' returns ranges 1 and 2. Yes, setting the strand to "*" for all transcripts works too. The "gaps" method for GRanges will compute the gaps for the 3 strand levels (+, -, *) independently of whether there are transcripts on the level. In your case there are no transcripts on the + and - levels so the gap you see spans the entire chromosome. Hope this helps. Cheers, H. > > Vincent > > > > On Aug 11, 2010, at 10:27 PM, Hervé Pagès wrote: > >> Hi Vincent, >> >> Because of the semantic of the "gaps" method for IRanges/IRangesList >> object, you won't get the gaps that are at the 5' and 3' ends of the >> chromosomes when using the IRanges/IRangesList trick. >> >> An easier way to extract the gaps between transcripts without >> considering their strand is to artificially "project" all the >> transcripts on the + strand: >> >> txdb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") >> tr <- transcripts(txdb) >> strand(tr) <- "+" >> >> Then: >> >> > gaps(tr)[1:5] >> GRanges with 5 ranges and 0 elementMetadata values >> seqnames ranges strand | >> <rle> <iranges> <rle> | >> [1] chr1 [ 1, 1872] + | >> [2] chr1 [ 3534, 4273] + | >> [3] chr1 [19670, 20228] + | >> [4] chr1 [20367, 24416] + | >> [5] chr1 [25945, 42911] + | >> >> seqlengths >> chr1 chr1_random chr10 ... chrX_random chrY >> 247249719 1663265 135374737 ... 1719168 57772954 >> >> Note the gap at the 5' end of chr1. >> >> Of course now your 'tr' object is sort of "broken" so you need to >> remake it if you want to use it for downstream analyses where the >> strand actually matters. >> >> Cheers, >> H. >> >> >> On 07/16/2010 03:47 PM, Patrick Aboyoun wrote: >>> Vincent, >>> I am cc'ing the Bioconductor mailing list with the response to your >>> e-mail so others can access it. >>> >>> The gaps,GRanges-method is strand specific, so when you pass a GRanges >>> object containing the transcript bounds into gaps, you will get the gaps >>> on the positive strand, the negative strand, and the both strand >>> category. This is why you found a gap between transcripts on the >>> positive strand, while there was a transcript at some of those same >>> positions on the negative strand. If you want a strandless >>> representation of the transcripts, I recommend using the IRanges and >>> IRangesList classes. Here is code that achieves what I think you are >>> looking for: >>> >>> >>> > library(GenomicFeatures) >>> > txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene") >>> > tx<- transcripts(txdb) >>> > unstrndTx<- split(ranges(tx), seqnames(tx)) >>> > unstrndGaps<- unlist(gaps(unstrndTx)) >>> > intAlt<- GRanges(seqnames = factor(names(unstrndGaps), >>> names(seqlengths(tx))), >>> + ranges = unname(unstrndGaps), >>> + seqlengths = seqlengths(tx)) >>> >>> > intAlt[1:3] >>> GRanges with 3 ranges and 0 elementMetadata values >>> seqnames ranges strand | >>> <rle> <iranges> <rle> | >>> [1] chr1 [ 3534, 4273] * | >>> [2] chr1 [19670, 20228] * | >>> [3] chr1 [20367, 24416] * | >>> >>> seqlengths >>> chr1 chr1_random chr10 ... chrX_random chrY >>> 247249719 1663265 135374737 ... 1719168 57772954 >>> >>> > sessionInfo() >>> R version 2.12.0 Under development (unstable) (2010-07-01 r52425) >>> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.16 IRanges_1.7.11 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.9.0 biomaRt_2.5.1 Biostrings_2.17.24 >>> BSgenome_1.17.5 >>> [5] DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1 >>> rtracklayer_1.9.3 >>> [9] tools_2.12.0 XML_3.1-0 >>> >>> >>> >>> >>> >>> >>> On 7/16/10 6:23 AM, Vincent Detours wrote: >>>> Dear Patrick, >>>> >>>> I am learning your packages genomicranges and transcriptdb, which >>>> I find impressively efficients. Here are two questions and perhaps a >>>> request. >>>> >>>> In transcriptdb, I get intergenic regions with: >>>> ----- >>>>> txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene") >>>>> tr<- transcripts(txdb) >>>>> int<- gaps(tr) >>>>> int[1:3] >>>> GRanges with 3 ranges and 0 elementMetadata values >>>> seqnames ranges strand | >>>> <rle> <iranges> <rle> | >>>> [1] chr1 [ 1, 1872] + | >>>> [2] chr1 [ 3534, 20228] + | >>>> [3] chr1 [20367, 42911] + | >>>> ... >>>> ----- >>>> Now I type chr1:3,534-20,228 in the UCSC browser, hg18, and I see >>>> that there is a transcript within this interval: >>>> http://genome.ucsc.edu/cgi- bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632 >>>> <http: genome.ucsc.edu="" cgi-="" bin="" hgc?hgsid="165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632"> >>>> <http: genome.ucsc.edu="" cgi-="" bin="" hgc?hgsid="165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632">>>> <http: genome.ucsc.edu="" cgi-="" bin="" hgc?hgsid="165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632">> >>>> This transcript is indeed in the tr object >>>> ------- >>>>> which(elementMetadata(tr)[,"tx_name"]=="ENST00000326632") >>>> [1] 3880 >>>>> tr[3880,] >>>> GRanges with 1 range and 2 elementMetadata values >>>> seqnames ranges strand | tx_id tx_name >>>> <rle> <iranges> <rle> |<integer> <character> >>>> [1] chr1 [4274, 19669] - | 1731 ENST00000326632 >>>>> >>>> ------- >>>> Is there a bug in 'gaps', or am I missing something about how it works? >>>> >>>> I am using the development version of the packages. >>>> >>>> Thank you for your help and great software. >>>> >>>> Vincent Detours, Ph. D. >>>> http://homepages.ulb.ac.be/~vdetours/ >>>> <http: homepages.ulb.ac.be="" %7evdetours=""/> >>>> >>>> IRIBHM - Universit? Libre de Bruxelles >>>> Bldg C, room C.4.116 >>>> ULB, Campus Erasme, CP602 >>>> 808 route de Lennik >>>> B-1070 Brussels >>>> Belgium >>>> >>>> Phone: +32-2-555 4220 >>>> Fax: +32-2-555 4655 >>>> Skype: vdetours >>>> E-mail: vdetours at ulb.ac.be<http: ulb.ac.be=""> >>>> >>>> >>>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch <mailto:bioconductor at="" stat.math.ethz.ch=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > > > Vincent Detours, Ph. D. > http://homepages.ulb.ac.be/~vdetours/ > > IRIBHM - Universit? Libre de Bruxelles > Bldg C, room C.4.116 > ULB, Campus Erasme, CP602 > 808 route de Lennik > B-1070 Brussels > Belgium > > Phone: +32-2-555 4220 > Fax: +32-2-555 4655 > Skype: vdetours > E-mail: vdetours at ulb.ac.be <http: ulb.ac.be=""> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 14.7 years ago Hervé Pagès 16k

Login before adding your answer.