Bioconductor Digest, Vol 120, Issue 12

0

Entering edit mode

Matthew Thornton ▴ 380

@matthew-thornton-5564

Last seen 26 days ago

USA, Los Angeles, USC

Sent from my Droid Charge on Verizon 4GLTE "bioconductor- request@r-project.org" wrote: Send Bioconductor mailing list submissions to bioconductor@r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/bioconductor or, via email, send a message with subject or body 'help' to bioconductor-request@r-project.org You can reach the person managing the list at bioconductor-owner@r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Bioconductor digest..." Today's Topics: 1. Re: Help with Gviz \"IdeogramTrack\" and \"BioMartGeneTrackRegion\" commands (Hahne, Florian) 2. Re: question about Gviz (Hahne, Florian) 3. distanceToNearest in GenomicRanges (Tom Oates) 4. edgeR cpm filtering (John [guest]) 5. Re: runSPIA() in graphite package (array chip) 6. Re: edgeR cpm filtering (James W. MacDonald) 7. Re: distanceToNearest in GenomicRanges (James W. MacDonald) 8. Re: DataFrame bug? (Valerie Obenchain) 9. Re: question about Gviz (Tim Triche, Jr.) 10. Re: question about Gviz (Tim Triche, Jr.) 11. plotting BED intervals to TSS regions (Seb) 12. Re: edgeR cpm filtering (James W. MacDonald) 13. Re: runSPIA() in graphite package (Enrica) 14. Re: plotting BED intervals to TSS regions (Tengfei Yin) 15. Re: question about Gviz (Bogdan Tanasa) 16. Error Bars on EdgeR (Ian Sudbery) 17. Differences in results analyzing Mouse Gene 1.0-ST using oligo and affy package (Jon Toledo) 18. Re: statistical test for time course data (Juliet Hannah) 19. Re: statistical test for time course data (Steve Lianoglou) 20. Re: edgeR cpm filtering (John Sperry) 21. Re: edgeR cpm filtering (Steve Lianoglou) 22. Re: question about Gviz (Hahne, Florian) 23. Re: question about Gviz (Bogdan Tanasa) ---------------------------------------------------------------------- Message: 1 Date: Mon, 11 Feb 2013 13:48:56 +0000 From: "Hahne, Florian" <florian.hahne@novartis.com> To: Marc Carlson <mcarlson@fhcrc.org>, "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] Help with Gviz \"IdeogramTrack\" and \"BioMartGeneTrackRegion\" commands Message-ID: <aabaab0b27af5c418fe809f352fb7ba40848536d@023-db3mpn1-082.023d .mgd.msft.net=""> Content-Type: text/plain; charset="iso-8859-2" Hi Marc, thanks for the hint, but I think this is not quite what I need. My problem is still on the level of genomes. UCSC for instance calls a particular version of they human genome hg19. Now there exists a similar genome in Ensembl, however they do not use the same name for it (GRCh37.p10). I made the maybe somewhat unwise attempt early on to identify genomes within Gviz by their UCSC name and to translate those names into Ensembl names if necessary. In hind sight this may not have been the smartest decision, and I should have left the translation job completely to the user. If somebody wants Ensebml gene models from BiomaRt they should make sure that they select the correct mart and dataset in the first place. I'll think about a pragmatic way out of this hole I've dug myself into. Florian -- On 1/23/13 2:18 AM, "Marc Carlson" <mcarlson@fhcrc.org> wrote: >Hi Florian, > >We actually have a small database called seqnames.db that is dedicated >to tracking these kinds of chromosome name conventions. You can see >more by looking at the help page for supportedSeqnameStyles() (and it's >friends). A quick way to see that is: > >library(Homo.sapiens) >?supportedSeqnameStyles > > >If you call the supportedSeqnameStyles() method, you will see that we >don't (yet) have an entry for zebrafish. If you were to give me one as a >tab file, I could add it to the database and it would therefore exist >for the future... The file I need is deliberately simple to make. It >should look like the example below, with as many columns as you want >there to be styles for, and each column separated by a tab. > >NCBI MSU6 >1 1 >2 2 >3 3 >4 4 > >etc. > > > Marc > > > > > >On 01/21/2013 09:15 AM, Hahne, Florian wrote: >> Hi Joseph, >> >> Regarding your first problem: UCSC has no cytoband information for any >>of >> the zebrafish genomes, and that's what is throwing the error. I think it >> should do something smarter, e.g. use the chromosome length information >> that should be available for every UCSC genome to draw at least a blank >> ideogram which could still be used to indicate the current plotting >> position. I'll have this ready in the next release of the package, and >> maybe even port this back to the current release. It seems to be more >>of a >> bug than a missing feature? >> >> Your second problem is a bit more tricky. There is no real mapping >>between >> the ensembl genome names used in the Biomart package and the UCSC ones >> which I decided to take as the defaults for the package. I tried to come >> up with my own static mapping for this, and obviously this means that >> things tend to get out of date soon. Now the zebrafish version that you >> will get in Ensembl is Zv9 (which is equivalent to danRer7), but my >> mapping is still to danRer6. This is even wrong, because what you will >>get >> from Biomart if you ask for danRer6 now is actually danRer7. Yikes. I >>will >> have to come up with a better solution for this. There should be a way >>to >> explicitly control for the Ensembl genome that you will get, and this >>is a >> simple change. Getting it right automagically is way more challenging, I >> am afraid. >> >> As a quick fix for you: >> Ask for the danRer6 genes and manually change the genome of the track: >> >>biomTrack<-BiomartGeneRegionTrack(genome="danRer6",chromosome=1,star t=1e6 >>,e >> nd=1e6+10000,name="ENSEMBL",showId=T) >> genome(biomTrack)<- "danRer7" >> >> I'll get back to you once I have a better solution. >> >> Florian >> >> > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 2 Date: Mon, 11 Feb 2013 14:03:39 +0000 From: "Hahne, Florian" <florian.hahne@novartis.com> To: Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org>, "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <aabaab0b27af5c418fe809f352fb7ba4084853cd@023-db3mpn1-082.023d .mgd.msft.net=""> Content-Type: text/plain; charset="us-ascii" Well, the main culprit here is really the TranscriptDB package. It does not seem to deal at all with gene symbols, so there is no way for Gviz to automatically fetch those. If you come up with a way to match the UCSC gene identifiers back to gene symbols you could stick those into the GeneRegionTrack using the 'symbol' replacement method. E.g., symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) I am not sure how this mapping is supposed to be done in the Bioconductor world these days. You may be able to find a way using one of the org.db packages. Or maybe you will have to download a mapping table directly from the UCSC table browser. With the next release of Bioconductor (or already now if you are working with the devel branch), Gviz supports building tracks from a whole range of standard annotation files. You could then export the whole known.gene table from UCSC as a GTF file and import in again as a GeneRegionTrack object. Alternatively you may want to look into the BiomartGeneRegionTrack class which will fetch the gene models from Ensembl, but this includes the HUGO gene symbols. Florian -- On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >Dear all, > >I am using the following code below in order to retrieve the gene >annotations in Gviz package : >please could advise on what shall I modify in order to display the HUGO >gene symbol on each gene ? thanks ! > >library("GenomicFeatures") >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >"knownGene") >saveDb(hg18db, file="hg18db_knownGene.sqlite") >txdb <-loadDb("hg18db_knownGene.sqlite") >txTr <- >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,gene Symbo >l=TRUE,name="UCSC") >plotTracks(txTr,from=start,to=end) > >-- bogdan >------------------ > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 3 Date: Mon, 11 Feb 2013 16:35:39 +0000 From: Tom Oates <toates19@gmail.com> To: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: [BioC] distanceToNearest in GenomicRanges Message-ID: <cagudn1auoxuwhwz9papqpzx3radxafisto9a- 8xhm+vyslfxhg@mail.gmail.com=""> Content-Type: text/plain Hi I am very much a learner in R in general & GenomicRanges in general I am struggling to find documentation to help me get my head around distanceToNearest in GenomicRanges If I have a GRanges object: GRanges with 6 ranges and 4 metadata columns: seqnames ranges strand | <rle> <iranges> <rle> | [1] 10 [ 96723746, 96723747] - | [2] 7 [ 13641170, 13641171] + | [3] 16 [ 17772801, 17772802] - | [4] 3 [ 88173502, 88173503] - | [5] 13 [106979682, 106979683] + | [6] 9 [104393139, 104393140] + | (You will notice that all the regions are only dinucleotides & I have removed the metadata ) I have a 2nd GRanges object which is ensembl rat transcripts as below: 39549 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] 1 [5473, 16844] + | 1 ENSRNOT00000044270 [2] 1 [5526, 16968] + | 2 ENSRNOT00000049921 [3] 1 [5526, 16968] + | 3 ENSRNOT00000051735 [4] 1 [5598, 13520] + | 4 ENSRNOT00000034630 [5] 1 [8268, 16850] + | 5 ENSRNOT00000044505 [6] 1 [8316, 17577] + | 6 ENSRNOT00000042693 [7] 1 [8884, 16850] + | 7 ENSRNOT00000044187 [8] 1 [8956, 9955] + | 8 ENSRNOT00000041082 [9] 1 [9055, 17351] + | 9 ENSRNOT00000050254 If I invoke: xx<-distanceToNearestdiff.cpgs.gr, rat.transcripts, ignore.strand=F) xx DataFrame with 1133 rows and 3 columns queryHits subjectHits distance <integer> <integer> <integer> 1 1 7752 0 2 2 32166 11946 3 3 14678 25377 4 4 24286 66747 5 5 10609 34242 6 6 37076 122683 7 7 35184 0 8 8 34180 45561 9 9 19351 50156 ... ... ... ... etc I am uncertain how I would then use the xx output to gain information (i.e. tx_id, tx_name) about the feature which the function has identified as nearest? I would be happy to supply any more info as required Tom [[alternative HTML version deleted]] ------------------------------ Message: 4 Date: Mon, 11 Feb 2013 08:54:54 -0800 (PST) From: "John [guest]" <guest@bioconductor.org> To: bioconductor@r-project.org, johnsperry51@yahoo.com Subject: [BioC] edgeR cpm filtering Message-ID: <20130211165454.35588AAF78@mamba.fhcrc.org> Content-Type: text/plain; charset=iso-8859-1 All, I am a new edgeR user. I have difficulty understanding the meaning of the ???cpm??? function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don???t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M? Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example: Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2 Gene_X, 150,100, 270,320,0,0 I used: d_DGEList <- d_DGEList[rowSums(cpm_filtered > 5) > 2,] But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this? Thank you. John -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_2.6.0 limma_3.12.0 > -- Sent via the guest posting facility at bioconductor.org. ------------------------------ Message: 5 Date: Mon, 11 Feb 2013 09:36:15 -0800 From: array chip <arrayprofile@yahoo.com> To: Enrica <enrica.calura@gmail.com> Cc: "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it>, "gabriele.sales@unipd.it" <gabriele.sales@unipd.it>, Bioconductor mailing list <bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] runSPIA() in graphite package Message-ID: <1360604175.82561.YahooMailNeo@web122901.mail.ne1.yahoo.com> Content-Type: text/plain Dear Enrica, Thanks very much for checking this out! graphite is a great package for pathway analysis with its ability to analyze so many different pathway databases! One more question, does runSPIA() run against these pathways on the fly (i.e. access these databases from their server in real time), or against a pre-downloaded database? Do you have a timeline when the new release graphite package will become available? Thanks again for your great work. John ________________________________ From: Enrica <enrica.calura@gmail.com> Cc: Bioconductor mailing list <bioconductor@stat.math.ethz.ch>; "gabriele.sales@unipd.it" <gabriele.sales@unipd.it>; "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it> Sent: Monday, February 11, 2013 3:24 AM Subject: Re: runSPIA() in graphite package Dear John, the procedure you applied to analyse your data is right. However, the runSpia function filters out those pathways that have no edges or less than 5 nodes with valid edges. Your pathway, after the conversion to Entrez Gene, do not satisfy neither the conditions, having no edges. > pe<-convertIdentifiers(biocarta[["estrogen responsive protein efp controls cell cycle and breast tumors growth"]], "entrez") > pe "estrogen responsive protein efp controls cell cycle and breast tumors growth" pathway from BioCarta Number of nodes???? = 5 Number of edges???? = 0 Type of identifiers = Entrez Gene Retrieved on??????? = 2011-05-12 We equipped our runSpia() function with the checks described above in order to protect the user from unreliable results. On a separate note, we have also re-checked how that specific pathway was converted. The original BioPax2 contained some edges, but their endpoints were unfortunately nodes which could not be associated with any entrez gene. Our software, thus, dropped them from the pathway. We are working on a new graphite release based on annotation released more recently. Those includes more edges; as a result that pathway is no longer empty. Enrica Calura Hi all, I am using runSPIA() from graphite package to analyze a gene list against biocarta pathway database. > > > >> library(graphite) > >> prepareSPIA(biocarta, "biocartaEx") >> obj<-runSPIA(de=siggenes, all=allgenes, "biocartaEx") > > > >where "siggenes" are a list of 1332 significant genes selected from "allgenes" (a list of 17138 genes). The running process verbose indicated a total of 178 pathways analyzed. > > > >One of the pathways I am particularly interested is "estrogen responsive protein efp controls cell cycle and breast tumors growth": > > >> p <- biocarta[["estrogen responsive protein efp controls cell cycle and breast tumors growth"]] >> nodes(p) >[1] "CDKs"???????????? "Cyclin B1/2"????? "EntrezGene:2099"? "EntrezGene:2810"? "EntrezGene:57154" "EntrezGene:7157"? "EntrezGene:7706"?? "ubiquitin" > > >And my siggenes contains 2 of genes involved in this pathway: > > >> siggenes[c('7706','2099')] >???? 7706????? 2099 >?4.347012 -3.792425 > > >So I would assume that runSPIA will examine this pathway during the? calculation, but surprisingly I didn't see this particular way being examined. Here is the section of runSPIA() verbose output that started with "e": > > >Done pathway 42 : e2f1 destruction pathway.. >Done pathway 43 : effects of calcineurin in kera.. >Done pathway 44 : egf signaling pathway.. >Done pathway 45 : endocytotic role of ndk phosph.. >Done pathway 46 : epo signaling pathway.. >Done pathway 47 : erk and pi-3 kinase are necess.. >Done pathway 48 : erk1/erk2 mapk signaling pathw.. >Done pathway 49 : eukaryotic protein translation.. >Done pathway 50 : extrinsic prothrombin activati.. > > > >Can anyone tell me why runSPIA() missed this pathway? Attached are siggenes and allgenes for you to try.? > > >> siggenes<-as.matrix(read.table("siggenes.txt",row.names=1))[,1] > >> allgenes<-as.matrix(read.table("H:\\test\\allgenes.txt", row.names=NULL, header=T, colClasses='character'))[,1] > > >Many thanks > > >John > > [[alternative HTML version deleted]] ------------------------------ Message: 6 Date: Mon, 11 Feb 2013 12:52:13 -0500 From: "James W. MacDonald" <jmacdon@uw.edu> To: "John [guest]" <guest@bioconductor.org> Cc: bioconductor@r-project.org Subject: Re: [BioC] edgeR cpm filtering Message-ID: <51192FCD.2000700@uw.edu> Content-Type: text/plain; charset=windows-1252; format=flowed Hi John, On 2/11/2013 11:54 AM, John [guest] wrote: > All, > > I am a new edgeR user. I have difficulty understanding the meaning of the ???cpm??? function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don???t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M? It's reads. You are inputting read counts per gene, so by definition edgeR knows nothing about bases. As for why 1M, I don't know for sure, but would imagine it is because a 'reasonable' sized RNA-Seq experiment is based on somewhere around 10M reads or so. In other words, say you have a sample with 10M reads, and one gene has 60 reads that align. You would have 6 cpm, which looks pretty low, and it is. However you could do statistics on it based on a discrete distribution like the negative binomial. If you used 10M to normalize, the cp10M would be 0.6, so you would have to throw that gene out. If you used cpk, it would be 6000 cpk and it would look like you had lots of reads when in fact you only had 60. So cpm looks like a reasonable compromise to me. > > Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example: > > Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2 > Gene_X, 150,100, 270,320,0,0 > > I used: > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this? Setting aside the logic of removing these genes (which IMO is missing the point of looking for differential expression), note the logic of your filter. Let's break it down by section: rowSums(cpm_filtered > 5) This gives the count of samples where the cpm value is > 5. In the case you mention, this would be four. rowSums(cpm_filtered > 5) > 2 This results in TRUE if the count of samples for a given gene with a cpm value > 5 is greater than two. So you are saying that at least two samples have to have a cpm > 5. In the instance you want to filter, the count is 4, and 4 > 2, so this passes the filter. So what you apparently want are rows where the cpm is greater than some value in ALL samples, not just two of them, so you would want to change the > 2 part to == 6. Note that this doesn't really make any sense. You are in effect saying that you are completely uninterested in any genes that appear not to be expressed in one of your samples, but that might be highly expressed in one or more of the other samples. That to me is actually really interesting, and I am not sure why you would want to filter out any gene that fulfills that criterion. Best, Jim > > Thank you. > John > > > -- output of sessionInfo(): > >> sessionInfo() > R version 2.15.0 (2012-03-30) > > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > > [1] edgeR_2.6.0 limma_3.12.0 > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 ------------------------------ Message: 7 Date: Mon, 11 Feb 2013 13:08:11 -0500 From: "James W. MacDonald" <jmacdon@uw.edu> To: Tom Oates <toates19@gmail.com> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] distanceToNearest in GenomicRanges Message-ID: <5119338B.9080702@uw.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi Tom, On 2/11/2013 11:35 AM, Tom Oates wrote: > Hi > I am very much a learner in R in general& GenomicRanges in general > I am struggling to find documentation to help me get my head around > distanceToNearest in GenomicRanges > If I have a GRanges object: > > GRanges with 6 ranges and 4 metadata columns: > seqnames ranges strand | > <rle> <iranges> <rle> | > [1] 10 [ 96723746, 96723747] - | > [2] 7 [ 13641170, 13641171] + | > [3] 16 [ 17772801, 17772802] - | > [4] 3 [ 88173502, 88173503] - | > [5] 13 [106979682, 106979683] + | > [6] 9 [104393139, 104393140] + | > > (You will notice that all the regions are only dinucleotides& I have > removed the metadata ) > > I have a 2nd GRanges object which is ensembl rat transcripts as below: > 39549 ranges and 2 metadata columns: > seqnames ranges strand | tx_id > tx_name > <rle> <iranges> <rle> |<integer> > <character> > [1] 1 [5473, 16844] + | 1 > ENSRNOT00000044270 > [2] 1 [5526, 16968] + | 2 > ENSRNOT00000049921 > [3] 1 [5526, 16968] + | 3 > ENSRNOT00000051735 > [4] 1 [5598, 13520] + | 4 > ENSRNOT00000034630 > [5] 1 [8268, 16850] + | 5 > ENSRNOT00000044505 > [6] 1 [8316, 17577] + | 6 > ENSRNOT00000042693 > [7] 1 [8884, 16850] + | 7 > ENSRNOT00000044187 > [8] 1 [8956, 9955] + | 8 > ENSRNOT00000041082 > [9] 1 [9055, 17351] + | 9 > ENSRNOT00000050254 > > > If I invoke: > xx<-distanceToNearestdiff.cpgs.gr, rat.transcripts, ignore.strand=F) > > xx > DataFrame with 1133 rows and 3 columns > queryHits subjectHits distance > <integer> <integer> <integer> > 1 1 7752 0 > 2 2 32166 11946 > 3 3 14678 25377 > 4 4 24286 66747 > 5 5 10609 34242 > 6 6 37076 122683 > 7 7 35184 0 > 8 8 34180 45561 > 9 9 19351 50156 > ... ... ... ... > etc > > I am uncertain how I would then use the xx output to gain information (i.e. > tx_id, tx_name) about the feature which the function has identified as > nearest? > I would be happy to supply any more info as required The subjectHits column gives the row of your transcript GRanges object that matches the corresponding query row. I am assuming here that the 'diff.cpgs.gr' GRanges object is longer than 6? Anyway, here is an example using your data and the TxDb.Mmusculus.UCSC.mm10.knownGene package: > x GRanges with 6 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr10 [ 96723746, 96723747] * [2] chr7 [ 13641170, 13641171] * [3] chr16 [ 17772801, 17772802] * [4] chr3 [ 88173502, 88173503] * [5] chr13 [106979682, 106979683] * [6] chr9 [104393139, 104393140] * --- > y <- transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene) > xx <- distanceToNearest(x, y, ignore.strand=F) > xx DataFrame with 6 rows and 3 columns queryHits subjectHits distance <integer> <integer> <integer> 1 1 4514 100935 2 2 45653 0 3 3 19383 0 4 4 34197 0 5 5 14383 0 6 6 54212 8108 > y[xx[,2],] GRanges with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr10 [ 96617001, 96622811] + | 33419 uc007gww.2 [2] chr7 [ 13623967, 13670807] + | 21400 uc012ezp.1 [3] chr16 [ 17759663, 17779206] + | 48288 uc007ylz.1 [4] chr3 [ 88171560, 88177785] - | 10107 uc008puf.2 [5] chr13 [106963757, 107022114] - | 43288 uc007rue.1 [6] chr9 [104361832, 104385031] + | 29956 uc009rhp.1 --- seqlengths: chr1 chr2 ... chrUn_JH584304 195471971 182113224 ... 114452 Best, Jim > Tom > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 ------------------------------ Message: 8 Date: Mon, 11 Feb 2013 11:16:17 -0800 From: Valerie Obenchain <vobencha@fhcrc.org> To: Charles Berry <ccberry@ucsd.edu> Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] DataFrame bug? Message-ID: <51194381.4000809@fhcrc.org> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Hi Charles, Thanks for reporting the bug. Now fixed in 1.17.32 devel and 1.16.5 in release. Valerie library(RUnit) DF <- DataFrame("A"=1:5,row.names=letters[1:5]) df <- data.frame("A"=1:5,row.names=letters[1:5]) > checkIdentical(DF['a','B'] <- 1, df['a','B'] <- 1) [1] TRUE DF <- DataFrame("A"=1:5,row.names=letters[1:5]) df <- data.frame("A"=1:5,row.names=letters[1:5]) > checkIdentical(DF['c','B'] <- 1, df['c','B'] <- 1) [1] TRUE > sessionInfo() ... other attached packages: [1] RUnit_0.4.26 IRanges_1.16.5 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] parallel_2.15.1 stats4_2.15.1 On 02/09/2013 12:26 PM, Charles Berry wrote: >>> > > Subset replacement like this > > df['a','c2']<-3 > > depends on the position of row 'a' when 'c2' is a new column. > > > Row 'a' first: > >> df1 <- DataFrame(c1=1:2,row.names=c("a","b")) >> df1['a','c2'] <- 3 >> df1 > DataFrame with 2 rows and 2 columns > c1 c2 > <integer> <numeric> > a 1 3 > b 2 3 > > Row 'a' second: > >> df2 <- DataFrame(c1=1:2,row.names= rev( c("a","b") )) >> df2['a','c2'] <- 3 >> df2 > DataFrame with 2 rows and 2 columns > c1 c2 > <integer> <numeric> > b 1 NA > a 2 3 > > FWIW, the latter - but not the former - agrees with "the 'DataFrame' behaves > very similarly to 'data.frame'"(from the help page). > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] IRanges_1.16.4 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] parallel_2.15.2 stats4_2.15.2 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ------------------------------ Message: 9 Date: Mon, 11 Feb 2013 11:34:04 -0800 From: "Tim Triche, Jr." <tim.triche@gmail.com> To: "Hahne, Florian" <florian.hahne@novartis.com> Cc: "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <cac+n9bw3pqvp2urshqh3fhgtjhf9v+eguvsv94aj5yxwjxrqiq@mail.gmail.com> Content-Type: text/plain Well, here's one approach. I'll start from the constructed 'txtr' track. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs can be accepted as keys, this next step might be unnecessary ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) That works. On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian <florian.hahne@novartis.com>wrote: > Well, the main culprit here is really the TranscriptDB package. It does > not seem to deal at all with gene symbols, so there is no way for Gviz to > automatically fetch those. If you come up with a way to match the UCSC > gene identifiers back to gene symbols you could stick those into the > GeneRegionTrack using the 'symbol' replacement method. E.g., > symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) > I am not sure how this mapping is supposed to be done in the Bioconductor > world these days. You may be able to find a way using one of the org.db > packages. Or maybe you will have to download a mapping table directly from > the UCSC table browser. > With the next release of Bioconductor (or already now if you are working > with the devel branch), Gviz supports building tracks from a whole range > of standard annotation files. You could then export the whole known.gene > table from UCSC as a GTF file and import in again as a GeneRegionTrack > object. Alternatively you may want to look into the BiomartGeneRegionTrack > class which will fetch the gene models from Ensembl, but this includes the > HUGO gene symbols. > Florian > > > -- > > > > > > > On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: > > >Dear all, > > > >I am using the following code below in order to retrieve the gene > >annotations in Gviz package : > >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] > > > >library("GenomicFeatures") > >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = > >"knownGene") > >saveDb(hg18db, file="hg18db_knownGene.sqlite") > >txdb <-loadDb("hg18db_knownGene.sqlite") > >txTr <- > >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,ge neSymbo > >l=TRUE,name="UCSC") > >plotTracks(txTr,from=start,to=end) > > > >-- bogdan > >------------------ > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor@r-project.org > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 10 Date: Mon, 11 Feb 2013 11:39:15 -0800 From: "Tim Triche, Jr." <tim.triche@gmail.com> To: "Hahne, Florian" <florian.hahne@novartis.com> Cc: "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <cac+n9bvd4vrf8q+1z1ydc7zphuqpy5doyiu7kkr4eu8u6sqcua@mail.gmail.com> Content-Type: text/plain Oops, critical step omitted. See below. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs could be accepted as keys, this next step might be unnecessary ## hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) The above (minus comments) is what I used to generate a test result. On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > Well, here's one approach. I'll start from the constructed 'txtr' track. > > ## see ?select for more on the following > ## > library(Homo.sapiens) > symbolMappings <- select(Homo.sapiens, > > cols=c('UCSCKG','ENTREZID','SYMBOL'), > keys=symbol(txtr), > keytype='UCSCKG') > > ## "pork out" the mappings table so that duplicates expand > ## > symbolMappings <- symbolMappings[match(symbol(txtr), > > symbolMappings$UCSCKG),] > > ## if NAs can be accepted as keys, this next step might be unnecessary > ## however, it seems that not all UCSC known genes have a Hugo symbol? > ## > txtr <- txtr[ hasHugoSymbol ] > symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] > plotTracks(txtr) > > That works. > > > > > On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian < > florian.hahne@novartis.com> wrote: > >> Well, the main culprit here is really the TranscriptDB package. It does >> not seem to deal at all with gene symbols, so there is no way for Gviz to >> automatically fetch those. If you come up with a way to match the UCSC >> gene identifiers back to gene symbols you could stick those into the >> GeneRegionTrack using the 'symbol' replacement method. E.g., >> symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) >> I am not sure how this mapping is supposed to be done in the Bioconductor >> world these days. You may be able to find a way using one of the org.db >> packages. Or maybe you will have to download a mapping table directly from >> the UCSC table browser. >> With the next release of Bioconductor (or already now if you are working >> with the devel branch), Gviz supports building tracks from a whole range >> of standard annotation files. You could then export the whole known.gene >> table from UCSC as a GTF file and import in again as a GeneRegionTrack >> object. Alternatively you may want to look into the BiomartGeneRegionTrack >> class which will fetch the gene models from Ensembl, but this includes the >> HUGO gene symbols. >> Florian >> >> >> -- >> >> >> >> >> >> >> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >> >> >Dear all, >> > >> >I am using the following code below in order to retrieve the gene >> >annotations in Gviz package : >> >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] >> > >> >library("GenomicFeatures") >> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >> >"knownGene") >> >saveDb(hg18db, file="hg18db_knownGene.sqlite") >> >txdb <-loadDb("hg18db_knownGene.sqlite") >> >txTr <- >> >> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,g eneSymbo >> >l=TRUE,name="UCSC") >> >plotTracks(txTr,from=start,to=end) >> > >> >-- bogdan >> >------------------ >> > >> > [[alternative HTML version deleted]] >> > >> >_______________________________________________ >> >Bioconductor mailing list >> >Bioconductor@r-project.org >> >https://stat.ethz.ch/mailman/listinfo/bioconductor >> >Search the archives: >> >http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 11 Date: Mon, 11 Feb 2013 14:45:59 -0500 From: Seb <seba.bat@gmail.com> To: bioconductor@r-project.org Subject: [BioC] plotting BED intervals to TSS regions Message-ID: <canstik=yenwn1-_drhpwjhaxczjj=c5jo_sghauecjcenvtnvq@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 hi gurus i have several BED files containing chromosome #, start and end that correspond to overlapping regions of different ChIP Seq experiments. this part was done with Galaxy. i also have a file containing TSS coordinates +/- 10kb. what i want to do is to create a plot to show how many of my overlapping intervals fall within the TSS regions, and, if they do, have on the X axis the distance to the TSS and on the Y axis the number of regions that overlap that certain part of the TSS ...i am a bit confused about how to do this tho...i looked in galaxy [[elided Yahoo spam]] thanks ------------------------------ Message: 12 Date: Mon, 11 Feb 2013 17:10:05 -0500 From: "James W. MacDonald" <jmacdon@uw.edu> Cc: "Bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] edgeR cpm filtering Message-ID: <51196C3D.6000009@uw.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi John, Please don't take things off-list. Even if you are not a subscriber (and if you are using BioC stuff you should be, and you can always stop delivery but remain a subscriber), I believe that replying to an existing thread will work. I don't see any zero counts causing a problem. Using the example for cpm() as a starting point, and modifying to have a set with zero counts, I get this: > y [,1] [,2] [,3] [,4] [1,] 1 2 14 11 [2,] 11 25 1 26 [3,] 1 22 2 19 [4,] 5 6 15 6 [5,] 0 0 1 5 > d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2))) > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > topTags(exactTest(d)) Comparison of groups: 2-1 logFC logCPM PValue FDR 1 2.9550376 12.76964 6.109348e-05 0.0003054674 5 4.6421574 10.54712 1.283343e-01 0.3208358043 4 0.9149142 12.96222 2.668415e-01 0.4447357815 2 -0.4149407 13.93933 8.539261e-01 0.9783799675 3 -0.1325391 13.42121 9.783800e-01 0.9783799675 So the sample with zero counts (sample 5), is the second row in the topTags() output, and it has no problem computing a logFC value. Best, Jim On 2/11/2013 4:30 PM, John Sperry wrote: > Hi again Jim, > > One more thing, in microarray days, people used to add a small value, > let say 1 to the 0 values to avoid non-sense fold changes. It's not > the case in NGS any more right? so it's not possible to do that in > edgeR, right? that's why I was thinking about filtering out with cpm. > > Thanks, > John > > > > -------------------------------------------------------------------- ---- > *To:* "jmacdon@uw.edu" <jmacdon@uw.edu> > *Sent:* Monday, February 11, 2013 1:47 PM > *Subject:* [BioC] edgeR cpm filtering > > Hi Jim, > > I'm very new to edgeR and BioC. I couldn't reply to your post in BioC, > so here is my post in an email :D > > I still cannot see why 1M is chosen, but I appreciate your explanations. > > About the cpm filtering, the reason that I chose '> 2' for 3 samples > with each having 2 replicates was that I though edgeR must be smart > enough to figure out that when I say more than 5 reads per million for [[elided Yahoo spam]] [[elided Yahoo spam]] > > as for the reason for wanting to get rid of the sample 3 with 2 > replicates that have 0 reads mapped to them, I don't want them, > because, they cause the logFC to become huge non-sense numbers! i > guess dividing be 0 causes the problem! so I thought for not seeing > weird values when the significant genes are selected, it's better to > get rid of genes that have 0 reads mapped to any of their groups. Does > it make sense? > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > Thanks, > John > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 ------------------------------ Message: 13 Date: Mon, 11 Feb 2013 12:24:58 +0100 From: Enrica <enrica.calura@gmail.com> Cc: "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it>, "gabriele.sales@unipd.it" <gabriele.sales@unipd.it>, Bioconductor mailing list <bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] runSPIA() in graphite package Message-ID: <capkcaw0ixxcknwxkj4vue25bfootjx2rd5flsgmgs9qk96hbsg@mail.gmail.com> Content-Type: text/plain Dear John, the procedure you applied to analyse your data is right. However, the runSpia function filters out those pathways that have no edges or less than 5 nodes with valid edges. Your pathway, after the conversion to Entrez Gene, do not satisfy neither the conditions, having no edges. > pe<-convertIdentifiers(biocarta[["estrogen responsive protein efp controls cell cycle and breast tumors growth"]], "entrez") > pe "estrogen responsive protein efp controls cell cycle and breast tumors growth" pathway from BioCarta Number of nodes = 5 Number of edges = 0 Type of identifiers = Entrez Gene Retrieved on = 2011-05-12 We equipped our runSpia() function with the checks described above in order to protect the user from unreliable results. On a separate note, we have also re-checked how that specific pathway was converted. The original BioPax2 contained some edges, but their endpoints were unfortunately nodes which could not be associated with any entrez gene. Our software, thus, dropped them from the pathway. We are working on a new graphite release based on annotation released more recently. Those includes more edges; as a result that pathway is no longer empty. Enrica Calura > Hi all, I am using runSPIA() from graphite package to analyze a gene list > against biocarta pathway database. > > > library(graphite) > > prepareSPIA(biocarta, "biocartaEx") > > obj<-runSPIA(de=siggenes, all=allgenes, "biocartaEx") > > where "siggenes" are a list of 1332 significant genes selected from > "allgenes" (a list of 17138 genes). The running process verbose indicated a > total of 178 pathways analyzed. > > One of the pathways I am particularly interested is "estrogen responsive > protein efp controls cell cycle and breast tumors growth": > > > p <- biocarta[["estrogen responsive protein efp controls cell cycle and > breast tumors growth"]] > > nodes(p) > [1] "CDKs" "Cyclin B1/2" "EntrezGene:2099" > "EntrezGene:2810" "EntrezGene:57154" "EntrezGene:7157" > "EntrezGene:7706" "ubiquitin" > > And my siggenes contains 2 of genes involved in this pathway: > > > siggenes[c('7706','2099')] > 7706 2099 > 4.347012 -3.792425 > > So I would assume that runSPIA will examine this pathway during the > calculation, but surprisingly I didn't see this particular way being > examined. Here is the section of runSPIA() verbose output that started with > "e": > > Done pathway 42 : e2f1 destruction pathway.. > Done pathway 43 : effects of calcineurin in kera.. > Done pathway 44 : egf signaling pathway.. > Done pathway 45 : endocytotic role of ndk phosph.. > Done pathway 46 : epo signaling pathway.. > Done pathway 47 : erk and pi-3 kinase are necess.. > Done pathway 48 : erk1/erk2 mapk signaling pathw.. > Done pathway 49 : eukaryotic protein translation.. > Done pathway 50 : extrinsic prothrombin activati.. > > Can anyone tell me why runSPIA() missed this pathway? Attached are > siggenes and allgenes for you to try. > > > siggenes<-as.matrix(read.table("siggenes.txt",row.names=1))[,1] > > allgenes<-as.matrix(read.table("H:\\test\\allgenes.txt", row.names=NULL, > header=T, colClasses='character'))[,1] > > Many thanks > > John > > [[alternative HTML version deleted]] ------------------------------ Message: 14 Date: Mon, 11 Feb 2013 16:41:25 -0600 From: Tengfei Yin <yintengfei@gmail.com> To: Seb <seba.bat@gmail.com> Cc: bioconductor@r-project.org Subject: Re: [BioC] plotting BED intervals to TSS regions Message-ID: <capjsq9kn0rsob3zno6k=+ok1r=qdyk_l3=wapmefci1r6fuhzq@mail.gmail.com> Content-Type: text/plain Hi Seb, I guess before visualization you need to get the summary statistics ready first, I got one idea, maybe you could give a try, and I assume the count you want is based on a per base resolution 1. 'import' function in package rtracklayer to import your BED files and TSS files as GRanges object into R, ready for analysis. 2. ?findOverlaps in package 'GenomicRanges', there are some utilities to summarize the overlapping between your BED and TSS region. Then you can easily get an answer to your first question: how many falls within your TSS region defined. 3. compute coverage for your imported BED intervals(GRanges object) , that will give you an Rle/RleList. check 'coverage' function in package IRanges/GenomicRanges. 4. then get views on this coverage data with you tss position object. please check 'Views' method in GenomicRanges/IRanges. This step is important, better make sure your TSS have equal length window, for example 20kb in your case. 5. Covert this Views to a matrix by using as.matrix on previous views object. You will get a matrix, whose columns correspond to position around tss, from -10kb to 10kb, each row correspond to one tss region. If you want to summarize over all tss, just use colSums over this matrix. 6. After you get this summary data, you can use any graphic package in R to visualize this data as lines and relabel the x-axis position from -10k to 10k. As far as I know, there is no direct way in bioc to import/aggregate/visualize your BED/TSS file together with one or two commands to get what you want yet ... HTH Tengfei On Mon, Feb 11, 2013 at 1:45 PM, Seb <seba.bat@gmail.com> wrote: > hi gurus > > i have several BED files containing chromosome #, start and end that > correspond to overlapping regions of different ChIP Seq experiments. > this part was done with Galaxy. > > i also have a file containing TSS coordinates +/- 10kb. > > what i want to do is to create a plot to show how many of my > overlapping intervals fall within the TSS regions, and, if they do, > have on the X axis the distance to the TSS and on the Y axis the > number of regions that overlap that certain part of the TSS > > ...i am a bit confused about how to do this tho...i looked in galaxy [[elided Yahoo spam]] > > thanks > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Tengfei Yin MCDB PhD student 1620 Howe Hall, 2274, Iowa State University Ames, IA,50011-2274 [[alternative HTML version deleted]] ------------------------------ Message: 15 Date: Mon, 11 Feb 2013 16:08:00 -0800 From: Bogdan Tanasa <tanasa@gmail.com> To: <ttriche@usc.edu> Cc: bioconductor@stat.math.ethz.ch, bioc-sig-sequencing@r-project.org Subject: Re: [BioC] question about Gviz Message-ID: <ca+jem02xqyqhkzpfagboltdrv_z+l6703jshpmpceq8rvqzrdg@mail.gmail.com> Content-Type: text/plain Thanks, Tim ...will do it accordingly. On Mon, Feb 11, 2013 at 3:49 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > So the problem (at least as regards my initial code snippet) is that there > is no mapping in the Homo.sapiens OrganismDb for either uc002yjx.1 > or uc010gkz.1 and thus no symbol (since symbol-less features are dropped). > That's sort of odd since, it seems, these are NRIP1 isoforms. > > I noticed that p15 and p16 transcripts appeared to be missing when I > looked at chr9 earlier. This might be an issue where mappings are missing > from Homo.sapiens / org.Hs.eg.db, in which case a posting to the list would > be good. > > Meanwhile, if you want to keep the UCSC names of the genes for which there > is not a Hugo mapping in the symbol table, the following replacement for > subsetting txtr should do it (I tested this): > > ## don't subset txtr. instead, > symbol(txtr)[ hasHugoSymbol ] <- symbolMappings$SYMBOL[hasHugoSymbol] > > Then proceed as before. I get the attached output from running > > plotTracks(txtr,from=start,to=end) > > since the NRIP1 isoforms, not being mapped to Hugo (?!), stay UCSC IDs. > > Do you mind if I cc: the list on this? And/or bring it up with Marc/Herve? > > --t > > > > On Mon, Feb 11, 2013 at 3:19 PM, Bogdan Tanasa <tanasa@gmail.com> wrote: > >> Hi Tim, >> >> I followed your code, in the following way (please see below); >> plotTracks(txtr) works; however, when I try to plot specific regions >> ( eg plotTracks(txtr,chr="chr21", start=15245000,end=15520000) ) it does >> not show the correct region : I suspect it may be an issue in my code, but >> not very sure where ... if you have 2-3 minutes, would appreciate your >> help. thanks ! >> >> library(Homo.sapiens) >> library("ggplot2") >> library(Gviz) >> library("GenomicFeatures") >> library("ggplot2") >> library("rtracklayer") >> library("TxDb.Hsapiens.UCSC.hg18.knownGene") >> >> chr<-"chr21" >> start<-15245000 >> end <-15520000 >> >> hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >> "knownGene") >> saveDb(hg18db, file="hg18db_knownGene.sqlite") >> txdb <-loadDb("hg18db_knownGene.sqlite") >> txtr <- >> GeneRegionTrack(txdb,genome="hg18",chromosome="chr21",showId=TRUE,g eneSymbol=TRUE,name="UCSC") >> plotTracks(txtr,from=start,to=end) >> >> >> >> symbolMappings <- select(Homo.sapiens, >> cols=c('UCSCKG','ENTREZID','SYMBOL'), >> keys=symbol(txtr), keytype='UCSCKG') >> symbolMappings <- symbolMappings[match(symbol(txtr), >> symbolMappings$UCSCKG),] >> >> hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) >> >> txtr <- txtr[ hasHugoSymbol ] >> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >> >> plotTracks(txtr) >> plotTracks(txtr,from=start,to=end) >> >> >> >> On Mon, Feb 11, 2013 at 11:39 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >> >>> Oops, critical step omitted. See below. >>> >>> ## see ?select for more on the following >>> ## >>> library(Homo.sapiens) >>> symbolMappings <- select(Homo.sapiens, >>> >>> cols=c('UCSCKG','ENTREZID','SYMBOL'), >>> keys=symbol(txtr), >>> keytype='UCSCKG') >>> >>> ## "pork out" the mappings table so that duplicates expand >>> ## >>> symbolMappings <- symbolMappings[match(symbol(txtr), >>> >>> symbolMappings$UCSCKG),] >>> >>> ## if NAs could be accepted as keys, this next step might be unnecessary >>> ## >>> hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) >>> >>> ## however, it seems that not all UCSC known genes have a Hugo symbol? >>> ## >>> txtr <- txtr[ hasHugoSymbol ] >>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >>> plotTracks(txtr) >>> >>> >>> The above (minus comments) is what I used to generate a test result. >>> >>> >>> >>> >>> On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >>> >>>> Well, here's one approach. I'll start from the constructed 'txtr' >>>> track. >>>> >>>> ## see ?select for more on the following >>>> ## >>>> library(Homo.sapiens) >>>> symbolMappings <- select(Homo.sapiens, >>>> >>>> cols=c('UCSCKG','ENTREZID','SYMBOL'), >>>> keys=symbol(txtr), >>>> keytype='UCSCKG') >>>> >>>> ## "pork out" the mappings table so that duplicates expand >>>> ## >>>> symbolMappings <- symbolMappings[match(symbol(txtr), >>>> >>>> symbolMappings$UCSCKG),] >>>> >>>> ## if NAs can be accepted as keys, this next step might be unnecessary >>>> ## however, it seems that not all UCSC known genes have a Hugo symbol? >>>> ## >>>> txtr <- txtr[ hasHugoSymbol ] >>>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >>>> plotTracks(txtr) >>>> >>>> That works. >>>> >>>> >>>> >>>> >>>> On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian < >>>> florian.hahne@novartis.com> wrote: >>>> >>>>> Well, the main culprit here is really the TranscriptDB package. It does >>>>> not seem to deal at all with gene symbols, so there is no way for Gviz >>>>> to >>>>> automatically fetch those. If you come up with a way to match the UCSC >>>>> gene identifiers back to gene symbols you could stick those into the >>>>> GeneRegionTrack using the 'symbol' replacement method. E.g., >>>>> symbol(foo) <- mappingTable[match(transcript(foo), >>>>> mappingTable$UCSCId),]) >>>>> I am not sure how this mapping is supposed to be done in the >>>>> Bioconductor >>>>> world these days. You may be able to find a way using one of the org.db >>>>> packages. Or maybe you will have to download a mapping table directly >>>>> from >>>>> the UCSC table browser. >>>>> With the next release of Bioconductor (or already now if you are >>>>> working >>>>> with the devel branch), Gviz supports building tracks from a whole >>>>> range >>>>> of standard annotation files. You could then export the whole >>>>> known.gene >>>>> table from UCSC as a GTF file and import in again as a GeneRegionTrack >>>>> object. Alternatively you may want to look into the >>>>> BiomartGeneRegionTrack >>>>> class which will fetch the gene models from Ensembl, but this includes >>>>> the >>>>> HUGO gene symbols. >>>>> Florian >>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >>>>> >>>>> >Dear all, >>>>> > >>>>> >I am using the following code below in order to retrieve the gene >>>>> >annotations in Gviz package : >>>>> >please could advise on what shall I modify in order to display the >>>>> HUGO [[elided Yahoo spam]] >>>>> > >>>>> >library("GenomicFeatures") >>>>> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >>>>> >"knownGene") >>>>> >saveDb(hg18db, file="hg18db_knownGene.sqlite") >>>>> >txdb <-loadDb("hg18db_knownGene.sqlite") >>>>> >txTr <- >>>>> >>>>> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRU E,geneSymbo >>>>> >l=TRUE,name="UCSC") >>>>> >plotTracks(txTr,from=start,to=end) >>>>> > >>>>> >-- bogdan >>>>> >------------------ >>>>> > >>>>> > [[alternative HTML version deleted]] >>>>> > >>>>> >_______________________________________________ >>>>> >Bioconductor mailing list >>>>> >Bioconductor@r-project.org >>>>> >https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >Search the archives: >>>>> >http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>>> >>>> -- >>>> *A model is a lie that helps you see the truth.* >>>> * >>>> * >>>> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>>> >>> >>> >>> >>> -- >>> *A model is a lie that helps you see the truth.* >>> * >>> * >>> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > [[alternative HTML version deleted]] ------------------------------ Message: 16 Date: Tue, 12 Feb 2013 00:41:02 +0000 From: Ian Sudbery <ian.sudbery@dpag.ox.ac.uk> To: "bioconductor@stat.math.ethz.ch" <bioconductor@stat.math.ethz.ch> Subject: [BioC] Error Bars on EdgeR Message-ID: <55D5020998905943B8A25CAFED69B2C219DCBCCF@MBX01.ad.oak.ox.ac.uk> Content-Type: text/plain Hi All, I was wondering if anyone had any idea how one might put error bars on the log fold changes calculated by edgeR? I'd like to do this so I can show examples of differentially expressed genes on a bar plot, and show them in comparison to qPCR validation. I feel like one should be able to get the variance from the dispersion: CV^2 = 1/mu + BCV^2 = 1/mu + dispersion So: variance = mu + mu^2*dispersion ? This could either be used to show the variance for each of the two conditions, or used to calculate an error for the fold change. but I'm a bit stuck as to how to actually implement this. Any ideas appreciated, Yours, Ian [[alternative HTML version deleted]] ------------------------------ Message: 17 Date: Tue, 12 Feb 2013 01:40:23 +0100 From: Jon Toledo <tintin_jb@hotmail.com> To: <bioconductor@r-project.org> Subject: [BioC] Differences in results analyzing Mouse Gene 1.0-ST using oligo and affy package Message-ID: <bay161-w16c78e461d37c096df6db698090@phx.gbl> Content-Type: text/plain Dear Bioconductor List, I have repeated my workflow using the affy and oligo package alternatively followed by the limma package to analyze and experiment with two conditions using Mouse Gene 1.0-ST chips and I arrive to different results. I have been looking online and found that at least for the for the Mouse Gene 1.1-ST the oligo package is preferred, but not anything clear for Mouse Gene 1.0-ST . I attach below my code and session info: A) For running affy: library(affy) library(pd.mogene.1.0.st.v1) library(mogene10sttranscriptcluster.db) cellist=list.celfiles(full.names=T) cellistD14=grep("CD1...14",cellist,value=T) esetD14<- justRMA(filenames=gsub("\\./","",cellistD14)) B) For runnign oligo: library(oligo) library(pd.mogene.1.0.st.v1) library(mogene10sttranscriptcluster.db) cellist=list.celfiles(full.names=T) cellistD14=grep("CD1...14",cellist,value=T) rsetD14=read.celfiles(cellistD14) esetpD14=rma(rsetD14,target="probeset") esetD14=rma(rsetD14,target="core") C) Finally running the same analysis: designD14<-read.delim('designD14.txt') contrast.matrix=model.matrix(~treatment,data=designD14) library(limma) fit <- lmFit(esetD14, contrast.matrix) fit <- eBayes(fit,proportion=0.01) m1=topTable(fit, coef="treatment",number=1e8,adjust.method="BH") m1=m1[,c("ID","logFC","P.Value","adj.P.Val")] m1=cbind(m1[,1:2],FCTreat=2**m1[,2],PTreat=m1[,3],adj.PTreat=m1[,4]) > sessionInfo() (This is for the affy run) R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] mogene10stv1cdf_2.11.0 limma_3.14.4 mogene10sttranscriptcluster.db_8.0.1 [4] org.Mm.eg.db_2.8.0 AnnotationDbi_1.20.3 pd.mogene.1.0.st.v1_3.8.0 [7] oligo_1.22.0 oligoClasses_1.20.0 RSQLite_0.11.2 [10] DBI_0.2-5 affy_1.36.1 Biobase_2.18.0 [13] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] affxparser_1.30.2 affyio_1.26.0 BiocInstaller_1.8.3 Biostrings_2.26.3 bit_1.1-9 codetools_0.2-8 [7] ff_2.2-10 foreach_1.4.0 GenomicRanges_1.10.6 IRanges_1.16.4 iterators_1.0.6 parallel_2.15.2 [13] preprocessCore_1.20.0 splines_2.15.2 stats4_2.15.2 tools_2.15.2 zlibbioc_1.4.0 Thank you very much J Toledo UPenn USA [[alternative HTML version deleted]] ------------------------------ Message: 18 Date: Mon, 11 Feb 2013 22:39:23 -0500 From: Juliet Hannah <juliet.hannah@gmail.com> To: Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] statistical test for time course data Message-ID: <calzuzrqt0ndnwgyekfphgc4z-bpj_mvhwxcvnijq- g12_4zszq@mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hi Gordon, I've been curious about this as well. Can limma model the following situation? We have two treatments and multiple time points. We are interested in if the mean profile for each treatment differs over time, treating time continuously. The measurements are over time for each individual so we have to account for correlations within individuals. Ideally, I would like to allow for a random intercept and a random slope (possible quadratic) if needed. EDGE uses splines, so that would be nice as well. I am aware of duplicateCorrelation, Is this the way to proceed? I'll post a reproducible example for guidance if you indicate that the above is possible. Thanks for your time and for your work on limma, which I use frequently. On Fri, Feb 1, 2013 at 6:46 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Chris, > > Yes, limma can easily test for a difference at one point, or can test for a > significant change over the whole time course like EDGE. > > I don't understand your experiment well enough to give more specific advice. > You would need to tell us your experimental design, in terms of the targets > frame, and exactly what you want to test. > > Best wishes > Gordon > >> Date: Fri, 1 Feb 2013 12:19:26 +0900 >> From: chris Jhon <cjhon217@gmail.com> >> To: Richard Friedman <friedman@cancercenter.columbia.edu> >> Cc: Bioconductor mailing list <bioconductor@r-project.org> >> Subject: Re: [BioC] statistical test for time course data >> >> Hi Richard, >> >> Thank you for help. >> In my data ,i have one point which i think it is different from other >> points and i would like to test statistical significance of the difference >> of this point. >> Your suggestion means that there is no direct function in R that i can >> use,i have to use a package which implement an algorithm. >> If so, i think edgeR can do the same analysis too,Am i right? >> >> Best Reagards, >> Chris >> >> On Thu, Jan 31, 2013 at 11:53 PM, Richard Friedman < >> friedman@cancercenter.columbia.edu> wrote: >> >>> Dear Chris, >>> >>> Limma can be used to test between time points >>> treating each time point as a categorical variable. >>> The program "EDGE" from the Storey lab, can test whether >>> there is significant change over a whole time course. >>> >>> http://www.ncbi.nlm.nih.gov/pubmed/16357033 >>> >>> with hopes that the above helps, >>> Rich >>> Richard A. Friedman, PhD >>> Associate Research Scientist, >>> Biomedical Informatics Shared Resource >>> Herbert Irving Comprehensive Cancer Center (HICCC) >>> Lecturer, >>> Department of Biomedical Informatics (DBMI) >>> Educational Coordinator, >>> Center for Computational Biology and Bioinformatics (C2B2)/ >>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)/ >>> Columbia Initiative in Systems Biology >>> Room 824 >>> Irving Cancer Research Center >>> Columbia University >>> 1130 St. Nicholas Ave >>> New York, NY 10032 >>> (212)851-4765 (voice) >>> friedman@cancercenter.columbia.edu >>> http://cancercenter.columbia.edu/~friedman/ >>> >>> "Complex numbers! Ha! Ha! There is nothing weirder >>> than imaginary numbers. Architects don't need to know >>> complex numbers. Whenever I get a negative root for >>> an area, I throw it out. And don't talk to me about >>> quaternions. I am not going into computer animation." >>> -Rose Friedman, age 16 >>> >>> >>> On Jan 30, 2013, at 11:43 PM, chris Jhon wrote: >>> >>> Hi All, >>> >>> I have data at different time points for time course experiment. >>> I have a response for each time point and i would like to test whether >>> the >>> difference between response of two time points is statistically >>> significant >>> or not. >>> my data is linear plot where response on y axis and time on x axis. >>> >>> what statistical test shall i use? >>> >>> >>> I appreciate any help. >>> >>> Best Regards, >>> Chris >>> > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 19 Date: Mon, 11 Feb 2013 23:31:31 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: Juliet Hannah <juliet.hannah@gmail.com> Cc: Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] statistical test for time course data Message-ID: <caha9mcosvlmerwxije_qnrpbqr0tc0qkb=y=kbmxeosf- 9xo+g@mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hi Juliet, On Mon, Feb 11, 2013 at 10:39 PM, Juliet Hannah <juliet.hannah@gmail.com> wrote: > Hi Gordon, > > I've been curious about this as well. Can limma model the following > situation? [snip] I think you'll find a thread that appeared a bit after the one you are replying to: http://thread.gmane.org/gmane.science.biology.informatics.conductor/46 188 If you look in Section 8.6 of the limmaUserGuide (in development): http://bioconductor.org/packages/2.12/bioc/vignettes/limma/inst/doc/us ersguide.pdf You'll find a section on time course analysis. 8.6.2, in particular, uses splines. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 20 Date: Mon, 11 Feb 2013 21:45:34 -0800 (PST) To: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] edgeR cpm filtering Message-ID: <1360647934.77288.YahooMailNeo@web162106.mail.bf1.yahoo.com> Content-Type: text/plain Hi Jim, I posted my first question as a guest and just became a member today, so it???s a bit of learning curve for me on how to use the list. Sorry for the email and I hope I am using the list correctly now by emailing this address. About my edgeR question: Thank you for your elaboration and also the example. I can see that in your example, you have samples with 0 reads and logFC is okay, and that???s what logically should be. However, in my dataset, I see many cases of logFC of ~144269489 (or negative of ~ this value). When I check the genes, I see that these are the cases where all the replicates of one samples have 0 reads mapped to them, whereas the other groups of samples have many reads. These are the cases that cpm didn???t filter them. That???s why I tried to use more restrictive cpm filtering to get rid of these genes. Any thoughts on why this non-interpretive logFC cases happen are greatly appreciated. Thanks, John Hi John, Please don't take things off-list. Even if you are not a subscriber (and if you are using BioC stuff you should be, and you can always stop delivery but remain a subscriber), I believe that replying to an existing thread will work. I don't see any zero counts causing a problem. Using the example for cpm() as a starting point, and modifying to have a set with zero counts, I get this: > y ?? ?? [,1] [,2] [,3] [,4] [1,]?? ?? 1?? ?? 2?? 14?? 11 [2,]?? 11?? 25?? ?? 1?? 26 [3,]?? ?? 1?? 22?? ?? 2?? 19 [4,]?? ?? 5?? ?? 6?? 15?? ?? 6 [5,]?? ?? 0?? ?? 0?? ?? 1?? ?? 5 > d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2))) > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > topTags(exactTest(d)) Comparison of groups:?? 2-1 ?? ?? ?? logFC?? logCPM?? ?? ?? PValue?? ?? ?? ?? ?? FDR 1?? 2.9550376 12.76964 6.109348e-05 0.0003054674 5?? 4.6421574 10.54712 1.283343e-01 0.3208358043 4?? 0.9149142 12.96222 2.668415e-01 0.4447357815 2 -0.4149407 13.93933 8.539261e-01 0.9783799675 3 -0.1325391 13.42121 9.783800e-01 0.9783799675 So the sample with zero counts (sample 5), is the second row in the topTags() output, and it has no problem computing a logFC value. Best, Jim On 2/11/2013 4:30 PM, John Sperry wrote: > Hi again Jim, > > One more thing, in microarray days, people used to add a small value, let say 1 to the 0 values to avoid non-sense fold changes. It's not the case in NGS any more right? so it's not possible to do that in edgeR, right? that's why I was thinking about filtering out with cpm. > > Thanks, > John > > > > -------------------------------------------------------------------- ---- > *To:* "jmacdon@uw.edu" <jmacdon@uw.edu> > *Sent:* Monday, February 11, 2013 1:47 PM > *Subject:* [BioC] edgeR cpm filtering > > Hi Jim, > > I'm very new to edgeR and BioC. I couldn't reply to your post in BioC, so here is my post in an email :D > > I still cannot see why 1M is chosen, but I appreciate your explanations. > > About the cpm filtering, the reason that I chose '> 2' for 3 samples with each having 2 replicates was that I though edgeR must be smart enough to figure out that when I say more than 5 reads per million for [[elided Yahoo spam]] [[elided Yahoo spam]] > > as for the reason for wanting to get rid of the sample 3 with 2 replicates that have 0 reads mapped to them, I don't want them, because, they cause the logFC to become huge non-sense numbers! i guess dividing be 0 causes the problem! so I thought for not seeing weird values when the significant genes are selected, it's better to get rid of genes that have 0 reads mapped to any of their groups. Does it make sense? > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > Thanks, > John > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]] ------------------------------ Message: 21 Date: Tue, 12 Feb 2013 01:40:55 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] edgeR cpm filtering Message-ID: <caha9mcnreg0zpcykxpodjiemkjc0e9e7mawhftnbuqd1c1sk1q@mail.gmail.com> Content-Type: text/plain; charset=windows-1252 Hi John, e: > Hi Jim, > > I posted my first question as a guest and just became a > member today, so it?s a bit of learning curve for me on how to use the list. Sorry > for the email and I hope I am using the list correctly now by emailing this > address. > > About my edgeR question: > > Thank you for your elaboration and also the example. I can > see that in your example, you have samples with 0 reads and logFC is okay, and > that?s what logically should be. However, in my dataset, I see many cases of > logFC of ~144269489 (or negative of ~ this value). When I check the genes, I see > that these are the cases where all the replicates of one samples have 0 reads > mapped to them, whereas the other groups of samples have many reads. These are > the cases that cpm didn?t filter them. That?s why I tried to use more restrictive > cpm filtering to get rid of these genes. > > Any thoughts on why this non-interpretive logFC cases happen > are greatly appreciated. >From the looks of your previous (original) email, it looks like you've landed w/ a (very) strange bioconductor setup given the version of R you are using -- you are using R-2.15.0, with edgeR version 2.6.0 Please update to the latest version of R (2.15.2 -- cuz, hey it's a good idea), but more importantly, the latest version of edgeR: http://bioconductor.org/packages/2.11/bioc/html/edgeR.html You should be installing your bioconductor packages w/ the BiocInstaller package/biocLite functionality. Unwinding yourself from the bizarre package versions you have running to ensure that you are on the latest bioc release train (2.11) might get a bit tricky, and sometimes the easiest way is to blow out your R install and start from scratch. After you do that, do you still get such extreme logFC's? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 22 Date: Tue, 12 Feb 2013 07:55:21 +0000 From: "Hahne, Florian" <florian.hahne@novartis.com> To: "ttriche@usc.edu" <ttriche@usc.edu> Cc: "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <aabaab0b27af5c418fe809f352fb7ba408486bed@023-db3mpn1-082.023d .mgd.msft.net=""> Content-Type: text/plain Cool, thanks. Just learned something again today :-) I will take a closer look at the Homo.sapiens package (and the other new organism annotation packages as well) to figure out how to integrate this more closely into the Gviz package. One should not have to jump through all these hoops to get gene symbols in there if all the necessary information is already on your fingertips. Florian -- From: "<tim triche="">", "Jr." <tim.triche@gmail.com<mailto:tim.triche@gmail.com>> Reply-To: "ttriche@usc.edu<mailto:ttriche@usc.edu>" <ttriche@usc.edu<mailto:ttriche@usc.edu>> Date: Monday, February 11, 2013 8:39 PM To: NIBR <florian.hahne@novartis.com<mailto:florian.hahne@novartis.com>> Cc: Bogdan Tanasa <tanasa@gmail.com<mailto:tanasa@gmail.com>>, Bioconductor mailing list <bioconductor@r-project.org<mailto:bioconductor@r-project.org>>, "bioc-sig-sequencing-request@r-project.org<mailto:bioc-sig-sequencing- request@r-project.org="">" <bioc-sig-sequencing- request@r-project.org<mailto:bioc-sig-sequencing-="" request@r-project.org="">> Subject: Re: [BioC] question about Gviz Oops, critical step omitted. See below. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs could be accepted as keys, this next step might be unnecessary ## hasHugoSymbol <- !is.na<http: is.na="">(symbolMappings$SYMBOL) ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) The above (minus comments) is what I used to generate a test result. On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com<mailto:tim.triche@gmail.com>> wrote: Well, here's one approach. I'll start from the constructed 'txtr' track. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs can be accepted as keys, this next step might be unnecessary ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) That works. On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian <florian.hahne@novartis.com<mailto:florian.hahne@novartis.com>> wrote: Well, the main culprit here is really the TranscriptDB package. It does not seem to deal at all with gene symbols, so there is no way for Gviz to automatically fetch those. If you come up with a way to match the UCSC gene identifiers back to gene symbols you could stick those into the GeneRegionTrack using the 'symbol' replacement method. E.g., symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) I am not sure how this mapping is supposed to be done in the Bioconductor world these days. You may be able to find a way using one of the org.db packages. Or maybe you will have to download a mapping table directly from the UCSC table browser. With the next release of Bioconductor (or already now if you are working with the devel branch), Gviz supports building tracks from a whole range of standard annotation files. You could then export the whole known.gene table from UCSC as a GTF file and import in again as a GeneRegionTrack object. Alternatively you may want to look into the BiomartGeneRegionTrack class which will fetch the gene models from Ensembl, but this includes the HUGO gene symbols. Florian -- On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com<mailto:tanasa@gmail.com>> wrote: >Dear all, > >I am using the following code below in order to retrieve the gene >annotations in Gviz package : >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] > >library("GenomicFeatures") >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >"knownGene") >saveDb(hg18db, file="hg18db_knownGene.sqlite") >txdb <-loadDb("hg18db_knownGene.sqlite") >txTr <- >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,gene Symbo >l=TRUE,name="UCSC") >plotTracks(txTr,from=start,to=end) > >-- bogdan >------------------ > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- A model is a lie that helps you see the truth. Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> -- A model is a lie that helps you see the truth. Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 23 Date: Tue, 12 Feb 2013 00:04:23 -0800 From: Bogdan Tanasa <tanasa@gmail.com> To: "Hahne, Florian" <florian.hahne@novartis.com> Cc: Bioconductor mailing list <bioconductor@r-project.org>, "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <ca+jem018udk357wghxa14wp_ncz3qwaesr- cn00pwkrt5w497w@mail.gmail.com=""> Content-Type: text/plain Thank you again gentlemen for your kind help ... On Mon, Feb 11, 2013 at 11:55 PM, Hahne, Florian <florian.hahne@novartis.com> wrote: > Cool, thanks. Just learned something again today :-) > I will take a closer look at the Homo.sapiens package (and the other new > organism annotation packages as well) to figure out how to integrate this > more closely into the Gviz package. One should not have to jump through all > these hoops to get gene symbols in there if all the necessary information > is already on your fingertips. > Florian > -- > > > From: "<tim triche="">", "Jr." <tim.triche@gmail.com> > Reply-To: "ttriche@usc.edu" <ttriche@usc.edu> > Date: Monday, February 11, 2013 8:39 PM > To: NIBR <florian.hahne@novartis.com> > Cc: Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list < > bioconductor@r-project.org>, "bioc-sig-sequencing- request@r-project.org" < > bioc-sig-sequencing-request@r-project.org> > Subject: Re: [BioC] question about Gviz > > Oops, critical step omitted. See below. > > ## see ?select for more on the following > ## > library(Homo.sapiens) > symbolMappings <- select(Homo.sapiens, > > cols=c('UCSCKG','ENTREZID','SYMBOL'), > keys=symbol(txtr), > keytype='UCSCKG') > > ## "pork out" the mappings table so that duplicates expand > ## > symbolMappings <- symbolMappings[match(symbol(txtr), > > symbolMappings$UCSCKG),] > > ## if NAs could be accepted as keys, this next step might be unnecessary > ## > hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) > > ## however, it seems that not all UCSC known genes have a Hugo symbol? > ## > txtr <- txtr[ hasHugoSymbol ] > symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] > plotTracks(txtr) > > > The above (minus comments) is what I used to generate a test result. > > > > > On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > >> Well, here's one approach. I'll start from the constructed 'txtr' track. >> >> ## see ?select for more on the following >> ## >> library(Homo.sapiens) >> symbolMappings <- select(Homo.sapiens, >> >> cols=c('UCSCKG','ENTREZID','SYMBOL'), >> keys=symbol(txtr), >> keytype='UCSCKG') >> >> ## "pork out" the mappings table so that duplicates expand >> ## >> symbolMappings <- symbolMappings[match(symbol(txtr), >> >> symbolMappings$UCSCKG),] >> >> ## if NAs can be accepted as keys, this next step might be unnecessary >> ## however, it seems that not all UCSC known genes have a Hugo symbol? >> ## >> txtr <- txtr[ hasHugoSymbol ] >> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >> plotTracks(txtr) >> >> That works. >> >> >> >> >> On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian < >> florian.hahne@novartis.com> wrote: >> >>> Well, the main culprit here is really the TranscriptDB package. It does >>> not seem to deal at all with gene symbols, so there is no way for Gviz to >>> automatically fetch those. If you come up with a way to match the UCSC >>> gene identifiers back to gene symbols you could stick those into the >>> GeneRegionTrack using the 'symbol' replacement method. E.g., >>> symbol(foo) <- mappingTable[match(transcript(foo), >>> mappingTable$UCSCId),]) >>> I am not sure how this mapping is supposed to be done in the Bioconductor >>> world these days. You may be able to find a way using one of the org.db >>> packages. Or maybe you will have to download a mapping table directly >>> from >>> the UCSC table browser. >>> With the next release of Bioconductor (or already now if you are working >>> with the devel branch), Gviz supports building tracks from a whole range >>> of standard annotation files. You could then export the whole known.gene >>> table from UCSC as a GTF file and import in again as a GeneRegionTrack >>> object. Alternatively you may want to look into the >>> BiomartGeneRegionTrack >>> class which will fetch the gene models from Ensembl, but this includes >>> the >>> HUGO gene symbols. >>> Florian >>> >>> >>> -- >>> >>> >>> >>> >>> >>> >>> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >>> >>> >Dear all, >>> > >>> >I am using the following code below in order to retrieve the gene >>> >annotations in Gviz package : >>> >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] >>> > >>> >library("GenomicFeatures") >>> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >>> >"knownGene") >>> >saveDb(hg18db, file="hg18db_knownGene.sqlite") >>> >txdb <-loadDb("hg18db_knownGene.sqlite") >>> >txTr <- >>> >>> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE, geneSymbo >>> >l=TRUE,name="UCSC") >>> >plotTracks(txTr,from=start,to=end) >>> > >>> >-- bogdan >>> >------------------ >>> > >>> > [[alternative HTML version deleted]] >>> > >>> >_______________________________________________ >>> >Bioconductor mailing list >>> >Bioconductor@r-project.org >>> >https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >Search the archives: >>> >http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > [[alternative HTML version deleted]] ------------------------------ _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor End of Bioconductor Digest, Vol 120, Issue 12 ********************************************* [[alternative HTML version deleted]]

Microarray qPCR Coverage Annotation Pathways Visualization Cancer Breast Organism affy • 2.4k views

ADD COMMENT • link 12.2 years ago Matthew Thornton ▴ 380

0

Entering edit mode

Matthew Thornton ▴ 380

@matthew-thornton-5564

Last seen 26 days ago

USA, Los Angeles, USC

Sent from my Droid Charge on Verizon 4GLTE "bioconductor- request@r-project.org" wrote: Send Bioconductor mailing list submissions to bioconductor@r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/bioconductor or, via email, send a message with subject or body 'help' to bioconductor-request@r-project.org You can reach the person managing the list at bioconductor-owner@r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Bioconductor digest..." Today's Topics: 1. Re: Help with Gviz \"IdeogramTrack\" and \"BioMartGeneTrackRegion\" commands (Hahne, Florian) 2. Re: question about Gviz (Hahne, Florian) 3. distanceToNearest in GenomicRanges (Tom Oates) 4. edgeR cpm filtering (John [guest]) 5. Re: runSPIA() in graphite package (array chip) 6. Re: edgeR cpm filtering (James W. MacDonald) 7. Re: distanceToNearest in GenomicRanges (James W. MacDonald) 8. Re: DataFrame bug? (Valerie Obenchain) 9. Re: question about Gviz (Tim Triche, Jr.) 10. Re: question about Gviz (Tim Triche, Jr.) 11. plotting BED intervals to TSS regions (Seb) 12. Re: edgeR cpm filtering (James W. MacDonald) 13. Re: runSPIA() in graphite package (Enrica) 14. Re: plotting BED intervals to TSS regions (Tengfei Yin) 15. Re: question about Gviz (Bogdan Tanasa) 16. Error Bars on EdgeR (Ian Sudbery) 17. Differences in results analyzing Mouse Gene 1.0-ST using oligo and affy package (Jon Toledo) 18. Re: statistical test for time course data (Juliet Hannah) 19. Re: statistical test for time course data (Steve Lianoglou) 20. Re: edgeR cpm filtering (John Sperry) 21. Re: edgeR cpm filtering (Steve Lianoglou) 22. Re: question about Gviz (Hahne, Florian) 23. Re: question about Gviz (Bogdan Tanasa) ---------------------------------------------------------------------- Message: 1 Date: Mon, 11 Feb 2013 13:48:56 +0000 From: "Hahne, Florian" <florian.hahne@novartis.com> To: Marc Carlson <mcarlson@fhcrc.org>, "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] Help with Gviz \"IdeogramTrack\" and \"BioMartGeneTrackRegion\" commands Message-ID: <aabaab0b27af5c418fe809f352fb7ba40848536d@023-db3mpn1-082.023d .mgd.msft.net=""> Content-Type: text/plain; charset="iso-8859-2" Hi Marc, thanks for the hint, but I think this is not quite what I need. My problem is still on the level of genomes. UCSC for instance calls a particular version of they human genome hg19. Now there exists a similar genome in Ensembl, however they do not use the same name for it (GRCh37.p10). I made the maybe somewhat unwise attempt early on to identify genomes within Gviz by their UCSC name and to translate those names into Ensembl names if necessary. In hind sight this may not have been the smartest decision, and I should have left the translation job completely to the user. If somebody wants Ensebml gene models from BiomaRt they should make sure that they select the correct mart and dataset in the first place. I'll think about a pragmatic way out of this hole I've dug myself into. Florian -- On 1/23/13 2:18 AM, "Marc Carlson" <mcarlson@fhcrc.org> wrote: >Hi Florian, > >We actually have a small database called seqnames.db that is dedicated >to tracking these kinds of chromosome name conventions. You can see >more by looking at the help page for supportedSeqnameStyles() (and it's >friends). A quick way to see that is: > >library(Homo.sapiens) >?supportedSeqnameStyles > > >If you call the supportedSeqnameStyles() method, you will see that we >don't (yet) have an entry for zebrafish. If you were to give me one as a >tab file, I could add it to the database and it would therefore exist >for the future... The file I need is deliberately simple to make. It >should look like the example below, with as many columns as you want >there to be styles for, and each column separated by a tab. > >NCBI MSU6 >1 1 >2 2 >3 3 >4 4 > >etc. > > > Marc > > > > > >On 01/21/2013 09:15 AM, Hahne, Florian wrote: >> Hi Joseph, >> >> Regarding your first problem: UCSC has no cytoband information for any >>of >> the zebrafish genomes, and that's what is throwing the error. I think it >> should do something smarter, e.g. use the chromosome length information >> that should be available for every UCSC genome to draw at least a blank >> ideogram which could still be used to indicate the current plotting >> position. I'll have this ready in the next release of the package, and >> maybe even port this back to the current release. It seems to be more >>of a >> bug than a missing feature? >> >> Your second problem is a bit more tricky. There is no real mapping >>between >> the ensembl genome names used in the Biomart package and the UCSC ones >> which I decided to take as the defaults for the package. I tried to come >> up with my own static mapping for this, and obviously this means that >> things tend to get out of date soon. Now the zebrafish version that you >> will get in Ensembl is Zv9 (which is equivalent to danRer7), but my >> mapping is still to danRer6. This is even wrong, because what you will >>get >> from Biomart if you ask for danRer6 now is actually danRer7. Yikes. I >>will >> have to come up with a better solution for this. There should be a way >>to >> explicitly control for the Ensembl genome that you will get, and this >>is a >> simple change. Getting it right automagically is way more challenging, I >> am afraid. >> >> As a quick fix for you: >> Ask for the danRer6 genes and manually change the genome of the track: >> >>biomTrack<-BiomartGeneRegionTrack(genome="danRer6",chromosome=1,star t=1e6 >>,e >> nd=1e6+10000,name="ENSEMBL",showId=T) >> genome(biomTrack)<- "danRer7" >> >> I'll get back to you once I have a better solution. >> >> Florian >> >> > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 2 Date: Mon, 11 Feb 2013 14:03:39 +0000 From: "Hahne, Florian" <florian.hahne@novartis.com> To: Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org>, "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <aabaab0b27af5c418fe809f352fb7ba4084853cd@023-db3mpn1-082.023d .mgd.msft.net=""> Content-Type: text/plain; charset="us-ascii" Well, the main culprit here is really the TranscriptDB package. It does not seem to deal at all with gene symbols, so there is no way for Gviz to automatically fetch those. If you come up with a way to match the UCSC gene identifiers back to gene symbols you could stick those into the GeneRegionTrack using the 'symbol' replacement method. E.g., symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) I am not sure how this mapping is supposed to be done in the Bioconductor world these days. You may be able to find a way using one of the org.db packages. Or maybe you will have to download a mapping table directly from the UCSC table browser. With the next release of Bioconductor (or already now if you are working with the devel branch), Gviz supports building tracks from a whole range of standard annotation files. You could then export the whole known.gene table from UCSC as a GTF file and import in again as a GeneRegionTrack object. Alternatively you may want to look into the BiomartGeneRegionTrack class which will fetch the gene models from Ensembl, but this includes the HUGO gene symbols. Florian -- On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >Dear all, > >I am using the following code below in order to retrieve the gene >annotations in Gviz package : >please could advise on what shall I modify in order to display the HUGO >gene symbol on each gene ? thanks ! > >library("GenomicFeatures") >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >"knownGene") >saveDb(hg18db, file="hg18db_knownGene.sqlite") >txdb <-loadDb("hg18db_knownGene.sqlite") >txTr <- >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,gene Symbo >l=TRUE,name="UCSC") >plotTracks(txTr,from=start,to=end) > >-- bogdan >------------------ > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 3 Date: Mon, 11 Feb 2013 16:35:39 +0000 From: Tom Oates <toates19@gmail.com> To: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: [BioC] distanceToNearest in GenomicRanges Message-ID: <cagudn1auoxuwhwz9papqpzx3radxafisto9a- 8xhm+vyslfxhg@mail.gmail.com=""> Content-Type: text/plain Hi I am very much a learner in R in general & GenomicRanges in general I am struggling to find documentation to help me get my head around distanceToNearest in GenomicRanges If I have a GRanges object: GRanges with 6 ranges and 4 metadata columns: seqnames ranges strand | <rle> <iranges> <rle> | [1] 10 [ 96723746, 96723747] - | [2] 7 [ 13641170, 13641171] + | [3] 16 [ 17772801, 17772802] - | [4] 3 [ 88173502, 88173503] - | [5] 13 [106979682, 106979683] + | [6] 9 [104393139, 104393140] + | (You will notice that all the regions are only dinucleotides & I have removed the metadata ) I have a 2nd GRanges object which is ensembl rat transcripts as below: 39549 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] 1 [5473, 16844] + | 1 ENSRNOT00000044270 [2] 1 [5526, 16968] + | 2 ENSRNOT00000049921 [3] 1 [5526, 16968] + | 3 ENSRNOT00000051735 [4] 1 [5598, 13520] + | 4 ENSRNOT00000034630 [5] 1 [8268, 16850] + | 5 ENSRNOT00000044505 [6] 1 [8316, 17577] + | 6 ENSRNOT00000042693 [7] 1 [8884, 16850] + | 7 ENSRNOT00000044187 [8] 1 [8956, 9955] + | 8 ENSRNOT00000041082 [9] 1 [9055, 17351] + | 9 ENSRNOT00000050254 If I invoke: xx<-distanceToNearestdiff.cpgs.gr, rat.transcripts, ignore.strand=F) xx DataFrame with 1133 rows and 3 columns queryHits subjectHits distance <integer> <integer> <integer> 1 1 7752 0 2 2 32166 11946 3 3 14678 25377 4 4 24286 66747 5 5 10609 34242 6 6 37076 122683 7 7 35184 0 8 8 34180 45561 9 9 19351 50156 ... ... ... ... etc I am uncertain how I would then use the xx output to gain information (i.e. tx_id, tx_name) about the feature which the function has identified as nearest? I would be happy to supply any more info as required Tom [[alternative HTML version deleted]] ------------------------------ Message: 4 Date: Mon, 11 Feb 2013 08:54:54 -0800 (PST) From: "John [guest]" <guest@bioconductor.org> To: bioconductor@r-project.org, johnsperry51@yahoo.com Subject: [BioC] edgeR cpm filtering Message-ID: <20130211165454.35588AAF78@mamba.fhcrc.org> Content-Type: text/plain; charset=iso-8859-1 All, I am a new edgeR user. I have difficulty understanding the meaning of the ???cpm??? function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don???t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M? Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example: Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2 Gene_X, 150,100, 270,320,0,0 I used: d_DGEList <- d_DGEList[rowSums(cpm_filtered > 5) > 2,] But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this? Thank you. John -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_2.6.0 limma_3.12.0 > -- Sent via the guest posting facility at bioconductor.org. ------------------------------ Message: 5 Date: Mon, 11 Feb 2013 09:36:15 -0800 From: array chip <arrayprofile@yahoo.com> To: Enrica <enrica.calura@gmail.com> Cc: "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it>, "gabriele.sales@unipd.it" <gabriele.sales@unipd.it>, Bioconductor mailing list <bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] runSPIA() in graphite package Message-ID: <1360604175.82561.YahooMailNeo@web122901.mail.ne1.yahoo.com> Content-Type: text/plain Dear Enrica, Thanks very much for checking this out! graphite is a great package for pathway analysis with its ability to analyze so many different pathway databases! One more question, does runSPIA() run against these pathways on the fly (i.e. access these databases from their server in real time), or against a pre-downloaded database? Do you have a timeline when the new release graphite package will become available? Thanks again for your great work. John ________________________________ From: Enrica <enrica.calura@gmail.com> Cc: Bioconductor mailing list <bioconductor@stat.math.ethz.ch>; "gabriele.sales@unipd.it" <gabriele.sales@unipd.it>; "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it> Sent: Monday, February 11, 2013 3:24 AM Subject: Re: runSPIA() in graphite package Dear John, the procedure you applied to analyse your data is right. However, the runSpia function filters out those pathways that have no edges or less than 5 nodes with valid edges. Your pathway, after the conversion to Entrez Gene, do not satisfy neither the conditions, having no edges. > pe<-convertIdentifiers(biocarta[["estrogen responsive protein efp controls cell cycle and breast tumors growth"]], "entrez") > pe "estrogen responsive protein efp controls cell cycle and breast tumors growth" pathway from BioCarta Number of nodes???? = 5 Number of edges???? = 0 Type of identifiers = Entrez Gene Retrieved on??????? = 2011-05-12 We equipped our runSpia() function with the checks described above in order to protect the user from unreliable results. On a separate note, we have also re-checked how that specific pathway was converted. The original BioPax2 contained some edges, but their endpoints were unfortunately nodes which could not be associated with any entrez gene. Our software, thus, dropped them from the pathway. We are working on a new graphite release based on annotation released more recently. Those includes more edges; as a result that pathway is no longer empty. Enrica Calura Hi all, I am using runSPIA() from graphite package to analyze a gene list against biocarta pathway database. > > > >> library(graphite) > >> prepareSPIA(biocarta, "biocartaEx") >> obj<-runSPIA(de=siggenes, all=allgenes, "biocartaEx") > > > >where "siggenes" are a list of 1332 significant genes selected from "allgenes" (a list of 17138 genes). The running process verbose indicated a total of 178 pathways analyzed. > > > >One of the pathways I am particularly interested is "estrogen responsive protein efp controls cell cycle and breast tumors growth": > > >> p <- biocarta[["estrogen responsive protein efp controls cell cycle and breast tumors growth"]] >> nodes(p) >[1] "CDKs"???????????? "Cyclin B1/2"????? "EntrezGene:2099"? "EntrezGene:2810"? "EntrezGene:57154" "EntrezGene:7157"? "EntrezGene:7706"?? "ubiquitin" > > >And my siggenes contains 2 of genes involved in this pathway: > > >> siggenes[c('7706','2099')] >???? 7706????? 2099 >?4.347012 -3.792425 > > >So I would assume that runSPIA will examine this pathway during the? calculation, but surprisingly I didn't see this particular way being examined. Here is the section of runSPIA() verbose output that started with "e": > > >Done pathway 42 : e2f1 destruction pathway.. >Done pathway 43 : effects of calcineurin in kera.. >Done pathway 44 : egf signaling pathway.. >Done pathway 45 : endocytotic role of ndk phosph.. >Done pathway 46 : epo signaling pathway.. >Done pathway 47 : erk and pi-3 kinase are necess.. >Done pathway 48 : erk1/erk2 mapk signaling pathw.. >Done pathway 49 : eukaryotic protein translation.. >Done pathway 50 : extrinsic prothrombin activati.. > > > >Can anyone tell me why runSPIA() missed this pathway? Attached are siggenes and allgenes for you to try.? > > >> siggenes<-as.matrix(read.table("siggenes.txt",row.names=1))[,1] > >> allgenes<-as.matrix(read.table("H:\\test\\allgenes.txt", row.names=NULL, header=T, colClasses='character'))[,1] > > >Many thanks > > >John > > [[alternative HTML version deleted]] ------------------------------ Message: 6 Date: Mon, 11 Feb 2013 12:52:13 -0500 From: "James W. MacDonald" <jmacdon@uw.edu> To: "John [guest]" <guest@bioconductor.org> Cc: bioconductor@r-project.org Subject: Re: [BioC] edgeR cpm filtering Message-ID: <51192FCD.2000700@uw.edu> Content-Type: text/plain; charset=windows-1252; format=flowed Hi John, On 2/11/2013 11:54 AM, John [guest] wrote: > All, > > I am a new edgeR user. I have difficulty understanding the meaning of the ???cpm??? function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don???t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M? It's reads. You are inputting read counts per gene, so by definition edgeR knows nothing about bases. As for why 1M, I don't know for sure, but would imagine it is because a 'reasonable' sized RNA-Seq experiment is based on somewhere around 10M reads or so. In other words, say you have a sample with 10M reads, and one gene has 60 reads that align. You would have 6 cpm, which looks pretty low, and it is. However you could do statistics on it based on a discrete distribution like the negative binomial. If you used 10M to normalize, the cp10M would be 0.6, so you would have to throw that gene out. If you used cpk, it would be 6000 cpk and it would look like you had lots of reads when in fact you only had 60. So cpm looks like a reasonable compromise to me. > > Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example: > > Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2 > Gene_X, 150,100, 270,320,0,0 > > I used: > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this? Setting aside the logic of removing these genes (which IMO is missing the point of looking for differential expression), note the logic of your filter. Let's break it down by section: rowSums(cpm_filtered > 5) This gives the count of samples where the cpm value is > 5. In the case you mention, this would be four. rowSums(cpm_filtered > 5) > 2 This results in TRUE if the count of samples for a given gene with a cpm value > 5 is greater than two. So you are saying that at least two samples have to have a cpm > 5. In the instance you want to filter, the count is 4, and 4 > 2, so this passes the filter. So what you apparently want are rows where the cpm is greater than some value in ALL samples, not just two of them, so you would want to change the > 2 part to == 6. Note that this doesn't really make any sense. You are in effect saying that you are completely uninterested in any genes that appear not to be expressed in one of your samples, but that might be highly expressed in one or more of the other samples. That to me is actually really interesting, and I am not sure why you would want to filter out any gene that fulfills that criterion. Best, Jim > > Thank you. > John > > > -- output of sessionInfo(): > >> sessionInfo() > R version 2.15.0 (2012-03-30) > > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > > [1] edgeR_2.6.0 limma_3.12.0 > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 ------------------------------ Message: 7 Date: Mon, 11 Feb 2013 13:08:11 -0500 From: "James W. MacDonald" <jmacdon@uw.edu> To: Tom Oates <toates19@gmail.com> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] distanceToNearest in GenomicRanges Message-ID: <5119338B.9080702@uw.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi Tom, On 2/11/2013 11:35 AM, Tom Oates wrote: > Hi > I am very much a learner in R in general& GenomicRanges in general > I am struggling to find documentation to help me get my head around > distanceToNearest in GenomicRanges > If I have a GRanges object: > > GRanges with 6 ranges and 4 metadata columns: > seqnames ranges strand | > <rle> <iranges> <rle> | > [1] 10 [ 96723746, 96723747] - | > [2] 7 [ 13641170, 13641171] + | > [3] 16 [ 17772801, 17772802] - | > [4] 3 [ 88173502, 88173503] - | > [5] 13 [106979682, 106979683] + | > [6] 9 [104393139, 104393140] + | > > (You will notice that all the regions are only dinucleotides& I have > removed the metadata ) > > I have a 2nd GRanges object which is ensembl rat transcripts as below: > 39549 ranges and 2 metadata columns: > seqnames ranges strand | tx_id > tx_name > <rle> <iranges> <rle> |<integer> > <character> > [1] 1 [5473, 16844] + | 1 > ENSRNOT00000044270 > [2] 1 [5526, 16968] + | 2 > ENSRNOT00000049921 > [3] 1 [5526, 16968] + | 3 > ENSRNOT00000051735 > [4] 1 [5598, 13520] + | 4 > ENSRNOT00000034630 > [5] 1 [8268, 16850] + | 5 > ENSRNOT00000044505 > [6] 1 [8316, 17577] + | 6 > ENSRNOT00000042693 > [7] 1 [8884, 16850] + | 7 > ENSRNOT00000044187 > [8] 1 [8956, 9955] + | 8 > ENSRNOT00000041082 > [9] 1 [9055, 17351] + | 9 > ENSRNOT00000050254 > > > If I invoke: > xx<-distanceToNearestdiff.cpgs.gr, rat.transcripts, ignore.strand=F) > > xx > DataFrame with 1133 rows and 3 columns > queryHits subjectHits distance > <integer> <integer> <integer> > 1 1 7752 0 > 2 2 32166 11946 > 3 3 14678 25377 > 4 4 24286 66747 > 5 5 10609 34242 > 6 6 37076 122683 > 7 7 35184 0 > 8 8 34180 45561 > 9 9 19351 50156 > ... ... ... ... > etc > > I am uncertain how I would then use the xx output to gain information (i.e. > tx_id, tx_name) about the feature which the function has identified as > nearest? > I would be happy to supply any more info as required The subjectHits column gives the row of your transcript GRanges object that matches the corresponding query row. I am assuming here that the 'diff.cpgs.gr' GRanges object is longer than 6? Anyway, here is an example using your data and the TxDb.Mmusculus.UCSC.mm10.knownGene package: > x GRanges with 6 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr10 [ 96723746, 96723747] * [2] chr7 [ 13641170, 13641171] * [3] chr16 [ 17772801, 17772802] * [4] chr3 [ 88173502, 88173503] * [5] chr13 [106979682, 106979683] * [6] chr9 [104393139, 104393140] * --- > y <- transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene) > xx <- distanceToNearest(x, y, ignore.strand=F) > xx DataFrame with 6 rows and 3 columns queryHits subjectHits distance <integer> <integer> <integer> 1 1 4514 100935 2 2 45653 0 3 3 19383 0 4 4 34197 0 5 5 14383 0 6 6 54212 8108 > y[xx[,2],] GRanges with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr10 [ 96617001, 96622811] + | 33419 uc007gww.2 [2] chr7 [ 13623967, 13670807] + | 21400 uc012ezp.1 [3] chr16 [ 17759663, 17779206] + | 48288 uc007ylz.1 [4] chr3 [ 88171560, 88177785] - | 10107 uc008puf.2 [5] chr13 [106963757, 107022114] - | 43288 uc007rue.1 [6] chr9 [104361832, 104385031] + | 29956 uc009rhp.1 --- seqlengths: chr1 chr2 ... chrUn_JH584304 195471971 182113224 ... 114452 Best, Jim > Tom > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 ------------------------------ Message: 8 Date: Mon, 11 Feb 2013 11:16:17 -0800 From: Valerie Obenchain <vobencha@fhcrc.org> To: Charles Berry <ccberry@ucsd.edu> Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] DataFrame bug? Message-ID: <51194381.4000809@fhcrc.org> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Hi Charles, Thanks for reporting the bug. Now fixed in 1.17.32 devel and 1.16.5 in release. Valerie library(RUnit) DF <- DataFrame("A"=1:5,row.names=letters[1:5]) df <- data.frame("A"=1:5,row.names=letters[1:5]) > checkIdentical(DF['a','B'] <- 1, df['a','B'] <- 1) [1] TRUE DF <- DataFrame("A"=1:5,row.names=letters[1:5]) df <- data.frame("A"=1:5,row.names=letters[1:5]) > checkIdentical(DF['c','B'] <- 1, df['c','B'] <- 1) [1] TRUE > sessionInfo() ... other attached packages: [1] RUnit_0.4.26 IRanges_1.16.5 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] parallel_2.15.1 stats4_2.15.1 On 02/09/2013 12:26 PM, Charles Berry wrote: >>> > > Subset replacement like this > > df['a','c2']<-3 > > depends on the position of row 'a' when 'c2' is a new column. > > > Row 'a' first: > >> df1 <- DataFrame(c1=1:2,row.names=c("a","b")) >> df1['a','c2'] <- 3 >> df1 > DataFrame with 2 rows and 2 columns > c1 c2 > <integer> <numeric> > a 1 3 > b 2 3 > > Row 'a' second: > >> df2 <- DataFrame(c1=1:2,row.names= rev( c("a","b") )) >> df2['a','c2'] <- 3 >> df2 > DataFrame with 2 rows and 2 columns > c1 c2 > <integer> <numeric> > b 1 NA > a 2 3 > > FWIW, the latter - but not the former - agrees with "the 'DataFrame' behaves > very similarly to 'data.frame'"(from the help page). > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] IRanges_1.16.4 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] parallel_2.15.2 stats4_2.15.2 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ------------------------------ Message: 9 Date: Mon, 11 Feb 2013 11:34:04 -0800 From: "Tim Triche, Jr." <tim.triche@gmail.com> To: "Hahne, Florian" <florian.hahne@novartis.com> Cc: "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <cac+n9bw3pqvp2urshqh3fhgtjhf9v+eguvsv94aj5yxwjxrqiq@mail.gmail.com> Content-Type: text/plain Well, here's one approach. I'll start from the constructed 'txtr' track. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs can be accepted as keys, this next step might be unnecessary ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) That works. On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian <florian.hahne@novartis.com>wrote: > Well, the main culprit here is really the TranscriptDB package. It does > not seem to deal at all with gene symbols, so there is no way for Gviz to > automatically fetch those. If you come up with a way to match the UCSC > gene identifiers back to gene symbols you could stick those into the > GeneRegionTrack using the 'symbol' replacement method. E.g., > symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) > I am not sure how this mapping is supposed to be done in the Bioconductor > world these days. You may be able to find a way using one of the org.db > packages. Or maybe you will have to download a mapping table directly from > the UCSC table browser. > With the next release of Bioconductor (or already now if you are working > with the devel branch), Gviz supports building tracks from a whole range > of standard annotation files. You could then export the whole known.gene > table from UCSC as a GTF file and import in again as a GeneRegionTrack > object. Alternatively you may want to look into the BiomartGeneRegionTrack > class which will fetch the gene models from Ensembl, but this includes the > HUGO gene symbols. > Florian > > > -- > > > > > > > On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: > > >Dear all, > > > >I am using the following code below in order to retrieve the gene > >annotations in Gviz package : > >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] > > > >library("GenomicFeatures") > >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = > >"knownGene") > >saveDb(hg18db, file="hg18db_knownGene.sqlite") > >txdb <-loadDb("hg18db_knownGene.sqlite") > >txTr <- > >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,ge neSymbo > >l=TRUE,name="UCSC") > >plotTracks(txTr,from=start,to=end) > > > >-- bogdan > >------------------ > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor@r-project.org > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 10 Date: Mon, 11 Feb 2013 11:39:15 -0800 From: "Tim Triche, Jr." <tim.triche@gmail.com> To: "Hahne, Florian" <florian.hahne@novartis.com> Cc: "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <cac+n9bvd4vrf8q+1z1ydc7zphuqpy5doyiu7kkr4eu8u6sqcua@mail.gmail.com> Content-Type: text/plain Oops, critical step omitted. See below. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs could be accepted as keys, this next step might be unnecessary ## hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) The above (minus comments) is what I used to generate a test result. On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > Well, here's one approach. I'll start from the constructed 'txtr' track. > > ## see ?select for more on the following > ## > library(Homo.sapiens) > symbolMappings <- select(Homo.sapiens, > > cols=c('UCSCKG','ENTREZID','SYMBOL'), > keys=symbol(txtr), > keytype='UCSCKG') > > ## "pork out" the mappings table so that duplicates expand > ## > symbolMappings <- symbolMappings[match(symbol(txtr), > > symbolMappings$UCSCKG),] > > ## if NAs can be accepted as keys, this next step might be unnecessary > ## however, it seems that not all UCSC known genes have a Hugo symbol? > ## > txtr <- txtr[ hasHugoSymbol ] > symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] > plotTracks(txtr) > > That works. > > > > > On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian < > florian.hahne@novartis.com> wrote: > >> Well, the main culprit here is really the TranscriptDB package. It does >> not seem to deal at all with gene symbols, so there is no way for Gviz to >> automatically fetch those. If you come up with a way to match the UCSC >> gene identifiers back to gene symbols you could stick those into the >> GeneRegionTrack using the 'symbol' replacement method. E.g., >> symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) >> I am not sure how this mapping is supposed to be done in the Bioconductor >> world these days. You may be able to find a way using one of the org.db >> packages. Or maybe you will have to download a mapping table directly from >> the UCSC table browser. >> With the next release of Bioconductor (or already now if you are working >> with the devel branch), Gviz supports building tracks from a whole range >> of standard annotation files. You could then export the whole known.gene >> table from UCSC as a GTF file and import in again as a GeneRegionTrack >> object. Alternatively you may want to look into the BiomartGeneRegionTrack >> class which will fetch the gene models from Ensembl, but this includes the >> HUGO gene symbols. >> Florian >> >> >> -- >> >> >> >> >> >> >> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >> >> >Dear all, >> > >> >I am using the following code below in order to retrieve the gene >> >annotations in Gviz package : >> >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] >> > >> >library("GenomicFeatures") >> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >> >"knownGene") >> >saveDb(hg18db, file="hg18db_knownGene.sqlite") >> >txdb <-loadDb("hg18db_knownGene.sqlite") >> >txTr <- >> >> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,g eneSymbo >> >l=TRUE,name="UCSC") >> >plotTracks(txTr,from=start,to=end) >> > >> >-- bogdan >> >------------------ >> > >> > [[alternative HTML version deleted]] >> > >> >_______________________________________________ >> >Bioconductor mailing list >> >Bioconductor@r-project.org >> >https://stat.ethz.ch/mailman/listinfo/bioconductor >> >Search the archives: >> >http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 11 Date: Mon, 11 Feb 2013 14:45:59 -0500 From: Seb <seba.bat@gmail.com> To: bioconductor@r-project.org Subject: [BioC] plotting BED intervals to TSS regions Message-ID: <canstik=yenwn1-_drhpwjhaxczjj=c5jo_sghauecjcenvtnvq@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 hi gurus i have several BED files containing chromosome #, start and end that correspond to overlapping regions of different ChIP Seq experiments. this part was done with Galaxy. i also have a file containing TSS coordinates +/- 10kb. what i want to do is to create a plot to show how many of my overlapping intervals fall within the TSS regions, and, if they do, have on the X axis the distance to the TSS and on the Y axis the number of regions that overlap that certain part of the TSS ...i am a bit confused about how to do this tho...i looked in galaxy [[elided Yahoo spam]] thanks ------------------------------ Message: 12 Date: Mon, 11 Feb 2013 17:10:05 -0500 From: "James W. MacDonald" <jmacdon@uw.edu> Cc: "Bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] edgeR cpm filtering Message-ID: <51196C3D.6000009@uw.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi John, Please don't take things off-list. Even if you are not a subscriber (and if you are using BioC stuff you should be, and you can always stop delivery but remain a subscriber), I believe that replying to an existing thread will work. I don't see any zero counts causing a problem. Using the example for cpm() as a starting point, and modifying to have a set with zero counts, I get this: > y [,1] [,2] [,3] [,4] [1,] 1 2 14 11 [2,] 11 25 1 26 [3,] 1 22 2 19 [4,] 5 6 15 6 [5,] 0 0 1 5 > d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2))) > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > topTags(exactTest(d)) Comparison of groups: 2-1 logFC logCPM PValue FDR 1 2.9550376 12.76964 6.109348e-05 0.0003054674 5 4.6421574 10.54712 1.283343e-01 0.3208358043 4 0.9149142 12.96222 2.668415e-01 0.4447357815 2 -0.4149407 13.93933 8.539261e-01 0.9783799675 3 -0.1325391 13.42121 9.783800e-01 0.9783799675 So the sample with zero counts (sample 5), is the second row in the topTags() output, and it has no problem computing a logFC value. Best, Jim On 2/11/2013 4:30 PM, John Sperry wrote: > Hi again Jim, > > One more thing, in microarray days, people used to add a small value, > let say 1 to the 0 values to avoid non-sense fold changes. It's not > the case in NGS any more right? so it's not possible to do that in > edgeR, right? that's why I was thinking about filtering out with cpm. > > Thanks, > John > > > > -------------------------------------------------------------------- ---- > *To:* "jmacdon@uw.edu" <jmacdon@uw.edu> > *Sent:* Monday, February 11, 2013 1:47 PM > *Subject:* [BioC] edgeR cpm filtering > > Hi Jim, > > I'm very new to edgeR and BioC. I couldn't reply to your post in BioC, > so here is my post in an email :D > > I still cannot see why 1M is chosen, but I appreciate your explanations. > > About the cpm filtering, the reason that I chose '> 2' for 3 samples > with each having 2 replicates was that I though edgeR must be smart > enough to figure out that when I say more than 5 reads per million for [[elided Yahoo spam]] [[elided Yahoo spam]] > > as for the reason for wanting to get rid of the sample 3 with 2 > replicates that have 0 reads mapped to them, I don't want them, > because, they cause the logFC to become huge non-sense numbers! i > guess dividing be 0 causes the problem! so I thought for not seeing > weird values when the significant genes are selected, it's better to > get rid of genes that have 0 reads mapped to any of their groups. Does > it make sense? > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > Thanks, > John > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 ------------------------------ Message: 13 Date: Mon, 11 Feb 2013 12:24:58 +0100 From: Enrica <enrica.calura@gmail.com> Cc: "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it>, "gabriele.sales@unipd.it" <gabriele.sales@unipd.it>, Bioconductor mailing list <bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] runSPIA() in graphite package Message-ID: <capkcaw0ixxcknwxkj4vue25bfootjx2rd5flsgmgs9qk96hbsg@mail.gmail.com> Content-Type: text/plain Dear John, the procedure you applied to analyse your data is right. However, the runSpia function filters out those pathways that have no edges or less than 5 nodes with valid edges. Your pathway, after the conversion to Entrez Gene, do not satisfy neither the conditions, having no edges. > pe<-convertIdentifiers(biocarta[["estrogen responsive protein efp controls cell cycle and breast tumors growth"]], "entrez") > pe "estrogen responsive protein efp controls cell cycle and breast tumors growth" pathway from BioCarta Number of nodes = 5 Number of edges = 0 Type of identifiers = Entrez Gene Retrieved on = 2011-05-12 We equipped our runSpia() function with the checks described above in order to protect the user from unreliable results. On a separate note, we have also re-checked how that specific pathway was converted. The original BioPax2 contained some edges, but their endpoints were unfortunately nodes which could not be associated with any entrez gene. Our software, thus, dropped them from the pathway. We are working on a new graphite release based on annotation released more recently. Those includes more edges; as a result that pathway is no longer empty. Enrica Calura > Hi all, I am using runSPIA() from graphite package to analyze a gene list > against biocarta pathway database. > > > library(graphite) > > prepareSPIA(biocarta, "biocartaEx") > > obj<-runSPIA(de=siggenes, all=allgenes, "biocartaEx") > > where "siggenes" are a list of 1332 significant genes selected from > "allgenes" (a list of 17138 genes). The running process verbose indicated a > total of 178 pathways analyzed. > > One of the pathways I am particularly interested is "estrogen responsive > protein efp controls cell cycle and breast tumors growth": > > > p <- biocarta[["estrogen responsive protein efp controls cell cycle and > breast tumors growth"]] > > nodes(p) > [1] "CDKs" "Cyclin B1/2" "EntrezGene:2099" > "EntrezGene:2810" "EntrezGene:57154" "EntrezGene:7157" > "EntrezGene:7706" "ubiquitin" > > And my siggenes contains 2 of genes involved in this pathway: > > > siggenes[c('7706','2099')] > 7706 2099 > 4.347012 -3.792425 > > So I would assume that runSPIA will examine this pathway during the > calculation, but surprisingly I didn't see this particular way being > examined. Here is the section of runSPIA() verbose output that started with > "e": > > Done pathway 42 : e2f1 destruction pathway.. > Done pathway 43 : effects of calcineurin in kera.. > Done pathway 44 : egf signaling pathway.. > Done pathway 45 : endocytotic role of ndk phosph.. > Done pathway 46 : epo signaling pathway.. > Done pathway 47 : erk and pi-3 kinase are necess.. > Done pathway 48 : erk1/erk2 mapk signaling pathw.. > Done pathway 49 : eukaryotic protein translation.. > Done pathway 50 : extrinsic prothrombin activati.. > > Can anyone tell me why runSPIA() missed this pathway? Attached are > siggenes and allgenes for you to try. > > > siggenes<-as.matrix(read.table("siggenes.txt",row.names=1))[,1] > > allgenes<-as.matrix(read.table("H:\\test\\allgenes.txt", row.names=NULL, > header=T, colClasses='character'))[,1] > > Many thanks > > John > > [[alternative HTML version deleted]] ------------------------------ Message: 14 Date: Mon, 11 Feb 2013 16:41:25 -0600 From: Tengfei Yin <yintengfei@gmail.com> To: Seb <seba.bat@gmail.com> Cc: bioconductor@r-project.org Subject: Re: [BioC] plotting BED intervals to TSS regions Message-ID: <capjsq9kn0rsob3zno6k=+ok1r=qdyk_l3=wapmefci1r6fuhzq@mail.gmail.com> Content-Type: text/plain Hi Seb, I guess before visualization you need to get the summary statistics ready first, I got one idea, maybe you could give a try, and I assume the count you want is based on a per base resolution 1. 'import' function in package rtracklayer to import your BED files and TSS files as GRanges object into R, ready for analysis. 2. ?findOverlaps in package 'GenomicRanges', there are some utilities to summarize the overlapping between your BED and TSS region. Then you can easily get an answer to your first question: how many falls within your TSS region defined. 3. compute coverage for your imported BED intervals(GRanges object) , that will give you an Rle/RleList. check 'coverage' function in package IRanges/GenomicRanges. 4. then get views on this coverage data with you tss position object. please check 'Views' method in GenomicRanges/IRanges. This step is important, better make sure your TSS have equal length window, for example 20kb in your case. 5. Covert this Views to a matrix by using as.matrix on previous views object. You will get a matrix, whose columns correspond to position around tss, from -10kb to 10kb, each row correspond to one tss region. If you want to summarize over all tss, just use colSums over this matrix. 6. After you get this summary data, you can use any graphic package in R to visualize this data as lines and relabel the x-axis position from -10k to 10k. As far as I know, there is no direct way in bioc to import/aggregate/visualize your BED/TSS file together with one or two commands to get what you want yet ... HTH Tengfei On Mon, Feb 11, 2013 at 1:45 PM, Seb <seba.bat@gmail.com> wrote: > hi gurus > > i have several BED files containing chromosome #, start and end that > correspond to overlapping regions of different ChIP Seq experiments. > this part was done with Galaxy. > > i also have a file containing TSS coordinates +/- 10kb. > > what i want to do is to create a plot to show how many of my > overlapping intervals fall within the TSS regions, and, if they do, > have on the X axis the distance to the TSS and on the Y axis the > number of regions that overlap that certain part of the TSS > > ...i am a bit confused about how to do this tho...i looked in galaxy [[elided Yahoo spam]] > > thanks > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Tengfei Yin MCDB PhD student 1620 Howe Hall, 2274, Iowa State University Ames, IA,50011-2274 [[alternative HTML version deleted]] ------------------------------ Message: 15 Date: Mon, 11 Feb 2013 16:08:00 -0800 From: Bogdan Tanasa <tanasa@gmail.com> To: <ttriche@usc.edu> Cc: bioconductor@stat.math.ethz.ch, bioc-sig-sequencing@r-project.org Subject: Re: [BioC] question about Gviz Message-ID: <ca+jem02xqyqhkzpfagboltdrv_z+l6703jshpmpceq8rvqzrdg@mail.gmail.com> Content-Type: text/plain Thanks, Tim ...will do it accordingly. On Mon, Feb 11, 2013 at 3:49 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > So the problem (at least as regards my initial code snippet) is that there > is no mapping in the Homo.sapiens OrganismDb for either uc002yjx.1 > or uc010gkz.1 and thus no symbol (since symbol-less features are dropped). > That's sort of odd since, it seems, these are NRIP1 isoforms. > > I noticed that p15 and p16 transcripts appeared to be missing when I > looked at chr9 earlier. This might be an issue where mappings are missing > from Homo.sapiens / org.Hs.eg.db, in which case a posting to the list would > be good. > > Meanwhile, if you want to keep the UCSC names of the genes for which there > is not a Hugo mapping in the symbol table, the following replacement for > subsetting txtr should do it (I tested this): > > ## don't subset txtr. instead, > symbol(txtr)[ hasHugoSymbol ] <- symbolMappings$SYMBOL[hasHugoSymbol] > > Then proceed as before. I get the attached output from running > > plotTracks(txtr,from=start,to=end) > > since the NRIP1 isoforms, not being mapped to Hugo (?!), stay UCSC IDs. > > Do you mind if I cc: the list on this? And/or bring it up with Marc/Herve? > > --t > > > > On Mon, Feb 11, 2013 at 3:19 PM, Bogdan Tanasa <tanasa@gmail.com> wrote: > >> Hi Tim, >> >> I followed your code, in the following way (please see below); >> plotTracks(txtr) works; however, when I try to plot specific regions >> ( eg plotTracks(txtr,chr="chr21", start=15245000,end=15520000) ) it does >> not show the correct region : I suspect it may be an issue in my code, but >> not very sure where ... if you have 2-3 minutes, would appreciate your >> help. thanks ! >> >> library(Homo.sapiens) >> library("ggplot2") >> library(Gviz) >> library("GenomicFeatures") >> library("ggplot2") >> library("rtracklayer") >> library("TxDb.Hsapiens.UCSC.hg18.knownGene") >> >> chr<-"chr21" >> start<-15245000 >> end <-15520000 >> >> hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >> "knownGene") >> saveDb(hg18db, file="hg18db_knownGene.sqlite") >> txdb <-loadDb("hg18db_knownGene.sqlite") >> txtr <- >> GeneRegionTrack(txdb,genome="hg18",chromosome="chr21",showId=TRUE,g eneSymbol=TRUE,name="UCSC") >> plotTracks(txtr,from=start,to=end) >> >> >> >> symbolMappings <- select(Homo.sapiens, >> cols=c('UCSCKG','ENTREZID','SYMBOL'), >> keys=symbol(txtr), keytype='UCSCKG') >> symbolMappings <- symbolMappings[match(symbol(txtr), >> symbolMappings$UCSCKG),] >> >> hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) >> >> txtr <- txtr[ hasHugoSymbol ] >> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >> >> plotTracks(txtr) >> plotTracks(txtr,from=start,to=end) >> >> >> >> On Mon, Feb 11, 2013 at 11:39 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >> >>> Oops, critical step omitted. See below. >>> >>> ## see ?select for more on the following >>> ## >>> library(Homo.sapiens) >>> symbolMappings <- select(Homo.sapiens, >>> >>> cols=c('UCSCKG','ENTREZID','SYMBOL'), >>> keys=symbol(txtr), >>> keytype='UCSCKG') >>> >>> ## "pork out" the mappings table so that duplicates expand >>> ## >>> symbolMappings <- symbolMappings[match(symbol(txtr), >>> >>> symbolMappings$UCSCKG),] >>> >>> ## if NAs could be accepted as keys, this next step might be unnecessary >>> ## >>> hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) >>> >>> ## however, it seems that not all UCSC known genes have a Hugo symbol? >>> ## >>> txtr <- txtr[ hasHugoSymbol ] >>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >>> plotTracks(txtr) >>> >>> >>> The above (minus comments) is what I used to generate a test result. >>> >>> >>> >>> >>> On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >>> >>>> Well, here's one approach. I'll start from the constructed 'txtr' >>>> track. >>>> >>>> ## see ?select for more on the following >>>> ## >>>> library(Homo.sapiens) >>>> symbolMappings <- select(Homo.sapiens, >>>> >>>> cols=c('UCSCKG','ENTREZID','SYMBOL'), >>>> keys=symbol(txtr), >>>> keytype='UCSCKG') >>>> >>>> ## "pork out" the mappings table so that duplicates expand >>>> ## >>>> symbolMappings <- symbolMappings[match(symbol(txtr), >>>> >>>> symbolMappings$UCSCKG),] >>>> >>>> ## if NAs can be accepted as keys, this next step might be unnecessary >>>> ## however, it seems that not all UCSC known genes have a Hugo symbol? >>>> ## >>>> txtr <- txtr[ hasHugoSymbol ] >>>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >>>> plotTracks(txtr) >>>> >>>> That works. >>>> >>>> >>>> >>>> >>>> On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian < >>>> florian.hahne@novartis.com> wrote: >>>> >>>>> Well, the main culprit here is really the TranscriptDB package. It does >>>>> not seem to deal at all with gene symbols, so there is no way for Gviz >>>>> to >>>>> automatically fetch those. If you come up with a way to match the UCSC >>>>> gene identifiers back to gene symbols you could stick those into the >>>>> GeneRegionTrack using the 'symbol' replacement method. E.g., >>>>> symbol(foo) <- mappingTable[match(transcript(foo), >>>>> mappingTable$UCSCId),]) >>>>> I am not sure how this mapping is supposed to be done in the >>>>> Bioconductor >>>>> world these days. You may be able to find a way using one of the org.db >>>>> packages. Or maybe you will have to download a mapping table directly >>>>> from >>>>> the UCSC table browser. >>>>> With the next release of Bioconductor (or already now if you are >>>>> working >>>>> with the devel branch), Gviz supports building tracks from a whole >>>>> range >>>>> of standard annotation files. You could then export the whole >>>>> known.gene >>>>> table from UCSC as a GTF file and import in again as a GeneRegionTrack >>>>> object. Alternatively you may want to look into the >>>>> BiomartGeneRegionTrack >>>>> class which will fetch the gene models from Ensembl, but this includes >>>>> the >>>>> HUGO gene symbols. >>>>> Florian >>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >>>>> >>>>> >Dear all, >>>>> > >>>>> >I am using the following code below in order to retrieve the gene >>>>> >annotations in Gviz package : >>>>> >please could advise on what shall I modify in order to display the >>>>> HUGO [[elided Yahoo spam]] >>>>> > >>>>> >library("GenomicFeatures") >>>>> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >>>>> >"knownGene") >>>>> >saveDb(hg18db, file="hg18db_knownGene.sqlite") >>>>> >txdb <-loadDb("hg18db_knownGene.sqlite") >>>>> >txTr <- >>>>> >>>>> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRU E,geneSymbo >>>>> >l=TRUE,name="UCSC") >>>>> >plotTracks(txTr,from=start,to=end) >>>>> > >>>>> >-- bogdan >>>>> >------------------ >>>>> > >>>>> > [[alternative HTML version deleted]] >>>>> > >>>>> >_______________________________________________ >>>>> >Bioconductor mailing list >>>>> >Bioconductor@r-project.org >>>>> >https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >Search the archives: >>>>> >http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>>> >>>> -- >>>> *A model is a lie that helps you see the truth.* >>>> * >>>> * >>>> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>>> >>> >>> >>> >>> -- >>> *A model is a lie that helps you see the truth.* >>> * >>> * >>> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > [[alternative HTML version deleted]] ------------------------------ Message: 16 Date: Tue, 12 Feb 2013 00:41:02 +0000 From: Ian Sudbery <ian.sudbery@dpag.ox.ac.uk> To: "bioconductor@stat.math.ethz.ch" <bioconductor@stat.math.ethz.ch> Subject: [BioC] Error Bars on EdgeR Message-ID: <55D5020998905943B8A25CAFED69B2C219DCBCCF@MBX01.ad.oak.ox.ac.uk> Content-Type: text/plain Hi All, I was wondering if anyone had any idea how one might put error bars on the log fold changes calculated by edgeR? I'd like to do this so I can show examples of differentially expressed genes on a bar plot, and show them in comparison to qPCR validation. I feel like one should be able to get the variance from the dispersion: CV^2 = 1/mu + BCV^2 = 1/mu + dispersion So: variance = mu + mu^2*dispersion ? This could either be used to show the variance for each of the two conditions, or used to calculate an error for the fold change. but I'm a bit stuck as to how to actually implement this. Any ideas appreciated, Yours, Ian [[alternative HTML version deleted]] ------------------------------ Message: 17 Date: Tue, 12 Feb 2013 01:40:23 +0100 From: Jon Toledo <tintin_jb@hotmail.com> To: <bioconductor@r-project.org> Subject: [BioC] Differences in results analyzing Mouse Gene 1.0-ST using oligo and affy package Message-ID: <bay161-w16c78e461d37c096df6db698090@phx.gbl> Content-Type: text/plain Dear Bioconductor List, I have repeated my workflow using the affy and oligo package alternatively followed by the limma package to analyze and experiment with two conditions using Mouse Gene 1.0-ST chips and I arrive to different results. I have been looking online and found that at least for the for the Mouse Gene 1.1-ST the oligo package is preferred, but not anything clear for Mouse Gene 1.0-ST . I attach below my code and session info: A) For running affy: library(affy) library(pd.mogene.1.0.st.v1) library(mogene10sttranscriptcluster.db) cellist=list.celfiles(full.names=T) cellistD14=grep("CD1...14",cellist,value=T) esetD14<- justRMA(filenames=gsub("\\./","",cellistD14)) B) For runnign oligo: library(oligo) library(pd.mogene.1.0.st.v1) library(mogene10sttranscriptcluster.db) cellist=list.celfiles(full.names=T) cellistD14=grep("CD1...14",cellist,value=T) rsetD14=read.celfiles(cellistD14) esetpD14=rma(rsetD14,target="probeset") esetD14=rma(rsetD14,target="core") C) Finally running the same analysis: designD14<-read.delim('designD14.txt') contrast.matrix=model.matrix(~treatment,data=designD14) library(limma) fit <- lmFit(esetD14, contrast.matrix) fit <- eBayes(fit,proportion=0.01) m1=topTable(fit, coef="treatment",number=1e8,adjust.method="BH") m1=m1[,c("ID","logFC","P.Value","adj.P.Val")] m1=cbind(m1[,1:2],FCTreat=2**m1[,2],PTreat=m1[,3],adj.PTreat=m1[,4]) > sessionInfo() (This is for the affy run) R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] mogene10stv1cdf_2.11.0 limma_3.14.4 mogene10sttranscriptcluster.db_8.0.1 [4] org.Mm.eg.db_2.8.0 AnnotationDbi_1.20.3 pd.mogene.1.0.st.v1_3.8.0 [7] oligo_1.22.0 oligoClasses_1.20.0 RSQLite_0.11.2 [10] DBI_0.2-5 affy_1.36.1 Biobase_2.18.0 [13] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] affxparser_1.30.2 affyio_1.26.0 BiocInstaller_1.8.3 Biostrings_2.26.3 bit_1.1-9 codetools_0.2-8 [7] ff_2.2-10 foreach_1.4.0 GenomicRanges_1.10.6 IRanges_1.16.4 iterators_1.0.6 parallel_2.15.2 [13] preprocessCore_1.20.0 splines_2.15.2 stats4_2.15.2 tools_2.15.2 zlibbioc_1.4.0 Thank you very much J Toledo UPenn USA [[alternative HTML version deleted]] ------------------------------ Message: 18 Date: Mon, 11 Feb 2013 22:39:23 -0500 From: Juliet Hannah <juliet.hannah@gmail.com> To: Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] statistical test for time course data Message-ID: <calzuzrqt0ndnwgyekfphgc4z-bpj_mvhwxcvnijq- g12_4zszq@mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hi Gordon, I've been curious about this as well. Can limma model the following situation? We have two treatments and multiple time points. We are interested in if the mean profile for each treatment differs over time, treating time continuously. The measurements are over time for each individual so we have to account for correlations within individuals. Ideally, I would like to allow for a random intercept and a random slope (possible quadratic) if needed. EDGE uses splines, so that would be nice as well. I am aware of duplicateCorrelation, Is this the way to proceed? I'll post a reproducible example for guidance if you indicate that the above is possible. Thanks for your time and for your work on limma, which I use frequently. On Fri, Feb 1, 2013 at 6:46 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Chris, > > Yes, limma can easily test for a difference at one point, or can test for a > significant change over the whole time course like EDGE. > > I don't understand your experiment well enough to give more specific advice. > You would need to tell us your experimental design, in terms of the targets > frame, and exactly what you want to test. > > Best wishes > Gordon > >> Date: Fri, 1 Feb 2013 12:19:26 +0900 >> From: chris Jhon <cjhon217@gmail.com> >> To: Richard Friedman <friedman@cancercenter.columbia.edu> >> Cc: Bioconductor mailing list <bioconductor@r-project.org> >> Subject: Re: [BioC] statistical test for time course data >> >> Hi Richard, >> >> Thank you for help. >> In my data ,i have one point which i think it is different from other >> points and i would like to test statistical significance of the difference >> of this point. >> Your suggestion means that there is no direct function in R that i can >> use,i have to use a package which implement an algorithm. >> If so, i think edgeR can do the same analysis too,Am i right? >> >> Best Reagards, >> Chris >> >> On Thu, Jan 31, 2013 at 11:53 PM, Richard Friedman < >> friedman@cancercenter.columbia.edu> wrote: >> >>> Dear Chris, >>> >>> Limma can be used to test between time points >>> treating each time point as a categorical variable. >>> The program "EDGE" from the Storey lab, can test whether >>> there is significant change over a whole time course. >>> >>> http://www.ncbi.nlm.nih.gov/pubmed/16357033 >>> >>> with hopes that the above helps, >>> Rich >>> Richard A. Friedman, PhD >>> Associate Research Scientist, >>> Biomedical Informatics Shared Resource >>> Herbert Irving Comprehensive Cancer Center (HICCC) >>> Lecturer, >>> Department of Biomedical Informatics (DBMI) >>> Educational Coordinator, >>> Center for Computational Biology and Bioinformatics (C2B2)/ >>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)/ >>> Columbia Initiative in Systems Biology >>> Room 824 >>> Irving Cancer Research Center >>> Columbia University >>> 1130 St. Nicholas Ave >>> New York, NY 10032 >>> (212)851-4765 (voice) >>> friedman@cancercenter.columbia.edu >>> http://cancercenter.columbia.edu/~friedman/ >>> >>> "Complex numbers! Ha! Ha! There is nothing weirder >>> than imaginary numbers. Architects don't need to know >>> complex numbers. Whenever I get a negative root for >>> an area, I throw it out. And don't talk to me about >>> quaternions. I am not going into computer animation." >>> -Rose Friedman, age 16 >>> >>> >>> On Jan 30, 2013, at 11:43 PM, chris Jhon wrote: >>> >>> Hi All, >>> >>> I have data at different time points for time course experiment. >>> I have a response for each time point and i would like to test whether >>> the >>> difference between response of two time points is statistically >>> significant >>> or not. >>> my data is linear plot where response on y axis and time on x axis. >>> >>> what statistical test shall i use? >>> >>> >>> I appreciate any help. >>> >>> Best Regards, >>> Chris >>> > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 19 Date: Mon, 11 Feb 2013 23:31:31 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: Juliet Hannah <juliet.hannah@gmail.com> Cc: Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] statistical test for time course data Message-ID: <caha9mcosvlmerwxije_qnrpbqr0tc0qkb=y=kbmxeosf- 9xo+g@mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hi Juliet, On Mon, Feb 11, 2013 at 10:39 PM, Juliet Hannah <juliet.hannah@gmail.com> wrote: > Hi Gordon, > > I've been curious about this as well. Can limma model the following > situation? [snip] I think you'll find a thread that appeared a bit after the one you are replying to: http://thread.gmane.org/gmane.science.biology.informatics.conductor/46 188 If you look in Section 8.6 of the limmaUserGuide (in development): http://bioconductor.org/packages/2.12/bioc/vignettes/limma/inst/doc/us ersguide.pdf You'll find a section on time course analysis. 8.6.2, in particular, uses splines. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 20 Date: Mon, 11 Feb 2013 21:45:34 -0800 (PST) To: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] edgeR cpm filtering Message-ID: <1360647934.77288.YahooMailNeo@web162106.mail.bf1.yahoo.com> Content-Type: text/plain Hi Jim, I posted my first question as a guest and just became a member today, so it???s a bit of learning curve for me on how to use the list. Sorry for the email and I hope I am using the list correctly now by emailing this address. About my edgeR question: Thank you for your elaboration and also the example. I can see that in your example, you have samples with 0 reads and logFC is okay, and that???s what logically should be. However, in my dataset, I see many cases of logFC of ~144269489 (or negative of ~ this value). When I check the genes, I see that these are the cases where all the replicates of one samples have 0 reads mapped to them, whereas the other groups of samples have many reads. These are the cases that cpm didn???t filter them. That???s why I tried to use more restrictive cpm filtering to get rid of these genes. Any thoughts on why this non-interpretive logFC cases happen are greatly appreciated. Thanks, John Hi John, Please don't take things off-list. Even if you are not a subscriber (and if you are using BioC stuff you should be, and you can always stop delivery but remain a subscriber), I believe that replying to an existing thread will work. I don't see any zero counts causing a problem. Using the example for cpm() as a starting point, and modifying to have a set with zero counts, I get this: > y ?? ?? [,1] [,2] [,3] [,4] [1,]?? ?? 1?? ?? 2?? 14?? 11 [2,]?? 11?? 25?? ?? 1?? 26 [3,]?? ?? 1?? 22?? ?? 2?? 19 [4,]?? ?? 5?? ?? 6?? 15?? ?? 6 [5,]?? ?? 0?? ?? 0?? ?? 1?? ?? 5 > d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2))) > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > topTags(exactTest(d)) Comparison of groups:?? 2-1 ?? ?? ?? logFC?? logCPM?? ?? ?? PValue?? ?? ?? ?? ?? FDR 1?? 2.9550376 12.76964 6.109348e-05 0.0003054674 5?? 4.6421574 10.54712 1.283343e-01 0.3208358043 4?? 0.9149142 12.96222 2.668415e-01 0.4447357815 2 -0.4149407 13.93933 8.539261e-01 0.9783799675 3 -0.1325391 13.42121 9.783800e-01 0.9783799675 So the sample with zero counts (sample 5), is the second row in the topTags() output, and it has no problem computing a logFC value. Best, Jim On 2/11/2013 4:30 PM, John Sperry wrote: > Hi again Jim, > > One more thing, in microarray days, people used to add a small value, let say 1 to the 0 values to avoid non-sense fold changes. It's not the case in NGS any more right? so it's not possible to do that in edgeR, right? that's why I was thinking about filtering out with cpm. > > Thanks, > John > > > > -------------------------------------------------------------------- ---- > *To:* "jmacdon@uw.edu" <jmacdon@uw.edu> > *Sent:* Monday, February 11, 2013 1:47 PM > *Subject:* [BioC] edgeR cpm filtering > > Hi Jim, > > I'm very new to edgeR and BioC. I couldn't reply to your post in BioC, so here is my post in an email :D > > I still cannot see why 1M is chosen, but I appreciate your explanations. > > About the cpm filtering, the reason that I chose '> 2' for 3 samples with each having 2 replicates was that I though edgeR must be smart enough to figure out that when I say more than 5 reads per million for [[elided Yahoo spam]] [[elided Yahoo spam]] > > as for the reason for wanting to get rid of the sample 3 with 2 replicates that have 0 reads mapped to them, I don't want them, because, they cause the logFC to become huge non-sense numbers! i guess dividing be 0 causes the problem! so I thought for not seeing weird values when the significant genes are selected, it's better to get rid of genes that have 0 reads mapped to any of their groups. Does it make sense? > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > Thanks, > John > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]] ------------------------------ Message: 21 Date: Tue, 12 Feb 2013 01:40:55 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] edgeR cpm filtering Message-ID: <caha9mcnreg0zpcykxpodjiemkjc0e9e7mawhftnbuqd1c1sk1q@mail.gmail.com> Content-Type: text/plain; charset=windows-1252 Hi John, e: > Hi Jim, > > I posted my first question as a guest and just became a > member today, so it?s a bit of learning curve for me on how to use the list. Sorry > for the email and I hope I am using the list correctly now by emailing this > address. > > About my edgeR question: > > Thank you for your elaboration and also the example. I can > see that in your example, you have samples with 0 reads and logFC is okay, and > that?s what logically should be. However, in my dataset, I see many cases of > logFC of ~144269489 (or negative of ~ this value). When I check the genes, I see > that these are the cases where all the replicates of one samples have 0 reads > mapped to them, whereas the other groups of samples have many reads. These are > the cases that cpm didn?t filter them. That?s why I tried to use more restrictive > cpm filtering to get rid of these genes. > > Any thoughts on why this non-interpretive logFC cases happen > are greatly appreciated. >From the looks of your previous (original) email, it looks like you've landed w/ a (very) strange bioconductor setup given the version of R you are using -- you are using R-2.15.0, with edgeR version 2.6.0 Please update to the latest version of R (2.15.2 -- cuz, hey it's a good idea), but more importantly, the latest version of edgeR: http://bioconductor.org/packages/2.11/bioc/html/edgeR.html You should be installing your bioconductor packages w/ the BiocInstaller package/biocLite functionality. Unwinding yourself from the bizarre package versions you have running to ensure that you are on the latest bioc release train (2.11) might get a bit tricky, and sometimes the easiest way is to blow out your R install and start from scratch. After you do that, do you still get such extreme logFC's? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 22 Date: Tue, 12 Feb 2013 07:55:21 +0000 From: "Hahne, Florian" <florian.hahne@novartis.com> To: "ttriche@usc.edu" <ttriche@usc.edu> Cc: "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <aabaab0b27af5c418fe809f352fb7ba408486bed@023-db3mpn1-082.023d .mgd.msft.net=""> Content-Type: text/plain Cool, thanks. Just learned something again today :-) I will take a closer look at the Homo.sapiens package (and the other new organism annotation packages as well) to figure out how to integrate this more closely into the Gviz package. One should not have to jump through all these hoops to get gene symbols in there if all the necessary information is already on your fingertips. Florian -- From: "<tim triche="">", "Jr." <tim.triche@gmail.com<mailto:tim.triche@gmail.com>> Reply-To: "ttriche@usc.edu<mailto:ttriche@usc.edu>" <ttriche@usc.edu<mailto:ttriche@usc.edu>> Date: Monday, February 11, 2013 8:39 PM To: NIBR <florian.hahne@novartis.com<mailto:florian.hahne@novartis.com>> Cc: Bogdan Tanasa <tanasa@gmail.com<mailto:tanasa@gmail.com>>, Bioconductor mailing list <bioconductor@r-project.org<mailto:bioconductor@r-project.org>>, "bioc-sig-sequencing-request@r-project.org<mailto:bioc-sig-sequencing- request@r-project.org="">" <bioc-sig-sequencing- request@r-project.org<mailto:bioc-sig-sequencing-="" request@r-project.org="">> Subject: Re: [BioC] question about Gviz Oops, critical step omitted. See below. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs could be accepted as keys, this next step might be unnecessary ## hasHugoSymbol <- !is.na<http: is.na="">(symbolMappings$SYMBOL) ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) The above (minus comments) is what I used to generate a test result. On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com<mailto:tim.triche@gmail.com>> wrote: Well, here's one approach. I'll start from the constructed 'txtr' track. ## see ?select for more on the following ## library(Homo.sapiens) symbolMappings <- select(Homo.sapiens, cols=c('UCSCKG','ENTREZID','SYMBOL'), keys=symbol(txtr), keytype='UCSCKG') ## "pork out" the mappings table so that duplicates expand ## symbolMappings <- symbolMappings[match(symbol(txtr), symbolMappings$UCSCKG),] ## if NAs can be accepted as keys, this next step might be unnecessary ## however, it seems that not all UCSC known genes have a Hugo symbol? ## txtr <- txtr[ hasHugoSymbol ] symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] plotTracks(txtr) That works. On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian <florian.hahne@novartis.com<mailto:florian.hahne@novartis.com>> wrote: Well, the main culprit here is really the TranscriptDB package. It does not seem to deal at all with gene symbols, so there is no way for Gviz to automatically fetch those. If you come up with a way to match the UCSC gene identifiers back to gene symbols you could stick those into the GeneRegionTrack using the 'symbol' replacement method. E.g., symbol(foo) <- mappingTable[match(transcript(foo), mappingTable$UCSCId),]) I am not sure how this mapping is supposed to be done in the Bioconductor world these days. You may be able to find a way using one of the org.db packages. Or maybe you will have to download a mapping table directly from the UCSC table browser. With the next release of Bioconductor (or already now if you are working with the devel branch), Gviz supports building tracks from a whole range of standard annotation files. You could then export the whole known.gene table from UCSC as a GTF file and import in again as a GeneRegionTrack object. Alternatively you may want to look into the BiomartGeneRegionTrack class which will fetch the gene models from Ensembl, but this includes the HUGO gene symbols. Florian -- On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com<mailto:tanasa@gmail.com>> wrote: >Dear all, > >I am using the following code below in order to retrieve the gene >annotations in Gviz package : >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] > >library("GenomicFeatures") >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >"knownGene") >saveDb(hg18db, file="hg18db_knownGene.sqlite") >txdb <-loadDb("hg18db_knownGene.sqlite") >txTr <- >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,gene Symbo >l=TRUE,name="UCSC") >plotTracks(txTr,from=start,to=end) > >-- bogdan >------------------ > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- A model is a lie that helps you see the truth. Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> -- A model is a lie that helps you see the truth. Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 23 Date: Tue, 12 Feb 2013 00:04:23 -0800 From: Bogdan Tanasa <tanasa@gmail.com> To: "Hahne, Florian" <florian.hahne@novartis.com> Cc: Bioconductor mailing list <bioconductor@r-project.org>, "bioc-sig-sequencing-request@r-project.org" <bioc-sig-sequencing-request@r-project.org> Subject: Re: [BioC] question about Gviz Message-ID: <ca+jem018udk357wghxa14wp_ncz3qwaesr- cn00pwkrt5w497w@mail.gmail.com=""> Content-Type: text/plain Thank you again gentlemen for your kind help ... On Mon, Feb 11, 2013 at 11:55 PM, Hahne, Florian <florian.hahne@novartis.com> wrote: > Cool, thanks. Just learned something again today :-) > I will take a closer look at the Homo.sapiens package (and the other new > organism annotation packages as well) to figure out how to integrate this > more closely into the Gviz package. One should not have to jump through all > these hoops to get gene symbols in there if all the necessary information > is already on your fingertips. > Florian > -- > > > From: "<tim triche="">", "Jr." <tim.triche@gmail.com> > Reply-To: "ttriche@usc.edu" <ttriche@usc.edu> > Date: Monday, February 11, 2013 8:39 PM > To: NIBR <florian.hahne@novartis.com> > Cc: Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list < > bioconductor@r-project.org>, "bioc-sig-sequencing- request@r-project.org" < > bioc-sig-sequencing-request@r-project.org> > Subject: Re: [BioC] question about Gviz > > Oops, critical step omitted. See below. > > ## see ?select for more on the following > ## > library(Homo.sapiens) > symbolMappings <- select(Homo.sapiens, > > cols=c('UCSCKG','ENTREZID','SYMBOL'), > keys=symbol(txtr), > keytype='UCSCKG') > > ## "pork out" the mappings table so that duplicates expand > ## > symbolMappings <- symbolMappings[match(symbol(txtr), > > symbolMappings$UCSCKG),] > > ## if NAs could be accepted as keys, this next step might be unnecessary > ## > hasHugoSymbol <- !is.na(symbolMappings$SYMBOL) > > ## however, it seems that not all UCSC known genes have a Hugo symbol? > ## > txtr <- txtr[ hasHugoSymbol ] > symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] > plotTracks(txtr) > > > The above (minus comments) is what I used to generate a test result. > > > > > On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > >> Well, here's one approach. I'll start from the constructed 'txtr' track. >> >> ## see ?select for more on the following >> ## >> library(Homo.sapiens) >> symbolMappings <- select(Homo.sapiens, >> >> cols=c('UCSCKG','ENTREZID','SYMBOL'), >> keys=symbol(txtr), >> keytype='UCSCKG') >> >> ## "pork out" the mappings table so that duplicates expand >> ## >> symbolMappings <- symbolMappings[match(symbol(txtr), >> >> symbolMappings$UCSCKG),] >> >> ## if NAs can be accepted as keys, this next step might be unnecessary >> ## however, it seems that not all UCSC known genes have a Hugo symbol? >> ## >> txtr <- txtr[ hasHugoSymbol ] >> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ] >> plotTracks(txtr) >> >> That works. >> >> >> >> >> On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian < >> florian.hahne@novartis.com> wrote: >> >>> Well, the main culprit here is really the TranscriptDB package. It does >>> not seem to deal at all with gene symbols, so there is no way for Gviz to >>> automatically fetch those. If you come up with a way to match the UCSC >>> gene identifiers back to gene symbols you could stick those into the >>> GeneRegionTrack using the 'symbol' replacement method. E.g., >>> symbol(foo) <- mappingTable[match(transcript(foo), >>> mappingTable$UCSCId),]) >>> I am not sure how this mapping is supposed to be done in the Bioconductor >>> world these days. You may be able to find a way using one of the org.db >>> packages. Or maybe you will have to download a mapping table directly >>> from >>> the UCSC table browser. >>> With the next release of Bioconductor (or already now if you are working >>> with the devel branch), Gviz supports building tracks from a whole range >>> of standard annotation files. You could then export the whole known.gene >>> table from UCSC as a GTF file and import in again as a GeneRegionTrack >>> object. Alternatively you may want to look into the >>> BiomartGeneRegionTrack >>> class which will fetch the gene models from Ensembl, but this includes >>> the >>> HUGO gene symbols. >>> Florian >>> >>> >>> -- >>> >>> >>> >>> >>> >>> >>> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote: >>> >>> >Dear all, >>> > >>> >I am using the following code below in order to retrieve the gene >>> >annotations in Gviz package : >>> >please could advise on what shall I modify in order to display the HUGO [[elided Yahoo spam]] >>> > >>> >library("GenomicFeatures") >>> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename = >>> >"knownGene") >>> >saveDb(hg18db, file="hg18db_knownGene.sqlite") >>> >txdb <-loadDb("hg18db_knownGene.sqlite") >>> >txTr <- >>> >>> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE, geneSymbo >>> >l=TRUE,name="UCSC") >>> >plotTracks(txTr,from=start,to=end) >>> > >>> >-- bogdan >>> >------------------ >>> > >>> > [[alternative HTML version deleted]] >>> > >>> >_______________________________________________ >>> >Bioconductor mailing list >>> >Bioconductor@r-project.org >>> >https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >Search the archives: >>> >http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > [[alternative HTML version deleted]] ------------------------------ _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor End of Bioconductor Digest, Vol 120, Issue 12 ********************************************* [[alternative HTML version deleted]]

ADD COMMENT • link 12.2 years ago Matthew Thornton ▴ 380

Login before adding your answer.