Entering edit mode
Sent from my Droid Charge on Verizon 4GLTE "bioconductor-
request@r-project.org" wrote:
Send Bioconductor mailing list submissions to
bioconductor@r-project.org
To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/bioconductor
or, via email, send a message with subject or body 'help' to
bioconductor-request@r-project.org
You can reach the person managing the list at
bioconductor-owner@r-project.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioconductor digest..."
Today's Topics:
1. Re: Help with Gviz \"IdeogramTrack\" and
\"BioMartGeneTrackRegion\" commands (Hahne, Florian)
2. Re: question about Gviz (Hahne, Florian)
3. distanceToNearest in GenomicRanges (Tom Oates)
4. edgeR cpm filtering (John [guest])
5. Re: runSPIA() in graphite package (array chip)
6. Re: edgeR cpm filtering (James W. MacDonald)
7. Re: distanceToNearest in GenomicRanges (James W. MacDonald)
8. Re: DataFrame bug? (Valerie Obenchain)
9. Re: question about Gviz (Tim Triche, Jr.)
10. Re: question about Gviz (Tim Triche, Jr.)
11. plotting BED intervals to TSS regions (Seb)
12. Re: edgeR cpm filtering (James W. MacDonald)
13. Re: runSPIA() in graphite package (Enrica)
14. Re: plotting BED intervals to TSS regions (Tengfei Yin)
15. Re: question about Gviz (Bogdan Tanasa)
16. Error Bars on EdgeR (Ian Sudbery)
17. Differences in results analyzing Mouse Gene 1.0-ST using
oligo and affy package (Jon Toledo)
18. Re: statistical test for time course data (Juliet Hannah)
19. Re: statistical test for time course data (Steve Lianoglou)
20. Re: edgeR cpm filtering (John Sperry)
21. Re: edgeR cpm filtering (Steve Lianoglou)
22. Re: question about Gviz (Hahne, Florian)
23. Re: question about Gviz (Bogdan Tanasa)
----------------------------------------------------------------------
Message: 1
Date: Mon, 11 Feb 2013 13:48:56 +0000
From: "Hahne, Florian" <florian.hahne@novartis.com>
To: Marc Carlson <mcarlson@fhcrc.org>, "bioconductor@r-project.org"
<bioconductor@r-project.org>
Subject: Re: [BioC] Help with Gviz \"IdeogramTrack\" and
\"BioMartGeneTrackRegion\" commands
Message-ID:
<aabaab0b27af5c418fe809f352fb7ba40848536d@023-db3mpn1-082.023d .mgd.msft.net="">
Content-Type: text/plain; charset="iso-8859-2"
Hi Marc,
thanks for the hint, but I think this is not quite what I need. My
problem
is still on the level of genomes. UCSC for instance calls a particular
version of they human genome hg19. Now there exists a similar genome
in
Ensembl, however they do not use the same name for it (GRCh37.p10). I
made
the maybe somewhat unwise attempt early on to identify genomes within
Gviz
by their UCSC name and to translate those names into Ensembl names if
necessary. In hind sight this may not have been the smartest decision,
and
I should have left the translation job completely to the user. If
somebody
wants Ensebml gene models from BiomaRt they should make sure that
they
select the correct mart and dataset in the first place.
I'll think about a pragmatic way out of this hole I've dug myself
into.
Florian
--
On 1/23/13 2:18 AM, "Marc Carlson" <mcarlson@fhcrc.org> wrote:
>Hi Florian,
>
>We actually have a small database called seqnames.db that is
dedicated
>to tracking these kinds of chromosome name conventions. You can see
>more by looking at the help page for supportedSeqnameStyles() (and
it's
>friends). A quick way to see that is:
>
>library(Homo.sapiens)
>?supportedSeqnameStyles
>
>
>If you call the supportedSeqnameStyles() method, you will see that we
>don't (yet) have an entry for zebrafish. If you were to give me one
as a
>tab file, I could add it to the database and it would therefore exist
>for the future... The file I need is deliberately simple to make.
It
>should look like the example below, with as many columns as you want
>there to be styles for, and each column separated by a tab.
>
>NCBI MSU6
>1 1
>2 2
>3 3
>4 4
>
>etc.
>
>
> Marc
>
>
>
>
>
>On 01/21/2013 09:15 AM, Hahne, Florian wrote:
>> Hi Joseph,
>>
>> Regarding your first problem: UCSC has no cytoband information for
any
>>of
>> the zebrafish genomes, and that's what is throwing the error. I
think it
>> should do something smarter, e.g. use the chromosome length
information
>> that should be available for every UCSC genome to draw at least a
blank
>> ideogram which could still be used to indicate the current plotting
>> position. I'll have this ready in the next release of the package,
and
>> maybe even port this back to the current release. It seems to be
more
>>of a
>> bug than a missing feature?
>>
>> Your second problem is a bit more tricky. There is no real mapping
>>between
>> the ensembl genome names used in the Biomart package and the UCSC
ones
>> which I decided to take as the defaults for the package. I tried to
come
>> up with my own static mapping for this, and obviously this means
that
>> things tend to get out of date soon. Now the zebrafish version that
you
>> will get in Ensembl is Zv9 (which is equivalent to danRer7), but my
>> mapping is still to danRer6. This is even wrong, because what you
will
>>get
>> from Biomart if you ask for danRer6 now is actually danRer7. Yikes.
I
>>will
>> have to come up with a better solution for this. There should be a
way
>>to
>> explicitly control for the Ensembl genome that you will get, and
this
>>is a
>> simple change. Getting it right automagically is way more
challenging, I
>> am afraid.
>>
>> As a quick fix for you:
>> Ask for the danRer6 genes and manually change the genome of the
track:
>>
>>biomTrack<-BiomartGeneRegionTrack(genome="danRer6",chromosome=1,star
t=1e6
>>,e
>> nd=1e6+10000,name="ENSEMBL",showId=T)
>> genome(biomTrack)<- "danRer7"
>>
>> I'll get back to you once I have a better solution.
>>
>> Florian
>>
>>
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 2
Date: Mon, 11 Feb 2013 14:03:39 +0000
From: "Hahne, Florian" <florian.hahne@novartis.com>
To: Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list
<bioconductor@r-project.org>,
"bioc-sig-sequencing-request@r-project.org"
<bioc-sig-sequencing-request@r-project.org>
Subject: Re: [BioC] question about Gviz
Message-ID:
<aabaab0b27af5c418fe809f352fb7ba4084853cd@023-db3mpn1-082.023d .mgd.msft.net="">
Content-Type: text/plain; charset="us-ascii"
Well, the main culprit here is really the TranscriptDB package. It
does
not seem to deal at all with gene symbols, so there is no way for Gviz
to
automatically fetch those. If you come up with a way to match the UCSC
gene identifiers back to gene symbols you could stick those into the
GeneRegionTrack using the 'symbol' replacement method. E.g.,
symbol(foo) <- mappingTable[match(transcript(foo),
mappingTable$UCSCId),])
I am not sure how this mapping is supposed to be done in the
Bioconductor
world these days. You may be able to find a way using one of the
org.db
packages. Or maybe you will have to download a mapping table directly
from
the UCSC table browser.
With the next release of Bioconductor (or already now if you are
working
with the devel branch), Gviz supports building tracks from a whole
range
of standard annotation files. You could then export the whole
known.gene
table from UCSC as a GTF file and import in again as a GeneRegionTrack
object. Alternatively you may want to look into the
BiomartGeneRegionTrack
class which will fetch the gene models from Ensembl, but this includes
the
HUGO gene symbols.
Florian
--
On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote:
>Dear all,
>
>I am using the following code below in order to retrieve the gene
>annotations in Gviz package :
>please could advise on what shall I modify in order to display the
HUGO
>gene symbol on each gene ? thanks !
>
>library("GenomicFeatures")
>hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename =
>"knownGene")
>saveDb(hg18db, file="hg18db_knownGene.sqlite")
>txdb <-loadDb("hg18db_knownGene.sqlite")
>txTr <-
>GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,gene
Symbo
>l=TRUE,name="UCSC")
>plotTracks(txTr,from=start,to=end)
>
>-- bogdan
>------------------
>
> [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 3
Date: Mon, 11 Feb 2013 16:35:39 +0000
From: Tom Oates <toates19@gmail.com>
To: "bioconductor@r-project.org" <bioconductor@r-project.org>
Subject: [BioC] distanceToNearest in GenomicRanges
Message-ID:
<cagudn1auoxuwhwz9papqpzx3radxafisto9a- 8xhm+vyslfxhg@mail.gmail.com="">
Content-Type: text/plain
Hi
I am very much a learner in R in general & GenomicRanges in general
I am struggling to find documentation to help me get my head around
distanceToNearest in GenomicRanges
If I have a GRanges object:
GRanges with 6 ranges and 4 metadata columns:
seqnames ranges strand |
<rle> <iranges> <rle> |
[1] 10 [ 96723746, 96723747] - |
[2] 7 [ 13641170, 13641171] + |
[3] 16 [ 17772801, 17772802] - |
[4] 3 [ 88173502, 88173503] - |
[5] 13 [106979682, 106979683] + |
[6] 9 [104393139, 104393140] + |
(You will notice that all the regions are only dinucleotides & I have
removed the metadata )
I have a 2nd GRanges object which is ensembl rat transcripts as below:
39549 ranges and 2 metadata columns:
seqnames ranges strand | tx_id
tx_name
<rle> <iranges> <rle> | <integer>
<character>
[1] 1 [5473, 16844] + | 1
ENSRNOT00000044270
[2] 1 [5526, 16968] + | 2
ENSRNOT00000049921
[3] 1 [5526, 16968] + | 3
ENSRNOT00000051735
[4] 1 [5598, 13520] + | 4
ENSRNOT00000034630
[5] 1 [8268, 16850] + | 5
ENSRNOT00000044505
[6] 1 [8316, 17577] + | 6
ENSRNOT00000042693
[7] 1 [8884, 16850] + | 7
ENSRNOT00000044187
[8] 1 [8956, 9955] + | 8
ENSRNOT00000041082
[9] 1 [9055, 17351] + | 9
ENSRNOT00000050254
If I invoke:
xx<-distanceToNearestdiff.cpgs.gr, rat.transcripts, ignore.strand=F)
xx
DataFrame with 1133 rows and 3 columns
queryHits subjectHits distance
<integer> <integer> <integer>
1 1 7752 0
2 2 32166 11946
3 3 14678 25377
4 4 24286 66747
5 5 10609 34242
6 6 37076 122683
7 7 35184 0
8 8 34180 45561
9 9 19351 50156
... ... ... ...
etc
I am uncertain how I would then use the xx output to gain information
(i.e.
tx_id, tx_name) about the feature which the function has identified as
nearest?
I would be happy to supply any more info as required
Tom
[[alternative HTML version deleted]]
------------------------------
Message: 4
Date: Mon, 11 Feb 2013 08:54:54 -0800 (PST)
From: "John [guest]" <guest@bioconductor.org>
To: bioconductor@r-project.org, johnsperry51@yahoo.com
Subject: [BioC] edgeR cpm filtering
Message-ID: <20130211165454.35588AAF78@mamba.fhcrc.org>
Content-Type: text/plain; charset=iso-8859-1
All,
I am a new edgeR user. I have difficulty understanding the meaning of
the ???cpm??? function of edgeR package. I mean I understand that
each value is divided by the total library value, and then multiplied
by 1,000,000. But why 1M? I don???t understand what is the logic
behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000?
Any specific reason for using 1M?
Another issues that I have is that how can I enforce filtering the
samples that have 0 reads in one group of samples, but very large
number of reads in another groups? Here is an example:
Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample
2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3-
replicate 2
Gene_X, 150,100, 270,320,0,0
I used:
d_DGEList <- d_DGEList[rowSums(cpm_filtered > 5) > 2,]
But still Gene_X is not filtered. Many genes with low number of reads
are filtered, but very few like Gene_X are still there. I think that
having many reads mapped to samples 1 and 2 qualifies it for passing
the cpm filtering. How can I filter genes like this? Is it OK if I
manually delete cases like this?
Thank you.
John
-- output of sessionInfo():
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_2.6.0 limma_3.12.0
>
--
Sent via the guest posting facility at bioconductor.org.
------------------------------
Message: 5
Date: Mon, 11 Feb 2013 09:36:15 -0800
From: array chip <arrayprofile@yahoo.com>
To: Enrica <enrica.calura@gmail.com>
Cc: "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it>,
"gabriele.sales@unipd.it" <gabriele.sales@unipd.it>,
Bioconductor
mailing list <bioconductor@stat.math.ethz.ch>
Subject: Re: [BioC] runSPIA() in graphite package
Message-ID:
<1360604175.82561.YahooMailNeo@web122901.mail.ne1.yahoo.com>
Content-Type: text/plain
Dear Enrica,
Thanks very much for checking this out! graphite is a great package
for pathway analysis with its ability to analyze so many different
pathway databases!
One more question, does runSPIA() run against these pathways on the
fly (i.e. access these databases from their server in real time), or
against a pre-downloaded database?
Do you have a timeline when the new release graphite package will
become available?
Thanks again for your great work.
John
________________________________
From: Enrica <enrica.calura@gmail.com>
Cc: Bioconductor mailing list <bioconductor@stat.math.ethz.ch>;
"gabriele.sales@unipd.it" <gabriele.sales@unipd.it>;
"chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it>
Sent: Monday, February 11, 2013 3:24 AM
Subject: Re: runSPIA() in graphite package
Dear John,
the procedure you applied to analyse your data is right.
However, the runSpia function filters out those pathways that have no
edges or less than 5 nodes with valid edges. Your pathway, after the
conversion to Entrez Gene, do not satisfy neither the conditions,
having no edges.
> pe<-convertIdentifiers(biocarta[["estrogen responsive protein efp
controls cell cycle and breast tumors growth"]], "entrez")
> pe
"estrogen responsive protein efp controls cell cycle and breast tumors
growth" pathway from BioCarta
Number of nodes???? = 5
Number of edges???? = 0
Type of identifiers = Entrez Gene
Retrieved on??????? = 2011-05-12
We equipped our runSpia() function with the checks described above in
order to protect the user from unreliable results.
On a separate note, we have also re-checked how that specific pathway
was converted. The original BioPax2 contained some edges, but their
endpoints were unfortunately nodes which could not be associated with
any entrez gene. Our software, thus, dropped them from the pathway.
We are working on a new graphite release based on annotation released
more recently. Those includes more edges; as a result that pathway is
no longer empty.
Enrica Calura
Hi all, I am using runSPIA() from graphite package to analyze a gene
list against biocarta pathway database.
>
>
>
>> library(graphite)
>
>> prepareSPIA(biocarta, "biocartaEx")
>> obj<-runSPIA(de=siggenes, all=allgenes, "biocartaEx")
>
>
>
>where "siggenes" are a list of 1332 significant genes selected from
"allgenes" (a list of 17138 genes). The running process verbose
indicated a total of 178 pathways analyzed.
>
>
>
>One of the pathways I am particularly interested is "estrogen
responsive protein efp controls cell cycle and breast tumors growth":
>
>
>> p <- biocarta[["estrogen responsive protein efp controls cell cycle
and breast tumors growth"]]
>> nodes(p)
>[1] "CDKs"???????????? "Cyclin B1/2"????? "EntrezGene:2099"?
"EntrezGene:2810"? "EntrezGene:57154" "EntrezGene:7157"?
"EntrezGene:7706"?? "ubiquitin"
>
>
>And my siggenes contains 2 of genes involved in this pathway:
>
>
>> siggenes[c('7706','2099')]
>???? 7706????? 2099
>?4.347012 -3.792425
>
>
>So I would assume that runSPIA will examine this pathway during the?
calculation, but surprisingly I didn't see this particular way being
examined. Here is the section of runSPIA() verbose output that started
with "e":
>
>
>Done pathway 42 : e2f1 destruction pathway..
>Done pathway 43 : effects of calcineurin in kera..
>Done pathway 44
: egf signaling pathway..
>Done pathway 45 : endocytotic role of ndk phosph..
>Done pathway 46 : epo signaling pathway..
>Done pathway 47 : erk and pi-3 kinase are necess..
>Done pathway 48 : erk1/erk2 mapk signaling pathw..
>Done pathway 49 : eukaryotic protein translation..
>Done pathway 50 : extrinsic prothrombin activati..
>
>
>
>Can anyone tell me why runSPIA() missed this pathway? Attached are
siggenes and allgenes for you to try.?
>
>
>> siggenes<-as.matrix(read.table("siggenes.txt",row.names=1))[,1]
>
>> allgenes<-as.matrix(read.table("H:\\test\\allgenes.txt",
row.names=NULL, header=T, colClasses='character'))[,1]
>
>
>Many thanks
>
>
>John
>
>
[[alternative HTML version deleted]]
------------------------------
Message: 6
Date: Mon, 11 Feb 2013 12:52:13 -0500
From: "James W. MacDonald" <jmacdon@uw.edu>
To: "John [guest]" <guest@bioconductor.org>
Cc: bioconductor@r-project.org
Subject: Re: [BioC] edgeR cpm filtering
Message-ID: <51192FCD.2000700@uw.edu>
Content-Type: text/plain; charset=windows-1252; format=flowed
Hi John,
On 2/11/2013 11:54 AM, John [guest] wrote:
> All,
>
> I am a new edgeR user. I have difficulty understanding the meaning
of the ???cpm??? function of edgeR package. I mean I understand that
each value is divided by the total library value, and then multiplied
by 1,000,000. But why 1M? I don???t understand what is the logic
behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000?
Any specific reason for using 1M?
It's reads. You are inputting read counts per gene, so by definition
edgeR knows nothing about bases. As for why 1M, I don't know for sure,
but would imagine it is because a 'reasonable' sized RNA-Seq
experiment
is based on somewhere around 10M reads or so.
In other words, say you have a sample with 10M reads, and one gene has
60 reads that align. You would have 6 cpm, which looks pretty low, and
it is. However you could do statistics on it based on a discrete
distribution like the negative binomial.
If you used 10M to normalize, the cp10M would be 0.6, so you would
have
to throw that gene out. If you used cpk, it would be 6000 cpk and it
would look like you had lots of reads when in fact you only had 60.
So cpm looks like a reasonable compromise to me.
>
> Another issues that I have is that how can I enforce filtering the
samples that have 0 reads in one group of samples, but very large
number of reads in another groups? Here is an example:
>
> Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample
2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3-
replicate 2
> Gene_X, 150,100, 270,320,0,0
>
> I used:
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,]
>
> But still Gene_X is not filtered. Many genes with low number of
reads are filtered, but very few like Gene_X are still there. I think
that having many reads mapped to samples 1 and 2 qualifies it for
passing the cpm filtering. How can I filter genes like this? Is it OK
if I manually delete cases like this?
Setting aside the logic of removing these genes (which IMO is missing
the point of looking for differential expression), note the logic of
your filter. Let's break it down by section:
rowSums(cpm_filtered > 5)
This gives the count of samples where the cpm value is > 5. In the
case
you mention, this would be four.
rowSums(cpm_filtered > 5) > 2
This results in TRUE if the count of samples for a given gene with a
cpm
value > 5 is greater than two. So you are saying that at least two
samples have to have a cpm > 5. In the instance you want to filter,
the
count is 4, and 4 > 2, so this passes the filter.
So what you apparently want are rows where the cpm is greater than
some
value in ALL samples, not just two of them, so you would want to
change
the > 2 part to == 6.
Note that this doesn't really make any sense. You are in effect saying
that you are completely uninterested in any genes that appear not to
be
expressed in one of your samples, but that might be highly expressed
in
one or more of the other samples. That to me is actually really
interesting, and I am not sure why you would want to filter out any
gene
that fulfills that criterion.
Best,
Jim
>
> Thank you.
> John
>
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
>
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
>
> locale:
>
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
>
> attached base packages:
>
> [1] stats graphics grDevices utils datasets methods base
>
>
> other attached packages:
>
> [1] edgeR_2.6.0 limma_3.12.0
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
------------------------------
Message: 7
Date: Mon, 11 Feb 2013 13:08:11 -0500
From: "James W. MacDonald" <jmacdon@uw.edu>
To: Tom Oates <toates19@gmail.com>
Cc: "bioconductor@r-project.org" <bioconductor@r-project.org>
Subject: Re: [BioC] distanceToNearest in GenomicRanges
Message-ID: <5119338B.9080702@uw.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi Tom,
On 2/11/2013 11:35 AM, Tom Oates wrote:
> Hi
> I am very much a learner in R in general& GenomicRanges in general
> I am struggling to find documentation to help me get my head around
> distanceToNearest in GenomicRanges
> If I have a GRanges object:
>
> GRanges with 6 ranges and 4 metadata columns:
> seqnames ranges strand |
> <rle> <iranges> <rle> |
> [1] 10 [ 96723746, 96723747] - |
> [2] 7 [ 13641170, 13641171] + |
> [3] 16 [ 17772801, 17772802] - |
> [4] 3 [ 88173502, 88173503] - |
> [5] 13 [106979682, 106979683] + |
> [6] 9 [104393139, 104393140] + |
>
> (You will notice that all the regions are only dinucleotides& I
have
> removed the metadata )
>
> I have a 2nd GRanges object which is ensembl rat transcripts as
below:
> 39549 ranges and 2 metadata columns:
> seqnames ranges strand | tx_id
> tx_name
> <rle> <iranges> <rle> |<integer>
> <character>
> [1] 1 [5473, 16844] + | 1
> ENSRNOT00000044270
> [2] 1 [5526, 16968] + | 2
> ENSRNOT00000049921
> [3] 1 [5526, 16968] + | 3
> ENSRNOT00000051735
> [4] 1 [5598, 13520] + | 4
> ENSRNOT00000034630
> [5] 1 [8268, 16850] + | 5
> ENSRNOT00000044505
> [6] 1 [8316, 17577] + | 6
> ENSRNOT00000042693
> [7] 1 [8884, 16850] + | 7
> ENSRNOT00000044187
> [8] 1 [8956, 9955] + | 8
> ENSRNOT00000041082
> [9] 1 [9055, 17351] + | 9
> ENSRNOT00000050254
>
>
> If I invoke:
> xx<-distanceToNearestdiff.cpgs.gr, rat.transcripts,
ignore.strand=F)
>
> xx
> DataFrame with 1133 rows and 3 columns
> queryHits subjectHits distance
> <integer> <integer> <integer>
> 1 1 7752 0
> 2 2 32166 11946
> 3 3 14678 25377
> 4 4 24286 66747
> 5 5 10609 34242
> 6 6 37076 122683
> 7 7 35184 0
> 8 8 34180 45561
> 9 9 19351 50156
> ... ... ... ...
> etc
>
> I am uncertain how I would then use the xx output to gain
information (i.e.
> tx_id, tx_name) about the feature which the function has identified
as
> nearest?
> I would be happy to supply any more info as required
The subjectHits column gives the row of your transcript GRanges object
that matches the corresponding query row. I am assuming here that the
'diff.cpgs.gr' GRanges object is longer than 6? Anyway, here is an
example using your data and the TxDb.Mmusculus.UCSC.mm10.knownGene
package:
> x
GRanges with 6 ranges and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] chr10 [ 96723746, 96723747] *
[2] chr7 [ 13641170, 13641171] *
[3] chr16 [ 17772801, 17772802] *
[4] chr3 [ 88173502, 88173503] *
[5] chr13 [106979682, 106979683] *
[6] chr9 [104393139, 104393140] *
---
> y <- transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene)
> xx <- distanceToNearest(x, y, ignore.strand=F)
> xx
DataFrame with 6 rows and 3 columns
queryHits subjectHits distance
<integer> <integer> <integer>
1 1 4514 100935
2 2 45653 0
3 3 19383 0
4 4 34197 0
5 5 14383 0
6 6 54212 8108
> y[xx[,2],]
GRanges with 6 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<rle> <iranges> <rle> | <integer> <character>
[1] chr10 [ 96617001, 96622811] + | 33419 uc007gww.2
[2] chr7 [ 13623967, 13670807] + | 21400 uc012ezp.1
[3] chr16 [ 17759663, 17779206] + | 48288 uc007ylz.1
[4] chr3 [ 88171560, 88177785] - | 10107 uc008puf.2
[5] chr13 [106963757, 107022114] - | 43288 uc007rue.1
[6] chr9 [104361832, 104385031] + | 29956 uc009rhp.1
---
seqlengths:
chr1 chr2 ... chrUn_JH584304
195471971 182113224 ... 114452
Best,
Jim
> Tom
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
------------------------------
Message: 8
Date: Mon, 11 Feb 2013 11:16:17 -0800
From: Valerie Obenchain <vobencha@fhcrc.org>
To: Charles Berry <ccberry@ucsd.edu>
Cc: bioconductor@stat.math.ethz.ch
Subject: Re: [BioC] DataFrame bug?
Message-ID: <51194381.4000809@fhcrc.org>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Hi Charles,
Thanks for reporting the bug. Now fixed in 1.17.32 devel and 1.16.5 in
release.
Valerie
library(RUnit)
DF <- DataFrame("A"=1:5,row.names=letters[1:5])
df <- data.frame("A"=1:5,row.names=letters[1:5])
> checkIdentical(DF['a','B'] <- 1, df['a','B'] <- 1)
[1] TRUE
DF <- DataFrame("A"=1:5,row.names=letters[1:5])
df <- data.frame("A"=1:5,row.names=letters[1:5])
> checkIdentical(DF['c','B'] <- 1, df['c','B'] <- 1)
[1] TRUE
> sessionInfo()
...
other attached packages:
[1] RUnit_0.4.26 IRanges_1.16.5 BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] parallel_2.15.1 stats4_2.15.1
On 02/09/2013 12:26 PM, Charles Berry wrote:
>>>
>
> Subset replacement like this
>
> df['a','c2']<-3
>
> depends on the position of row 'a' when 'c2' is a new column.
>
>
> Row 'a' first:
>
>> df1 <- DataFrame(c1=1:2,row.names=c("a","b"))
>> df1['a','c2'] <- 3
>> df1
> DataFrame with 2 rows and 2 columns
> c1 c2
> <integer> <numeric>
> a 1 3
> b 2 3
>
> Row 'a' second:
>
>> df2 <- DataFrame(c1=1:2,row.names= rev( c("a","b") ))
>> df2['a','c2'] <- 3
>> df2
> DataFrame with 2 rows and 2 columns
> c1 c2
> <integer> <numeric>
> b 1 NA
> a 2 3
>
> FWIW, the latter - but not the former - agrees with "the 'DataFrame'
behaves
> very similarly to 'data.frame'"(from the help page).
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] IRanges_1.16.4 BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.15.2 stats4_2.15.2
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
------------------------------
Message: 9
Date: Mon, 11 Feb 2013 11:34:04 -0800
From: "Tim Triche, Jr." <tim.triche@gmail.com>
To: "Hahne, Florian" <florian.hahne@novartis.com>
Cc: "bioc-sig-sequencing-request@r-project.org"
<bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa
<tanasa@gmail.com>, Bioconductor mailing list
<bioconductor@r-project.org>
Subject: Re: [BioC] question about Gviz
Message-ID:
<cac+n9bw3pqvp2urshqh3fhgtjhf9v+eguvsv94aj5yxwjxrqiq@mail.gmail.com>
Content-Type: text/plain
Well, here's one approach. I'll start from the constructed 'txtr'
track.
## see ?select for more on the following
##
library(Homo.sapiens)
symbolMappings <- select(Homo.sapiens,
cols=c('UCSCKG','ENTREZID','SYMBOL'),
keys=symbol(txtr),
keytype='UCSCKG')
## "pork out" the mappings table so that duplicates expand
##
symbolMappings <- symbolMappings[match(symbol(txtr),
symbolMappings$UCSCKG),]
## if NAs can be accepted as keys, this next step might be unnecessary
## however, it seems that not all UCSC known genes have a Hugo symbol?
##
txtr <- txtr[ hasHugoSymbol ]
symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
plotTracks(txtr)
That works.
On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian
<florian.hahne@novartis.com>wrote:
> Well, the main culprit here is really the TranscriptDB package. It
does
> not seem to deal at all with gene symbols, so there is no way for
Gviz to
> automatically fetch those. If you come up with a way to match the
UCSC
> gene identifiers back to gene symbols you could stick those into the
> GeneRegionTrack using the 'symbol' replacement method. E.g.,
> symbol(foo) <- mappingTable[match(transcript(foo),
mappingTable$UCSCId),])
> I am not sure how this mapping is supposed to be done in the
Bioconductor
> world these days. You may be able to find a way using one of the
org.db
> packages. Or maybe you will have to download a mapping table
directly from
> the UCSC table browser.
> With the next release of Bioconductor (or already now if you are
working
> with the devel branch), Gviz supports building tracks from a whole
range
> of standard annotation files. You could then export the whole
known.gene
> table from UCSC as a GTF file and import in again as a
GeneRegionTrack
> object. Alternatively you may want to look into the
BiomartGeneRegionTrack
> class which will fetch the gene models from Ensembl, but this
includes the
> HUGO gene symbols.
> Florian
>
>
> --
>
>
>
>
>
>
> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote:
>
> >Dear all,
> >
> >I am using the following code below in order to retrieve the gene
> >annotations in Gviz package :
> >please could advise on what shall I modify in order to display the
HUGO
[[elided Yahoo spam]]
> >
> >library("GenomicFeatures")
> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename =
> >"knownGene")
> >saveDb(hg18db, file="hg18db_knownGene.sqlite")
> >txdb <-loadDb("hg18db_knownGene.sqlite")
> >txTr <-
> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,ge
neSymbo
> >l=TRUE,name="UCSC")
> >plotTracks(txTr,from=start,to=end)
> >
> >-- bogdan
> >------------------
> >
> > [[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor@r-project.org
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives:
> >http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]
------------------------------
Message: 10
Date: Mon, 11 Feb 2013 11:39:15 -0800
From: "Tim Triche, Jr." <tim.triche@gmail.com>
To: "Hahne, Florian" <florian.hahne@novartis.com>
Cc: "bioc-sig-sequencing-request@r-project.org"
<bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa
<tanasa@gmail.com>, Bioconductor mailing list
<bioconductor@r-project.org>
Subject: Re: [BioC] question about Gviz
Message-ID:
<cac+n9bvd4vrf8q+1z1ydc7zphuqpy5doyiu7kkr4eu8u6sqcua@mail.gmail.com>
Content-Type: text/plain
Oops, critical step omitted. See below.
## see ?select for more on the following
##
library(Homo.sapiens)
symbolMappings <- select(Homo.sapiens,
cols=c('UCSCKG','ENTREZID','SYMBOL'),
keys=symbol(txtr),
keytype='UCSCKG')
## "pork out" the mappings table so that duplicates expand
##
symbolMappings <- symbolMappings[match(symbol(txtr),
symbolMappings$UCSCKG),]
## if NAs could be accepted as keys, this next step might be
unnecessary
##
hasHugoSymbol <- !is.na(symbolMappings$SYMBOL)
## however, it seems that not all UCSC known genes have a Hugo symbol?
##
txtr <- txtr[ hasHugoSymbol ]
symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
plotTracks(txtr)
The above (minus comments) is what I used to generate a test result.
On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr.
<tim.triche@gmail.com>wrote:
> Well, here's one approach. I'll start from the constructed 'txtr'
track.
>
> ## see ?select for more on the following
> ##
> library(Homo.sapiens)
> symbolMappings <- select(Homo.sapiens,
>
> cols=c('UCSCKG','ENTREZID','SYMBOL'),
> keys=symbol(txtr),
> keytype='UCSCKG')
>
> ## "pork out" the mappings table so that duplicates expand
> ##
> symbolMappings <- symbolMappings[match(symbol(txtr),
>
> symbolMappings$UCSCKG),]
>
> ## if NAs can be accepted as keys, this next step might be
unnecessary
> ## however, it seems that not all UCSC known genes have a Hugo
symbol?
> ##
> txtr <- txtr[ hasHugoSymbol ]
> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
> plotTracks(txtr)
>
> That works.
>
>
>
>
> On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian <
> florian.hahne@novartis.com> wrote:
>
>> Well, the main culprit here is really the TranscriptDB package. It
does
>> not seem to deal at all with gene symbols, so there is no way for
Gviz to
>> automatically fetch those. If you come up with a way to match the
UCSC
>> gene identifiers back to gene symbols you could stick those into
the
>> GeneRegionTrack using the 'symbol' replacement method. E.g.,
>> symbol(foo) <- mappingTable[match(transcript(foo),
mappingTable$UCSCId),])
>> I am not sure how this mapping is supposed to be done in the
Bioconductor
>> world these days. You may be able to find a way using one of the
org.db
>> packages. Or maybe you will have to download a mapping table
directly from
>> the UCSC table browser.
>> With the next release of Bioconductor (or already now if you are
working
>> with the devel branch), Gviz supports building tracks from a whole
range
>> of standard annotation files. You could then export the whole
known.gene
>> table from UCSC as a GTF file and import in again as a
GeneRegionTrack
>> object. Alternatively you may want to look into the
BiomartGeneRegionTrack
>> class which will fetch the gene models from Ensembl, but this
includes the
>> HUGO gene symbols.
>> Florian
>>
>>
>> --
>>
>>
>>
>>
>>
>>
>> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote:
>>
>> >Dear all,
>> >
>> >I am using the following code below in order to retrieve the gene
>> >annotations in Gviz package :
>> >please could advise on what shall I modify in order to display the
HUGO
[[elided Yahoo spam]]
>> >
>> >library("GenomicFeatures")
>> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename =
>> >"knownGene")
>> >saveDb(hg18db, file="hg18db_knownGene.sqlite")
>> >txdb <-loadDb("hg18db_knownGene.sqlite")
>> >txTr <-
>>
>> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,g
eneSymbo
>> >l=TRUE,name="UCSC")
>> >plotTracks(txTr,from=start,to=end)
>> >
>> >-- bogdan
>> >------------------
>> >
>> > [[alternative HTML version deleted]]
>> >
>> >_______________________________________________
>> >Bioconductor mailing list
>> >Bioconductor@r-project.org
>> >https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >Search the archives:
>> >http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>
> --
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]
------------------------------
Message: 11
Date: Mon, 11 Feb 2013 14:45:59 -0500
From: Seb <seba.bat@gmail.com>
To: bioconductor@r-project.org
Subject: [BioC] plotting BED intervals to TSS regions
Message-ID:
<canstik=yenwn1-_drhpwjhaxczjj=c5jo_sghauecjcenvtnvq@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
hi gurus
i have several BED files containing chromosome #, start and end that
correspond to overlapping regions of different ChIP Seq experiments.
this part was done with Galaxy.
i also have a file containing TSS coordinates +/- 10kb.
what i want to do is to create a plot to show how many of my
overlapping intervals fall within the TSS regions, and, if they do,
have on the X axis the distance to the TSS and on the Y axis the
number of regions that overlap that certain part of the TSS
...i am a bit confused about how to do this tho...i looked in galaxy
[[elided Yahoo spam]]
thanks
------------------------------
Message: 12
Date: Mon, 11 Feb 2013 17:10:05 -0500
From: "James W. MacDonald" <jmacdon@uw.edu>
Cc: "Bioconductor@r-project.org" <bioconductor@r-project.org>
Subject: Re: [BioC] edgeR cpm filtering
Message-ID: <51196C3D.6000009@uw.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi John,
Please don't take things off-list. Even if you are not a subscriber
(and
if you are using BioC stuff you should be, and you can always stop
delivery but remain a subscriber), I believe that replying to an
existing thread will work.
I don't see any zero counts causing a problem. Using the example for
cpm() as a starting point, and modifying to have a set with zero
counts,
I get this:
> y
[,1] [,2] [,3] [,4]
[1,] 1 2 14 11
[2,] 11 25 1 26
[3,] 1 22 2 19
[4,] 5 6 15 6
[5,] 0 0 1 5
> d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2)))
> d <- estimateCommonDisp(d)
> d <- estimateTagwiseDisp(d)
> topTags(exactTest(d))
Comparison of groups: 2-1
logFC logCPM PValue FDR
1 2.9550376 12.76964 6.109348e-05 0.0003054674
5 4.6421574 10.54712 1.283343e-01 0.3208358043
4 0.9149142 12.96222 2.668415e-01 0.4447357815
2 -0.4149407 13.93933 8.539261e-01 0.9783799675
3 -0.1325391 13.42121 9.783800e-01 0.9783799675
So the sample with zero counts (sample 5), is the second row in the
topTags() output, and it has no problem computing a logFC value.
Best,
Jim
On 2/11/2013 4:30 PM, John Sperry wrote:
> Hi again Jim,
>
> One more thing, in microarray days, people used to add a small
value,
> let say 1 to the 0 values to avoid non-sense fold changes. It's not
> the case in NGS any more right? so it's not possible to do that in
> edgeR, right? that's why I was thinking about filtering out with
cpm.
>
> Thanks,
> John
>
>
>
> --------------------------------------------------------------------
----
> *To:* "jmacdon@uw.edu" <jmacdon@uw.edu>
> *Sent:* Monday, February 11, 2013 1:47 PM
> *Subject:* [BioC] edgeR cpm filtering
>
> Hi Jim,
>
> I'm very new to edgeR and BioC. I couldn't reply to your post in
BioC,
> so here is my post in an email :D
>
> I still cannot see why 1M is chosen, but I appreciate your
explanations.
>
> About the cpm filtering, the reason that I chose '> 2' for 3 samples
> with each having 2 replicates was that I though edgeR must be smart
> enough to figure out that when I say more than 5 reads per million
for
[[elided Yahoo spam]]
[[elided Yahoo spam]]
>
> as for the reason for wanting to get rid of the sample 3 with 2
> replicates that have 0 reads mapped to them, I don't want them,
> because, they cause the logFC to become huge non-sense numbers! i
> guess dividing be 0 causes the problem! so I thought for not seeing
> weird values when the significant genes are selected, it's better to
> get rid of genes that have 0 reads mapped to any of their groups.
Does
> it make sense?
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,]
>
> Thanks,
> John
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
------------------------------
Message: 13
Date: Mon, 11 Feb 2013 12:24:58 +0100
From: Enrica <enrica.calura@gmail.com>
Cc: "chiara.romualdi@unipd.it" <chiara.romualdi@unipd.it>,
"gabriele.sales@unipd.it" <gabriele.sales@unipd.it>,
Bioconductor
mailing list <bioconductor@stat.math.ethz.ch>
Subject: Re: [BioC] runSPIA() in graphite package
Message-ID:
<capkcaw0ixxcknwxkj4vue25bfootjx2rd5flsgmgs9qk96hbsg@mail.gmail.com>
Content-Type: text/plain
Dear John,
the procedure you applied to analyse your data is right.
However, the runSpia function filters out those pathways that have no
edges
or less than 5 nodes with valid edges. Your pathway, after the
conversion
to Entrez Gene, do not satisfy neither the conditions, having no
edges.
> pe<-convertIdentifiers(biocarta[["estrogen responsive protein efp
controls cell cycle and breast tumors growth"]], "entrez")
> pe
"estrogen responsive protein efp controls cell cycle and breast tumors
growth" pathway from BioCarta
Number of nodes = 5
Number of edges = 0
Type of identifiers = Entrez Gene
Retrieved on = 2011-05-12
We equipped our runSpia() function with the checks described above in
order
to protect the user from unreliable results.
On a separate note, we have also re-checked how that specific pathway
was
converted. The original BioPax2 contained some edges, but their
endpoints
were unfortunately nodes which could not be associated with any entrez
gene. Our software, thus, dropped them from the pathway.
We are working on a new graphite release based on annotation released
more
recently. Those includes more edges; as a result that pathway is no
longer
empty.
Enrica Calura
> Hi all, I am using runSPIA() from graphite package to analyze a gene
list
> against biocarta pathway database.
>
> > library(graphite)
> > prepareSPIA(biocarta, "biocartaEx")
> > obj<-runSPIA(de=siggenes, all=allgenes, "biocartaEx")
>
> where "siggenes" are a list of 1332 significant genes selected from
> "allgenes" (a list of 17138 genes). The running process verbose
indicated a
> total of 178 pathways analyzed.
>
> One of the pathways I am particularly interested is "estrogen
responsive
> protein efp controls cell cycle and breast tumors growth":
>
> > p <- biocarta[["estrogen responsive protein efp controls cell
cycle and
> breast tumors growth"]]
> > nodes(p)
> [1] "CDKs" "Cyclin B1/2" "EntrezGene:2099"
> "EntrezGene:2810" "EntrezGene:57154" "EntrezGene:7157"
> "EntrezGene:7706" "ubiquitin"
>
> And my siggenes contains 2 of genes involved in this pathway:
>
> > siggenes[c('7706','2099')]
> 7706 2099
> 4.347012 -3.792425
>
> So I would assume that runSPIA will examine this pathway during the
> calculation, but surprisingly I didn't see this particular way being
> examined. Here is the section of runSPIA() verbose output that
started with
> "e":
>
> Done pathway 42 : e2f1 destruction pathway..
> Done pathway 43 : effects of calcineurin in kera..
> Done pathway 44 : egf signaling pathway..
> Done pathway 45 : endocytotic role of ndk phosph..
> Done pathway 46 : epo signaling pathway..
> Done pathway 47 : erk and pi-3 kinase are necess..
> Done pathway 48 : erk1/erk2 mapk signaling pathw..
> Done pathway 49 : eukaryotic protein translation..
> Done pathway 50 : extrinsic prothrombin activati..
>
> Can anyone tell me why runSPIA() missed this pathway? Attached are
> siggenes and allgenes for you to try.
>
> > siggenes<-as.matrix(read.table("siggenes.txt",row.names=1))[,1]
> > allgenes<-as.matrix(read.table("H:\\test\\allgenes.txt",
row.names=NULL,
> header=T, colClasses='character'))[,1]
>
> Many thanks
>
> John
>
>
[[alternative HTML version deleted]]
------------------------------
Message: 14
Date: Mon, 11 Feb 2013 16:41:25 -0600
From: Tengfei Yin <yintengfei@gmail.com>
To: Seb <seba.bat@gmail.com>
Cc: bioconductor@r-project.org
Subject: Re: [BioC] plotting BED intervals to TSS regions
Message-ID:
<capjsq9kn0rsob3zno6k=+ok1r=qdyk_l3=wapmefci1r6fuhzq@mail.gmail.com>
Content-Type: text/plain
Hi Seb,
I guess before visualization you need to get the summary statistics
ready
first, I got one idea, maybe you could give a try, and I assume the
count
you want is based on a per base resolution
1. 'import' function in package rtracklayer to import your BED files
and
TSS files as GRanges object into R, ready for analysis.
2. ?findOverlaps in package 'GenomicRanges', there are some utilities
to
summarize the overlapping between your BED and TSS region. Then you
can
easily get an answer to your first question: how many falls within
your TSS
region defined.
3. compute coverage for your imported BED intervals(GRanges object) ,
that
will give you an Rle/RleList. check 'coverage' function in package
IRanges/GenomicRanges.
4. then get views on this coverage data with you tss position object.
please check 'Views' method in GenomicRanges/IRanges. This step is
important, better make sure your TSS have equal length window, for
example
20kb in your case.
5. Covert this Views to a matrix by using as.matrix on previous views
object. You will get a matrix, whose columns correspond to position
around
tss, from -10kb to 10kb, each row correspond to one tss region. If you
want
to summarize over all tss, just use colSums over this matrix.
6. After you get this summary data, you can use any graphic package in
R to
visualize this data as lines and relabel the x-axis position from -10k
to
10k.
As far as I know, there is no direct way in bioc to
import/aggregate/visualize your BED/TSS file together with one or two
commands to get what you want yet ...
HTH
Tengfei
On Mon, Feb 11, 2013 at 1:45 PM, Seb <seba.bat@gmail.com> wrote:
> hi gurus
>
> i have several BED files containing chromosome #, start and end that
> correspond to overlapping regions of different ChIP Seq experiments.
> this part was done with Galaxy.
>
> i also have a file containing TSS coordinates +/- 10kb.
>
> what i want to do is to create a plot to show how many of my
> overlapping intervals fall within the TSS regions, and, if they do,
> have on the X axis the distance to the TSS and on the Y axis the
> number of regions that overlap that certain part of the TSS
>
> ...i am a bit confused about how to do this tho...i looked in galaxy
[[elided Yahoo spam]]
>
> thanks
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Tengfei Yin
MCDB PhD student
1620 Howe Hall, 2274,
Iowa State University
Ames, IA,50011-2274
[[alternative HTML version deleted]]
------------------------------
Message: 15
Date: Mon, 11 Feb 2013 16:08:00 -0800
From: Bogdan Tanasa <tanasa@gmail.com>
To: <ttriche@usc.edu>
Cc: bioconductor@stat.math.ethz.ch, bioc-sig-sequencing@r-project.org
Subject: Re: [BioC] question about Gviz
Message-ID:
<ca+jem02xqyqhkzpfagboltdrv_z+l6703jshpmpceq8rvqzrdg@mail.gmail.com>
Content-Type: text/plain
Thanks, Tim ...will do it accordingly.
On Mon, Feb 11, 2013 at 3:49 PM, Tim Triche, Jr.
<tim.triche@gmail.com>wrote:
> So the problem (at least as regards my initial code snippet) is that
there
> is no mapping in the Homo.sapiens OrganismDb for either uc002yjx.1
> or uc010gkz.1 and thus no symbol (since symbol-less features are
dropped).
> That's sort of odd since, it seems, these are NRIP1 isoforms.
>
> I noticed that p15 and p16 transcripts appeared to be missing when I
> looked at chr9 earlier. This might be an issue where mappings are
missing
> from Homo.sapiens / org.Hs.eg.db, in which case a posting to the
list would
> be good.
>
> Meanwhile, if you want to keep the UCSC names of the genes for which
there
> is not a Hugo mapping in the symbol table, the following replacement
for
> subsetting txtr should do it (I tested this):
>
> ## don't subset txtr. instead,
> symbol(txtr)[ hasHugoSymbol ] <-
symbolMappings$SYMBOL[hasHugoSymbol]
>
> Then proceed as before. I get the attached output from running
>
> plotTracks(txtr,from=start,to=end)
>
> since the NRIP1 isoforms, not being mapped to Hugo (?!), stay UCSC
IDs.
>
> Do you mind if I cc: the list on this? And/or bring it up with
Marc/Herve?
>
> --t
>
>
>
> On Mon, Feb 11, 2013 at 3:19 PM, Bogdan Tanasa <tanasa@gmail.com>
wrote:
>
>> Hi Tim,
>>
>> I followed your code, in the following way (please see below);
>> plotTracks(txtr) works; however, when I try to plot specific
regions
>> ( eg plotTracks(txtr,chr="chr21", start=15245000,end=15520000) ) it
does
>> not show the correct region : I suspect it may be an issue in my
code, but
>> not very sure where ... if you have 2-3 minutes, would appreciate
your
>> help. thanks !
>>
>> library(Homo.sapiens)
>> library("ggplot2")
>> library(Gviz)
>> library("GenomicFeatures")
>> library("ggplot2")
>> library("rtracklayer")
>> library("TxDb.Hsapiens.UCSC.hg18.knownGene")
>>
>> chr<-"chr21"
>> start<-15245000
>> end <-15520000
>>
>> hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename =
>> "knownGene")
>> saveDb(hg18db, file="hg18db_knownGene.sqlite")
>> txdb <-loadDb("hg18db_knownGene.sqlite")
>> txtr <-
>> GeneRegionTrack(txdb,genome="hg18",chromosome="chr21",showId=TRUE,g
eneSymbol=TRUE,name="UCSC")
>> plotTracks(txtr,from=start,to=end)
>>
>>
>>
>> symbolMappings <- select(Homo.sapiens,
>> cols=c('UCSCKG','ENTREZID','SYMBOL'),
>> keys=symbol(txtr), keytype='UCSCKG')
>> symbolMappings <- symbolMappings[match(symbol(txtr),
>> symbolMappings$UCSCKG),]
>>
>> hasHugoSymbol <- !is.na(symbolMappings$SYMBOL)
>>
>> txtr <- txtr[ hasHugoSymbol ]
>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
>>
>> plotTracks(txtr)
>> plotTracks(txtr,from=start,to=end)
>>
>>
>>
>> On Mon, Feb 11, 2013 at 11:39 AM, Tim Triche, Jr.
<tim.triche@gmail.com>wrote:
>>
>>> Oops, critical step omitted. See below.
>>>
>>> ## see ?select for more on the following
>>> ##
>>> library(Homo.sapiens)
>>> symbolMappings <- select(Homo.sapiens,
>>>
>>> cols=c('UCSCKG','ENTREZID','SYMBOL'),
>>> keys=symbol(txtr),
>>> keytype='UCSCKG')
>>>
>>> ## "pork out" the mappings table so that duplicates expand
>>> ##
>>> symbolMappings <- symbolMappings[match(symbol(txtr),
>>>
>>> symbolMappings$UCSCKG),]
>>>
>>> ## if NAs could be accepted as keys, this next step might be
unnecessary
>>> ##
>>> hasHugoSymbol <- !is.na(symbolMappings$SYMBOL)
>>>
>>> ## however, it seems that not all UCSC known genes have a Hugo
symbol?
>>> ##
>>> txtr <- txtr[ hasHugoSymbol ]
>>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
>>> plotTracks(txtr)
>>>
>>>
>>> The above (minus comments) is what I used to generate a test
result.
>>>
>>>
>>>
>>>
>>> On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr.
<tim.triche@gmail.com>wrote:
>>>
>>>> Well, here's one approach. I'll start from the constructed
'txtr'
>>>> track.
>>>>
>>>> ## see ?select for more on the following
>>>> ##
>>>> library(Homo.sapiens)
>>>> symbolMappings <- select(Homo.sapiens,
>>>>
>>>> cols=c('UCSCKG','ENTREZID','SYMBOL'),
>>>> keys=symbol(txtr),
>>>> keytype='UCSCKG')
>>>>
>>>> ## "pork out" the mappings table so that duplicates expand
>>>> ##
>>>> symbolMappings <- symbolMappings[match(symbol(txtr),
>>>>
>>>> symbolMappings$UCSCKG),]
>>>>
>>>> ## if NAs can be accepted as keys, this next step might be
unnecessary
>>>> ## however, it seems that not all UCSC known genes have a Hugo
symbol?
>>>> ##
>>>> txtr <- txtr[ hasHugoSymbol ]
>>>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
>>>> plotTracks(txtr)
>>>>
>>>> That works.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian <
>>>> florian.hahne@novartis.com> wrote:
>>>>
>>>>> Well, the main culprit here is really the TranscriptDB package.
It does
>>>>> not seem to deal at all with gene symbols, so there is no way
for Gviz
>>>>> to
>>>>> automatically fetch those. If you come up with a way to match
the UCSC
>>>>> gene identifiers back to gene symbols you could stick those into
the
>>>>> GeneRegionTrack using the 'symbol' replacement method. E.g.,
>>>>> symbol(foo) <- mappingTable[match(transcript(foo),
>>>>> mappingTable$UCSCId),])
>>>>> I am not sure how this mapping is supposed to be done in the
>>>>> Bioconductor
>>>>> world these days. You may be able to find a way using one of the
org.db
>>>>> packages. Or maybe you will have to download a mapping table
directly
>>>>> from
>>>>> the UCSC table browser.
>>>>> With the next release of Bioconductor (or already now if you are
>>>>> working
>>>>> with the devel branch), Gviz supports building tracks from a
whole
>>>>> range
>>>>> of standard annotation files. You could then export the whole
>>>>> known.gene
>>>>> table from UCSC as a GTF file and import in again as a
GeneRegionTrack
>>>>> object. Alternatively you may want to look into the
>>>>> BiomartGeneRegionTrack
>>>>> class which will fetch the gene models from Ensembl, but this
includes
>>>>> the
>>>>> HUGO gene symbols.
>>>>> Florian
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote:
>>>>>
>>>>> >Dear all,
>>>>> >
>>>>> >I am using the following code below in order to retrieve the
gene
>>>>> >annotations in Gviz package :
>>>>> >please could advise on what shall I modify in order to display
the
>>>>> HUGO
[[elided Yahoo spam]]
>>>>> >
>>>>> >library("GenomicFeatures")
>>>>> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename =
>>>>> >"knownGene")
>>>>> >saveDb(hg18db, file="hg18db_knownGene.sqlite")
>>>>> >txdb <-loadDb("hg18db_knownGene.sqlite")
>>>>> >txTr <-
>>>>>
>>>>> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRU
E,geneSymbo
>>>>> >l=TRUE,name="UCSC")
>>>>> >plotTracks(txTr,from=start,to=end)
>>>>> >
>>>>> >-- bogdan
>>>>> >------------------
>>>>> >
>>>>> > [[alternative HTML version deleted]]
>>>>> >
>>>>> >_______________________________________________
>>>>> >Bioconductor mailing list
>>>>> >Bioconductor@r-project.org
>>>>> >https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> >Search the archives:
>>>>>
>http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor@r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *A model is a lie that helps you see the truth.*
>>>> *
>>>> *
>>>> Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
>>>>
>>>
>>>
>>>
>>> --
>>> *A model is a lie that helps you see the truth.*
>>> *
>>> *
>>> Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
>>>
>>
>>
>
>
> --
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
>
[[alternative HTML version deleted]]
------------------------------
Message: 16
Date: Tue, 12 Feb 2013 00:41:02 +0000
From: Ian Sudbery <ian.sudbery@dpag.ox.ac.uk>
To: "bioconductor@stat.math.ethz.ch" <bioconductor@stat.math.ethz.ch>
Subject: [BioC] Error Bars on EdgeR
Message-ID:
<55D5020998905943B8A25CAFED69B2C219DCBCCF@MBX01.ad.oak.ox.ac.uk>
Content-Type: text/plain
Hi All,
I was wondering if anyone had any idea how one might
put error bars on the log fold changes calculated by edgeR? I'd like
to do this so I can show examples of differentially expressed genes on
a bar plot, and show them in comparison to qPCR validation.
I feel like one should be able to get the variance from the
dispersion:
CV^2 = 1/mu + BCV^2 = 1/mu + dispersion
So:
variance = mu + mu^2*dispersion ?
This could either be used to show the variance for each of the two
conditions, or used to calculate an error for the fold change. but I'm
a bit stuck as to how to actually implement this.
Any ideas appreciated,
Yours,
Ian
[[alternative HTML version deleted]]
------------------------------
Message: 17
Date: Tue, 12 Feb 2013 01:40:23 +0100
From: Jon Toledo <tintin_jb@hotmail.com>
To: <bioconductor@r-project.org>
Subject: [BioC] Differences in results analyzing Mouse Gene 1.0-ST
using oligo and affy package
Message-ID: <bay161-w16c78e461d37c096df6db698090@phx.gbl>
Content-Type: text/plain
Dear Bioconductor List,
I have repeated my workflow using the affy and oligo package
alternatively followed by the limma package to analyze and
experiment with two conditions using Mouse Gene 1.0-ST chips and I
arrive to different results.
I have been looking online and found that at least for the for the
Mouse Gene 1.1-ST the oligo package is preferred, but not anything
clear for Mouse Gene 1.0-ST .
I attach below my code and session info:
A) For running affy:
library(affy)
library(pd.mogene.1.0.st.v1)
library(mogene10sttranscriptcluster.db)
cellist=list.celfiles(full.names=T)
cellistD14=grep("CD1...14",cellist,value=T)
esetD14<- justRMA(filenames=gsub("\\./","",cellistD14))
B) For runnign oligo:
library(oligo)
library(pd.mogene.1.0.st.v1)
library(mogene10sttranscriptcluster.db)
cellist=list.celfiles(full.names=T)
cellistD14=grep("CD1...14",cellist,value=T)
rsetD14=read.celfiles(cellistD14)
esetpD14=rma(rsetD14,target="probeset")
esetD14=rma(rsetD14,target="core")
C) Finally running the same analysis:
designD14<-read.delim('designD14.txt')
contrast.matrix=model.matrix(~treatment,data=designD14)
library(limma)
fit <- lmFit(esetD14, contrast.matrix)
fit <- eBayes(fit,proportion=0.01)
m1=topTable(fit, coef="treatment",number=1e8,adjust.method="BH")
m1=m1[,c("ID","logFC","P.Value","adj.P.Val")]
m1=cbind(m1[,1:2],FCTreat=2**m1[,2],PTreat=m1[,3],adj.PTreat=m1[,4])
> sessionInfo() (This is for the affy run)
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United
States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mogene10stv1cdf_2.11.0 limma_3.14.4
mogene10sttranscriptcluster.db_8.0.1
[4] org.Mm.eg.db_2.8.0 AnnotationDbi_1.20.3
pd.mogene.1.0.st.v1_3.8.0
[7] oligo_1.22.0 oligoClasses_1.20.0
RSQLite_0.11.2
[10] DBI_0.2-5 affy_1.36.1
Biobase_2.18.0
[13] BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] affxparser_1.30.2 affyio_1.26.0 BiocInstaller_1.8.3
Biostrings_2.26.3 bit_1.1-9 codetools_0.2-8
[7] ff_2.2-10 foreach_1.4.0 GenomicRanges_1.10.6
IRanges_1.16.4 iterators_1.0.6 parallel_2.15.2
[13] preprocessCore_1.20.0 splines_2.15.2 stats4_2.15.2
tools_2.15.2 zlibbioc_1.4.0
Thank you very much
J Toledo
UPenn
USA
[[alternative HTML version deleted]]
------------------------------
Message: 18
Date: Mon, 11 Feb 2013 22:39:23 -0500
From: Juliet Hannah <juliet.hannah@gmail.com>
To: Bioconductor mailing list <bioconductor@r-project.org>
Subject: Re: [BioC] statistical test for time course data
Message-ID:
<calzuzrqt0ndnwgyekfphgc4z-bpj_mvhwxcvnijq- g12_4zszq@mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Hi Gordon,
I've been curious about this as well. Can limma model the following
situation? We have two
treatments and multiple time points. We are interested in if the mean
profile for each treatment
differs over time, treating time continuously. The measurements are
over time for each individual so
we have to account for correlations within individuals. Ideally, I
would like to allow
for a random intercept and a random slope (possible quadratic) if
needed.
EDGE uses splines, so that would be nice as well.
I am aware of duplicateCorrelation, Is this the way to proceed?
I'll post a reproducible example for guidance if you indicate that the
above is possible.
Thanks for your time and for your work on limma, which I use
frequently.
On Fri, Feb 1, 2013 at 6:46 PM, Gordon K Smyth <smyth@wehi.edu.au>
wrote:
> Dear Chris,
>
> Yes, limma can easily test for a difference at one point, or can
test for a
> significant change over the whole time course like EDGE.
>
> I don't understand your experiment well enough to give more specific
advice.
> You would need to tell us your experimental design, in terms of the
targets
> frame, and exactly what you want to test.
>
> Best wishes
> Gordon
>
>> Date: Fri, 1 Feb 2013 12:19:26 +0900
>> From: chris Jhon <cjhon217@gmail.com>
>> To: Richard Friedman <friedman@cancercenter.columbia.edu>
>> Cc: Bioconductor mailing list <bioconductor@r-project.org>
>> Subject: Re: [BioC] statistical test for time course data
>>
>> Hi Richard,
>>
>> Thank you for help.
>> In my data ,i have one point which i think it is different from
other
>> points and i would like to test statistical significance of the
difference
>> of this point.
>> Your suggestion means that there is no direct function in R that i
can
>> use,i have to use a package which implement an algorithm.
>> If so, i think edgeR can do the same analysis too,Am i right?
>>
>> Best Reagards,
>> Chris
>>
>> On Thu, Jan 31, 2013 at 11:53 PM, Richard Friedman <
>> friedman@cancercenter.columbia.edu> wrote:
>>
>>> Dear Chris,
>>>
>>> Limma can be used to test between time points
>>> treating each time point as a categorical variable.
>>> The program "EDGE" from the Storey lab, can test whether
>>> there is significant change over a whole time course.
>>>
>>> http://www.ncbi.nlm.nih.gov/pubmed/16357033
>>>
>>> with hopes that the above helps,
>>> Rich
>>> Richard A. Friedman, PhD
>>> Associate Research Scientist,
>>> Biomedical Informatics Shared Resource
>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>> Lecturer,
>>> Department of Biomedical Informatics (DBMI)
>>> Educational Coordinator,
>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>> National Center for Multiscale Analysis of Genomic Networks
(MAGNet)/
>>> Columbia Initiative in Systems Biology
>>> Room 824
>>> Irving Cancer Research Center
>>> Columbia University
>>> 1130 St. Nicholas Ave
>>> New York, NY 10032
>>> (212)851-4765 (voice)
>>> friedman@cancercenter.columbia.edu
>>> http://cancercenter.columbia.edu/~friedman/
>>>
>>> "Complex numbers! Ha! Ha! There is nothing weirder
>>> than imaginary numbers. Architects don't need to know
>>> complex numbers. Whenever I get a negative root for
>>> an area, I throw it out. And don't talk to me about
>>> quaternions. I am not going into computer animation."
>>> -Rose Friedman, age 16
>>>
>>>
>>> On Jan 30, 2013, at 11:43 PM, chris Jhon wrote:
>>>
>>> Hi All,
>>>
>>> I have data at different time points for time course experiment.
>>> I have a response for each time point and i would like to test
whether
>>> the
>>> difference between response of two time points is statistically
>>> significant
>>> or not.
>>> my data is linear plot where response on y axis and time on x
axis.
>>>
>>> what statistical test shall i use?
>>>
>>>
>>> I appreciate any help.
>>>
>>> Best Regards,
>>> Chris
>>>
>
>
______________________________________________________________________
> The information in this email is confidential and
intend...{{dropped:4}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 19
Date: Mon, 11 Feb 2013 23:31:31 -0500
From: Steve Lianoglou <mailinglist.honeypot@gmail.com>
To: Juliet Hannah <juliet.hannah@gmail.com>
Cc: Bioconductor mailing list <bioconductor@r-project.org>
Subject: Re: [BioC] statistical test for time course data
Message-ID:
<caha9mcosvlmerwxije_qnrpbqr0tc0qkb=y=kbmxeosf- 9xo+g@mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Hi Juliet,
On Mon, Feb 11, 2013 at 10:39 PM, Juliet Hannah
<juliet.hannah@gmail.com> wrote:
> Hi Gordon,
>
> I've been curious about this as well. Can limma model the following
> situation?
[snip]
I think you'll find a thread that appeared a bit after the one you are
replying to:
http://thread.gmane.org/gmane.science.biology.informatics.conductor/46
188
If you look in Section 8.6 of the limmaUserGuide (in development):
http://bioconductor.org/packages/2.12/bioc/vignettes/limma/inst/doc/us
ersguide.pdf
You'll find a section on time course analysis. 8.6.2, in particular,
uses splines.
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
------------------------------
Message: 20
Date: Mon, 11 Feb 2013 21:45:34 -0800 (PST)
To: "bioconductor@r-project.org" <bioconductor@r-project.org>
Subject: Re: [BioC] edgeR cpm filtering
Message-ID:
<1360647934.77288.YahooMailNeo@web162106.mail.bf1.yahoo.com>
Content-Type: text/plain
Hi Jim,
I posted my first question as a guest and just became a
member today, so it???s a bit of learning curve for me on how to use
the list. Sorry
for the email and I hope I am using the list correctly now by emailing
this
address.
About my edgeR question:
Thank you for your elaboration and also the example. I can
see that in your example, you have samples with 0 reads and logFC is
okay, and
that???s what logically should be. However, in my dataset, I see many
cases of
logFC of ~144269489 (or negative of ~ this value). When I check the
genes, I see
that these are the cases where all the replicates of one samples have
0 reads
mapped to them, whereas the other groups of samples have many reads.
These are
the cases that cpm didn???t filter them. That???s why I tried to use
more restrictive
cpm filtering to get rid of these genes.
Any thoughts on why this non-interpretive logFC cases happen
are greatly appreciated.
Thanks,
John
Hi John,
Please don't take things off-list. Even if you are not a
subscriber (and if you are using BioC stuff you should be, and you
can
always stop delivery but remain a subscriber), I believe that replying
to an existing thread will work.
I don't see any zero counts
causing a problem. Using the example for cpm() as a starting point,
and
modifying to have a set with zero counts, I get this:
> y
?? ?? [,1] [,2] [,3] [,4]
[1,]?? ?? 1?? ?? 2?? 14?? 11
[2,]?? 11?? 25?? ?? 1?? 26
[3,]?? ?? 1?? 22?? ?? 2?? 19
[4,]?? ?? 5?? ?? 6?? 15?? ?? 6
[5,]?? ?? 0?? ?? 0?? ?? 1?? ?? 5
> d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2)))
> d <- estimateCommonDisp(d)
> d <- estimateTagwiseDisp(d)
> topTags(exactTest(d))
Comparison of groups:?? 2-1
?? ?? ?? logFC?? logCPM?? ?? ?? PValue?? ?? ?? ?? ?? FDR
1?? 2.9550376 12.76964 6.109348e-05 0.0003054674
5?? 4.6421574 10.54712 1.283343e-01 0.3208358043
4?? 0.9149142 12.96222 2.668415e-01 0.4447357815
2 -0.4149407 13.93933 8.539261e-01 0.9783799675
3 -0.1325391 13.42121 9.783800e-01 0.9783799675
So
the sample with zero counts (sample 5), is the second row in the
topTags() output, and it has no problem computing a logFC value.
Best,
Jim
On 2/11/2013 4:30 PM, John Sperry wrote:
> Hi again Jim,
>
>
One more thing, in microarray days, people used to add a small value,
let say 1 to the 0 values to avoid non-sense fold changes. It's not
the
case in NGS any more right? so it's not possible to do that in edgeR,
right? that's why I was thinking about filtering out with cpm.
>
> Thanks,
> John
>
>
>
> --------------------------------------------------------------------
----
> *To:* "jmacdon@uw.edu" <jmacdon@uw.edu>
> *Sent:* Monday, February 11, 2013 1:47 PM
> *Subject:* [BioC] edgeR cpm filtering
>
> Hi Jim,
>
> I'm very new to edgeR and BioC. I couldn't reply to your post in
BioC, so here is my post in an email :D
>
> I still cannot see why 1M is chosen, but I appreciate your
explanations.
>
>
About the cpm filtering, the reason that I chose '> 2' for 3 samples
with each having 2 replicates was that I though edgeR must be smart
enough to figure out that when I say more than 5 reads per million for
[[elided Yahoo spam]]
[[elided Yahoo spam]]
>
>
as for the reason for wanting to get rid of the sample 3 with 2
replicates that have 0 reads mapped to them, I don't want them,
because,
they cause the logFC to become huge non-sense numbers! i guess
dividing
be 0 causes the problem! so I thought for not seeing weird values
when
the significant genes are selected, it's better to get rid of genes
that
have 0 reads mapped to any of their groups. Does it make sense?
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,]
>
> Thanks,
> John
>
>
-- James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
------------------------------
Message: 21
Date: Tue, 12 Feb 2013 01:40:55 -0500
From: Steve Lianoglou <mailinglist.honeypot@gmail.com>
Cc: "bioconductor@r-project.org" <bioconductor@r-project.org>
Subject: Re: [BioC] edgeR cpm filtering
Message-ID:
<caha9mcnreg0zpcykxpodjiemkjc0e9e7mawhftnbuqd1c1sk1q@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252
Hi John,
e:
> Hi Jim,
>
> I posted my first question as a guest and just became a
> member today, so it?s a bit of learning curve for me on how to use
the list. Sorry
> for the email and I hope I am using the list correctly now by
emailing this
> address.
>
> About my edgeR question:
>
> Thank you for your elaboration and also the example. I can
> see that in your example, you have samples with 0 reads and logFC is
okay, and
> that?s what logically should be. However, in my dataset, I see many
cases of
> logFC of ~144269489 (or negative of ~ this value). When I check the
genes, I see
> that these are the cases where all the replicates of one samples
have 0 reads
> mapped to them, whereas the other groups of samples have many reads.
These are
> the cases that cpm didn?t filter them. That?s why I tried to use
more restrictive
> cpm filtering to get rid of these genes.
>
> Any thoughts on why this non-interpretive logFC cases happen
> are greatly appreciated.
>From the looks of your previous (original) email, it looks like
you've
landed w/ a (very) strange bioconductor setup given the version of R
you are using -- you are using R-2.15.0, with edgeR version 2.6.0
Please update to the latest version of R (2.15.2 -- cuz, hey it's a
good idea), but more importantly, the latest version of edgeR:
http://bioconductor.org/packages/2.11/bioc/html/edgeR.html
You should be installing your bioconductor packages w/ the
BiocInstaller package/biocLite functionality. Unwinding yourself from
the bizarre package versions you have running to ensure that you are
on the latest bioc release train (2.11) might get a bit tricky, and
sometimes the easiest way is to blow out your R install and start from
scratch.
After you do that, do you still get such extreme logFC's?
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
------------------------------
Message: 22
Date: Tue, 12 Feb 2013 07:55:21 +0000
From: "Hahne, Florian" <florian.hahne@novartis.com>
To: "ttriche@usc.edu" <ttriche@usc.edu>
Cc: "bioc-sig-sequencing-request@r-project.org"
<bioc-sig-sequencing-request@r-project.org>, Bogdan Tanasa
<tanasa@gmail.com>, Bioconductor mailing list
<bioconductor@r-project.org>
Subject: Re: [BioC] question about Gviz
Message-ID:
<aabaab0b27af5c418fe809f352fb7ba408486bed@023-db3mpn1-082.023d .mgd.msft.net="">
Content-Type: text/plain
Cool, thanks. Just learned something again today :-)
I will take a closer look at the Homo.sapiens package (and the other
new organism annotation packages as well) to figure out how to
integrate this more closely into the Gviz package. One should not have
to jump through all these hoops to get gene symbols in there if all
the necessary information is already on your fingertips.
Florian
--
From: "<tim triche="">", "Jr."
<tim.triche@gmail.com<mailto:tim.triche@gmail.com>>
Reply-To: "ttriche@usc.edu<mailto:ttriche@usc.edu>"
<ttriche@usc.edu<mailto:ttriche@usc.edu>>
Date: Monday, February 11, 2013 8:39 PM
To: NIBR
<florian.hahne@novartis.com<mailto:florian.hahne@novartis.com>>
Cc: Bogdan Tanasa <tanasa@gmail.com<mailto:tanasa@gmail.com>>,
Bioconductor mailing list
<bioconductor@r-project.org<mailto:bioconductor@r-project.org>>,
"bioc-sig-sequencing-request@r-project.org<mailto:bioc-sig-sequencing- request@r-project.org="">" <bioc-sig-sequencing- request@r-project.org<mailto:bioc-sig-sequencing-="" request@r-project.org="">>
Subject: Re: [BioC] question about Gviz
Oops, critical step omitted. See below.
## see ?select for more on the following
##
library(Homo.sapiens)
symbolMappings <- select(Homo.sapiens,
cols=c('UCSCKG','ENTREZID','SYMBOL'),
keys=symbol(txtr),
keytype='UCSCKG')
## "pork out" the mappings table so that duplicates expand
##
symbolMappings <- symbolMappings[match(symbol(txtr),
symbolMappings$UCSCKG),]
## if NAs could be accepted as keys, this next step might be
unnecessary
##
hasHugoSymbol <- !is.na<http: is.na="">(symbolMappings$SYMBOL)
## however, it seems that not all UCSC known genes have a Hugo symbol?
##
txtr <- txtr[ hasHugoSymbol ]
symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
plotTracks(txtr)
The above (minus comments) is what I used to generate a test result.
On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr.
<tim.triche@gmail.com<mailto:tim.triche@gmail.com>> wrote:
Well, here's one approach. I'll start from the constructed 'txtr'
track.
## see ?select for more on the following
##
library(Homo.sapiens)
symbolMappings <- select(Homo.sapiens,
cols=c('UCSCKG','ENTREZID','SYMBOL'),
keys=symbol(txtr),
keytype='UCSCKG')
## "pork out" the mappings table so that duplicates expand
##
symbolMappings <- symbolMappings[match(symbol(txtr),
symbolMappings$UCSCKG),]
## if NAs can be accepted as keys, this next step might be unnecessary
## however, it seems that not all UCSC known genes have a Hugo symbol?
##
txtr <- txtr[ hasHugoSymbol ]
symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
plotTracks(txtr)
That works.
On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian
<florian.hahne@novartis.com<mailto:florian.hahne@novartis.com>> wrote:
Well, the main culprit here is really the TranscriptDB package. It
does
not seem to deal at all with gene symbols, so there is no way for Gviz
to
automatically fetch those. If you come up with a way to match the UCSC
gene identifiers back to gene symbols you could stick those into the
GeneRegionTrack using the 'symbol' replacement method. E.g.,
symbol(foo) <- mappingTable[match(transcript(foo),
mappingTable$UCSCId),])
I am not sure how this mapping is supposed to be done in the
Bioconductor
world these days. You may be able to find a way using one of the
org.db
packages. Or maybe you will have to download a mapping table directly
from
the UCSC table browser.
With the next release of Bioconductor (or already now if you are
working
with the devel branch), Gviz supports building tracks from a whole
range
of standard annotation files. You could then export the whole
known.gene
table from UCSC as a GTF file and import in again as a GeneRegionTrack
object. Alternatively you may want to look into the
BiomartGeneRegionTrack
class which will fetch the gene models from Ensembl, but this includes
the
HUGO gene symbols.
Florian
--
On 2/7/13 10:51 AM, "Bogdan Tanasa"
<tanasa@gmail.com<mailto:tanasa@gmail.com>> wrote:
>Dear all,
>
>I am using the following code below in order to retrieve the gene
>annotations in Gviz package :
>please could advise on what shall I modify in order to display the
HUGO
[[elided Yahoo spam]]
>
>library("GenomicFeatures")
>hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename =
>"knownGene")
>saveDb(hg18db, file="hg18db_knownGene.sqlite")
>txdb <-loadDb("hg18db_knownGene.sqlite")
>txTr <-
>GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,gene
Symbo
>l=TRUE,name="UCSC")
>plotTracks(txTr,from=start,to=end)
>
>-- bogdan
>------------------
>
> [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
A model is a lie that helps you see the truth.
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
--
A model is a lie that helps you see the truth.
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]
------------------------------
Message: 23
Date: Tue, 12 Feb 2013 00:04:23 -0800
From: Bogdan Tanasa <tanasa@gmail.com>
To: "Hahne, Florian" <florian.hahne@novartis.com>
Cc: Bioconductor mailing list <bioconductor@r-project.org>,
"bioc-sig-sequencing-request@r-project.org"
<bioc-sig-sequencing-request@r-project.org>
Subject: Re: [BioC] question about Gviz
Message-ID:
<ca+jem018udk357wghxa14wp_ncz3qwaesr- cn00pwkrt5w497w@mail.gmail.com="">
Content-Type: text/plain
Thank you again gentlemen for your kind help ...
On Mon, Feb 11, 2013 at 11:55 PM, Hahne, Florian
<florian.hahne@novartis.com> wrote:
> Cool, thanks. Just learned something again today :-)
> I will take a closer look at the Homo.sapiens package (and the other
new
> organism annotation packages as well) to figure out how to integrate
this
> more closely into the Gviz package. One should not have to jump
through all
> these hoops to get gene symbols in there if all the necessary
information
> is already on your fingertips.
> Florian
> --
>
>
> From: "<tim triche="">", "Jr." <tim.triche@gmail.com>
> Reply-To: "ttriche@usc.edu" <ttriche@usc.edu>
> Date: Monday, February 11, 2013 8:39 PM
> To: NIBR <florian.hahne@novartis.com>
> Cc: Bogdan Tanasa <tanasa@gmail.com>, Bioconductor mailing list <
> bioconductor@r-project.org>, "bioc-sig-sequencing-
request@r-project.org" <
> bioc-sig-sequencing-request@r-project.org>
> Subject: Re: [BioC] question about Gviz
>
> Oops, critical step omitted. See below.
>
> ## see ?select for more on the following
> ##
> library(Homo.sapiens)
> symbolMappings <- select(Homo.sapiens,
>
> cols=c('UCSCKG','ENTREZID','SYMBOL'),
> keys=symbol(txtr),
> keytype='UCSCKG')
>
> ## "pork out" the mappings table so that duplicates expand
> ##
> symbolMappings <- symbolMappings[match(symbol(txtr),
>
> symbolMappings$UCSCKG),]
>
> ## if NAs could be accepted as keys, this next step might be
unnecessary
> ##
> hasHugoSymbol <- !is.na(symbolMappings$SYMBOL)
>
> ## however, it seems that not all UCSC known genes have a Hugo
symbol?
> ##
> txtr <- txtr[ hasHugoSymbol ]
> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
> plotTracks(txtr)
>
>
> The above (minus comments) is what I used to generate a test
result.
>
>
>
>
> On Mon, Feb 11, 2013 at 11:34 AM, Tim Triche, Jr.
<tim.triche@gmail.com>wrote:
>
>> Well, here's one approach. I'll start from the constructed 'txtr'
track.
>>
>> ## see ?select for more on the following
>> ##
>> library(Homo.sapiens)
>> symbolMappings <- select(Homo.sapiens,
>>
>> cols=c('UCSCKG','ENTREZID','SYMBOL'),
>> keys=symbol(txtr),
>> keytype='UCSCKG')
>>
>> ## "pork out" the mappings table so that duplicates expand
>> ##
>> symbolMappings <- symbolMappings[match(symbol(txtr),
>>
>> symbolMappings$UCSCKG),]
>>
>> ## if NAs can be accepted as keys, this next step might be
unnecessary
>> ## however, it seems that not all UCSC known genes have a Hugo
symbol?
>> ##
>> txtr <- txtr[ hasHugoSymbol ]
>> symbol(txtr) <- symbolMappings$SYMBOL[ hasHugoSymbol ]
>> plotTracks(txtr)
>>
>> That works.
>>
>>
>>
>>
>> On Mon, Feb 11, 2013 at 6:03 AM, Hahne, Florian <
>> florian.hahne@novartis.com> wrote:
>>
>>> Well, the main culprit here is really the TranscriptDB package. It
does
>>> not seem to deal at all with gene symbols, so there is no way for
Gviz to
>>> automatically fetch those. If you come up with a way to match the
UCSC
>>> gene identifiers back to gene symbols you could stick those into
the
>>> GeneRegionTrack using the 'symbol' replacement method. E.g.,
>>> symbol(foo) <- mappingTable[match(transcript(foo),
>>> mappingTable$UCSCId),])
>>> I am not sure how this mapping is supposed to be done in the
Bioconductor
>>> world these days. You may be able to find a way using one of the
org.db
>>> packages. Or maybe you will have to download a mapping table
directly
>>> from
>>> the UCSC table browser.
>>> With the next release of Bioconductor (or already now if you are
working
>>> with the devel branch), Gviz supports building tracks from a whole
range
>>> of standard annotation files. You could then export the whole
known.gene
>>> table from UCSC as a GTF file and import in again as a
GeneRegionTrack
>>> object. Alternatively you may want to look into the
>>> BiomartGeneRegionTrack
>>> class which will fetch the gene models from Ensembl, but this
includes
>>> the
>>> HUGO gene symbols.
>>> Florian
>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 2/7/13 10:51 AM, "Bogdan Tanasa" <tanasa@gmail.com> wrote:
>>>
>>> >Dear all,
>>> >
>>> >I am using the following code below in order to retrieve the gene
>>> >annotations in Gviz package :
>>> >please could advise on what shall I modify in order to display
the HUGO
[[elided Yahoo spam]]
>>> >
>>> >library("GenomicFeatures")
>>> >hg18db <- makeTranscriptDbFromUCSC(genome = "hg18",tablename =
>>> >"knownGene")
>>> >saveDb(hg18db, file="hg18db_knownGene.sqlite")
>>> >txdb <-loadDb("hg18db_knownGene.sqlite")
>>> >txTr <-
>>>
>>> >GeneRegionTrack(txdb,genome="hg18",chromosome="chr9",showId=TRUE,
geneSymbo
>>> >l=TRUE,name="UCSC")
>>> >plotTracks(txTr,from=start,to=end)
>>> >
>>> >-- bogdan
>>> >------------------
>>> >
>>> > [[alternative HTML version deleted]]
>>> >
>>> >_______________________________________________
>>> >Bioconductor mailing list
>>> >Bioconductor@r-project.org
>>> >https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >Search the archives:
>>> >http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor@r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>>
>> --
>> *A model is a lie that helps you see the truth.*
>> *
>> *
>> Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
>>
>
>
>
> --
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
>
>
[[alternative HTML version deleted]]
------------------------------
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
End of Bioconductor Digest, Vol 120, Issue 12
*********************************************
[[alternative HTML version deleted]]