Entering edit mode
Sarah Pohl
▴
30
@sarah-pohl-6107
Last seen 10.4 years ago
Hey Mark,
I had a self-made file, but this one would also be fine if it worked.
This time, this happened:
> txdb <- makeTranscriptDbFromGFF(file="NC_008463_ncbi.gff",
format="gff3", dataSource="CDS", species="Pseudomonas aeruginosa",
chrominfo=inf)
extracting transcript information
Error in .prepareGFF3TXS(data) :
No Transcript information present in gff file
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] BSgenome_1.28.0 GenomicFeatures_1.12.3
AnnotationDbi_1.22.6
[4] Biobase_2.20.1 VariantAnnotation_1.6.7 Rsamtools_1.12.3
[7] Biostrings_2.28.0 GenomicRanges_1.12.4 IRanges_1.18.3
[10] BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] biomaRt_2.16.0 bitops_1.0-6 DBI_0.2-7
RCurl_1.95-4.1 RSQLite_0.11.4
[6] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1
XML_3.98-1.1 zlibbioc_1.6.0
What transcript information should be present in the gff3 file??
Sarah
> ------------------------------
>
> Message: 5
> Date: Fri, 23 Aug 2013 10:23:49 -0700
> From: Marc Carlson <mcarlson at="" fhcrc.org="">
> To: bioconductor at r-project.org
> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria
> genomes
> Message-ID: <52179AA5.8070206 at fhcrc.org>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> Thank you Sarah,
>
> That is much better. Is this the file you were parsing here?
>
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Pseudomonas_aeruginosa_U
CBPP_PA14_uid57977/NC_008463.gff
>
>
> Marc
On 08/23/2013 03:49 AM, Sarah Pohl wrote:
> Hey Marc,
>
> I'm sorry, I came here via gmane.org and didn't see the posting
guide. I'll attach the relevant information this time.
> I tried with the chrominfo argument, and in a sense it works. At
least there's no error about the missing chromosome size now. The main
error stays the same, though.
>
> I checked my gff3 file with http://modencode.oicr.on.ca/cgi-
bin/validate_gff3_online yesterday and according to them it is fine.
>
> Here's the code:
> library(VariantAnnotation)
> library(GenomicFeatures)
> library(BSgenome)
> inf <- data.frame(cbind("NC_008463", 6537648, TRUE))
> txdb <- makeTranscriptDbFromGFF(file="//CPI-
SL64001/spo12/BSgenome/annotation/NC_008463.gff", format="gff3",
dataSource="CDS", species="Pseudomonas aeruginosa", chrominfo=inf)
>
> the error:
> Prepare the 'metadata' data frame ... metadata: OK
> Error in is.data.frame(arg) : object 'tables' not found
>
> and the session info:
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
LC_MONETARY=German_Germany.1252
> [4] LC_NUMERIC=C LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods base
>
> other attached packages:
> [1] BSgenome_1.28.0 GenomicFeatures_1.12.3
AnnotationDbi_1.22.6
> [4] Biobase_2.20.1 VariantAnnotation_1.6.7
Rsamtools_1.12.3
> [7] Biostrings_2.28.0 GenomicRanges_1.12.4 IRanges_1.18.3
> [10] BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.16.0 bitops_1.0-6 DBI_0.2-7
RCurl_1.95-4.1 RSQLite_0.11.4
> [6] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1
XML_3.98-1.1 zlibbioc_1.6.0
> Date: Thu, 22 Aug 2013 11:27:39 -0700
> From: Marc Carlson <mcarlson at="" fhcrc.org="">
> To: bioconductor at r-project.org
> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria
> genomes
> Message-ID: <5216581B.8090608 at fhcrc.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>
>
> On 08/22/2013 02:12 AM, Sarah Pohl wrote:
>> Cook, Malcolm <mec at="" ...=""> writes:
>>
>>> FYI, bioperl includes bp_genbank2gff3.pl
>>>
>>> which when run as
>>>
>>>> bp_genbank2gff3.pl NC_011025.gbk
>>> produces NC_011025.gbk.gff (attached)
>>>
>>> which loaded without error with transcript:
>>>
>>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff",
format="gff3",
>> dataSource="NCBI",
>>> species="Some bact")
>>> extracting transcript information
>>> Extracting gene IDs
>>> extracting transcript information
>>> Processing splicing information for gff3 file.
>>> Deducing exon rank from relative coordinates provided
>>> Prepare the 'metadata' data frame ... metadata: OK
>>> Now generating chrominfo from available sequence names. No
chromosome
>> length information is available.
>>> Warning messages:
>>> 1: In .deduceExonRankings(exs, format = "gff") :
>>> Infering Exon Rankings. If this is not what you expected,
then please
>> be sure that you have provided a valid
>>> attribute for exonRankAttributeName
>>> 2: In matchCircularity(chroms, circ_seqs) :
>>> None of the strings in your circ_seqs argument match your
seqnames.
>>>> txdb
>>> TranscriptDb object:
>>> | Db type: TranscriptDb
>>> | Supporting package: GenomicFeatures
>>> | Data source: NCBI
>>> | Genus and Species: Some bact
>>> | miRBase build ID: NA
>>> | transcript_nrow: 631
>>> | exon_nrow: 631
>>> | cds_nrow: 631
>>> | Db created by: GenomicFeatures package from Bioconductor
>>> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
>>> | GenomicFeatures version at creation time: 1.10.2
>>> | RSQLite version at creation time: 0.11.2
>>> | DBSCHEMAVERSION: 1.0
>> Hey,
>>
>> I know I'm a bit late for this discussion, but I have a similar
problem.
>>
>> I have a bacterial GBK file which I tried to convert using the
>> bp_genbank2gff3.pl script,
>> perl bp_genbank2gff3.pl annotation/NC_008463.gbk -o
annotation/
>> but I got the following error:
>> "Can't call method "binomial" on an undefined value at
bp_genbank2gff3.pl
>> line 672, <fh> line 208948."
>> So instead I converted it with Biopython and the BCBio module,
which worked
>> fine.
>> Only now, when I try to load it with makeTranscriptDbFromGFF,
>> txdb <- makeTranscriptDbFromGFF(file="NC_008463.gff",
format="gff3",
>> dataSource="CDS", species="Pseudomonas aeruginosa")
>> I also get an error:
>> Error in unique(tables[["transcripts"]][["tx_chrom"]]) :
>> 'unique': Error: object 'tables' not found
>>
>> Why does this happen and what can I do about it?
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> Hi Sarah,
>
> It's hard to help you because it's pretty difficult to know what
> actually happened after reading your post. I can't be sure if the
other
> scripts you mention produced a valid gff3 file and I have no idea
which
> version of the software you are using. Please see our posting guide
here:
>
> http://www.bioconductor.org/help/mailing-list/posting-guide/
>
> But I will go out on a limb anyways and guess (based only the error
code
> in your message), that your problem might get better if you passed
in a
> value to the chrominfo argument. You can see an example of how to
use
> that argument in the manual page by pulling the manual page up like
this:
>
> help(makeTranscriptDbFromGFF)
>
> Hope this helps,
>
>
> Marc
>
> ________________________________
>
> Helmholtz-Zentrum f?r Infektionsforschung GmbH | Inhoffenstra?e 7 |
38124 Braunschweig | www.helmholtz-hzi.de
> Das HZI ist seit 2007 zertifiziertes Mitglied im "audit
berufundfamilie"
>
> Vorsitzende des Aufsichtsrates: MinDir?in B?rbel Brumme-Bothe,
Bundesministerium f?r Bildung und Forschung
> Stellvertreter: R?diger Eichel, Abteilungsleiter Nieders?chsisches
Ministerium f?r Wissenschaft und Kultur
> Gesch?ftsf?hrung: Prof. Dr. Dirk Heinz; Ulf Richter, MBA
> Gesellschaft mit beschr?nkter Haftung (GmbH)
> Sitz der Gesellschaft: Braunschweig
> Handelsregister: Amtsgericht Braunschweig, HRB 477
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 6
Date: Fri, 23 Aug 2013 11:27:39 -0700
From: Dan Tenenbaum <dtenenba@fhcrc.org>
To: Alleene <astrickland at="" med.miami.edu="">
Cc: Xi Wang <xi.wang at="" newcastle.edu.au="">,
"bioconductor at stat.math.ethz.ch bioconductor"
<bioconductor at="" stat.math.ethz.ch="">
Subject: Re: [BioC] SeqGSEA estiGeneNBstat()
Message-ID:
<caf42j23htulrxs=5vstqaoc2etpqqt8r1hkcij5xpwzu8dmfha at="" mail.gmail.com="">
Content-Type: text/plain; charset="ISO-8859-1"
CC'ing the maintainer. I think this person is referring to this post:
https://stat.ethz.ch/pipermail/bioconductor/2013-July/053717.html
Dan
On Wed, Aug 21, 2013 at 1:03 PM, Alleene <astrickland at="" med.miami.edu=""> wrote:
> I am having the same issue - any help would be appreciated.
>
> Thank you!
> -Alleene
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 7
Date: Fri, 23 Aug 2013 11:31:46 -0700
From: Valerie Obenchain <vobencha@fhcrc.org>
To: Dan Du <tooyoung at="" gmail.com="">
Cc: Michael Lawrence <lawrence.michael at="" gene.com="">,
Bioconductor List
<bioconductor at="" stat.math.ethz.ch="">
Subject: Re: [BioC] Semantics of GenomicRanges gaps()
Message-ID: <5217AA92.9040703 at fhcrc.org>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Hi Dan,
Thanks for catching this. Herve is currently out of town. We'll put
this
on the TODO and address it when he is back.
Valerie
On 08/21/2013 12:13 AM, Dan Du wrote:
> Hi all,
>
> Sorry to wake up this sleeping beauty.
>
> Just something more to add and consider, there is another
inconsistency
> in the gaps function for GRanges, when there is no seqinfo present.
The
> default gaps will not produce extra ranges for those unused strands,
> unless one specify the "end" argument, however only setting "start"
> won't trigger the extra results. This is the current behavior of
both
> release and dev branch,
>
> ## rls
> GenomicRanges_1.13.36 IRanges_1.19.24
> ## dev
> GenomicRanges_1.12.4 IRanges_1.18.3
>
> ## case without seqinfo
> g<-GRanges(seqnames="1", IRanges(3, 10), strand='*')
> seqinfo=Seqinfo("1",seqlengths=1000))
> g
> GRanges with 1 range and 0 metadata columns:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] 1 [3, 10] *
> ---
> seqlengths:
> 1
> NA
>
> ## default, no extra
> gaps(g) # this is expected, start always defaults to 1L
> GRanges with 1 range and 0 metadata columns:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] 1 [1, 2] *
> ---
> seqlengths:
> 1
> NA
>
> ## with start, no extra
> gaps(g, start=2)
> GRanges with 1 range and 0 metadata columns:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] 1 [2, 2] *
> ---
> seqlengths:
> 1
> NA
> ## with end, strand comes into play, giving two more
> gaps(g, end=20)
> GRanges with 4 ranges and 0 metadata columns:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] 1 [ 1, 20] +
> [2] 1 [ 1, 20] -
> [3] 1 [ 1, 2] *
> [4] 1 [11, 20] *
> ---
> seqlengths:
> 1
> NA
>
> ## with both start and end, same as above
> gaps(g, start=2, end=20) #
> GRanges with 4 ranges and 0 metadata columns:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] 1 [ 2, 20] +
> [2] 1 [ 2, 20] -
> [3] 1 [ 2, 2] *
> [4] 1 [11, 20] *
> ---
> seqlengths:
> 1
> ## reset seqinfo
> seqinfo(g)<-Seqinfo("1",seqlengths=1000)
>
> ## now the extra ranges are back
> gaps(g, start=2)
> GRanges with 4 ranges and 0 metadata columns:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] 1 [ 2, 1000] +
> [2] 1 [ 2, 1000] -
> [3] 1 [ 2, 2] *
> [4] 1 [11, 1000] *
> ---
> seqlengths:
> 1
> 1000
>
> Best,
> Dan
>
> On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote:
>> Not sure how confusing this would be, but a common case is (like
Alex's
>> input) a set of ranges where the strand is uniform, i.e., all
ranges have
>> strand 'x'. In that case, gaps,GenomicRanges could behave just like
>> gaps,Ranges and return only ranges on strand 'x'. That would
satisfy both
>> properties.
>>
>> A natural extension would be that if all ranges are '+' or '-'
(none of
>> them are '*'), then do not add the '*' range spanning the whole
sequence.
>> Also satisfies both properties.
>>
>> I only know of one use-case where all three strand types are mixed:
>> rna-seq, where the strand has been inferred only from the spliced
>> alignments. Not if gaps() would even be useful in that case, but if
it
>> were, I think the current gaps behavior would make sense.
>>
>> Michael
>>
>>
>>
>>
>> On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at="" fhcrc.org="">
wrote:
>>
>>> Hi Alex,
>>>
>>>
>>> On 05/31/2013 01:46 AM, Alex Gutteridge wrote:
>>>
>>>> I just have a quick question/comment about the behaviour of the
gaps
>>>> function in the GenomicRanges package, particularly with how it
handles
>>>> * strand ranges. The current behaviour is as below:
>>>>
>>>> range = GRanges(seqnames="1",IRanges(**start=300,end=700),
strand="*",
>>>>> seqinfo=Seqinfo("1",**seqlengths=1000))
>>>>> range
>>>>>
>>>> GRanges with 1 range and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <rle> <iranges> <rle>
>>>> [1] 1 [300, 700] *
>>>> ---
>>>> seqlengths:
>>>> 1
>>>> 1000
>>>>
>>>>> gaps(range)
>>>>>
>>>> GRanges with 4 ranges and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <rle> <iranges> <rle>
>>>> [1] 1 [ 1, 1000] +
>>>> [2] 1 [ 1, 1000] -
>>>> [3] 1 [ 1, 299] *
>>>> [4] 1 [701, 1000] *
>>>> ---
>>>> seqlengths:
>>>> 1
>>>> 1000
>>>>
>>>> My interpretation of "*" as a strand identifier is that it means
the
>>>> range covers both + and - strands and so intuitively I was
expecting the
>>>> 'gaps' to only cover the 3rd and 4th ranges returned above (not
the
>>>> full-length + and - strand ranges).
>>>>
>>>
>>> It's even worse than that. If there is no range at all on a
chromosome,
>>> gaps(gr) will return 3 ranges covering the full chromosome:
>>>
>>> > gr <- GRanges("chr1",
>>> IRanges(start=300,end=700),
>>> seqlengths=c(chr1=1000,chr2=**1000))
>>>
>>> > gr
>>>
>>> GRanges with 1 range and 0 metadata columns:
>>> seqnames ranges strand
>>> <rle> <iranges> <rle>
>>> [1] chr1 [300, 700] *
>>> ---
>>> seqlengths:
>>> chr1 chr2
>>> 1000 1000
>>>
>>> > gaps(gr)
>>>
>>> GRanges with 7 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <rle> <iranges> <rle>
>>> [1] chr1 [ 1, 1000] +
>>> [2] chr1 [ 1, 1000] -
>>> [3] chr1 [ 1, 299] *
>>> [4] chr1 [701, 1000] *
>>> [5] chr2 [ 1, 1000] +
>>> [6] chr2 [ 1, 1000] -
>>> [7] chr2 [ 1, 1000] *
>>> ---
>>> seqlengths:
>>> chr1 chr2
>>> 1000 1000
>>>
>>>
>>> The semantics here implies to me
>>>> that the * strand is being thought of as a kind of imaginary
third
>>>> strand and hence doesn't overlap with the + or - strands. This is
>>>> contrary to the semantics of findOverlaps which does appear to
consider
>>>> a * strand range to overlap with + or - strand ranges:
>>>>
>>>> findOverlaps(range,gaps(range)**)
>>>>>
>>>> Hits of length 2
>>>> queryLength: 1
>>>> subjectLength: 4
>>>> queryHits subjectHits
>>>> <integer> <integer>
>>>> 1 1 1
>>>> 2 1 2
>>>>
>>>> To summarise I guess I was expecting
findOverlaps(range,gaps(range)**) to
>>>> never return any overlaps under any circumstance (that I can
think of!).
>>>>
>>>
>>> Let's call this property 2. This sounds indeed like a good
property
>>> that it would be nice to have. But an even more important property
is
>>> property 1: gaps() must be its own reverse operation i.e.
>>> 'gaps(gaps(gr))' must always return 'gr'.
>>>
>>> The current behavior of gaps() guarantees property 1. I'm not
against
>>> changing gaps() behavior to also have property 2, as long as
property
>>> 1 is preserved. Suggestions are welcome.
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> Do people agree the behaviour of gaps() is not quite intuitive
in this
>>>> case?
>>>>
>>>> sessionInfo()
>>>>>
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=C LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods
base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] parallel_2.15.2 stats4_2.15.2
>>>>
>>>>
>>> --
>>> Herv Pags
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor="">
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor="">
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
------------------------------
Message: 8
Date: Fri, 23 Aug 2013 14:30:32 -0700
From: Marc Carlson <mcarlson@fhcrc.org>
To: bioconductor at r-project.org
Subject: Re: [BioC] defining accessors (cols, keys, keytypes, select)
for annotation package built with AnnotationForge package
Message-ID: <5217D478.5070000 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 08/20/2013 12:40 AM, Sashi wrote:
> Marc Carlson <mcarlson at="" ...=""> writes:
>
>> On 08/14/2013 04:58 AM, Sashi wrote:
>>> Marc Carlson <mcarlson <at=""> ...> writes:
>>>
>>>> Hi Sashi,
>>>>
>>>> The PDF from Gabor that you are looking at is much older and was
from
>>>> before we even had the select method. These days you probably
don't
>>>> want to do that. Especially if you want to implement a method
like
>>>> select(). I strongly suspect that you really just want to be
looking
> at
>>>> this vignette instead:
>>>>
>>>>
> http://www.bioconductor.org/packages/release/bioc/vignettes/Annotati
onForge/
> inst/doc/MakingNewAnnotationPackages.pdf
>>>> To answer your questions, GO is actually looking at a view that
is
>>>> created in the database of the three GO tables (one for BP, MF
and CC).
>>>> But you probably don't need that level of detail. If you are
using
>>>> org.At.tair.db to look at arabidopsis, then you may already have
>>>> everything you need. And if you need another organism, you
probably
>>>> want to look 1st at making an org package using
>>>> makeOrgPackageFromNCBI(). And if for some reason you want to
expose
>>>> some entirely new database resource (IOW you don't want to make
an
>>>> organism package but something else entirely), then you might
need to
>>>> use the vignette above.
>>>>
>>>> I hope this helps you,
>>>>
>>>> Marc
>>>>
>>>> On 08/13/2013 04:33 AM, Rameswara Sashi Kiran Challa wrote:
>>>>> Hi ,
>>>>>
>>>>> I am trying to build an annotation organism package by using
> Annotation
>>>>> Forge package. I followed this
>>>>>
> document<http: www.bioconductor.org="" packages="" 2.12="" bioc="" vignettes="" an="" notation=""> Forge/inst/doc/NewSchema.pdf>written
>>>>> by Gabor Csardi.
>>>>> I was able to build a sqlite database and create an Annotation
package
>>>>> using the makeAnnDbPkg() function.
>>>>>
>>>>> I understand cols(), keys(), keytypes() and select() are set as
> generic
>>>>> methods in AnnotationDbi.
>>>>>
>>>>> When I look into methods-AnnotationDb.R script in AnnotationDbi
> package, I
>>>>> see cols() method is set and it actually reads all the columns
of all
> the
>>>>> tables in the sqlite table.
>>>>>
>>>>> When I run *cols() *on *org.At.tair.db *I get few values which
are
>>>>> actually not field/column names in the sqlite db. For Eg. there
is no
> table
>>>>> called "GO" in org.At.tair.sqlite database. I am unable to
understand
> how
>>>>> it creates these values. Could someone please help me understand
how
> and
>>>>> where exactly these accessor functions are defined and how and
where
> are
>>>>> they to be modified to be able to access the data in the sqlite
db
> that I
>>>>> am creating for the organism I am working on.
>>>>>
>>>>> ==========================
>>>>>
>>>>>> cols(org.At.tair.db)
>>>>> [1] "TAIR" "CHRLOC" "CHRLOCEND" "ENZYME"
> "PATH"
>>>>>
>>>>> [6] "PMID" "REFSEQ" "SYMBOL" "GENENAME"
"GO"
>>>>>
>>>>>
>>>>> [11] "EVIDENCE" "ONTOLOGY" "GOALL" "EVIDENCEALL"
>>> "ONTOLOGYALL"
>>>>> [16] "ARACYC" "ARACYCENZYME" "ENTREZID" "CHR"
>>>>> =======================================
>>>>>
>>>>> Please point me to any documentation available for the same.
>>>>>
>>>>> Thanks for your time,
>>>>> Sashi
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor <at> ...
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor <at> ...
>>> Hi Marc,
>>>
>>> Thanks for your prompt reply. Referring to the document you
pointed me
> to, I
>>> created another R script within the organism package skeleton( an
R
> script
>>> apart from zzz.R) and set cols, keytypes accessor methods.
>>>
>>> As part of annotation packages Bimaps are created in every
annotation
>>> package. How do we use these Bimaps in these accessor methods? Am
I
> right in
>>> thinking that these Bimaps are to be used in these accessor
methods? Or
>>> those Bimaps have to be accessed only via get(), mget(), toTable()
> methods?
>>> Also, can you please let me know if there is any documentation
available
> on
>>> how the GO views are created? I see there are seperate tables like
> go_cc,
>>> go_mf, go_bp, etc under Arabidopsis annotation package. Is it
necessary
> to
>>> have go_cc, go_mf, go_bp, go_mf_all, like tables in the sqlite
database
> for
>>> the customized annotation package I am creating? Will not just a
single
>>> table for all GO annotations suffice?
>>>
>>> Thanks again for your time,
>>> Sashi
>> Hi Sashi,
>>
>> I really doubt that you need to think about bimaps at all. You
don't
>> need them to implement select, cols, keytypes or keys. And they
are
>> really only still supported for the sake of older legacy code. The
get,
>> mget, and toTable methods are defined to help with bimaps, but you
>> probably don't need to use these methods anyways. So it's very
unlikely
>> that you would even need to use bimaps let alone implement them.
>>
>> And the go view is just a SQLite database view. A view is sort of
like
>> a pre-canned database query that appears as a table. Our "go view"
is
>> really just the union of go_bp, go_mf, and go_cc tables. Those
three
>> separate tables allow us to still keep the different terms (from
the
>> different ontologies) as separate from each other in the database.
But
>> since we are using a view, we can also easily query all three of
them
>> (as if they were lumped together) WITHOUT actually duplicating all
that
>> data into another enormous table. And the performance for this is
still
>> great.
>>
>> You can read a bit about how SQLITE views are created here if you
are
>> curious:
>>
>> http://www.sqlite.org/lang_createview.html
>>
>> But if you are making an org package, why not just use
>> makeOrgPackageFromNCBI?
>>
>> Marc
>>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at ...
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at ...
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> Thanks a lot Marc!!
>
> It's good to know that BioConductor community is trying to move away
from
> Bimaps and adopt cols, select, keys, keytypes methods for any sort
of
> queries. Are Reverse maps that are part of Bimaps taken care by
these
> accesors?
>
> As I understand, for older legacy code perhaps the
makeOrgPackageFromNCBI is
> also still generating Bimaps and in near future, perhaps all the
Annotation
> packages will just have a sqlite database and these accessors,
defined. Am I
> correct?
>
> I had started by looking at how to build a sqlite db with some of
the
> mappings we have and had not used makeOrgPackageFromNCBI function.
My
> thinking was that having an understanding of sqlite db building will
enable
> me to add any new mappings that are not part of NCBI.
>
> So, to summarize, for Annotation package development one approach is
using
> makeOrgPackageFromNCBI() and the other approach is to make a sqlite
db and
> then define these accessors, as given in the pdf you had linked me
to
> earlier. And there will be no need of any Bimaps for the package
development
> as such.
>
> Thanks for your time,
> -Sashi
Hi Sashi,
We don't aim to get rid of bimaps from packages that have already had
them before, but we definitely don't think that new packages need to
have anything other than cols, keys, keytypes and select. Reverse maps
are also unnecessary if you have defined those newer methods.
So yes any new stuff you are developing should really just focus on
those four accessors. The other stuff is just older stuff that you
shouldn't ever need. And if you do need it, when we need to change
something so that you no longer need it! We want to make it EASIER
for
people like you who are interested in exposing new resources.
Science is hard enough already. ;)
Marc
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 9
Date: Fri, 23 Aug 2013 15:34:42 -0700
From: Valerie Obenchain <vobencha@fhcrc.org>
To: Peter Hickey <hickey at="" wehi.edu.au="">
Cc: bioconductor at r-project.org
Subject: Re: [BioC] findOverlaps() fails with type = 'equal' fails
when query is a CollapsedVCF object and subject is a GRanges
object
Message-ID: <5217E382.7090107 at fhcrc.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Hi Peter,
Thanks for catching this oversight. 'equal' was omitted when
implementing findOverlaps() methods for SummarizedExperiment. The
other
'overlap' methods call findOverlaps() so they were failing too.
Now fixed in GenomicRanges 1.13.37 (devel) and 1.12.5 (release). Both
should be available via biocLite() by Saturday noon PST.
Valerie
On 08/21/2013 09:04 PM, Peter Hickey wrote:
> Hi all,
>
> I got a bit lost trying to figure out why the following code does
not work. The same code does work when type = 'any', 'start', 'end' or
'within', but not when type = 'equal'. When type = 'equal' it fails
with one of the following:
> ## countOverlaps() error message
> Error in queryHits(findOverlaps(query, subject, maxgap = maxgap,
minoverlap = minoverlap, :
> error in evaluating the argument 'x' in selecting a method for
function 'queryHits': Error in match.arg(type) :
> 'arg' should be one of ?any?, ?start?, ?end?, ?within?
> ## findOverlaps() and overlapsAny() error message
> Error in match.arg(type) :
> 'arg' should be one of ?any?, ?start?, ?end?, ?within?
>
> So 'equal' isn't a valid option for when the subject is a
CollapsedVCF object, whereas it is a valid option for when the subject
is, say, a GRanges object, correct?
>
> Is it possible to extend findOverlaps(), countOverlaps() and
overlapsAny() to allow for type = 'equal' when the subject is a
CollapsedVCF object? Ideally this would also work if query and subject
were both of CollapsedVCF class because I'm often interested in
finding overlaps between two VCF files and I'd like to be able to do
that with type = 'equal'.
>
> Thanks
> Peter Hickey
>
> #### DESCRIPTION ####
> # Peter Hickey
> # 22/08/2013
> # findOverlaps(), countOverlaps() and overlapsAny() don't work when
`query` is a CollapsedVCF object, `subject` is a GRanges object and
`type` is 'equal' yet they do work when `type` is 'any'.
>
> #### Load packages ####
> library(VariantAnnotation)
>
> #### Create VCF ####
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
>
> #### Create GRanges object ####
> gr <- GRanges(seqnames = '20', ranges = IRanges(start = 14370, end =
14370))
>
> #### countOverlaps ####
> countOverlaps(vcf, gr, type = 'any') # Works
> countOverlaps(vcf, gr, type = 'equal') # Doesn't work
> findOverlaps(vcf, gr, type = 'any') # Works
> findOverlaps(vcf, gr, type = 'equal') # Doesn't work
> overlapsAny(vcf, gr, type = 'any') # Works
> overlapsAny(vcf, gr, type = 'equal') # Doesn't work
>
> #### sessionInfo ####
> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] VariantAnnotation_1.6.7 Rsamtools_1.12.3
Biostrings_2.28.0
> [4] GenomicRanges_1.12.4 IRanges_1.18.3
BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.22.6 Biobase_2.20.1 biomaRt_2.16.0
> [4] bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7
> [7] GenomicFeatures_1.12.3 RCurl_1.95-4.1 RSQLite_0.11.4
> [10] rtracklayer_1.20.4 stats4_3.0.0 tools_3.0.0
> [13] XML_3.98-1.1 zlibbioc_1.6.0
>
>
>
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au
> http://www.wehi.edu.au
>
>
>
______________________________________________________________________
> The information in this email is confidential and
intend...{{dropped:8}}
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
------------------------------
Message: 10
Date: Fri, 23 Aug 2013 16:05:07 -0700
From: "Tim Triche, Jr." <tim.triche@gmail.com>
To: Jon Manning <jonathan.manning at="" ed.ac.uk="">
Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] Control probe QC plots for Infinium 450k chips
(lumi, minfi...)
Message-ID:
<cac+n9bxw54a_k3ekwe9st5hzwifrhcqvrme-ezch1qej14g2tq at="" mail.gmail.com="">
Content-Type: text/plain
heh. yeah I wrote something like this a long time ago for
methylumi... not
sure if there's an issue with the Final Report Format parser or what,
but
normally the control probe names are preserved. Anyways, here's the
fugly
old code, which I might as well update in methylumi in SVN.
If a type of probe isn't found, the plot will try and search by name
for
the specified controls. If it finds those, it will break them out by
name
(the assumption here is that the user knows what they are doing if
they ask
for e.g. bisulfite conversion probes).
You might also look into shinyMethyl, which is another way of
approaching
this. There are coercions within methylumi that will take a
MethyLumiSet
or a MethyLumiM object and turn it into a MethylSet or a RGChannelSet,
so
presumably it would be possible to use shinyMethyl (which builds on
minfi)
for QC. I happen to think that minfi has a nice clean codebase and
makes
it relatively easy to build upon, but that's just me.
Fugly old code follows...
library(methylumi)
PBLs <- readRDS('PBLs.rds') ## a MethyLumiSet I had lying around
## caveat: I have always read in data from IDATs, so I didn't test the
parser here
## examples:
## qc.probe.plot(PBLs, controltype='DNP') ## finds the probes by name
## qc.probe.plot(PBLs, controltype='bisulfite') ## finds the probes by
type
## qc.probe.plot(PBLs, by.type=T, controltype='bisulfite') ## splits
the
probes by type
## avert your eyes, this code predates my learning how to be hygienic
in
R...
qc.probe.plot <- function (obj, controltype="negnorm", log2=T,
by.type=F,
...)
{
require("ggplot2")
require("reshape2")
require("scales")
by.probe <- FALSE
log2_trans <- log_trans(base = 2)
if (class(obj) %in% c("MethyLumiSet", "MethyLumiM")) {
qc <- controlData(obj)
if (!identical(sampleNames(qc), sampleNames(obj))) {
sampleNames(qc) <- sampleNames(obj)
}
}
else if (class(obj) == "MethyLumiOOB" & tolower(controltype) ==
"oob") {
qc <- obj
}
else if (class(obj) == "MethyLumiQC") {
qc <- obj
}
else {
stop("Don't know how to QC this data you've given me...")
}
if (tolower(controltype) == "negnorm" || missing(controltype)) {
rows <- grep("(Negative|Norm)", fData(qc)$Type, ignore.case =
T)
}
else {
rows <- grep(paste("^", controltype, sep = ""),
fData(qc)$Type,
ignore.case = T)
if(length(rows) < 1) {
## try looking by name instead
rows <- grep(paste("^", controltype, sep = ""),
featureNames(qc),
ignore.case = T)
by.probe <- TRUE
}
}
if (tolower(controltype) == "oob") {
dat <- intensities.OOB.allelic(obj)
rownames(dat$Cy5$M) <- paste(rownames(dat$Cy5$M), "M", sep =
"_")
rownames(dat$Cy5$U) <- paste(rownames(dat$Cy5$U), "U", sep =
"_")
rownames(dat$Cy3$M) <- paste(rownames(dat$Cy3$M), "M", sep =
"_")
rownames(dat$Cy3$U) <- paste(rownames(dat$Cy3$U), "U", sep =
"_")
dat$Cy5 <- rbind(dat$Cy5$M, dat$Cy5$U)
dat$Cy3 <- rbind(dat$Cy3$M, dat$Cy3$U)
names(dat) = gsub("Cy3", "green", gsub("Cy5", "red",
names(dat)))
dat <- lapply(dat, function(d) {
datum = data.frame(d)
colnames(datum) = gsub("^X", "", colnames(datum))
m.probes = grepl("_M$", rownames(d))
datum$probe = as.factor(gsub("_(M|U)$", "",
rownames(datum)))
datum$type = "unmethylated"
datum$type[which(m.probes)] = "methylated"
datum$type = as.factor(datum$type)
return(datum)
})
dat$green$channel = "Cy5"
dat$red$channel = "Cy3"
dat.frame <- rbind(dat$red, dat$green)
dat.frame$channel <- as.factor(dat.frame$channel)
a.title <- "Out-of-band probe intensity plot"
} else {
probes <- featureNames(qc)[rows]
type <- as.factor(fData(qc)$Type[rows])
if (tolower(controltype) == "negnorm") {
if ("NORM_C" %in% unique(fData(qc)$Type)) {
colour.settings <- c(NEGATIVE = "darkgray", NORM_A =
"red",
NORM_T = "darkred", NORM_C = "green", NORM_G =
"darkgreen")
type = factor(type, levels = names(colour.settings))
shape.settings <- c(20, 20, 20, 20, 20)
} else {
colour.settings <- c("darkgray", "darkgreen", "red")
shape.settings <- c(20, 20, 20)
}
}
dat <- c(red = "unmethylated", green = "methylated")
Cy5 <- as.data.frame(assayDataElement(qc, dat[1])[rows, ])
Cy5$channel <- "Cy5 (Red)"
Cy5$probe <- as.factor(probes)
Cy3 <- as.data.frame(assayDataElement(qc, dat[2])[rows, ])
Cy3$channel <- "Cy3 (Green)"
Cy3$probe <- as.factor(probes)
Cy3$type <- Cy5$type <- type
dat.frame <- rbind(Cy5, Cy3)
dat.frame$channel <- as.factor(dat.frame$channel)
a.title <- paste(controltype, "control probe plot")
if (tolower(controltype) == "negnorm") {
a.title <- "Negative & normalization control probes"
}
}
more.args = list(...)
if ("extra" %in% names(more.args))
a.title = paste(a.title, more.args[["extra"]])
qc <- melt(dat.frame, id = c("probe", "channel", "type"))
geometry <- ifelse(tolower(controltype) == "negnorm",
ifelse(tolower(controltype) == "oob", "boxplot", "jitter"), "point")
if (tolower(controltype) == "oob") {
qc$grouping = paste(qc$variable, qc$type, sep = ".")
p <- ggplot2::qplot(data = qc, colour = type, fill = type,
group = grouping, x = variable, y = value, geom =
"boxplot",
main = a.title, xlab = "Sample", ylab = "Intensities") +
coord_flip()
if (log2)
p <- p + scale_x_continuous(trans = "log2", limits =
c(2^4,
2^16))
else
p <- p + scale_x_continuous(limits = c(2^4, 2^16))
} else {
if(by.probe) {
p <- ggplot2::qplot(data = qc, colour = probe, shape =
probe,
x = value, y = variable, geom = geometry, main =
a.title,
ylab = "Sample", xlab = "Intensities")
} else {
p <- ggplot2::qplot(data = qc, colour = type, shape =
type,
x = value, y = variable, geom = geometry, main =
a.title,
ylab = "Sample", xlab = "Intensities")
}
p <- p + scale_y_discrete(limits = rev(sampleNames(obj)))
if (log2) p <- p + scale_x_continuous(trans="log2",
limits=c(2^4,
2^16))
else p <- p + scale_x_continuous(limits = c(2^4, 2^16))
}
if (by.type) {
p <- p + facet_grid(type ~ channel)
} else if(by.probe) {
p <- p + facet_grid(. ~ channel)
} else {
p <- p + facet_grid(. ~ channel)
}
if (tolower(controltype) == "negnorm") {
p <- p + scale_colour_manual(values = colour.settings)
p <- p + scale_shape_manual(values = shape.settings)
}
p <- p + theme_bw()
return(p)
}
On Fri, Aug 23, 2013 at 4:05 AM, Jon Manning <jonathan.manning at="" ed.ac.uk="">wrote:
> Hi Bioconductors,
>
> I'm trying to find out if there are good existing ways in
Bioconductor of
> plotting out the QC data from Illumina Methylation 450k chips. I
want to
> plot the control probe intensities separately for green and red
channels,
> label the individual probe types properly (such as 'DNP (High)',
'Biotin
> (Med)' for the staining controls), and ideally indicate expected
> (qualitative) intensities. I've actually done this myself, but want
to know
> if I've missed the BioC way.
>
> The detail:
>
> I have 'sample methylation profile', 'control probe profile' etc
files
> (not the IDAT files). The way Lumi and friends read the controlData
> (retrievable from a controlData slot in the methyLumiM object),
probe
> detail seems to be lost, such that for the staining controls ('DNP
(High)',
> 'Biotin (Med)') I just get identifiers like STAINING, STAINING.1.
The
> results of methylumi's qcplot() function are then supremely un-
useful.
> Likewise Minfi's QC plotting functions seem from the docs not to
label
> individual control types (though I don't have the IDAT data so Minfi
isn't
> very useful to me anyway).
>
> I needed to finish my work, so I brewed my own method using ggplot()
and
> cross-referencing the control data and annotation from the manifest
> (outside of Lumi et al), but I feel this is something others will
have done
> and that I've therefore missed something. Any pointers?
>
> Many thanks,
>
> Jon Manning
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> ______________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor="">
>
--
*He that would live in peace and at ease, *
*Must not speak all he knows, nor judge all he sees.*
*
*
Benjamin Franklin, Poor Richard's
Almanack<http: archive.org="" details="" poorrichardsalma00franrich="">
[[alternative HTML version deleted]]
------------------------------
Message: 11
Date: Fri, 23 Aug 2013 16:07:50 -0700
From: Marc Carlson <mcarlson@fhcrc.org>
To: bioconductor at r-project.org
Subject: Re: [BioC] error while using goProfiles package on
arabidopsis entrez gene IDs
Message-ID: <5217EB46.202 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi,
So now I can see a little better what you are doing. The problem is
what is happening inside of goProfiles. Now this is not my package
and
I have never really used it much myself, so I just did a little
debugging to see what was happening, and this is what I found:
The basicProfile() function is expecting you to give it a central ID
for
the org package you name for it. It seems to be assuming that this
will
be an entrez gene ID. But that is *not* what the arabidopsis
community
usually uses. That community likes to use TAIR IDs. So the
org.At.tair.db, uses TAIR IDs as the central ID (this is why TAIR is
in
the middle of the package name). You can get and use entrez gene IDs
with the org.At.tair.db package, but they are not the central id that
is
expected by many of the older methods like mget() etc. These days,
we
have moved away from that model and now use the select method. We
feel
it's less confusing since there is no longer the need to pay attention
to which key type is most important for a package etc. Instead the
select() interface just asks you to provide the kind of key that you
are
using. We feel this is more transparent.
So anyways here is how I was able to make it run:
## 1st take some of your entrez gene IDs
egIDs <- c("839235", "838362", "838961", "837091", "837455", "837543")
## use select to quickly translate these into TAIR IDs, and then grab
that column of IDs back out.
## (You may find it more convenient to just start with the TAIR IDs
that
you said were in your file, but I don't have those here)
tairIDs <- as.character(select(org.At.tair.db, keys=egIDs,
cols="TAIR",
keytype="ENTREZID")[[2]])
## THEN call basicProfile function and pass in tair IDs instead...
## Now when it calls mget on the GO mapping, it will actually get some
matches.
basicProfile(tairIDs, idType="Entrez", onto ="ANY", level=2,
orgPackage="org.At.tair.db", ord=FALSE)
I hope this helps you,
Marc
On 08/22/2013 02:07 AM, dd [guest] wrote:
> Hi all,
> I was using goProfiles package for functional analysis using a
genelist of 316 Arabidopsis entrez gene IDs as shown below in the R
command sessionInfo().
>
> - Read a file containing Entrez IDs and TAIR IDs.
> - Subset the Entrez IDs and converted to character vector.
> - Used the vector as genelist.
> -Used goProfiles package function basicProfile for this genelist
with organism package of Arabidopsis.
>
> OUTPUT :Error in GOtermslist[[i]] : subscript out of bounds.
>
> Can somebody please help me in finding any mistake I might have
done?
>
> Thanks in advance.
>
> -- output of sessionInfo():
>
> Console output :
>
>>> a<-read.table("tair_ids to gene_ids.csv" ,header=TRUE,sep=",")
>>
>>> b<-as.character(a[,2])
>>> head(b)
>> [1] "839235" "838362" "838961" "837091" "837455" "837543"
>>
>>> h<-basicProfile(b,idType="Entrez",onto
="ANY",level=2,orgPackage="org.At.tair.db",ord=FALSE)
>> Error in GOtermslist[[i]] : subscript out of bounds
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 12
Date: Fri, 23 Aug 2013 18:53:18 -0700 (PDT)
From: Luo Weijun <luo_weijun@yahoo.com>
To: Bioconductor at r-project.org, Oleg Moskvin <moskvin at="" wisc.edu="">
Subject: Re: [BioC] pathview puzzle
Message-ID:
<1377309198.3634.YahooMailBasic at
web125504.mail.ne1.yahoo.com>
Content-Type: text/plain; charset=utf-8
Hi Oleg,
Thanks for the note. This is indeed a problem I didn?t realize
previously! KEGG uses Entrez Gene ID for all other model organisms
I?ve checked.
I am working on a generic fix (not only for E coli but other species
with similar situation) and will incorporate that into the development
version of pathview soon. Will keep you posted.
Thanks for pointing this out.
Weijun
--------------------------------------------
On Fri, 8/23/13, Oleg Moskvin <moskvin at="" wisc.edu=""> wrote:
Subject: Re: [BioC] pathview puzzle
Date: Friday, August 23, 2013, 12:19 PM
Hi Weijun,
Thank you for the response.
The problem seems to be deeper than that and is connected to
special handling of a particular species - E.coli - by KEGG.
I looked into the pathview() code and here is what I see:
1) gene.data is remapped internally via mol.sum() to have
ENTREZ IDs;
2) remapped gene.data is used by node.map() to map onto KEGG
nodes using node.data
3) the node.data used in (2) was originally extracted from
the KEGG XML by node.info()
The above route implies that the "name" entries in the KEGG
XML of type="gene" have "speciesID:ENTREZ" format...
And in the case of E.coli this doesn't hold true! See the
examples of XML entries for H.sapience and E.coli from my
yesterday's message (below).
In fact, in KEGG XML for E.coli "gene" records b-numbers are
used as IDs!
So, for the cases like that, when KEGG fails to be
consistent in the supplied XML structure, one may suggest
introducing an "id.bypass" option to pathview() which will
take the gene.data as is (with the IDs supplied by user that
match KEGG XML ids; for example, b-numbers), and pass this
directly to the step 3 (node matching).
Thanks!
Oleg
On 08/22/13, Luo Weijun wrote:
> Hi Oleg,
> You are right, the problem is due to ID type
inconsistency.
> You have to specify gene.idtype when calling pathview
function, if your gene id type is not Entrez Gene. I don?t
think b-numbers are recognized for sure. For your gene name
example, if you mean official gene symbols by ?gene
name?, you should specify gene.idtype="SYMBOL" (lower case
is fine):
> eco2.out <- pathview(gene.data =
T2.CEBF095.crt115.ASCH.DROP3.rel.gn, pathway.id = "02010",
gene.idtype="SYMBOL", out.suffix = "T2ACSH", species =
"eco", kegg.native=TRUE)
On 08/22/13, Oleg Moskvin? wrote:
>
> <entry id="2" name="hsa:51343" type="gene"> link="http://www.kegg.jp/dbget-bin/www_bget?hsa:51343">
> <graphics name="FZR1, CDC20C, CDH1, FZR, FZR2, HCDH,
HCDH1" fgcolor="#000000" bgcolor="#BFFFBF"> type="rectangle" x="919" y="536" width="46"
height="17"/>
> </entry>
>
>
> <entry id="4" name="eco:b1513" type="gene"> link="http://www.kegg.jp/dbget-bin/www_bget?eco:b1513">
> <graphics name="lsrA" fgcolor="#000000" bgcolor="#BFFFBF"> type="rectangle" x="339" y="1882" width="46"
height="17"/>
> </entry>
------------------------------
Message: 13
Date: Sat, 24 Aug 2013 02:24:30 +0000
From: "Blanchette, Marco" <mab@stowers.org>
To: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: [BioC] Bug in makeOrgPackageFromNCBI from AnnotationForge?
Message-ID: <e05792c7-364d-4f0c-be9e-821c302effcd at="" stowers.org="">
Content-Type: text/plain; charset="us-ascii"
I am working on a project involving Schizosaccharomyces pombe as a
source for genomic analysis and love to use ReportingTools html
producing wrappers. However, I am struggling as there is no
AnnotationDbi package available for this organism. I decided to
finally take the plunge and try to see if I could be one myself using
AnnotationForge and was quite exciting to find that I could perhaps
melt one simply by using the makeOrgPackageFromNCBI(). Most likely,
something went wrong and I suspect a bug somewhere in the pipeline. I
have not dug deeper then trying to build the package and use it hoping
that someone closer to the code could shed some lights. Here the steps
I took:'
> library(AnnotationForge)
> makeOrgPackageFromNCBI(version = "0.1",
author = "Marco Blanchette <mab at="" stowers.org="">",
maintainer = "Marco Blanchette <mab at="" stowers.org="">",
outputDir = ".",
tax_id = "4896",
genus = "Schizosaccharomyces",
species = "pombe")
This step succeeded with only a warning:
Warning message:
In .makeSimpleTable(ug, table = "unigene", con) :
no values found for table unigene in this data chunk.
I didn't think this was critical enough to raise any red flag, so I
then proceeded with the installation that went smoothly
> library(devtools)
> install('org.Spombe.eg.db')
> library('org.Spombe.eg.db')
Then I try to use it with ReportingTools publish() but fail as it
returns an error related to Entrez ID which I had a conversion table
from biomaRt. I dug a bit deeper and found that none of the genes I
was querying were in the database to finally realize that there was
only 38 entries int the org.Spombe.eg.db database I had just created
and installed... Check this out:
> keytypes(org.Spombe.eg.db)
[1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "PMID" "REFSEQ"
[7] "SYMBOL" "UNIGENE" "GENENAME" "GO" "EVIDENCE" "ONTOLOGY"
Looking good! However:
> length(keys(org.Spombe.eg.db,'ENTREZID'))
[1] 38
Can someone close enough to the code shed some lights has to whether
there is a bug in AnnotationForge or whether it is the NCBI database
that is not conforming to what is expected? For instance, biomart has
5117 entrez ID
> library(biomaRt)
> mart <- useMart("fungi_mart_18","spombe_eg_gene")
> ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart)
> sum(!is.na(ensembl2entrez$entrezgene))
[1] 5117
The ids I tested on the NCBI website return the correct genes.
However, only 10 of the AnnotationForge EntrezID (out of a skirmish 38
ids) are found in biomaRt
> sum(keys(org.Spombe.eg.db,'ENTREZID') %in%
ensembl2entrez$entrezgene)
[1] 10
Again, I would appreciate any comments or suggestions as to whether
this is a bug or something I did wrong or a miss alignment between the
NCBI S. pombe annotation and what is expected by AnnotationForge.
Thanks
--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.
Kansas City, MO 64110
Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018
------------------------------
Message: 14
Date: Fri, 23 Aug 2013 23:46:53 -0400
From: Matthew McCall <mccallm@gmail.com>
To: Wolfgang Huber <whuber at="" embl.de="">
Cc: j.m.boer at erasmusmc.nl, bioconductor at r-project.org
Subject: Re: [BioC] frma normalization and batch effects
Message-ID:
<caddc-itkv9eo-ycttuwa+sb8enbexzyvv9igvdeb-tzfc9d93w at="" mail.gmail.com="">
Content-Type: text/plain
Judith,
You're probably fine using the default frma summarization unless your
data
are in some way atypical. The random effect summarization just allows
the
probe effects in your data to differ slightly from the global (frozen)
probe effects. Also if you're going to have singletons for which you
want
to predict in the future, then the default summarization is definitely
the
way to go.
Depending on how big your dataset is going to be, you might consider
creating your own custom frma implementation using the frmaTools
package.
Finally frma addresses one type of batch effect functioning at the
probe
level. It does nothing when the batch effect exists at the probeset
level.
So something like fSVA, after preprocessing with frma, would certainly
be a
good idea as well.
Best,
Matt
On Aug 23, 2013 7:56 AM, "Wolfgang Huber" <whuber at="" embl.de=""> wrote:
> Hi Judith
>
> I am sure the frma people will have more specific recommendations,
but in
> addition, both your questions below could be interpreted as
questions of
> parameter choice in a (somewhat complex, since it includes the
> preprocessing and batch adjustment) classifier. An often useful way
of
> making such choices is by cross-validation on a dataset that mimics
the
> kind of data you expect to see in the future.
>
> I guess you might also enjoy Jeff Leek's recent talk:
> http://www.birs.ca/events/2013/5-day-
workshops/13w5083/videos/watch/201308151110-Leek.mp4with frozen sva,
and top scoring pairs
>
> Best wishes
> Wolfgang
>
> On 23 Aug 2013, at 10:55, Judith Boer [guest] <guest at="" bioconductor.org="">
> wrote:
>
> >
> > Dear all,
> >
> > I am working on expression classifiers for leukemic subtypes using
> Affymetrix Plus2 arrays. The training data consists of several
batches. The
> developed classifier will be used to predict the subtype of new sets
of
> samples as well as single samples. So far, I co-normalized new
arrays with
> the training set, but this is not ideal.
> >
> > I have read the frma paper by McCall et al, and it seems the
perfect
> solutions. Before I start, I have a few conceptual questions:
> >
> > 1. The training data consists of several batches of different
sizes,
> some of them biased towards a single subtype. Does normalization
per batch
> using summarize=?? random_effect?? remove biology in this case?
ComBat
> clearly did, and I ended up not correcting for batch effect, which
worked
> fine for the classifiers I am using. Any suggestion which
summarization
> would be best to use in this case?
> >
> > 2. Is there a minimum of arrays to use with summarize=??
random_effect??
> ?
> >
> > Any suggestions on how to best implement frma in this project are
very
> welcome!
> >
> > Cheers, Judith
> >
> >
> > -- output of sessionInfo():
> >
> > R version 2.15.2 (2012-10-26)
> > Platform: i386-w64-mingw32/i386 (32-bit)
> >
> > locale:
> > [1] LC_COLLATE=English_United Kingdom.1252
LC_CTYPE=English_United
> Kingdom.1252
> > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
> > [5] LC_TIME=English_United Kingdom.1252
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods
base
> >
> >
> > --
> > Sent via the guest posting facility at bioconductor.org.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
------------------------------
Message: 15
Date: Fri, 23 Aug 2013 15:58:59 +0100
From: James Floyd <j.a.floyd@qmul.ac.uk>
To: <bioconductor at="" r-project.org="">
Subject: [BioC] DESeq query
Message-ID: <ce3d3743.21a%j.a.floyd at="" qmul.ac.uk="">
Content-Type: text/plain
Hi,
I am very new to RNA-seq and differential expression analyses, but
have been
attempting to use DESeq in my analyses (package maintainer: Simon
Anders).
I was going through the vignette provided here
<http: bioconductor.org="" packages="" 2.12="" bioc="" vignettes="" deseq="" inst="" doc="" d="" eseq.p="" df=""> . Is the idea of performing the getVarianceStabilizedData to then
be
able to go ahead and re-analyse the transformed count data with
stabilised
variance? I ask because trying to use the value from
getVarianceStabilizedData as input to newCountDataSet doesn't work
since the
"counts" are no longer integers.
Thanks in advance,
Jamie
[[alternative HTML version deleted]]
------------------------------
Message: 16
Date: Fri, 23 Aug 2013 22:23:48 +0200
From: Xi Wang <xi.wang@newcastle.edu.au>
To: "Strickland, Alleene" <astrickland at="" med.miami.edu="">
Cc: "bioconductor at stat.math.ethz.ch bioconductor"
<bioconductor at="" stat.math.ethz.ch="">
Subject: Re: [BioC] SeqGSEA estiGeneNBstat()
Message-ID:
<capm7ood4hsskvt8=wtyls6wckpguf8qca9bnd8pj+ls66z5__w at="" mail.gmail.com="">
Content-Type: text/plain
Hi Alleene,
Have you solved your problems in running SeqGSEA? I'm very glad to
help
solve out any running issues regarding SeqGSEA.
The NA values generated after running estiExonNBstat() were due to the
only
one exon in each genes, where no differential splicing can happen.
Regards
Xi
On Fri, Aug 23, 2013 at 9:16 PM, Strickland, Alleene <
AStrickland at med.miami.edu> wrote:
> Dan,
> I was referring to that post - thank you! I didn't realize it went
to the
> wrong place.
>
> -----Original Message-----
> From: dandante at gmail.com [mailto:dandante at gmail.com] On Behalf
Of Dan
> Tenenbaum
> Sent: Friday, August 23, 2013 2:28 PM
> To: Strickland, Alleene
> Cc: bioconductor at stat.math.ethz.ch bioconductor; Xi Wang
> Subject: Re: [BioC] SeqGSEA estiGeneNBstat()
>
> CC'ing the maintainer. I think this person is referring to this
post:
> https://stat.ethz.ch/pipermail/bioconductor/2013-July/053717.html
>
> Dan
>
>
> On Wed, Aug 21, 2013 at 1:03 PM, Alleene <astrickland at="" med.miami.edu="">
> wrote:
> > I am having the same issue - any help would be appreciated.
> >
> > Thank you!
> > -Alleene
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
------------------------------
Message: 17
Date: Fri, 23 Aug 2013 11:38:56 +0200
From: Moritz Hess <ssehztirom@googlemail.com>
To: "James W. MacDonald" <jmacdon at="" uw.edu="">,
"bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] Proper set up to test interaction of continuous
covariate and factor levels (limma)
Message-ID:
<camvls2xlo3c4dsz0gw91hqg=7qkt900vgodcugjhj3lag56q0q at="" mail.gmail.com="">
Content-Type: text/plain
Hi James,
thanks for your reply! My intention to use the above mentioned
(although
improper as you mentioned) method was to be able to form clear tests.
I
assumed that when I am testing for genes responding to the covariate
e.g.
significant "cov" I am testing against the Baseline in the reference
group
(the group that has been assigned as reference by limma or lm). But I
want
to test against the baseline in all groups. Basically I wanted to
adapt the
procedure that has been proposed for factoral designs in the limma
userguide (Section 8.7).
I now realize that testing for significant coefficients for cov and
interaction of group only seems to be a good choice for me. However I
was a
little puzzled by the fact that the effect size for cov would be only
the
effect size / slope within GroupA, which I somehow confused with a
signifcant nonzero slope of the covariate exclusively in GroupA.
Best,
Moritz
2013/8/21 James W. MacDonald <jmacdon at="" uw.edu="">
> Hi Moritz,
>
>
> On 8/20/2013 2:30 PM, Moritz Hess wrote:
>
>> Hi,
>> I am investigating the global gene expression response to a
continuous
>> environmental covariate in two groups of individuals using limma.
>> I am interested in genes whose expression is:
>> A) correlated with the covariate
>> B) differentially correlated with the covariate in the two groups
of
>> individuals (Interaction of Group and Covariate)
>>
>> In order to be able to set up the proper contrasts I split the
Covariate
>> in
>> two Vectors where one Vector contains only the samples with the
lowest
>> level of the covariate
>> e.g. CovBase = c(0,0,0,3,0,3,0,0)
>> and where the other vector contains all the samples with higher
levels of
>> the covariate
>> e.g. Cov = c(34,2,5,0,5,0,2,34)
>> and set up a design matrix without intercept:
>>
>> ~ Group + CovBase + Cov
>>
>>
>> The contrasts I am testing are specified as follows:
>> Effect of Covariate: Cov-CovBase
>> Interaction of Group and Covariate: (GroupA-GroupB) - (Cov-CovBase)
>>
>> Does this procedure makes sense ?
>>
>
> It doesn't make sense to me. In the first place the formula you use
will
> create an intercept. Secondly, if you want the covariate to be
continuous,
> then why are you splitting it like that? And why do you think values
of 3
> are lower than values of 2?
>
> If I were trying to do what you say you are doing, then I would just
do
>
> cov <- c(34,2,5,3,5,3,2,34)
> group <- factor(whateveryourgroupsare)
>
> design <- model.matrix(~cov*group)
>
> Best,
>
> Jim
>
>
>
>> Thank you very much in advance
>>
>> Moritz
>>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor="">
>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
--
*Moritz He?
PhD Candidate
*
*Research associate
Forest Research Institute
of Baden W?rttemberg (FVA)
Wonnhalde 4
79100 Freiburg (Germany)
phone +49 761 4018 301*
[[alternative HTML version deleted]]
------------------------------
Message: 18
Date: Sat, 24 Aug 2013 08:37:31 +0200
From: Michael Love <michaelisaiahlove@gmail.com>
To: James Floyd <j.a.floyd at="" qmul.ac.uk="">
Cc: bioconductor at r-project.org
Subject: Re: [BioC] DESeq query
Message-ID:
<cadqzidugqtocf7a=hsvx4via=azpir8dfgjfoxga=1u2dgmcxw at="" mail.gmail.com="">
Content-Type: text/plain
Hi Jamie,
The point of providing the variance stabilization is for applications
other
than DE testing, for example clustering of samples or other machine
learning applications.
For DE testing we suggest to use the count-based methods.
Mike
On Aug 24, 2013 8:07 AM, "James Floyd" <j.a.floyd at="" qmul.ac.uk="">
wrote:
> Hi,
>
> I am very new to RNA-seq and differential expression analyses, but
have
> been
> attempting to use DESeq in my analyses (package maintainer: Simon
Anders).
>
> I was going through the vignette provided here
> <
> http://bioconductor.org/packages/2.12/bioc/vignettes/DESeq/inst/doc/
DESeq.p
> df> . Is the idea of performing the getVarianceStabilizedData to
then be
> able to go ahead and re-analyse the transformed count data with
stabilised
> variance? I ask because trying to use the value from
> getVarianceStabilizedData as input to newCountDataSet doesn't work
since
> the
> "counts" are no longer integers.
>
> Thanks in advance,
> Jamie
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
------------------------------
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
End of Bioconductor Digest, Vol 126, Issue 23
*********************************************
________________________________
Helmholtz-Zentrum f?r Infektionsforschung GmbH | Inhoffenstra?e 7 |
38124 Braunschweig | www.helmholtz-hzi.de
Das HZI ist seit 2007 zertifiziertes Mitglied im "audit
berufundfamilie"
Vorsitzende des Aufsichtsrates: MinDir?in B?rbel Brumme-Bothe,
Bundesministerium f?r Bildung und Forschung
Stellvertreter: R?diger Eichel, Abteilungsleiter Nieders?chsisches
Ministerium f?r Wissenschaft und Kultur
Gesch?ftsf?hrung: Prof. Dr. Dirk Heinz; Ulf Richter, MBA
Gesellschaft mit beschr?nkter Haftung (GmbH)
Sitz der Gesellschaft: Braunschweig
Handelsregister: Amtsgericht Braunschweig, HRB 477