Entering edit mode
Vining, Kelly
▴
220
@vining-kelly-5029
Last seen 10.3 years ago
Thanks Lukas, Martin and Steve for all of your help. MEDIPS.createSet
almost works now. Alas, I'm still getting an error. Troubleshooting
these error messages is giving me a deeper understanding of what is
going on at each step, and that is good, but I am ready to get this
working once and for all. Any help with this most recent error is much
appreciated.
--Kelly
> bF1 = MEDIPS.createSet(file="FallBud_brep1_aln_sorted.bam",
chr.select=seqnames(BSgenome), BSgenome="BSgenome", extend=300,
uniq=TRUE, shift=0, window_size=1000)
Reading bam alignment FallBud_brep1_aln_sorted.bam
Selecting Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 Chr10
Chr11 Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19
Total number of imported short reads: 18306036
Extending reads...
Creating GRange Object...
Extract unique regions...
Number of unique short reads: 15614381
Error in seqlengths(seqinfo(x)) :
error in evaluating the argument 'x' in selecting a method for
function 'seqlengths': Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function âseqinfoâ for
signature â"function"â
________________________________
From: Lukas Chavez [lukas.chavez.mailings@googlemail.com]
Sent: Wednesday, April 02, 2014 1:03 AM
To: Martin Morgan
Cc: Vining, Kelly; Steve Lianoglou; bioconductor@r-project.org
Subject: Re: [BioC] Rsamtools filtering bam file: RE: MEDIPS.createSet
error
And yes, the argument BSgenome is supposed to be a character string
naming the BSgenome object. Please let me know, if this will work for
you.
Lukas
P.S. I like the chr.select=seqnames(BSgenome) idea- I think I will
implement this as a default together with a warning/ info message.
On Wed, Apr 2, 2014 at 12:46 AM, Lukas Chavez <lukas.chavez.mailings@g ooglemail.com<mailto:lukas.chavez.mailings@googlemail.com="">> wrote:
Yes, the chr.select parameter will allow you to import selected
chromosomes only from your bam files. I recommend to create an
according bam index file (samtools) and to store it together with the
bam file in the same directory.
Lukas
On Tue, Apr 1, 2014 at 1:49 PM, Martin Morgan
<mtmorgan@fhcrc.org<mailto:mtmorgan@fhcrc.org>> wrote:
On 04/01/2014 12:52 PM, Vining, Kelly wrote:
This is a follow-up question about the most efficient way to use
functions in Rsamtools to filter my bam files so that only alignments
to chromosomes are included (extrachromosomal scaffold alignments are
excluded).
I think from your original question you are really looking to provide
the names of the sequences in your BSgenome object as a value to the
chr.select argument of MEDIPS.createSet, I *think*
chr.select=seqnames(BSgenome) (be sure to only include seqnames
defined in the bam file, too).
>From the Rsamtools documentation, it looks like there are two ways to
accomplish this:
1) bamFile(bamFilter())
??? I'm not sure what you're talking about here, neither bamFile nor
bamFilter are functions defined in Rsamtools ???
Use filterBam() to create a new bam file from an old bam files. It's
use is illustrated in an example on the help page ?filterBam.
2) bamFile(ScanBamParam())
ScanBamParam() is a way to selectively input data, e.g., with the
readGAlignments or scanBam functions. Its use is illustrated for
instance on p. 2 of the vignette available in the Rsamtools package
and at
http://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/inst
/doc/Rsamtools-Overview.pdf
MEDIPS.createSet uses a ScanBamParam internally, but as a user you do
not have access to it.
Martin
For the first method, a "destination" output file name has to be
designated. I don't want to create a separate file, so it appears that
the second method is what I want, as it allows control of which
records are imported and doesn't output a separate file.
I set up a FilterRules object like so:
filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr"))
I also set up a "which" variable to contain the seqnames in my
BSgenome:
which <- seqnames(BSgenome)
However, I'm getting stuck trying to apply either of these to
ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can
someone clue me in about the proper syntax to accomplish this simple
filtering operation?
Thanks for any help,
--Kelly
________________________________________
From: bioconductor-bounces@r-project.org<mailto:bioconductor- bounces@r-project.org=""> [bioconductor-bounces@r-project.org<mailto :bioconductor-bounces@r-project.org="">] on behalf of Vining, Kelly
[Kelly.Vining@oregonstate.edu<mailto:kelly.vining@oregonstate.edu>]
Sent: Tuesday, April 01, 2014 8:12 AM
To: 'Lukas Chavez'
Cc: Steve Lianoglou;
bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: Re: [BioC] MEDIPS.createSet error
I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS
looks in the BSgenome object for seqnames only, and doesn't look in
mseqnames. It seems like a lot of users of the MEDIPS package must
encounter this problem, because most genomes have sets of unassembled
scaffolds that, following the BSgenome forging instructions,
inevitably end up as mseq objects. It would be great if a future
version of MEDIPS looked for both types of sequences in the BSgenome.
I also meant to reply to Steve's question about this line from the
vignette:
BSgenome="BSgenome.Hsapiens.UCSC.hg19"
In my case, with my custom genome, this would be:
BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3"
But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is
assigned to the BSgenome variable as a character vector. So instead, I
do this:
BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3
That works. So maybe quotes are only required for genomes listed under
available_genomes()?
Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam
files and filter out the scaffold alignments. It looks like Rsamtools
has methods for this. Should I be trying to set up a FilterRules
instance, or a ScanBamParam instance? Any help with the syntax for
this step will be appreciated.
I'm sure this is not going to work, but I'll start with something like
this:
bamFilt_Fall1 <- BamFile(filterBam(file,
filter=FilterRules("scaffold")))
Thanks much,
--Kelly
From: Lukas Chavez [mailto:lukas.chavez.mailings@googlemail.com<mailto :lukas.chavez.mailings@googlemail.com="">]
Sent: Tuesday, April 01, 2014 12:32 AM
To: Vining, Kelly
Cc: Steve Lianoglou;
bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: Re: [BioC] MEDIPS.createSet error
Thank you Steve for explaining in detail how to look at the content of
the BSgenome object and the bam file.
When MEDIPS imports the bam file, it will try to extract information
for the 'scaffold_' chromosomes from the BSgenome object. The
initially reported error
Calculating genomic coordinates...Error in vector(length =
supersize_chr[length(
chromosomes)], mode = "character") :
vector size cannot be NA/NaN
occurs because MEDIPS is trying to find the length of a chromosome
given in the bam file, but does not find the chromosome in the
BSgenome object. Subsequently, MEDIPS tries to calculate genomic
coordinates for the chromosome wide windows, but the length of the
chromosome is NA. I understand that it will be necessary to catch this
error and add a warning message in a future version of MEDIPS.
If you want to keep your hits on the 'scaffold_' chromosomes in the
bam file, and to include all scaffolds in your further analysis, you
will have to recreate your BSgenome object treating each scaffold as
an individual chromosome just as the other assembled chromosomes.
Lukas
On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly <kelly.vining@oregonsta te.edu<mailto:kelly.vining@oregonstate.edu=""><mailto:kelly.vining@oregon state.edu<mailto:kelly.vining@oregonstate.edu="">>> wrote:
Thanks, Steve, for your advice - appreciate you working with a package
that isn't familiar to you.
Thanks also for catching that I didn't have Rsamtools installed.
I think this output that you suggested I generate gives an idea about
what might be going on:
si.bsg
Seqinfo of length 19
seqnames seqlengths isCircular genome
Chr01 50495391 FALSE 3.0
Chr02 25263035 FALSE 3.0
Chr03 21816808 FALSE 3.0
Chr04 24267051 FALSE 3.0
Chr05 25890704 FALSE 3.0
... ... ... ...
Chr15 15278577 FALSE 3.0
Chr16 14494361 FALSE 3.0
Chr17 16080358 FALSE 3.0
Chr18 16958300 FALSE 3.0
Chr19 15942145 FALSE 3.0
si.bam
Seqinfo of length 1446
seqnames seqlengths isCircular genome
Chr01 50495391 <na> <na>
Chr02 25263035 <na> <na>
Chr03 21816808 <na> <na>
Chr04 24267051 <na> <na>
Chr05 25890704 <na> <na>
... ... ... ...
scaffold_3584 3654 <na> <na>
scaffold_3595 2008 <na> <na>
scaffold_3648 2796 <na> <na>
scaffold_3664 3053 <na> <na>
scaffold_3681 1022 <na> <na>
When I created the reference genome structure, the scaffolds were all
entered in a single, multiple seq object, following instructions. But
in my bam files, the scaffold hits are as shown above. So that could
be one problem, but I don't know how to solve that other than
filtering out all of the alignments to the scaffolds from the bam
files. I really would like to retain the scaffold alignments and not
lose all of that precious data.
The other potential issue is the "NA" entries in the "isCircular" and
"genome" columns. I'm not sure whether those would be a problem.
Continued thanks,
--Kelly
________________________________________
From: Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve @gene.com=""><mailto:lianoglou.steve@gene.com<mailto:lianoglou.steve@gene .com="">>]
Sent: Monday, March 31, 2014 1:15 PM
To: Vining, Kelly
Cc: Lukas Chavez; bioconductor@r-project.org<mailto:bioconductor@r-pro ject.org=""><mailto:bioconductor@r-project.org<mailto:bioconductor@r-proj ect.org="">>
Subject: Re: [BioC] MEDIPS.createSet error
Hi,
Caveat being that I've never used MEDIPS, and I'm just going along
with
the vignette.
So, comments inline:
On 31 Mar 2014, at 13:09, Vining, Kelly wrote:
Hi,
Thanks for the advice thus far! To confirm what is in my BSgenome
variable, I did this:
class(BSgenome)
[1] "BSgenome"
attr(,"package")
[1] "BSgenome"
names(BSgenome)
[1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08"
"Chr09"
[10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17"
"Chr18"
[19] "Chr19" "scaf"
And then:
print(BSgenome)
Black cottonwood genome
|
| organism: Populus trichocarpa (Black cottonwood)
| provider: Phytozome (JGI)
| provider version: 3.0
| release date: January 2010
| release name: Populus trichocarpa v3.0
|
| single sequences (see '?seqnames'):
| Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09
Chr10 Chr11
| Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19
|
| multiple sequences (see '?mseqnames'):
| scaf
|
| (use the '$' or '[[' operator to access a given sequence)
So that looks ok. Interestingly, when I followed the vignette and did
the equivalent of
BSgenome="BSgenome.Hsapiens.UCSC.hg19"
The vignette suggests that BSgenome should be the package name of the
BSgenome package to open, so yours should be something like
"BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this
packge by yourself, or something, so you'd know its name ...
That didn't work for me. It only worked without quotes. If I included
quotes, it just assigned a character vector to that variable.
Sorry, what exactly didn't work for you? Can you show me the code that
failed?
Then, following your advice:
si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3)
si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam"))
Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) :
error in evaluating the argument 'x' in selecting a method for
function 'seqinfo': Error: could not find function "BamFile"
The BamFile function is defined in the Rsamtools package, you need to
load that first (the first line of code I suggested you run was to
load
the Rsamtools package). Load it first, then redo the
seqinfo(BamFile(...)) stuff.
-steve
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793<tel:%28206%29%20667-2793>
[[alternative HTML version deleted]]