Rsamtools filtering bam file: RE: MEDIPS.createSet error

0

Entering edit mode

Vining, Kelly ▴ 220

@vining-kelly-5029

Last seen 10.2 years ago

This is a follow-up question about the most efficient way to use functions in Rsamtools to filter my bam files so that only alignments to chromosomes are included (extrachromosomal scaffold alignments are excluded). >From the Rsamtools documentation, it looks like there are two ways to accomplish this: 1) bamFile(bamFilter()) 2) bamFile(ScanBamParam()) For the first method, a "destination" output file name has to be designated. I don't want to create a separate file, so it appears that the second method is what I want, as it allows control of which records are imported and doesn't output a separate file. I set up a FilterRules object like so: > filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr")) I also set up a "which" variable to contain the seqnames in my BSgenome: > which <- seqnames(BSgenome) However, I'm getting stuck trying to apply either of these to ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can someone clue me in about the proper syntax to accomplish this simple filtering operation? Thanks for any help, --Kelly ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] on behalf of Vining, Kelly [Kelly.Vining@oregonstate.edu] Sent: Tuesday, April 01, 2014 8:12 AM To: 'Lukas Chavez' Cc: Steve Lianoglou; bioconductor at r-project.org Subject: Re: [BioC] MEDIPS.createSet error I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS looks in the BSgenome object for seqnames only, and doesn't look in mseqnames. It seems like a lot of users of the MEDIPS package must encounter this problem, because most genomes have sets of unassembled scaffolds that, following the BSgenome forging instructions, inevitably end up as mseq objects. It would be great if a future version of MEDIPS looked for both types of sequences in the BSgenome. I also meant to reply to Steve's question about this line from the vignette: BSgenome="BSgenome.Hsapiens.UCSC.hg19" In my case, with my custom genome, this would be: BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3" But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is assigned to the BSgenome variable as a character vector. So instead, I do this: BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3 That works. So maybe quotes are only required for genomes listed under available_genomes()? Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam files and filter out the scaffold alignments. It looks like Rsamtools has methods for this. Should I be trying to set up a FilterRules instance, or a ScanBamParam instance? Any help with the syntax for this step will be appreciated. I'm sure this is not going to work, but I'll start with something like this: bamFilt_Fall1 <- BamFile(filterBam(file, filter=FilterRules("scaffold"))) Thanks much, --Kelly From: Lukas Chavez [mailto:lukas.chavez.mailings@googlemail.com] Sent: Tuesday, April 01, 2014 12:32 AM To: Vining, Kelly Cc: Steve Lianoglou; bioconductor at r-project.org Subject: Re: [BioC] MEDIPS.createSet error Thank you Steve for explaining in detail how to look at the content of the BSgenome object and the bam file. When MEDIPS imports the bam file, it will try to extract information for the 'scaffold_' chromosomes from the BSgenome object. The initially reported error Calculating genomic coordinates...Error in vector(length = supersize_chr[length( chromosomes)], mode = "character") : vector size cannot be NA/NaN occurs because MEDIPS is trying to find the length of a chromosome given in the bam file, but does not find the chromosome in the BSgenome object. Subsequently, MEDIPS tries to calculate genomic coordinates for the chromosome wide windows, but the length of the chromosome is NA. I understand that it will be necessary to catch this error and add a warning message in a future version of MEDIPS. If you want to keep your hits on the 'scaffold_' chromosomes in the bam file, and to include all scaffolds in your further analysis, you will have to recreate your BSgenome object treating each scaffold as an individual chromosome just as the other assembled chromosomes. Lukas On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly <kelly.vining at="" oregonstate.edu<mailto:kelly.vining="" at="" oregonstate.edu="">> wrote: Thanks, Steve, for your advice - appreciate you working with a package that isn't familiar to you. Thanks also for catching that I didn't have Rsamtools installed. I think this output that you suggested I generate gives an idea about what might be going on: > si.bsg Seqinfo of length 19 seqnames seqlengths isCircular genome Chr01 50495391 FALSE 3.0 Chr02 25263035 FALSE 3.0 Chr03 21816808 FALSE 3.0 Chr04 24267051 FALSE 3.0 Chr05 25890704 FALSE 3.0 ... ... ... ... Chr15 15278577 FALSE 3.0 Chr16 14494361 FALSE 3.0 Chr17 16080358 FALSE 3.0 Chr18 16958300 FALSE 3.0 Chr19 15942145 FALSE 3.0 > si.bam Seqinfo of length 1446 seqnames seqlengths isCircular genome Chr01 50495391 <na> <na> Chr02 25263035 <na> <na> Chr03 21816808 <na> <na> Chr04 24267051 <na> <na> Chr05 25890704 <na> <na> ... ... ... ... scaffold_3584 3654 <na> <na> scaffold_3595 2008 <na> <na> scaffold_3648 2796 <na> <na> scaffold_3664 3053 <na> <na> scaffold_3681 1022 <na> <na> When I created the reference genome structure, the scaffolds were all entered in a single, multiple seq object, following instructions. But in my bam files, the scaffold hits are as shown above. So that could be one problem, but I don't know how to solve that other than filtering out all of the alignments to the scaffolds from the bam files. I really would like to retain the scaffold alignments and not lose all of that precious data. The other potential issue is the "NA" entries in the "isCircular" and "genome" columns. I'm not sure whether those would be a problem. Continued thanks, --Kelly ________________________________________ From: Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>] Sent: Monday, March 31, 2014 1:15 PM To: Vining, Kelly Cc: Lukas Chavez; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> Subject: Re: [BioC] MEDIPS.createSet error Hi, Caveat being that I've never used MEDIPS, and I'm just going along with the vignette. So, comments inline: On 31 Mar 2014, at 13:09, Vining, Kelly wrote: > Hi, > Thanks for the advice thus far! To confirm what is in my BSgenome > variable, I did this: > >> class(BSgenome) > [1] "BSgenome" > attr(,"package") > [1] "BSgenome" >> names(BSgenome) > [1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08" > "Chr09" > [10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17" > "Chr18" > [19] "Chr19" "scaf" > > And then: >> print(BSgenome) > Black cottonwood genome > | > | organism: Populus trichocarpa (Black cottonwood) > | provider: Phytozome (JGI) > | provider version: 3.0 > | release date: January 2010 > | release name: Populus trichocarpa v3.0 > | > | single sequences (see '?seqnames'): > | Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 > Chr10 Chr11 > | Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 > | > | multiple sequences (see '?mseqnames'): > | scaf > | > | (use the '$' or '[[' operator to access a given sequence) > > > So that looks ok. Interestingly, when I followed the vignette and did > the equivalent of > BSgenome="BSgenome.Hsapiens.UCSC.hg19" The vignette suggests that BSgenome should be the package name of the BSgenome package to open, so yours should be something like "BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this packge by yourself, or something, so you'd know its name ... > That didn't work for me. It only worked without quotes. If I included > quotes, it just assigned a character vector to that variable. Sorry, what exactly didn't work for you? Can you show me the code that failed? > Then, following your advice: > >> si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3) >> si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) > Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) : > error in evaluating the argument 'x' in selecting a method for > function 'seqinfo': Error: could not find function "BamFile" The BamFile function is defined in the Rsamtools package, you need to load that first (the first line of code I suggested you run was to load the Rsamtools package). Load it first, then redo the seqinfo(BamFile(...)) stuff. -steve [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

GO BSgenome BSgenome Rsamtools genomes MEDIPS GO BSgenome BSgenome Rsamtools genomes • 2.1k views

ADD COMMENT • link updated 10.7 years ago by Martin Morgan 25k • written 10.7 years ago by Vining, Kelly ▴ 220

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 months ago

United States

On 04/01/2014 12:52 PM, Vining, Kelly wrote: > This is a follow-up question about the most efficient way to use functions in Rsamtools to filter my bam files so that only alignments to chromosomes are included (extrachromosomal scaffold alignments are excluded). I think from your original question you are really looking to provide the names of the sequences in your BSgenome object as a value to the chr.select argument of MEDIPS.createSet, I *think* chr.select=seqnames(BSgenome) (be sure to only include seqnames defined in the bam file, too). > >>From the Rsamtools documentation, it looks like there are two ways to accomplish this: > 1) bamFile(bamFilter()) ??? I'm not sure what you're talking about here, neither bamFile nor bamFilter are functions defined in Rsamtools ??? Use filterBam() to create a new bam file from an old bam files. It's use is illustrated in an example on the help page ?filterBam. > 2) bamFile(ScanBamParam()) ScanBamParam() is a way to selectively input data, e.g., with the readGAlignments or scanBam functions. Its use is illustrated for instance on p. 2 of the vignette available in the Rsamtools package and at http://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/inst /doc/Rsamtools-Overview.pdf MEDIPS.createSet uses a ScanBamParam internally, but as a user you do not have access to it. Martin > > For the first method, a "destination" output file name has to be designated. I don't want to create a separate file, so it appears that the second method is what I want, as it allows control of which records are imported and doesn't output a separate file. > > I set up a FilterRules object like so: >> filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr")) > > I also set up a "which" variable to contain the seqnames in my BSgenome: >> which <- seqnames(BSgenome) > > However, I'm getting stuck trying to apply either of these to ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can someone clue me in about the proper syntax to accomplish this simple filtering operation? > > Thanks for any help, > --Kelly > > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Vining, Kelly [Kelly.Vining at oregonstate.edu] > Sent: Tuesday, April 01, 2014 8:12 AM > To: 'Lukas Chavez' > Cc: Steve Lianoglou; bioconductor at r-project.org > Subject: Re: [BioC] MEDIPS.createSet error > > I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS looks in the BSgenome object for seqnames only, and doesn't look in mseqnames. It seems like a lot of users of the MEDIPS package must encounter this problem, because most genomes have sets of unassembled scaffolds that, following the BSgenome forging instructions, inevitably end up as mseq objects. It would be great if a future version of MEDIPS looked for both types of sequences in the BSgenome. > > I also meant to reply to Steve's question about this line from the vignette: > BSgenome="BSgenome.Hsapiens.UCSC.hg19" > > In my case, with my custom genome, this would be: > BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3" > > But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is assigned to the BSgenome variable as a character vector. So instead, I do this: > BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3 > > That works. So maybe quotes are only required for genomes listed under available_genomes()? > > Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam files and filter out the scaffold alignments. It looks like Rsamtools has methods for this. Should I be trying to set up a FilterRules instance, or a ScanBamParam instance? Any help with the syntax for this step will be appreciated. > > I'm sure this is not going to work, but I'll start with something like this: > bamFilt_Fall1 <- BamFile(filterBam(file, filter=FilterRules("scaffold"))) > > > Thanks much, > --Kelly > > From: Lukas Chavez [mailto:lukas.chavez.mailings at googlemail.com] > Sent: Tuesday, April 01, 2014 12:32 AM > To: Vining, Kelly > Cc: Steve Lianoglou; bioconductor at r-project.org > Subject: Re: [BioC] MEDIPS.createSet error > > Thank you Steve for explaining in detail how to look at the content of the BSgenome object and the bam file. > > When MEDIPS imports the bam file, it will try to extract information for the 'scaffold_' chromosomes from the BSgenome object. The initially reported error > > Calculating genomic coordinates...Error in vector(length = supersize_chr[length( > chromosomes)], mode = "character") : > vector size cannot be NA/NaN > occurs because MEDIPS is trying to find the length of a chromosome given in the bam file, but does not find the chromosome in the BSgenome object. Subsequently, MEDIPS tries to calculate genomic coordinates for the chromosome wide windows, but the length of the chromosome is NA. I understand that it will be necessary to catch this error and add a warning message in a future version of MEDIPS. > If you want to keep your hits on the 'scaffold_' chromosomes in the bam file, and to include all scaffolds in your further analysis, you will have to recreate your BSgenome object treating each scaffold as an individual chromosome just as the other assembled chromosomes. > Lukas > > > > On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly <kelly.vining at="" oregonstate.edu<mailto:kelly.vining="" at="" oregonstate.edu="">> wrote: > Thanks, Steve, for your advice - appreciate you working with a package that isn't familiar to you. > Thanks also for catching that I didn't have Rsamtools installed. > > I think this output that you suggested I generate gives an idea about what might be going on: > > >> si.bsg > Seqinfo of length 19 > seqnames seqlengths isCircular genome > Chr01 50495391 FALSE 3.0 > Chr02 25263035 FALSE 3.0 > Chr03 21816808 FALSE 3.0 > Chr04 24267051 FALSE 3.0 > Chr05 25890704 FALSE 3.0 > ... ... ... ... > Chr15 15278577 FALSE 3.0 > Chr16 14494361 FALSE 3.0 > Chr17 16080358 FALSE 3.0 > Chr18 16958300 FALSE 3.0 > Chr19 15942145 FALSE 3.0 >> si.bam > Seqinfo of length 1446 > seqnames seqlengths isCircular genome > Chr01 50495391 <na> <na> > Chr02 25263035 <na> <na> > Chr03 21816808 <na> <na> > Chr04 24267051 <na> <na> > Chr05 25890704 <na> <na> > ... ... ... ... > scaffold_3584 3654 <na> <na> > scaffold_3595 2008 <na> <na> > scaffold_3648 2796 <na> <na> > scaffold_3664 3053 <na> <na> > scaffold_3681 1022 <na> <na> > > > When I created the reference genome structure, the scaffolds were all entered in a single, multiple seq object, following instructions. But in my bam files, the scaffold hits are as shown above. So that could be one problem, but I don't know how to solve that other than filtering out all of the alignments to the scaffolds from the bam files. I really would like to retain the scaffold alignments and not lose all of that precious data. > > The other potential issue is the "NA" entries in the "isCircular" and "genome" columns. I'm not sure whether those would be a problem. > > Continued thanks, > --Kelly > > ________________________________________ > From: Steve Lianoglou [lianoglou.steve at gene.com<mailto:lianoglou.steve at="" gene.com="">] > Sent: Monday, March 31, 2014 1:15 PM > To: Vining, Kelly > Cc: Lukas Chavez; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: Re: [BioC] MEDIPS.createSet error > Hi, > > Caveat being that I've never used MEDIPS, and I'm just going along with > the vignette. > > So, comments inline: > > > On 31 Mar 2014, at 13:09, Vining, Kelly wrote: > >> Hi, >> Thanks for the advice thus far! To confirm what is in my BSgenome >> variable, I did this: >> >>> class(BSgenome) >> [1] "BSgenome" >> attr(,"package") >> [1] "BSgenome" >>> names(BSgenome) >> [1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08" >> "Chr09" >> [10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17" >> "Chr18" >> [19] "Chr19" "scaf" >> >> And then: >>> print(BSgenome) >> Black cottonwood genome >> | >> | organism: Populus trichocarpa (Black cottonwood) >> | provider: Phytozome (JGI) >> | provider version: 3.0 >> | release date: January 2010 >> | release name: Populus trichocarpa v3.0 >> | >> | single sequences (see '?seqnames'): >> | Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 >> Chr10 Chr11 >> | Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 >> | >> | multiple sequences (see '?mseqnames'): >> | scaf >> | >> | (use the '$' or '[[' operator to access a given sequence) >> >> >> So that looks ok. Interestingly, when I followed the vignette and did >> the equivalent of >> BSgenome="BSgenome.Hsapiens.UCSC.hg19" > > The vignette suggests that BSgenome should be the package name of the > BSgenome package to open, so yours should be something like > "BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this > packge by yourself, or something, so you'd know its name ... > >> That didn't work for me. It only worked without quotes. If I included >> quotes, it just assigned a character vector to that variable. > > Sorry, what exactly didn't work for you? Can you show me the code that > failed? > >> Then, following your advice: >> >>> si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3) >>> si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) >> Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) : >> error in evaluating the argument 'x' in selecting a method for >> function 'seqinfo': Error: could not find function "BamFile" > > The BamFile function is defined in the Rsamtools package, you need to > load that first (the first line of code I suggested you run was to load > the Rsamtools package). Load it first, then redo the > seqinfo(BamFile(...)) stuff. > > -steve > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 10.7 years ago Martin Morgan 25k

0

Entering edit mode

It'll be great if I can simply use the chr.select parameter instead of creating a separate subset. I tried that, and it looks like it almost worked, but now I'm getting a different error: > budFall1_MEDIP = MEDIPS.createSet(file="FallBud_brep1_aln_sorted.bam", chr.select=seqnames(BSgenome), BSgenome=BSgenome, extend=300, uniq=TRUE, shift=0, window_size=1000) Reading bam alignment FallBud_brep1_aln_sorted.bam Selecting Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 Chr10 Chr11 Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 Total number of imported short reads: 18306036 Extending reads... Creating GRange Object... Extract unique regions... Number of unique short reads: 15614381 Error in as.environment(pos) : no item called "paste("package:", BSgenome, sep = "")" on the search list In addition: Warning message: In ls(paste("package:", BSgenome, sep = "")) : ?paste("package:", BSgenome, sep = "")? converted to character string Any idea what's going on here? It looks like this has to do with how MEDIPS.createSet is processing the BSgenome object. Thanks, --Kelly ________________________________________ From: Martin Morgan [mtmorgan@fhcrc.org] Sent: Tuesday, April 01, 2014 1:49 PM To: Vining, Kelly; 'Lukas Chavez' Cc: Steve Lianoglou; bioconductor at r-project.org Subject: Re: [BioC] Rsamtools filtering bam file: RE: MEDIPS.createSet error On 04/01/2014 12:52 PM, Vining, Kelly wrote: > This is a follow-up question about the most efficient way to use functions in Rsamtools to filter my bam files so that only alignments to chromosomes are included (extrachromosomal scaffold alignments are excluded). I think from your original question you are really looking to provide the names of the sequences in your BSgenome object as a value to the chr.select argument of MEDIPS.createSet, I *think* chr.select=seqnames(BSgenome) (be sure to only include seqnames defined in the bam file, too). > >>From the Rsamtools documentation, it looks like there are two ways to accomplish this: > 1) bamFile(bamFilter()) ??? I'm not sure what you're talking about here, neither bamFile nor bamFilter are functions defined in Rsamtools ??? Use filterBam() to create a new bam file from an old bam files. It's use is illustrated in an example on the help page ?filterBam. > 2) bamFile(ScanBamParam()) ScanBamParam() is a way to selectively input data, e.g., with the readGAlignments or scanBam functions. Its use is illustrated for instance on p. 2 of the vignette available in the Rsamtools package and at http://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/inst /doc/Rsamtools-Overview.pdf MEDIPS.createSet uses a ScanBamParam internally, but as a user you do not have access to it. Martin > > For the first method, a "destination" output file name has to be designated. I don't want to create a separate file, so it appears that the second method is what I want, as it allows control of which records are imported and doesn't output a separate file. > > I set up a FilterRules object like so: >> filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr")) > > I also set up a "which" variable to contain the seqnames in my BSgenome: >> which <- seqnames(BSgenome) > > However, I'm getting stuck trying to apply either of these to ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can someone clue me in about the proper syntax to accomplish this simple filtering operation? > > Thanks for any help, > --Kelly > > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Vining, Kelly [Kelly.Vining at oregonstate.edu] > Sent: Tuesday, April 01, 2014 8:12 AM > To: 'Lukas Chavez' > Cc: Steve Lianoglou; bioconductor at r-project.org > Subject: Re: [BioC] MEDIPS.createSet error > > I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS looks in the BSgenome object for seqnames only, and doesn't look in mseqnames. It seems like a lot of users of the MEDIPS package must encounter this problem, because most genomes have sets of unassembled scaffolds that, following the BSgenome forging instructions, inevitably end up as mseq objects. It would be great if a future version of MEDIPS looked for both types of sequences in the BSgenome. > > I also meant to reply to Steve's question about this line from the vignette: > BSgenome="BSgenome.Hsapiens.UCSC.hg19" > > In my case, with my custom genome, this would be: > BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3" > > But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is assigned to the BSgenome variable as a character vector. So instead, I do this: > BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3 > > That works. So maybe quotes are only required for genomes listed under available_genomes()? > > Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam files and filter out the scaffold alignments. It looks like Rsamtools has methods for this. Should I be trying to set up a FilterRules instance, or a ScanBamParam instance? Any help with the syntax for this step will be appreciated. > > I'm sure this is not going to work, but I'll start with something like this: > bamFilt_Fall1 <- BamFile(filterBam(file, filter=FilterRules("scaffold"))) > > > Thanks much, > --Kelly > > From: Lukas Chavez [mailto:lukas.chavez.mailings at googlemail.com] > Sent: Tuesday, April 01, 2014 12:32 AM > To: Vining, Kelly > Cc: Steve Lianoglou; bioconductor at r-project.org > Subject: Re: [BioC] MEDIPS.createSet error > > Thank you Steve for explaining in detail how to look at the content of the BSgenome object and the bam file. > > When MEDIPS imports the bam file, it will try to extract information for the 'scaffold_' chromosomes from the BSgenome object. The initially reported error > > Calculating genomic coordinates...Error in vector(length = supersize_chr[length( > chromosomes)], mode = "character") : > vector size cannot be NA/NaN > occurs because MEDIPS is trying to find the length of a chromosome given in the bam file, but does not find the chromosome in the BSgenome object. Subsequently, MEDIPS tries to calculate genomic coordinates for the chromosome wide windows, but the length of the chromosome is NA. I understand that it will be necessary to catch this error and add a warning message in a future version of MEDIPS. > If you want to keep your hits on the 'scaffold_' chromosomes in the bam file, and to include all scaffolds in your further analysis, you will have to recreate your BSgenome object treating each scaffold as an individual chromosome just as the other assembled chromosomes. > Lukas > > > > On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly <kelly.vining at="" oregonstate.edu<mailto:kelly.vining="" at="" oregonstate.edu="">> wrote: > Thanks, Steve, for your advice - appreciate you working with a package that isn't familiar to you. > Thanks also for catching that I didn't have Rsamtools installed. > > I think this output that you suggested I generate gives an idea about what might be going on: > > >> si.bsg > Seqinfo of length 19 > seqnames seqlengths isCircular genome > Chr01 50495391 FALSE 3.0 > Chr02 25263035 FALSE 3.0 > Chr03 21816808 FALSE 3.0 > Chr04 24267051 FALSE 3.0 > Chr05 25890704 FALSE 3.0 > ... ... ... ... > Chr15 15278577 FALSE 3.0 > Chr16 14494361 FALSE 3.0 > Chr17 16080358 FALSE 3.0 > Chr18 16958300 FALSE 3.0 > Chr19 15942145 FALSE 3.0 >> si.bam > Seqinfo of length 1446 > seqnames seqlengths isCircular genome > Chr01 50495391 <na> <na> > Chr02 25263035 <na> <na> > Chr03 21816808 <na> <na> > Chr04 24267051 <na> <na> > Chr05 25890704 <na> <na> > ... ... ... ... > scaffold_3584 3654 <na> <na> > scaffold_3595 2008 <na> <na> > scaffold_3648 2796 <na> <na> > scaffold_3664 3053 <na> <na> > scaffold_3681 1022 <na> <na> > > > When I created the reference genome structure, the scaffolds were all entered in a single, multiple seq object, following instructions. But in my bam files, the scaffold hits are as shown above. So that could be one problem, but I don't know how to solve that other than filtering out all of the alignments to the scaffolds from the bam files. I really would like to retain the scaffold alignments and not lose all of that precious data. > > The other potential issue is the "NA" entries in the "isCircular" and "genome" columns. I'm not sure whether those would be a problem. > > Continued thanks, > --Kelly > > ________________________________________ > From: Steve Lianoglou [lianoglou.steve at gene.com<mailto:lianoglou.steve at="" gene.com="">] > Sent: Monday, March 31, 2014 1:15 PM > To: Vining, Kelly > Cc: Lukas Chavez; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: Re: [BioC] MEDIPS.createSet error > Hi, > > Caveat being that I've never used MEDIPS, and I'm just going along with > the vignette. > > So, comments inline: > > > On 31 Mar 2014, at 13:09, Vining, Kelly wrote: > >> Hi, >> Thanks for the advice thus far! To confirm what is in my BSgenome >> variable, I did this: >> >>> class(BSgenome) >> [1] "BSgenome" >> attr(,"package") >> [1] "BSgenome" >>> names(BSgenome) >> [1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08" >> "Chr09" >> [10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17" >> "Chr18" >> [19] "Chr19" "scaf" >> >> And then: >>> print(BSgenome) >> Black cottonwood genome >> | >> | organism: Populus trichocarpa (Black cottonwood) >> | provider: Phytozome (JGI) >> | provider version: 3.0 >> | release date: January 2010 >> | release name: Populus trichocarpa v3.0 >> | >> | single sequences (see '?seqnames'): >> | Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 >> Chr10 Chr11 >> | Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 >> | >> | multiple sequences (see '?mseqnames'): >> | scaf >> | >> | (use the '$' or '[[' operator to access a given sequence) >> >> >> So that looks ok. Interestingly, when I followed the vignette and did >> the equivalent of >> BSgenome="BSgenome.Hsapiens.UCSC.hg19" > > The vignette suggests that BSgenome should be the package name of the > BSgenome package to open, so yours should be something like > "BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this > packge by yourself, or something, so you'd know its name ... > >> That didn't work for me. It only worked without quotes. If I included >> quotes, it just assigned a character vector to that variable. > > Sorry, what exactly didn't work for you? Can you show me the code that > failed? > >> Then, following your advice: >> >>> si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3) >>> si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) >> Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) : >> error in evaluating the argument 'x' in selecting a method for >> function 'seqinfo': Error: could not find function "BamFile" > > The BamFile function is defined in the Rsamtools package, you need to > load that first (the first line of code I suggested you run was to load > the Rsamtools package). Load it first, then redo the > seqinfo(BamFile(...)) stuff. > > -steve > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.7 years ago Vining, Kelly ▴ 220

0

Entering edit mode

On 04/01/2014 02:43 PM, Vining, Kelly wrote: > It'll be great if I can simply use the chr.select parameter instead of creating a separate subset. I tried that, and it looks like it almost worked, but now I'm getting a different error: > > >> budFall1_MEDIP = MEDIPS.createSet(file="FallBud_brep1_aln_sorted.bam", chr.select=seqnames(BSgenome), BSgenome=BSgenome, extend=300, uniq=TRUE, shift=0, window_size=1000) > Reading bam alignment FallBud_brep1_aln_sorted.bam > Selecting Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 Chr10 Chr11 Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 > Total number of imported short reads: 18306036 > Extending reads... > Creating GRange Object... > Extract unique regions... > Number of unique short reads: 15614381 > Error in as.environment(pos) : > no item called "paste("package:", BSgenome, sep = "")" on the search list > In addition: Warning message: > In ls(paste("package:", BSgenome, sep = "")) : > ?paste("package:", BSgenome, sep = "")? converted to character string I think the argument BSgenome is supposed to be a character string naming the BSgenome object, whereas you are providing the BSgenome object itself. I think you want budFall1_MEDIP = MEDIPS.createSet(file="FallBud_brep1_aln_sorted.bam", chr.select=seqnames(BSgenome), BSgenome="BSgenome", extend=300, uniq=TRUE, shift=0, window_size=1000) For instance, I get a very similar message if I do the following > library(BSgenome.Hsapiens.UCSC.hg19) > ls(paste("package:", BSgenome.Hsapiens.UCSC.hg19, sep="")) Error in as.environment(pos) : no item called "paste("package:", BSgenome.Hsapiens.UCSC.hg19)" on the search list In addition: Warning message: In ls(paste("package:", BSgenome.Hsapiens.UCSC.hg19)) : 'paste("package:", BSgenome.Hsapiens.UCSC.hg19)' converted to character string but not if I do > ls(paste("package:", "BSgenome.Hsapiens.UCSC.hg19", sep="")) [1] "BSgenome.Hsapiens.UCSC.hg19" "Hsapiens" I think Steve tried to point this out in an earlier message; see also the vignette and help page for MEDIPS.createSet. Martin > > > Any idea what's going on here? It looks like this has to do with how MEDIPS.createSet is processing the BSgenome object. > > Thanks, > --Kelly > > ________________________________________ > From: Martin Morgan [mtmorgan at fhcrc.org] > Sent: Tuesday, April 01, 2014 1:49 PM > To: Vining, Kelly; 'Lukas Chavez' > Cc: Steve Lianoglou; bioconductor at r-project.org > Subject: Re: [BioC] Rsamtools filtering bam file: RE: MEDIPS.createSet error > > On 04/01/2014 12:52 PM, Vining, Kelly wrote: >> This is a follow-up question about the most efficient way to use functions in Rsamtools to filter my bam files so that only alignments to chromosomes are included (extrachromosomal scaffold alignments are excluded). > > I think from your original question you are really looking to provide the names > of the sequences in your BSgenome object as a value to the chr.select argument > of MEDIPS.createSet, I *think* chr.select=seqnames(BSgenome) (be sure to only > include seqnames defined in the bam file, too). > >> >> >From the Rsamtools documentation, it looks like there are two ways to accomplish this: >> 1) bamFile(bamFilter()) > > ??? I'm not sure what you're talking about here, neither bamFile nor bamFilter > are functions defined in Rsamtools ??? > > Use filterBam() to create a new bam file from an old bam files. It's use is > illustrated in an example on the help page ?filterBam. > >> 2) bamFile(ScanBamParam()) > > ScanBamParam() is a way to selectively input data, e.g., with the > readGAlignments or scanBam functions. Its use is illustrated for instance on p. > 2 of the vignette available in the Rsamtools package and at > > http://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/in st/doc/Rsamtools-Overview.pdf > > MEDIPS.createSet uses a ScanBamParam internally, but as a user you do not have > access to it. > > Martin > >> >> For the first method, a "destination" output file name has to be designated. I don't want to create a separate file, so it appears that the second method is what I want, as it allows control of which records are imported and doesn't output a separate file. >> >> I set up a FilterRules object like so: >>> filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr")) >> >> I also set up a "which" variable to contain the seqnames in my BSgenome: >>> which <- seqnames(BSgenome) >> >> However, I'm getting stuck trying to apply either of these to ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can someone clue me in about the proper syntax to accomplish this simple filtering operation? >> >> Thanks for any help, >> --Kelly >> >> ________________________________________ >> From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Vining, Kelly [Kelly.Vining at oregonstate.edu] >> Sent: Tuesday, April 01, 2014 8:12 AM >> To: 'Lukas Chavez' >> Cc: Steve Lianoglou; bioconductor at r-project.org >> Subject: Re: [BioC] MEDIPS.createSet error >> >> I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS looks in the BSgenome object for seqnames only, and doesn't look in mseqnames. It seems like a lot of users of the MEDIPS package must encounter this problem, because most genomes have sets of unassembled scaffolds that, following the BSgenome forging instructions, inevitably end up as mseq objects. It would be great if a future version of MEDIPS looked for both types of sequences in the BSgenome. >> >> I also meant to reply to Steve's question about this line from the vignette: >> BSgenome="BSgenome.Hsapiens.UCSC.hg19" >> >> In my case, with my custom genome, this would be: >> BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3" >> >> But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is assigned to the BSgenome variable as a character vector. So instead, I do this: >> BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3 >> >> That works. So maybe quotes are only required for genomes listed under available_genomes()? >> >> Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam files and filter out the scaffold alignments. It looks like Rsamtools has methods for this. Should I be trying to set up a FilterRules instance, or a ScanBamParam instance? Any help with the syntax for this step will be appreciated. >> >> I'm sure this is not going to work, but I'll start with something like this: >> bamFilt_Fall1 <- BamFile(filterBam(file, filter=FilterRules("scaffold"))) >> >> >> Thanks much, >> --Kelly >> >> From: Lukas Chavez [mailto:lukas.chavez.mailings at googlemail.com] >> Sent: Tuesday, April 01, 2014 12:32 AM >> To: Vining, Kelly >> Cc: Steve Lianoglou; bioconductor at r-project.org >> Subject: Re: [BioC] MEDIPS.createSet error >> >> Thank you Steve for explaining in detail how to look at the content of the BSgenome object and the bam file. >> >> When MEDIPS imports the bam file, it will try to extract information for the 'scaffold_' chromosomes from the BSgenome object. The initially reported error >> >> Calculating genomic coordinates...Error in vector(length = supersize_chr[length( >> chromosomes)], mode = "character") : >> vector size cannot be NA/NaN >> occurs because MEDIPS is trying to find the length of a chromosome given in the bam file, but does not find the chromosome in the BSgenome object. Subsequently, MEDIPS tries to calculate genomic coordinates for the chromosome wide windows, but the length of the chromosome is NA. I understand that it will be necessary to catch this error and add a warning message in a future version of MEDIPS. >> If you want to keep your hits on the 'scaffold_' chromosomes in the bam file, and to include all scaffolds in your further analysis, you will have to recreate your BSgenome object treating each scaffold as an individual chromosome just as the other assembled chromosomes. >> Lukas >> >> >> >> On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly <kelly.vining at="" oregonstate.edu<mailto:kelly.vining="" at="" oregonstate.edu="">> wrote: >> Thanks, Steve, for your advice - appreciate you working with a package that isn't familiar to you. >> Thanks also for catching that I didn't have Rsamtools installed. >> >> I think this output that you suggested I generate gives an idea about what might be going on: >> >> >>> si.bsg >> Seqinfo of length 19 >> seqnames seqlengths isCircular genome >> Chr01 50495391 FALSE 3.0 >> Chr02 25263035 FALSE 3.0 >> Chr03 21816808 FALSE 3.0 >> Chr04 24267051 FALSE 3.0 >> Chr05 25890704 FALSE 3.0 >> ... ... ... ... >> Chr15 15278577 FALSE 3.0 >> Chr16 14494361 FALSE 3.0 >> Chr17 16080358 FALSE 3.0 >> Chr18 16958300 FALSE 3.0 >> Chr19 15942145 FALSE 3.0 >>> si.bam >> Seqinfo of length 1446 >> seqnames seqlengths isCircular genome >> Chr01 50495391 <na> <na> >> Chr02 25263035 <na> <na> >> Chr03 21816808 <na> <na> >> Chr04 24267051 <na> <na> >> Chr05 25890704 <na> <na> >> ... ... ... ... >> scaffold_3584 3654 <na> <na> >> scaffold_3595 2008 <na> <na> >> scaffold_3648 2796 <na> <na> >> scaffold_3664 3053 <na> <na> >> scaffold_3681 1022 <na> <na> >> >> >> When I created the reference genome structure, the scaffolds were all entered in a single, multiple seq object, following instructions. But in my bam files, the scaffold hits are as shown above. So that could be one problem, but I don't know how to solve that other than filtering out all of the alignments to the scaffolds from the bam files. I really would like to retain the scaffold alignments and not lose all of that precious data. >> >> The other potential issue is the "NA" entries in the "isCircular" and "genome" columns. I'm not sure whether those would be a problem. >> >> Continued thanks, >> --Kelly >> >> ________________________________________ >> From: Steve Lianoglou [lianoglou.steve at gene.com<mailto:lianoglou.steve at="" gene.com="">] >> Sent: Monday, March 31, 2014 1:15 PM >> To: Vining, Kelly >> Cc: Lukas Chavez; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >> Subject: Re: [BioC] MEDIPS.createSet error >> Hi, >> >> Caveat being that I've never used MEDIPS, and I'm just going along with >> the vignette. >> >> So, comments inline: >> >> >> On 31 Mar 2014, at 13:09, Vining, Kelly wrote: >> >>> Hi, >>> Thanks for the advice thus far! To confirm what is in my BSgenome >>> variable, I did this: >>> >>>> class(BSgenome) >>> [1] "BSgenome" >>> attr(,"package") >>> [1] "BSgenome" >>>> names(BSgenome) >>> [1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08" >>> "Chr09" >>> [10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17" >>> "Chr18" >>> [19] "Chr19" "scaf" >>> >>> And then: >>>> print(BSgenome) >>> Black cottonwood genome >>> | >>> | organism: Populus trichocarpa (Black cottonwood) >>> | provider: Phytozome (JGI) >>> | provider version: 3.0 >>> | release date: January 2010 >>> | release name: Populus trichocarpa v3.0 >>> | >>> | single sequences (see '?seqnames'): >>> | Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 >>> Chr10 Chr11 >>> | Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 >>> | >>> | multiple sequences (see '?mseqnames'): >>> | scaf >>> | >>> | (use the '$' or '[[' operator to access a given sequence) >>> >>> >>> So that looks ok. Interestingly, when I followed the vignette and did >>> the equivalent of >>> BSgenome="BSgenome.Hsapiens.UCSC.hg19" >> >> The vignette suggests that BSgenome should be the package name of the >> BSgenome package to open, so yours should be something like >> "BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this >> packge by yourself, or something, so you'd know its name ... >> >>> That didn't work for me. It only worked without quotes. If I included >>> quotes, it just assigned a character vector to that variable. >> >> Sorry, what exactly didn't work for you? Can you show me the code that >> failed? >> >>> Then, following your advice: >>> >>>> si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3) >>>> si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) >>> Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) : >>> error in evaluating the argument 'x' in selecting a method for >>> function 'seqinfo': Error: could not find function "BamFile" >> >> The BamFile function is defined in the Rsamtools package, you need to >> load that first (the first line of code I suggested you run was to load >> the Rsamtools package). Load it first, then redo the >> seqinfo(BamFile(...)) stuff. >> >> -steve >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.7 years ago Martin Morgan 25k

0

Entering edit mode

Yes, the chr.select parameter will allow you to import selected chromosomes only from your bam files. I recommend to create an according bam index file (samtools) and to store it together with the bam file in the same directory. Lukas On Tue, Apr 1, 2014 at 1:49 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 04/01/2014 12:52 PM, Vining, Kelly wrote: > >> This is a follow-up question about the most efficient way to use >> functions in Rsamtools to filter my bam files so that only alignments to >> chromosomes are included (extrachromosomal scaffold alignments are >> excluded). >> > > I think from your original question you are really looking to provide the > names of the sequences in your BSgenome object as a value to the chr.select > argument of MEDIPS.createSet, I *think* chr.select=seqnames(BSgenome) (be > sure to only include seqnames defined in the bam file, too). > > >> From the Rsamtools documentation, it looks like there are two ways to >>> accomplish this: >>> >> 1) bamFile(bamFilter()) >> > > ??? I'm not sure what you're talking about here, neither bamFile nor > bamFilter are functions defined in Rsamtools ??? > > Use filterBam() to create a new bam file from an old bam files. It's use > is illustrated in an example on the help page ?filterBam. > > 2) bamFile(ScanBamParam()) >> > > ScanBamParam() is a way to selectively input data, e.g., with the > readGAlignments or scanBam functions. Its use is illustrated for instance > on p. 2 of the vignette available in the Rsamtools package and at > > http://bioconductor.org/packages/release/bioc/ > vignettes/Rsamtools/inst/doc/Rsamtools-Overview.pdf > > MEDIPS.createSet uses a ScanBamParam internally, but as a user you do not > have access to it. > > Martin > > >> For the first method, a "destination" output file name has to be >> designated. I don't want to create a separate file, so it appears that the >> second method is what I want, as it allows control of which records are >> imported and doesn't output a separate file. >> >> I set up a FilterRules object like so: >> >>> filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr")) >>> >> >> I also set up a "which" variable to contain the seqnames in my BSgenome: >> >>> which <- seqnames(BSgenome) >>> >> >> However, I'm getting stuck trying to apply either of these to >> ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can >> someone clue me in about the proper syntax to accomplish this simple >> filtering operation? >> >> Thanks for any help, >> --Kelly >> >> ________________________________________ >> From: bioconductor-bounces@r-project.org [bioconductor-bounces@r- >> project.org] on behalf of Vining, Kelly [Kelly.Vining@oregonstate.edu] >> Sent: Tuesday, April 01, 2014 8:12 AM >> To: 'Lukas Chavez' >> Cc: Steve Lianoglou; bioconductor@r-project.org >> Subject: Re: [BioC] MEDIPS.createSet error >> >> I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS >> looks in the BSgenome object for seqnames only, and doesn't look in >> mseqnames. It seems like a lot of users of the MEDIPS package must >> encounter this problem, because most genomes have sets of unassembled >> scaffolds that, following the BSgenome forging instructions, inevitably end >> up as mseq objects. It would be great if a future version of MEDIPS looked >> for both types of sequences in the BSgenome. >> >> I also meant to reply to Steve's question about this line from the >> vignette: >> BSgenome="BSgenome.Hsapiens.UCSC.hg19" >> >> In my case, with my custom genome, this would be: >> BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3" >> >> But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is >> assigned to the BSgenome variable as a character vector. So instead, I do >> this: >> BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3 >> >> That works. So maybe quotes are only required for genomes listed under >> available_genomes()? >> >> Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam >> files and filter out the scaffold alignments. It looks like Rsamtools has >> methods for this. Should I be trying to set up a FilterRules instance, or a >> ScanBamParam instance? Any help with the syntax for this step will be >> appreciated. >> >> I'm sure this is not going to work, but I'll start with something like >> this: >> bamFilt_Fall1 <- BamFile(filterBam(file, filter=FilterRules("scaffold"))) >> >> >> Thanks much, >> --Kelly >> >> From: Lukas Chavez [mailto:lukas.chavez.mailings@googlemail.com] >> Sent: Tuesday, April 01, 2014 12:32 AM >> To: Vining, Kelly >> Cc: Steve Lianoglou; bioconductor@r-project.org >> Subject: Re: [BioC] MEDIPS.createSet error >> >> Thank you Steve for explaining in detail how to look at the content of >> the BSgenome object and the bam file. >> >> When MEDIPS imports the bam file, it will try to extract information for >> the 'scaffold_' chromosomes from the BSgenome object. The initially >> reported error >> >> Calculating genomic coordinates...Error in vector(length = >> supersize_chr[length( >> chromosomes)], mode = "character") : >> vector size cannot be NA/NaN >> occurs because MEDIPS is trying to find the length of a chromosome given >> in the bam file, but does not find the chromosome in the BSgenome object. >> Subsequently, MEDIPS tries to calculate genomic coordinates for the >> chromosome wide windows, but the length of the chromosome is NA. I >> understand that it will be necessary to catch this error and add a warning >> message in a future version of MEDIPS. >> If you want to keep your hits on the 'scaffold_' chromosomes in the bam >> file, and to include all scaffolds in your further analysis, you will have >> to recreate your BSgenome object treating each scaffold as an individual >> chromosome just as the other assembled chromosomes. >> Lukas >> >> >> >> On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly < >> Kelly.Vining@oregonstate.edu<mailto:kelly.vining@oregonstate.edu>> wrote: >> Thanks, Steve, for your advice - appreciate you working with a package >> that isn't familiar to you. >> Thanks also for catching that I didn't have Rsamtools installed. >> >> I think this output that you suggested I generate gives an idea about >> what might be going on: >> >> >> si.bsg >>> >> Seqinfo of length 19 >> seqnames seqlengths isCircular genome >> Chr01 50495391 FALSE 3.0 >> Chr02 25263035 FALSE 3.0 >> Chr03 21816808 FALSE 3.0 >> Chr04 24267051 FALSE 3.0 >> Chr05 25890704 FALSE 3.0 >> ... ... ... ... >> Chr15 15278577 FALSE 3.0 >> Chr16 14494361 FALSE 3.0 >> Chr17 16080358 FALSE 3.0 >> Chr18 16958300 FALSE 3.0 >> Chr19 15942145 FALSE 3.0 >> >>> si.bam >>> >> Seqinfo of length 1446 >> seqnames seqlengths isCircular genome >> Chr01 50495391 <na> <na> >> Chr02 25263035 <na> <na> >> Chr03 21816808 <na> <na> >> Chr04 24267051 <na> <na> >> Chr05 25890704 <na> <na> >> ... ... ... ... >> scaffold_3584 3654 <na> <na> >> scaffold_3595 2008 <na> <na> >> scaffold_3648 2796 <na> <na> >> scaffold_3664 3053 <na> <na> >> scaffold_3681 1022 <na> <na> >> >> >> When I created the reference genome structure, the scaffolds were all >> entered in a single, multiple seq object, following instructions. But in my >> bam files, the scaffold hits are as shown above. So that could be one >> problem, but I don't know how to solve that other than filtering out all of >> the alignments to the scaffolds from the bam files. I really would like to >> retain the scaffold alignments and not lose all of that precious data. >> >> The other potential issue is the "NA" entries in the "isCircular" and >> "genome" columns. I'm not sure whether those would be a problem. >> >> Continued thanks, >> --Kelly >> >> ________________________________________ >> From: Steve Lianoglou [lianoglou.steve@gene.com<mailto:>> lianoglou.steve@gene.com>] >> Sent: Monday, March 31, 2014 1:15 PM >> To: Vining, Kelly >> Cc: Lukas Chavez; bioconductor@r-project.org<mailto:>> bioconductor@r-project.org> >> Subject: Re: [BioC] MEDIPS.createSet error >> Hi, >> >> Caveat being that I've never used MEDIPS, and I'm just going along with >> the vignette. >> >> So, comments inline: >> >> >> On 31 Mar 2014, at 13:09, Vining, Kelly wrote: >> >> Hi, >>> Thanks for the advice thus far! To confirm what is in my BSgenome >>> variable, I did this: >>> >>> class(BSgenome) >>>> >>> [1] "BSgenome" >>> attr(,"package") >>> [1] "BSgenome" >>> >>>> names(BSgenome) >>>> >>> [1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08" >>> "Chr09" >>> [10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17" >>> "Chr18" >>> [19] "Chr19" "scaf" >>> >>> And then: >>> >>>> print(BSgenome) >>>> >>> Black cottonwood genome >>> | >>> | organism: Populus trichocarpa (Black cottonwood) >>> | provider: Phytozome (JGI) >>> | provider version: 3.0 >>> | release date: January 2010 >>> | release name: Populus trichocarpa v3.0 >>> | >>> | single sequences (see '?seqnames'): >>> | Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 >>> Chr10 Chr11 >>> | Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 >>> | >>> | multiple sequences (see '?mseqnames'): >>> | scaf >>> | >>> | (use the '$' or '[[' operator to access a given sequence) >>> >>> >>> So that looks ok. Interestingly, when I followed the vignette and did >>> the equivalent of >>> BSgenome="BSgenome.Hsapiens.UCSC.hg19" >>> >> >> The vignette suggests that BSgenome should be the package name of the >> BSgenome package to open, so yours should be something like >> "BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this >> packge by yourself, or something, so you'd know its name ... >> >> That didn't work for me. It only worked without quotes. If I included >>> quotes, it just assigned a character vector to that variable. >>> >> >> Sorry, what exactly didn't work for you? Can you show me the code that >> failed? >> >> Then, following your advice: >>> >>> si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3) >>>> si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) >>>> >>> Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) : >>> error in evaluating the argument 'x' in selecting a method for >>> function 'seqinfo': Error: could not find function "BamFile" >>> >> >> The BamFile function is defined in the Rsamtools package, you need to >> load that first (the first line of code I suggested you run was to load >> the Rsamtools package). Load it first, then redo the >> seqinfo(BamFile(...)) stuff. >> >> -steve >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Lukas Chavez ▴ 570

0

Entering edit mode

And yes, the argument BSgenome is supposed to be a character string naming the BSgenome object. Please let me know, if this will work for you. Lukas P.S. I like the chr.select=seqnames(BSgenome) idea- I think I will implement this as a default together with a warning/ info message. On Wed, Apr 2, 2014 at 12:46 AM, Lukas Chavez < lukas.chavez.mailings@googlemail.com> wrote: > Yes, the chr.select parameter will allow you to import selected > chromosomes only from your bam files. I recommend to create an according > bam index file (samtools) and to store it together with the bam file in the > same directory. > > Lukas > > > On Tue, Apr 1, 2014 at 1:49 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > >> On 04/01/2014 12:52 PM, Vining, Kelly wrote: >> >>> This is a follow-up question about the most efficient way to use >>> functions in Rsamtools to filter my bam files so that only alignments to >>> chromosomes are included (extrachromosomal scaffold alignments are >>> excluded). >>> >> >> I think from your original question you are really looking to provide the >> names of the sequences in your BSgenome object as a value to the chr.select >> argument of MEDIPS.createSet, I *think* chr.select=seqnames(BSgenome) (be >> sure to only include seqnames defined in the bam file, too). >> >> >>> From the Rsamtools documentation, it looks like there are two ways to >>>> accomplish this: >>>> >>> 1) bamFile(bamFilter()) >>> >> >> ??? I'm not sure what you're talking about here, neither bamFile nor >> bamFilter are functions defined in Rsamtools ??? >> >> Use filterBam() to create a new bam file from an old bam files. It's use >> is illustrated in an example on the help page ?filterBam. >> >> 2) bamFile(ScanBamParam()) >>> >> >> ScanBamParam() is a way to selectively input data, e.g., with the >> readGAlignments or scanBam functions. Its use is illustrated for instance >> on p. 2 of the vignette available in the Rsamtools package and at >> >> http://bioconductor.org/packages/release/bioc/ >> vignettes/Rsamtools/inst/doc/Rsamtools-Overview.pdf >> >> MEDIPS.createSet uses a ScanBamParam internally, but as a user you do not >> have access to it. >> >> Martin >> >> >>> For the first method, a "destination" output file name has to be >>> designated. I don't want to create a separate file, so it appears that the >>> second method is what I want, as it allows control of which records are >>> imported and doesn't output a separate file. >>> >>> I set up a FilterRules object like so: >>> >>>> filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr")) >>>> >>> >>> I also set up a "which" variable to contain the seqnames in my BSgenome: >>> >>>> which <- seqnames(BSgenome) >>>> >>> >>> However, I'm getting stuck trying to apply either of these to >>> ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can >>> someone clue me in about the proper syntax to accomplish this simple >>> filtering operation? >>> >>> Thanks for any help, >>> --Kelly >>> >>> ________________________________________ >>> From: bioconductor-bounces@r-project.org [bioconductor-bounces@r- >>> project.org] on behalf of Vining, Kelly [Kelly.Vining@oregonstate.edu] >>> Sent: Tuesday, April 01, 2014 8:12 AM >>> To: 'Lukas Chavez' >>> Cc: Steve Lianoglou; bioconductor@r-project.org >>> Subject: Re: [BioC] MEDIPS.createSet error >>> >>> I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS >>> looks in the BSgenome object for seqnames only, and doesn't look in >>> mseqnames. It seems like a lot of users of the MEDIPS package must >>> encounter this problem, because most genomes have sets of unassembled >>> scaffolds that, following the BSgenome forging instructions, inevitably end >>> up as mseq objects. It would be great if a future version of MEDIPS looked >>> for both types of sequences in the BSgenome. >>> >>> I also meant to reply to Steve's question about this line from the >>> vignette: >>> BSgenome="BSgenome.Hsapiens.UCSC.hg19" >>> >>> In my case, with my custom genome, this would be: >>> BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3" >>> >>> But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is >>> assigned to the BSgenome variable as a character vector. So instead, I do >>> this: >>> BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3 >>> >>> That works. So maybe quotes are only required for genomes listed under >>> available_genomes()? >>> >>> Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam >>> files and filter out the scaffold alignments. It looks like Rsamtools has >>> methods for this. Should I be trying to set up a FilterRules instance, or a >>> ScanBamParam instance? Any help with the syntax for this step will be >>> appreciated. >>> >>> I'm sure this is not going to work, but I'll start with something like >>> this: >>> bamFilt_Fall1 <- BamFile(filterBam(file, filter=FilterRules("scaffold") >>> )) >>> >>> >>> Thanks much, >>> --Kelly >>> >>> From: Lukas Chavez [mailto:lukas.chavez.mailings@googlemail.com] >>> Sent: Tuesday, April 01, 2014 12:32 AM >>> To: Vining, Kelly >>> Cc: Steve Lianoglou; bioconductor@r-project.org >>> Subject: Re: [BioC] MEDIPS.createSet error >>> >>> Thank you Steve for explaining in detail how to look at the content of >>> the BSgenome object and the bam file. >>> >>> When MEDIPS imports the bam file, it will try to extract information for >>> the 'scaffold_' chromosomes from the BSgenome object. The initially >>> reported error >>> >>> Calculating genomic coordinates...Error in vector(length = >>> supersize_chr[length( >>> chromosomes)], mode = "character") : >>> vector size cannot be NA/NaN >>> occurs because MEDIPS is trying to find the length of a chromosome given >>> in the bam file, but does not find the chromosome in the BSgenome object. >>> Subsequently, MEDIPS tries to calculate genomic coordinates for the >>> chromosome wide windows, but the length of the chromosome is NA. I >>> understand that it will be necessary to catch this error and add a warning >>> message in a future version of MEDIPS. >>> If you want to keep your hits on the 'scaffold_' chromosomes in the bam >>> file, and to include all scaffolds in your further analysis, you will have >>> to recreate your BSgenome object treating each scaffold as an individual >>> chromosome just as the other assembled chromosomes. >>> Lukas >>> >>> >>> >>> On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly < >>> Kelly.Vining@oregonstate.edu<mailto:kelly.vining@oregonstate.edu>> >>> wrote: >>> Thanks, Steve, for your advice - appreciate you working with a package >>> that isn't familiar to you. >>> Thanks also for catching that I didn't have Rsamtools installed. >>> >>> I think this output that you suggested I generate gives an idea about >>> what might be going on: >>> >>> >>> si.bsg >>>> >>> Seqinfo of length 19 >>> seqnames seqlengths isCircular genome >>> Chr01 50495391 FALSE 3.0 >>> Chr02 25263035 FALSE 3.0 >>> Chr03 21816808 FALSE 3.0 >>> Chr04 24267051 FALSE 3.0 >>> Chr05 25890704 FALSE 3.0 >>> ... ... ... ... >>> Chr15 15278577 FALSE 3.0 >>> Chr16 14494361 FALSE 3.0 >>> Chr17 16080358 FALSE 3.0 >>> Chr18 16958300 FALSE 3.0 >>> Chr19 15942145 FALSE 3.0 >>> >>>> si.bam >>>> >>> Seqinfo of length 1446 >>> seqnames seqlengths isCircular genome >>> Chr01 50495391 <na> <na> >>> Chr02 25263035 <na> <na> >>> Chr03 21816808 <na> <na> >>> Chr04 24267051 <na> <na> >>> Chr05 25890704 <na> <na> >>> ... ... ... ... >>> scaffold_3584 3654 <na> <na> >>> scaffold_3595 2008 <na> <na> >>> scaffold_3648 2796 <na> <na> >>> scaffold_3664 3053 <na> <na> >>> scaffold_3681 1022 <na> <na> >>> >>> >>> When I created the reference genome structure, the scaffolds were all >>> entered in a single, multiple seq object, following instructions. But in my >>> bam files, the scaffold hits are as shown above. So that could be one >>> problem, but I don't know how to solve that other than filtering out all of >>> the alignments to the scaffolds from the bam files. I really would like to >>> retain the scaffold alignments and not lose all of that precious data. >>> >>> The other potential issue is the "NA" entries in the "isCircular" and >>> "genome" columns. I'm not sure whether those would be a problem. >>> >>> Continued thanks, >>> --Kelly >>> >>> ________________________________________ >>> From: Steve Lianoglou [lianoglou.steve@gene.com<mailto:>>> lianoglou.steve@gene.com>] >>> Sent: Monday, March 31, 2014 1:15 PM >>> To: Vining, Kelly >>> Cc: Lukas Chavez; bioconductor@r-project.org<mailto:>>> bioconductor@r-project.org> >>> Subject: Re: [BioC] MEDIPS.createSet error >>> Hi, >>> >>> Caveat being that I've never used MEDIPS, and I'm just going along with >>> the vignette. >>> >>> So, comments inline: >>> >>> >>> On 31 Mar 2014, at 13:09, Vining, Kelly wrote: >>> >>> Hi, >>>> Thanks for the advice thus far! To confirm what is in my BSgenome >>>> variable, I did this: >>>> >>>> class(BSgenome) >>>>> >>>> [1] "BSgenome" >>>> attr(,"package") >>>> [1] "BSgenome" >>>> >>>>> names(BSgenome) >>>>> >>>> [1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08" >>>> "Chr09" >>>> [10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17" >>>> "Chr18" >>>> [19] "Chr19" "scaf" >>>> >>>> And then: >>>> >>>>> print(BSgenome) >>>>> >>>> Black cottonwood genome >>>> | >>>> | organism: Populus trichocarpa (Black cottonwood) >>>> | provider: Phytozome (JGI) >>>> | provider version: 3.0 >>>> | release date: January 2010 >>>> | release name: Populus trichocarpa v3.0 >>>> | >>>> | single sequences (see '?seqnames'): >>>> | Chr01 Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 >>>> Chr10 Chr11 >>>> | Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 >>>> | >>>> | multiple sequences (see '?mseqnames'): >>>> | scaf >>>> | >>>> | (use the '$' or '[[' operator to access a given sequence) >>>> >>>> >>>> So that looks ok. Interestingly, when I followed the vignette and did >>>> the equivalent of >>>> BSgenome="BSgenome.Hsapiens.UCSC.hg19" >>>> >>> >>> The vignette suggests that BSgenome should be the package name of the >>> BSgenome package to open, so yours should be something like >>> "BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this >>> packge by yourself, or something, so you'd know its name ... >>> >>> That didn't work for me. It only worked without quotes. If I included >>>> quotes, it just assigned a character vector to that variable. >>>> >>> >>> Sorry, what exactly didn't work for you? Can you show me the code that >>> failed? >>> >>> Then, following your advice: >>>> >>>> si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3) >>>>> si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) >>>>> >>>> Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) : >>>> error in evaluating the argument 'x' in selecting a method for >>>> function 'seqinfo': Error: could not find function "BamFile" >>>> >>> >>> The BamFile function is defined in the Rsamtools package, you need to >>> load that first (the first line of code I suggested you run was to load >>> the Rsamtools package). Load it first, then redo the >>> seqinfo(BamFile(...)) stuff. >>> >>> -steve >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >>> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> > > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Lukas Chavez ▴ 570

Login before adding your answer.