filterBam

0

Entering edit mode

rcaloger ▴ 500

@rcaloger-1888

Last seen 10.1 years ago

European Union

Hi, I am trying to filter a bam file using a set of coordinates. I sorted the bam file using both sortBam or Picard SortSam tools but I always get the same error: filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", param=ScanBamParam(which=exons.gr)) Error in FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_d ata_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], : failed to build index file: /data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_s et/for_tm_creation/m1/accepted_hits.tp_exons In addition: Warning messages: 1: In FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_d ata_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], : [bam_index_core] the alignment is not sorted (D44TDFP1_1:1:1102:6940:117318): 85079683 > 84952092 in 1-th chr 2: In FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_d ata_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], : [bam_index_build2] fail to index the BAM file. Any suggestion how to handle this problem? cheers Raf exons.gr GRanges with 3399 ranges and 1 metadata column: seqnames ranges strand | transcriptID <rle> <iranges> <rle> | <character> [1] chr1 [ 4764598, 4766882] * | uc007aff.2 [2] chr1 [ 7110275, 7110696] * | uc007agb.1 [3] chr1 [ 9858548, 9858769] * | uc007agw.1 [4] chr1 [10029961, 10029968] * | uc007ahk.1 ... ... ... ... ... ... [3397] chrX [147525139, 147525147] * | uc012hqq.1 [3398] chrX [148034128, 148034929] * | uc009upe.1 [3399] chrX [156357850, 156359017] * | uc009uss.2 --- seqlengths: chr1 chr10 chr11 chr12 chr13 chr14 ... chr5 chr6 chr7 chr8 chr9 chrX NA NA NA NA NA NA ... NA NA NA NA NA NA sessionInfo() R version 2.15.3 (2013-03-01) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 [4] IRanges_1.16.6 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0 -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it

Alignment Alignment • 2.3k views

ADD COMMENT • link updated 11.7 years ago by raffaele calogero ▴ 500 • written 11.7 years ago by rcaloger ▴ 500

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 10 weeks ago

United States

On 07/15/2013 11:19 PM, rcaloger wrote: > Hi, > I am trying to filter a bam file using a set of coordinates. > I sorted the bam file using both sortBam or Picard SortSam tools but I always > get the same error: > > filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", > param=ScanBamParam(which=exons.gr)) > > Error in > FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic _data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], > : > failed to build index > file: > /data03/calogero/Documents/singapore/transcripts/semi_synthetic_data _set/for_tm_creation/m1/accepted_hits.tp_exons > > In addition: Warning messages: > 1: In > FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic _data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], > : > [bam_index_core] the alignment is not sorted (D44TDFP1_1:1:1102:6940:117318): > 85079683 > 84952092 in 1-th chr > 2: In > FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic _data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], > : > [bam_index_build2] fail to index the BAM file. filterBam tries to place the GRanges into correct order by calling bf <- BamFile("accepted_hits.sorted.bam") xx <- Rsamtools:::.filterBam_preprocess(bf, param) after which bamWhich(xx) is supposed to have (a) non-overlapping ranges with (b) seqnames in the same order as in the original BAM file and (c) ordered by start position. I'm supposing that this fails for your exons.gr, and would be happy to trouble-shoot if you're willing to share (off-list is fine) the result of saveexons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")), file="test.Rda") Martin > > > Any suggestion how to handle this problem? > cheers > Raf > > exons.gr > GRanges with 3399 ranges and 1 metadata column: > seqnames ranges strand | transcriptID > <rle> <iranges> <rle> | <character> > [1] chr1 [ 4764598, 4766882] * | uc007aff.2 > [2] chr1 [ 7110275, 7110696] * | uc007agb.1 > [3] chr1 [ 9858548, 9858769] * | uc007agw.1 > [4] chr1 [10029961, 10029968] * | uc007ahk.1 > ... ... ... ... ... ... > [3397] chrX [147525139, 147525147] * | uc012hqq.1 > [3398] chrX [148034128, 148034929] * | uc009upe.1 > [3399] chrX [156357850, 156359017] * | uc009uss.2 > --- > seqlengths: > chr1 chr10 chr11 chr12 chr13 chr14 ... chr5 chr6 chr7 chr8 chr9 chrX > NA NA NA NA NA NA ... NA NA NA NA NA NA > > sessionInfo() > R version 2.15.3 (2013-03-01) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 > [4] IRanges_1.16.6 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0 > > -- ---------------------------------------- > Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di > Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax > ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it > raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

raffaele calogero ▴ 500

@raffaele-calogero-294

Last seen 9.4 years ago

Italy/Turin/University of Torino

Dear Martin, many thanks for the information. I got the point: in exons.gr: seqlevelsexons.gr) [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" [19] "chr9" "chrX" seqlevels(BamFile("accepted_hits.sorted.bam") [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" [19] "chr9" "chrM" "chrX" "chrY" "chrM" is missing in exons.gr There is any simple way to add to exons.gr also the chrM? you can get the test.Rda at : https://mail.unito.it/mailftp/uplister/get_file.php?file=ZmlsZXMvRjEvN DAvZDBjMGExMTk0OTkyYWM3YzY4Nzg0MTFlYjc1My80MzA0ZWU0MzA1OGMzYTBhMmQ1NWZ jMWZjNDM4M2M5ZQ== Many thanks ! Raf the file I used to generate exon.gr was ordered by chr and then by start position On 7/16/13 1:25 PM, Martin Morgan wrote: > On 07/15/2013 11:19 PM, rcaloger wrote: >> Hi, >> I am trying to filter a bam file using a set of coordinates. >> I sorted the bam file using both sortBam or Picard SortSam tools but >> I always >> get the same error: >> >> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", >> param=ScanBamParam(which=exons.gr)) >> >> Error in >> FUN("/data03/calogero/Documents/singapore/transcripts/semi_syntheti c_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >> >> : >> failed to build index >> file: >> /data03/calogero/Documents/singapore/transcripts/semi_synthetic_dat a_set/for_tm_creation/m1/accepted_hits.tp_exons >> >> >> In addition: Warning messages: >> 1: In >> FUN("/data03/calogero/Documents/singapore/transcripts/semi_syntheti c_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >> >> : >> [bam_index_core] the alignment is not sorted >> (D44TDFP1_1:1:1102:6940:117318): >> 85079683 > 84952092 in 1-th chr >> 2: In >> FUN("/data03/calogero/Documents/singapore/transcripts/semi_syntheti c_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >> >> : >> [bam_index_build2] fail to index the BAM file. > > filterBam tries to place the GRanges into correct order by calling > > bf <- BamFile("accepted_hits.sorted.bam") > xx <- Rsamtools:::.filterBam_preprocess(bf, param) > > after which > > bamWhich(xx) > > is supposed to have (a) non-overlapping ranges with (b) seqnames in > the same order as in the original BAM file and (c) ordered by start > position. > > I'm supposing that this fails for your exons.gr, and would be happy to > trouble-shoot if you're willing to share (off-list is fine) the result of > > saveexons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")), > file="test.Rda") > > Martin > >> >> >> Any suggestion how to handle this problem? >> cheers >> Raf >> >> exons.gr >> GRanges with 3399 ranges and 1 metadata column: >> seqnames ranges strand | transcriptID >> <rle> <iranges> <rle> | <character> >> [1] chr1 [ 4764598, 4766882] * | uc007aff.2 >> [2] chr1 [ 7110275, 7110696] * | uc007agb.1 >> [3] chr1 [ 9858548, 9858769] * | uc007agw.1 >> [4] chr1 [10029961, 10029968] * | uc007ahk.1 >> ... ... ... ... ... ... >> [3397] chrX [147525139, 147525147] * | uc012hqq.1 >> [3398] chrX [148034128, 148034929] * | uc009upe.1 >> [3399] chrX [156357850, 156359017] * | uc009uss.2 >> --- >> seqlengths: >> chr1 chr10 chr11 chr12 chr13 chr14 ... chr5 chr6 chr7 chr8 >> chr9 chrX >> NA NA NA NA NA NA ... NA NA NA NA >> NA NA >> >> sessionInfo() >> R version 2.15.3 (2013-03-01) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >> [4] IRanges_1.16.6 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0 >> >> -- ---------------------------------------- >> Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC >> Centro di >> Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 >> 0116706457 Fax >> ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero@unito.it >> raffaele[dot]calogero[at]gmail[dot]com www: >> http://www.bioinformatica.unito.it >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero@unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it [[alternative HTML version deleted]]

ADD COMMENT • link 11.7 years ago raffaele calogero ▴ 500

0

Entering edit mode

On 07/16/2013 05:01 AM, rcaloger wrote: > Dear Martin, > many thanks for the information. > > I got the point: > in exons.gr: > seqlevelsexons.gr) > [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" > [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" > [19] "chr9" "chrX" > > seqlevels(BamFile("accepted_hits.sorted.bam") > [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" > [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" > [19] "chr9" "chrM" "chrX" "chrY" > > "chrM" is missing in exons.gr > > There is any simple way to add to exons.gr also the chrM? Thanks for the file; the missing 'chrM' should not matter, and the ranges look like they are sorted to me. Maybe a workaround is to use the indexDestination=FALSE argument to filterBam, and then sort and index the filtered results filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", indexDestination=FALSE, param=ScanBamParam(which=exons.gr)) sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted") indexBam("accepted_hits.tp_exons.sorted") Martin > you can get the test.Rda at : > https://mail.unito.it/mailftp/uplister/get_file.php?file=ZmlsZXMvRjE vNDAvZDBjMGExMTk0OTkyYWM3YzY4Nzg0MTFlYjc1My80MzA0ZWU0MzA1OGMzYTBhMmQ1N WZjMWZjNDM4M2M5ZQ== > Many thanks ! > Raf > > the file I used to generate exon.gr was ordered by chr and then by start position > > On 7/16/13 1:25 PM, Martin Morgan wrote: >> On 07/15/2013 11:19 PM, rcaloger wrote: >>> Hi, >>> I am trying to filter a bam file using a set of coordinates. >>> I sorted the bam file using both sortBam or Picard SortSam tools but I always >>> get the same error: >>> >>> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", >>> param=ScanBamParam(which=exons.gr)) >>> >>> Error in >>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthet ic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>> >>> : >>> failed to build index >>> file: >>> /data03/calogero/Documents/singapore/transcripts/semi_synthetic_da ta_set/for_tm_creation/m1/accepted_hits.tp_exons >>> >>> >>> In addition: Warning messages: >>> 1: In >>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthet ic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>> >>> : >>> [bam_index_core] the alignment is not sorted (D44TDFP1_1:1:1102:6940:117318): >>> 85079683 > 84952092 in 1-th chr >>> 2: In >>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthet ic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>> >>> : >>> [bam_index_build2] fail to index the BAM file. >> >> filterBam tries to place the GRanges into correct order by calling >> >> bf <- BamFile("accepted_hits.sorted.bam") >> xx <- Rsamtools:::.filterBam_preprocess(bf, param) >> >> after which >> >> bamWhich(xx) >> >> is supposed to have (a) non-overlapping ranges with (b) seqnames in the same >> order as in the original BAM file and (c) ordered by start position. >> >> I'm supposing that this fails for your exons.gr, and would be happy to >> trouble-shoot if you're willing to share (off-list is fine) the result of >> >> saveexons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")), file="test.Rda") >> >> Martin >> >>> >>> >>> Any suggestion how to handle this problem? >>> cheers >>> Raf >>> >>> exons.gr >>> GRanges with 3399 ranges and 1 metadata column: >>> seqnames ranges strand | transcriptID >>> <rle> <iranges> <rle> | <character> >>> [1] chr1 [ 4764598, 4766882] * | uc007aff.2 >>> [2] chr1 [ 7110275, 7110696] * | uc007agb.1 >>> [3] chr1 [ 9858548, 9858769] * | uc007agw.1 >>> [4] chr1 [10029961, 10029968] * | uc007ahk.1 >>> ... ... ... ... ... ... >>> [3397] chrX [147525139, 147525147] * | uc012hqq.1 >>> [3398] chrX [148034128, 148034929] * | uc009upe.1 >>> [3399] chrX [156357850, 156359017] * | uc009uss.2 >>> --- >>> seqlengths: >>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr5 chr6 chr7 chr8 chr9 chrX >>> NA NA NA NA NA NA ... NA NA NA NA NA NA >>> >>> sessionInfo() >>> R version 2.15.3 (2013-03-01) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >>> [4] IRanges_1.16.6 BiocGenerics_0.4.0 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-5 parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0 >>> >>> -- ---------------------------------------- >>> Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di >>> Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax >>> ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it >>> raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > -- > > ---------------------------------------- > Prof. Raffaele A. Calogero > Bioinformatics and Genomics Unit > MBC Centro di Biotecnologie Molecolari > Via Nizza 52, Torino 10126 > tel. ++39 0116706457 > Fax ++39 0112366457 > Mobile ++39 3333827080 > email:raffaele.calogero at unito.it > raffaele[dot]calogero[at]gmail[dot]com > www:http://www.bioinformatica.unito.it > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

raffaele calogero ▴ 500

@raffaele-calogero-294

Last seen 9.4 years ago

Italy/Turin/University of Torino

Hi Martin, thanks again for the quick response. I managed to build the filtered bam file: filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", indexDestination=FALSE, param=ScanBamParam(which=exons.gr)) sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted") indexBam("accepted_hits.tp_exons.sorted.bam") Cheers raf On 7/16/13 3:04 PM, Martin Morgan wrote: > On 07/16/2013 05:01 AM, rcaloger wrote: >> Dear Martin, >> many thanks for the information. >> >> I got the point: >> in exons.gr: >> seqlevelsexons.gr) >> [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" >> "chr17" >> [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" >> "chr8" >> [19] "chr9" "chrX" >> >> seqlevels(BamFile("accepted_hits.sorted.bam") >> [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" >> "chr17" >> [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" >> "chr8" >> [19] "chr9" "chrM" "chrX" "chrY" >> >> "chrM" is missing in exons.gr >> >> There is any simple way to add to exons.gr also the chrM? > > Thanks for the file; the missing 'chrM' should not matter, and the > ranges look like they are sorted to me. Maybe a workaround is to use > the indexDestination=FALSE argument to filterBam, and then sort and > index the filtered results > > filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", > indexDestination=FALSE, param=ScanBamParam(which=exons.gr)) > sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted") > indexBam("accepted_hits.tp_exons.sorted") > > Martin > >> you can get the test.Rda at : >> https://mail.unito.it/mailftp/uplister/get_file.php?file=ZmlsZXMvRj EvNDAvZDBjMGExMTk0OTkyYWM3YzY4Nzg0MTFlYjc1My80MzA0ZWU0MzA1OGMzYTBhMmQ1 NWZjMWZjNDM4M2M5ZQ== >> >> Many thanks ! >> Raf >> >> the file I used to generate exon.gr was ordered by chr and then by >> start position >> >> On 7/16/13 1:25 PM, Martin Morgan wrote: >>> On 07/15/2013 11:19 PM, rcaloger wrote: >>>> Hi, >>>> I am trying to filter a bam file using a set of coordinates. >>>> I sorted the bam file using both sortBam or Picard SortSam tools >>>> but I always >>>> get the same error: >>>> >>>> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", >>>> param=ScanBamParam(which=exons.gr)) >>>> >>>> Error in >>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthe tic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>>> >>>> >>>> : >>>> failed to build index >>>> file: >>>> /data03/calogero/Documents/singapore/transcripts/semi_synthetic_d ata_set/for_tm_creation/m1/accepted_hits.tp_exons >>>> >>>> >>>> >>>> In addition: Warning messages: >>>> 1: In >>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthe tic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>>> >>>> >>>> : >>>> [bam_index_core] the alignment is not sorted >>>> (D44TDFP1_1:1:1102:6940:117318): >>>> 85079683 > 84952092 in 1-th chr >>>> 2: In >>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthe tic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>>> >>>> >>>> : >>>> [bam_index_build2] fail to index the BAM file. >>> >>> filterBam tries to place the GRanges into correct order by calling >>> >>> bf <- BamFile("accepted_hits.sorted.bam") >>> xx <- Rsamtools:::.filterBam_preprocess(bf, param) >>> >>> after which >>> >>> bamWhich(xx) >>> >>> is supposed to have (a) non-overlapping ranges with (b) seqnames in >>> the same >>> order as in the original BAM file and (c) ordered by start position. >>> >>> I'm supposing that this fails for your exons.gr, and would be happy to >>> trouble-shoot if you're willing to share (off-list is fine) the >>> result of >>> >>> saveexons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")), >>> file="test.Rda") >>> >>> Martin >>> >>>> >>>> >>>> Any suggestion how to handle this problem? >>>> cheers >>>> Raf >>>> >>>> exons.gr >>>> GRanges with 3399 ranges and 1 metadata column: >>>> seqnames ranges strand | transcriptID >>>> <rle> <iranges> <rle> | <character> >>>> [1] chr1 [ 4764598, 4766882] * | uc007aff.2 >>>> [2] chr1 [ 7110275, 7110696] * | uc007agb.1 >>>> [3] chr1 [ 9858548, 9858769] * | uc007agw.1 >>>> [4] chr1 [10029961, 10029968] * | uc007ahk.1 >>>> ... ... ... ... ... ... >>>> [3397] chrX [147525139, 147525147] * | uc012hqq.1 >>>> [3398] chrX [148034128, 148034929] * | uc009upe.1 >>>> [3399] chrX [156357850, 156359017] * | uc009uss.2 >>>> --- >>>> seqlengths: >>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr5 chr6 chr7 chr8 >>>> chr9 chrX >>>> NA NA NA NA NA NA ... NA NA NA NA >>>> NA NA >>>> >>>> sessionInfo() >>>> R version 2.15.3 (2013-03-01) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >>>> [4] IRanges_1.16.6 BiocGenerics_0.4.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] bitops_1.0-5 parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0 >>>> >>>> -- ---------------------------------------- >>>> Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC >>>> Centro di >>>> Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 >>>> 0116706457 Fax >>>> ++39 0112366457 Mobile ++39 3333827080 email: >>>> raffaele.calogero at unito.it >>>> raffaele[dot]calogero[at]gmail[dot]com www: >>>> http://www.bioinformatica.unito.it >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> -- >> >> ---------------------------------------- >> Prof. Raffaele A. Calogero >> Bioinformatics and Genomics Unit >> MBC Centro di Biotecnologie Molecolari >> Via Nizza 52, Torino 10126 >> tel. ++39 0116706457 >> Fax ++39 0112366457 >> Mobile ++39 3333827080 >> email:raffaele.calogero at unito.it >> raffaele[dot]calogero[at]gmail[dot]com >> www:http://www.bioinformatica.unito.it >> > > -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it

ADD COMMENT • link 11.7 years ago raffaele calogero ▴ 500

0

Entering edit mode

On 07/16/2013 06:20 AM, rcaloger wrote: > Hi Martin, > thanks again for the quick response. > I managed to build the filtered bam file: > > filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", > indexDestination=FALSE, param=ScanBamParam(which=exons.gr)) > sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted") > indexBam("accepted_hits.tp_exons.sorted.bam") OK, good. I think what happens is that you specify two ranges |-----| |-----| and then a read overlapping the second range starts before a read overlapping the first range 1 x-x 2 x---------------x |-----| |-----| The bam file filters the first range and writes read 1, then filters the second range and writes read 2. The output file is no longer sorted even though the input file was sorted. In some ways I'm ok with this requiring that the new BAM file needs to be sorted then indexed. There is a second hazard here, and that is because read 2 overlaps both ranges, it is duplicated in the output file. This is probably not at all what is desired, and I'll implement a fix. Martin > > Cheers > raf > > On 7/16/13 3:04 PM, Martin Morgan wrote: >> On 07/16/2013 05:01 AM, rcaloger wrote: >>> Dear Martin, >>> many thanks for the information. >>> >>> I got the point: >>> in exons.gr: >>> seqlevelsexons.gr) >>> [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" >>> [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" >>> [19] "chr9" "chrX" >>> >>> seqlevels(BamFile("accepted_hits.sorted.bam") >>> [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" >>> [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" >>> [19] "chr9" "chrM" "chrX" "chrY" >>> >>> "chrM" is missing in exons.gr >>> >>> There is any simple way to add to exons.gr also the chrM? >> >> Thanks for the file; the missing 'chrM' should not matter, and the ranges look >> like they are sorted to me. Maybe a workaround is to use the >> indexDestination=FALSE argument to filterBam, and then sort and index the >> filtered results >> >> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", >> indexDestination=FALSE, param=ScanBamParam(which=exons.gr)) >> sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted") >> indexBam("accepted_hits.tp_exons.sorted") >> >> Martin >> >>> you can get the test.Rda at : >>> https://mail.unito.it/mailftp/uplister/get_file.php?file=ZmlsZXMvR jEvNDAvZDBjMGExMTk0OTkyYWM3YzY4Nzg0MTFlYjc1My80MzA0ZWU0MzA1OGMzYTBhMmQ 1NWZjMWZjNDM4M2M5ZQ== >>> >>> Many thanks ! >>> Raf >>> >>> the file I used to generate exon.gr was ordered by chr and then by start >>> position >>> >>> On 7/16/13 1:25 PM, Martin Morgan wrote: >>>> On 07/15/2013 11:19 PM, rcaloger wrote: >>>>> Hi, >>>>> I am trying to filter a bam file using a set of coordinates. >>>>> I sorted the bam file using both sortBam or Picard SortSam tools but I always >>>>> get the same error: >>>>> >>>>> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons", >>>>> param=ScanBamParam(which=exons.gr)) >>>>> >>>>> Error in >>>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synth etic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>>>> >>>>> >>>>> : >>>>> failed to build index >>>>> file: >>>>> /data03/calogero/Documents/singapore/transcripts/semi_synthetic_ data_set/for_tm_creation/m1/accepted_hits.tp_exons >>>>> >>>>> >>>>> >>>>> In addition: Warning messages: >>>>> 1: In >>>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synth etic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>>>> >>>>> >>>>> : >>>>> [bam_index_core] the alignment is not sorted >>>>> (D44TDFP1_1:1:1102:6940:117318): >>>>> 85079683 > 84952092 in 1-th chr >>>>> 2: In >>>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synth etic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]], >>>>> >>>>> >>>>> : >>>>> [bam_index_build2] fail to index the BAM file. >>>> >>>> filterBam tries to place the GRanges into correct order by calling >>>> >>>> bf <- BamFile("accepted_hits.sorted.bam") >>>> xx <- Rsamtools:::.filterBam_preprocess(bf, param) >>>> >>>> after which >>>> >>>> bamWhich(xx) >>>> >>>> is supposed to have (a) non-overlapping ranges with (b) seqnames in the same >>>> order as in the original BAM file and (c) ordered by start position. >>>> >>>> I'm supposing that this fails for your exons.gr, and would be happy to >>>> trouble-shoot if you're willing to share (off-list is fine) the result of >>>> >>>> saveexons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")), >>>> file="test.Rda") >>>> >>>> Martin >>>> >>>>> >>>>> >>>>> Any suggestion how to handle this problem? >>>>> cheers >>>>> Raf >>>>> >>>>> exons.gr >>>>> GRanges with 3399 ranges and 1 metadata column: >>>>> seqnames ranges strand | transcriptID >>>>> <rle> <iranges> <rle> | <character> >>>>> [1] chr1 [ 4764598, 4766882] * | uc007aff.2 >>>>> [2] chr1 [ 7110275, 7110696] * | uc007agb.1 >>>>> [3] chr1 [ 9858548, 9858769] * | uc007agw.1 >>>>> [4] chr1 [10029961, 10029968] * | uc007ahk.1 >>>>> ... ... ... ... ... ... >>>>> [3397] chrX [147525139, 147525147] * | uc012hqq.1 >>>>> [3398] chrX [148034128, 148034929] * | uc009upe.1 >>>>> [3399] chrX [156357850, 156359017] * | uc009uss.2 >>>>> --- >>>>> seqlengths: >>>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr5 chr6 chr7 chr8 chr9 chrX >>>>> NA NA NA NA NA NA ... NA NA NA NA NA NA >>>>> >>>>> sessionInfo() >>>>> R version 2.15.3 (2013-03-01) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods base >>>>> >>>>> other attached packages: >>>>> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >>>>> [4] IRanges_1.16.6 BiocGenerics_0.4.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] bitops_1.0-5 parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0 >>>>> >>>>> -- ---------------------------------------- >>>>> Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di >>>>> Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax >>>>> ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it >>>>> raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> >>> -- >>> >>> ---------------------------------------- >>> Prof. Raffaele A. Calogero >>> Bioinformatics and Genomics Unit >>> MBC Centro di Biotecnologie Molecolari >>> Via Nizza 52, Torino 10126 >>> tel. ++39 0116706457 >>> Fax ++39 0112366457 >>> Mobile ++39 3333827080 >>> email:raffaele.calogero at unito.it >>> raffaele[dot]calogero[at]gmail[dot]com >>> www:http://www.bioinformatica.unito.it >>> >> >> > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

Login before adding your answer.