By way of getting a reproducible example, I ran
library(Rsamtools)
example(filterBam)
qnames <- scanBam(fl, param=ScanBamParam(what="qname"))[[1]][[1]]
table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))
with the output
> table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))
B7 EAS1 EAS112 EAS114 EAS139 EAS188 EAS192 EAS218 EAS219 EAS220 EAS221
467 505 61 534 209 107 68 82 91 51 130
EAS51 EAS54 EAS56
222 355 425
Suppose I'm interested in the reads starting with 'B7'. I'd use ScanBamParam(what="qname")
to extract minimal information from the BAM file for the filter. This would give my filter a DataFrame with a single column, qname, and my filter's task would be to return a logical vector indicating which rows pass the filter, so maybe grepl("^B7_", df$qname)
or in R-devel the new-fangled startsWith(df$qname, "B7_")
. So...
dest <- tempfile()
filter <- FilterRules(list(iWantB7=function(x) startsWith(x$qname, "B7")))
filterBam(fl, dest, filter=filter, param=ScanBamParam(what="qname"))
and...
> qnames <- scanBam(dest, param=ScanBamParam(what="qname"))[[1]][[1]]
> table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))
B7
467
I worked through this by consulting the help page (also for FilterRules) and getting off the ground with a basic filter
filter <- FilterRules(list(iWantB7=function(x) {
print(head(x, 3))
rep(FALSE, nrow(x))
}))
and
> filterBam(fl, dest, filter=filter)
DataFrame with 3 rows and 15 columns
qname flag rname strand pos qwidth
<character> <integer> <factor> <factor> <integer> <integer>
1 B7_591:4:96:693:509 73 seq1 + 1 36
2 EAS54_65:7:152:368:113 73 seq1 + 3 35
3 EAS51_64:8:5:734:57 137 seq1 + 5 35
mapq cigar mrnm mpos isize seq
<integer> <character> <factor> <integer> <integer> <DNAStringSet>
1 99 36M NA NA NA CACTAGTGGC...GTTTAACTCG
2 99 35M NA NA NA CTAGTGGCTC...TTTAACTCGT
3 99 35M NA NA NA AGTGGCTCAT...TAACTCGTCC
qual groupid mate_status
<PhredQuality> <integer> <factor>
1 <<<<<<<<<<...<<<<<;:<;7 0 NA
2 <<<<<<<<<<...9<<3/:<6): 0 NA
3 <<<<<<<<<<...<3;);3*8/5 0 NA
[1] "/tmp/RtmpOySFZo/file46ce6598a760"
Probably a better approach is to use formal tags (rather than munging qnames) to annotate your BAM file (you would do this 'upstream', when you're doing alignments or otherwise, and then use the tagFilter
argument to ScanBamParam()
to selectively input the data rather than making yet another set of BAM files.
Thanks a lot Dr. Morgan for a fast and helpful reply.
Actually I add these tags in my raw fastq files for demultiplexing the samples post-alignment (my fastq files are multiplexed using a novel multiplexing index). I was exploring the idea that demultiplexing after alignment might be more convenient than demultiplexing raw fastq files, followed by alignment.
The solution you suggested worked nicely. I was wondering how I can make it more efficient to filter a large bam file into 3-4 bamFiles this way..
Thanks again
Vivek
in release the only way to do this is with multiple passes through the file, perhaps using
BiocParallel::bplapply
asI added a more efficient implementation of this to the 'devel' version of Rsamtools, version 1.23.2; this requires use of the devel version of Bioconductor, and should be available via
biocLite()
after the builds complete tomorrow afternoon, January 15, with