Hi there,
Below is my biocValid() result.
* sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocInstaller_1.22.2
loaded via a namespace (and not attached):
[1] tools_3.3.0
* Out-of-date packages
Package
Biostrings "Biostrings"
limma "limma"
LibPath Installed
Biostrings "/projects/sequence_analysis/vol3/tools/R-3.3.0/library" "2.40.1"
limma "/projects/sequence_analysis/vol3/tools/R-3.3.0/library" "3.28.5"
Built ReposVer
Biostrings "3.3.0" "2.40.2"
limma "3.3.0" "3.28.6"
Repository
Biostrings "https://bioconductor.org/packages/3.3/bioc/src/contrib"
limma "https://bioconductor.org/packages/3.3/bioc/src/contrib"
I am trying to run some MEDIPS QC analyses, for example MEDIPS.saturation, MEDIPS.seqCoverage, and MEDIPS.CpGenrich, on paired-end reads of MBD-seq experiments using bowtie and bwa aligners. I have MEDIPS v1.22 (supporting bwa paired-end bam) installed under R-3.3.0 as well as v1.16 installed under R-3.1.3 which is not supporting bwa paired-end bam file. When analyzing a BWA bam file with 48.5 million reads and 98% mapped on to hg19, MEDIPS.seqCoverage reports 99.97% of CpG's are covered >5X. This sounds too good to be true. I have another bam file from another MBD-Seq experiment for the same sample aligned by BOWTIE (42.8 million reads, 92% mapped). When try to analyze it by MEDIPS v1.22, below error message returned from saturation analysis:
...Creating GRange Object...
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
solving row 4330: negative widths are not allowed
Calls: MEDIPS.saturation ... newGRanges -> IRanges -> solveUserSEW0 -> .Call2 -> .Call
Execution halted
However, the same bam file was processed by v1.16 without error and reported 18.37% of CpG's covered by >5X. I looked into the source code for handling non-bwa reads for creating regions in getPairedGRange function.
From v1.16:
regions = data.frame(chr = as.character(as.vector(regions$rname)),
start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$pos) +
as.vector(regions$qwidth) - 1), strand = as.character(as.vector(regions$strand)),
isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
plus = regions$strand == "+"
regions[plus, "stop"] = regions[plus, "stop"] + regions[plus,"isize"] + extend
regions[!plus, "start"] = regions[!plus, "start"] + regions[!plus,"isize"] - extend
From v1.22:
regions = data.frame(chr = as.character(as.vector(regions$rname)),
start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$pos) +
as.vector(regions$qwidth) - 1), strand = as.character(as.vector(regions$strand)),
isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
plus = regions$strand == "+"
regions[plus, "stop"] = regions[plus, "start"] + regions[plus,"isize"] + extend
regions[!plus, "start"] = regions[!plus, "stop"] + regions[!plus,"isize"] - extend
Not only there are differences between 2 versions but also, if I understand correctly, there seems to be an error in defining the coordinates of the region of sequenced DNA fragment in v1.16. Since the stop is set as stop = as.numeric(as.vector(regions$pos) + as.vector(regions$qwidth) - 1) which is the last base of the first read. When setting the stop (for strand +) or the start (for strand -), the fragment/insert size (regions[plus,"isize"], which is a positive value for strand + and negative value for strand -) is added. The problem is the insert size includes the query size (regions$qwidth) of both paired reads. In that case, shouldn't this be done like this before being extended?
regions[plus, "stop"] = regions[plus, "stop"] + regions[plus,"isize"] - regions[plus,"qwidth"]
regions[!plus, "start"] = regions[!plus, "start"] + regions[!plus,"isize"] + regions[plus,"qwidth"]
However, in my test even after applying this change, it still complained with the same error message and I am guessing using the regions$strand to identify reversed mapping paired reads probably is not reliable. It is safer to compare regions$pos and regions$mpos to identify the reversed mapping paired reads as implemented in the part of code for handling BWA reads.
qwidth = regions[, "qwidth"]
regions = data.frame(chr = as.character(as.vector(regions$rname)),
start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$mpos)),
strand = as.character(as.vector(regions$strand)),
isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
regionsToRev = regions$start > regions$stop
tmp = regions[regionsToRev, ]$start
regions[regionsToRev, ]$start = regions[regionsToRev,]$stop
regions[regionsToRev, ]$stop = tmp
regions[, "stop"] = regions[, "stop"] + qwidth - 1 + extend
In this part of code, it swaps the start and stop for reversed mapping reads then calculate the true stop position by adding qwidth. This is assuming the query read size is same between paired reads. If this assumption is hot held then the stop position is wrong for forward mapping read pair. In addition, why only the stop position got extended? Doesn't start position need to be extended as well?
Comparing my bam files from BWA and BOWTIE, I really don't see the difference between them about reversed mapping reads in BAM/SAM data. If that's true, does it really need to handle BWA and non-BWA bam files differently? I would suggest to consolidate the code by removing the checking for bwa and merging codes.
Using one set of scanParam when index available: scanParam = ScanBamParam(flag = scanFlag, simpleCigar=TRUE, what = c("rname", "pos", "strand", "qwidth", "isize", "mpos"), which = sel)
And another set when index ont available: scanParam = ScanBamParam(flag = scanFlag, simpleCigar=TRUE, what = c("rname", "pos", "strand", "qwidth", "isize", "mpos"))
And use the absolute value if the fragment/insert size (isize) to get the stop position.
regions = data.frame(chr = as.character(as.vector(regions$rname)),
start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$mpos)),
strand = as.character(as.vector(regions$strand)),
isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
regionsToRev = regions$start > regions$stop
regions[regionsToRev, ]$start = regions[regionsToRev,]$stop
regions[, "stop"] = regions[, "start"] + abs(regions[, "isize"]) - 1
regions[, "start"] = regions[, "start"] - extend
regions[, "stop"] = regions[, "stop"] + extend
With these changes implemented in v1.22, I recompiled and installed the MEDIPS package and run on both BWA and BOWTIE bam files for QC analyses and it went through without any error.
Does this make sense to you? I appreciate your feedback if there's any problems in my 2 cents.
Thanks,
David
[Correction] I overlooked the bam file header information. It's bowtie2 not bowtie. Sorry about the confusion. Does anyone know if there's major output differences between bowtie and bowtie2 for paired-end reads? Thanks a lot.