Questions for MEDIPS getPairedGRange function in MEDIPS.readRegionsFile.R
0
0
Entering edit mode
fkuo ▴ 10
@fkuo-10868
Last seen 7.2 years ago

Hi there,

Below is my biocValid() result.

* sessionInfo()

R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocInstaller_1.22.2

loaded via a namespace (and not attached):
[1] tools_3.3.0

* Out-of-date packages
           Package
Biostrings "Biostrings"
limma      "limma"
           LibPath                                                  Installed
Biostrings "/projects/sequence_analysis/vol3/tools/R-3.3.0/library" "2.40.1"
limma      "/projects/sequence_analysis/vol3/tools/R-3.3.0/library" "3.28.5"
           Built   ReposVer
Biostrings "3.3.0" "2.40.2"
limma      "3.3.0" "3.28.6"
           Repository
Biostrings "https://bioconductor.org/packages/3.3/bioc/src/contrib"
limma      "https://bioconductor.org/packages/3.3/bioc/src/contrib"

 

I am trying to run some MEDIPS QC analyses, for example MEDIPS.saturation, MEDIPS.seqCoverage, and MEDIPS.CpGenrich, on paired-end reads of MBD-seq experiments using bowtie and bwa aligners. I have MEDIPS v1.22 (supporting bwa paired-end bam) installed under R-3.3.0 as well as v1.16 installed under R-3.1.3 which is not supporting bwa paired-end bam file. When analyzing a BWA bam file with 48.5 million reads and 98% mapped on to hg19, MEDIPS.seqCoverage reports 99.97% of CpG's are covered >5X. This sounds too good to be true. I have another bam file from another MBD-Seq experiment for the same sample aligned by BOWTIE (42.8 million reads, 92% mapped). When try to analyze it by MEDIPS v1.22, below error message returned from saturation analysis:

...Creating GRange Object...
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
  solving row 4330: negative widths are not allowed
Calls: MEDIPS.saturation ... newGRanges -> IRanges -> solveUserSEW0 -> .Call2 -> .Call
Execution halted

However, the same bam file was processed by v1.16 without error and reported 18.37% of CpG's covered by >5X. I looked into the source code for handling non-bwa reads for creating regions in getPairedGRange function.

From v1.16:

    regions = data.frame(chr = as.character(as.vector(regions$rname)),
        start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$pos) +
            as.vector(regions$qwidth) - 1), strand = as.character(as.vector(regions$strand)),
        isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
    plus = regions$strand == "+"
    regions[plus, "stop"] = regions[plus, "stop"] + regions[plus,"isize"] + extend
    regions[!plus, "start"] = regions[!plus, "start"] + regions[!plus,"isize"] - extend
 

From v1.22:

        regions = data.frame(chr = as.character(as.vector(regions$rname)),
            start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$pos) +
                as.vector(regions$qwidth) - 1), strand = as.character(as.vector(regions$strand)),
            isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
        plus = regions$strand == "+"
        regions[plus, "stop"] = regions[plus, "start"] + regions[plus,"isize"] + extend
        regions[!plus, "start"] = regions[!plus, "stop"] + regions[!plus,"isize"] - extend

Not only there are differences between 2 versions but also, if I understand correctly, there seems to be an error in defining the coordinates of the region of sequenced DNA fragment in v1.16. Since the stop is set as stop = as.numeric(as.vector(regions$pos) + as.vector(regions$qwidth) - 1) which is the last base of the first read. When setting the stop (for strand +) or the start (for strand -), the fragment/insert size (regions[plus,"isize"], which is a positive value for strand + and negative value for strand -) is added. The problem is the insert size includes the query size (regions$qwidth) of both paired reads. In that case, shouldn't this be done like this before being extended?

regions[plus, "stop"] = regions[plus, "stop"] + regions[plus,"isize"] - regions[plus,"qwidth"]

regions[!plus, "start"] = regions[!plus, "start"] + regions[!plus,"isize"] + regions[plus,"qwidth"]

However, in my test even after applying this change, it still complained with the same error message and I am guessing using the regions$strand to identify reversed mapping paired reads probably is not reliable. It is safer to compare regions$pos and regions$mpos to identify the reversed mapping paired reads as implemented in the part of code for handling BWA reads.

        qwidth = regions[, "qwidth"]
        regions = data.frame(chr = as.character(as.vector(regions$rname)),
            start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$mpos)),
            strand = as.character(as.vector(regions$strand)),
            isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
        regionsToRev = regions$start > regions$stop
        tmp = regions[regionsToRev, ]$start
        regions[regionsToRev, ]$start = regions[regionsToRev,]$stop
        regions[regionsToRev, ]$stop = tmp
        regions[, "stop"] = regions[, "stop"] + qwidth - 1 + extend

In this part of code, it swaps the start and stop for reversed mapping reads then calculate the true stop position by adding qwidth. This is assuming the query read size is same between paired reads. If this assumption is hot held then the stop position is wrong for forward mapping read pair. In addition, why only the stop position got extended? Doesn't start position need to be extended as well?

Comparing my bam files from BWA and BOWTIE, I really don't see the difference between them about reversed mapping reads in BAM/SAM data. If that's true, does it really need to handle BWA and non-BWA bam files differently? I would suggest to consolidate the code by removing the checking for bwa and merging codes.

Using one set of scanParam when index available: scanParam = ScanBamParam(flag = scanFlag, simpleCigar=TRUE, what = c("rname", "pos", "strand", "qwidth", "isize", "mpos"), which = sel)

And another set when index ont available: scanParam = ScanBamParam(flag = scanFlag, simpleCigar=TRUE, what = c("rname", "pos", "strand", "qwidth", "isize", "mpos"))

And use the absolute value if the fragment/insert size (isize) to get the stop position.

    regions = data.frame(chr = as.character(as.vector(regions$rname)),
        start = as.numeric(as.vector(regions$pos)), stop = as.numeric(as.vector(regions$mpos)),
        strand = as.character(as.vector(regions$strand)),
        isize = as.numeric(as.vector(regions$isize)), stringsAsFactors = F)
    
    regionsToRev = regions$start > regions$stop
    regions[regionsToRev, ]$start = regions[regionsToRev,]$stop
    regions[, "stop"] = regions[, "start"] + abs(regions[, "isize"]) - 1            
    regions[, "start"] = regions[, "start"] - extend
    regions[, "stop"] = regions[, "stop"] + extend

With these changes implemented in v1.22, I recompiled and installed the MEDIPS package and run on both BWA and BOWTIE bam files for QC analyses and it went through without any error.

Does this make sense to you? I appreciate your feedback if there's any problems in my 2 cents.

Thanks,

David        

 

 

medips bam file • 1.4k views
ADD COMMENT
0
Entering edit mode

[Correction] I overlooked the bam file header information. It's bowtie2 not bowtie. Sorry about the confusion. Does anyone know if there's major output differences between bowtie and bowtie2 for paired-end reads? Thanks a lot.

ADD REPLY

Login before adding your answer.

Traffic: 591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6