Hi,
I just recently tried using diffHic and everything seemed to be working until I tried loading the bamfile into the hdf5 format. None of the reads were mapped. see below:
ML275 <- preparePairs("275s.bam", hs.param, file="275.h5", dedup=TRUE)
> ML275
$pairs
total marked filtered mapped
81109522 63459873 28886930 0
$same.id
dangling self.circle
0 0
$singles
[1] 0
I saw a similar post about this and tried the following code thinking that it might be a chromosome name issue, but the chromosome names are the same in both the hs.frag object and the bam file. Thus, I am out of ideas as to why none of the reads have mapped. I'd appreciate any advice you could give. Thanks.
> names(Rsamtools::scanBamHeader("275s.bam")[[1]]$targets)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chrM"
> as.character(runValue(seqnames(hs.frag)))
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chrM"
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BSgenome.yeast.mine.saccer3_1.0 BSgenome_1.36.3
[3] rtracklayer_1.28.10 Biostrings_2.36.4
[5] XVector_0.8.0 diffHic_1.0.1
[7] GenomicRanges_1.20.8 GenomeInfoDb_1.4.3
[9] IRanges_2.2.7 S4Vectors_0.6.6
[11] BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.30.1 edgeR_3.10.3 zlibbioc_1.14.0
[4] GenomicAlignments_1.4.1 BiocParallel_1.2.21 lattice_0.20-33
[7] tools_3.2.2 grid_3.2.2 Biobase_2.28.0
[10] rhdf5_2.12.0 csaw_1.2.1 DBI_0.3.1
[13] lambda.r_1.1.7 futile.logger_1.4.1 futile.options_1.0.0
[16] bitops_1.0-6 biomaRt_2.24.1 RCurl_1.95-4.7
[19] RSQLite_1.0.0 limma_3.24.15 GenomicFeatures_1.20.5
[22] Rsamtools_1.20.4 XML_3.98-1.3 locfit_1.5-9.1
They use a iterative approach which performs hard clipping, but I don't recall whether they report these hard clips in the CIGAR strings. You should be able to check this in your BAM file; however, even if it is present, it shouldn't have been a problem, as any hard clipping should occur from the 3' end and be ignored by
preparePairs
. The only remaining explanation is a lot of reads have the unmapped flag set. What does thecountBam
command above return?Edit: Another (somewhat bizarre) possibility is that the BAM file is sorted such that several alignments for one read (i.e., at different truncation lengths in the iterative mapping process) are located together in the order of the BAM file, but are not located alongside the alignments for the mate. This would be considered as a pair with an unmapped read and ignored, but it would not increment the
singles
count due to the presence of multiple alignments.You could be correct, but I am fairly certain that once a read is mapped, it is no longer included in the iterative mapping process. I should note that the bam file I was using for the diffHic input was made using samtools merge of several smaller bam files containing fragments mapped at different truncation lengths, so one pair read may not be mapped at the same length as its partner read. Whether this would affect anything, I don't know. I am not sure what portion of the bam file may be excluded by their pipeline that your pipeline might be parsing, so its probably easiest if I just use your mapping software.
Here was the output of the countBam function:
> test
space start end width file records nucleotides
1 NA NA NA NA test.bam 865367 865367
To be precise, once a read is mapped uniquely, it will not be extended and remapped. Also, it doesn't matter to
preparePairs
if the two reads in the pair are mapped at different lengths.It occurs to me that the
hiclib
pipeline does not automatically synchronize mate information, so thecountBam
function isn't actually returning the number of reads with a mapped mate (otherwise the number of records should be even). My guess is that there are very low numbers of mapped and non-duplicate reads, and that all their mates are unmapped. As such, they get thrown out bypreparePairs,
as they can't be used to identify interactions.If that's the case, then switching to the
diffHic
mapping pipeline won't help much. You might as well give it a go, but keep in mind that the problem might be more fundamental, i.e., at the read mapping step.