Hi,
I am trying to run ChIPQC with my 6 samples, with no success so far.
The commands I ran was:
exp <- ChIPQC(metadata, annotation = "mm10")
It returns:
samp1 brain TF 1 bed
samp2 brain TF 2 bed
samp3 brain TF 1 bed
samp4 brain TF 2 bed
samp5 brain TF 1 bed
samp6 brain TF 2 bed
Checking chromosomes:
[1] "chr1"
Compiling annotation...
Computing metrics for 6 samples...
Error: 6 errors; first error:
Error in sampleQC(bamFile = reads, bedFile = peaks, GeneAnnotation = annotation, : Contigs of interest are not all in Bam file!!
For more information, use bplasterror(). To resume calculation, re-call the function and set the argument 'BPRESUME' to TRUE or wrap the previous call in bpresume().
First traceback:
30: ChIPQC(metadata, annotation = "mm10")
29: bplapply(samplelist, doChIPQCsample, experiment, chromosomes,
annotation, mapQCth, blacklist, profileWin, fragmentLength,
shifts)
28: bplapply(samplelist, doChIPQCsample, experiment, chromosomes,
annotation, mapQCth, blacklist, profileWin, fragmentLength,
shifts)
27: bplapply(X, FUN, ..., BPRESUME = BPRESUME, BPPARAM = BPPARAM)
26: bplapply(X, FUN, ..., BPRESUME = BPRESUME, BPPARAM = BPPARAM)
25: mclapply(X = X, FUN = FUN, ..., mc.set.seed = BPPARAM$setSeed,
mc.silent = !BPPARAM$verbose, mc.cores = bpworkers(BPPARAM),
mc.cl
metadata
is the summary description of the experimental design, as a data.frame:
SampleID Tissue Factor Replicate bamReads Peaks
samp1 brain TF 1 /path2/samp1.bam /path2/samp1.bed
samp2 brain TF 2 /path2/samp2.bam /path2/samp2.bed
samp3 brain TF 1 /path2/samp3.bam /path2/samp3.bed
samp4 brain TF 2 /path2/samp4.bam /path2/samp4.bed
samp5 brain TF 1 /path2/samp5.bam /path2/samp5.bed
samp6 brain TF 2 /path2/samp6.bam /path2/samp6.bed
The .bam files (mapping with bowtie2) are of course sorted with samtools sort
.
They have been also indexed with samtools index
.
The .bed files (peak calling with MACS) are correctly formatted (tab separated, 5-columns as chr, start, end, name, score) and look like this:
chr1 12000000 12000200 MACS_peak_1 115.92
My sessionInfo() is as following:
R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_CH.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_CH.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_CH.UTF-8 LC_NAME=de_CH.UTF-8 LC_ADDRESS=de_CH.UTF-8 LC_TELEPHONE=de_CH.UTF-8 LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=de_CH.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.0.0 GenomicFeatures_1.18.3 AnnotationDbi_1.28.1 Biobase_2.26.0 XLConnect_0.2-11
[6] XLConnectJars_0.2-9 ChIPQC_1.2.2 DiffBind_1.12.3 GenomicAlignments_1.2.2 Rsamtools_1.18.3
[11] Biostrings_2.34.1 XVector_0.6.0 limma_3.22.7 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4
[16] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1 ggplot2_1.0.0
loaded via a namespace (and not attached):
[1] amap_0.8-14 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9 BiocParallel_1.0.3 biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 BSgenome_1.34.1 caTools_1.17.1
[11] checkmate_1.5.1 chipseq_1.16.0 codetools_0.2-10 colorspace_1.2-6 DBI_0.3.1 digest_0.6.8 edgeR_3.8.6 fail_1.2 foreach_1.4.2 gdata_2.13.3
[21] gplots_2.16.0 grid_3.1.3 gtable_0.1.2 gtools_3.4.1 hwriter_1.3.2 iterators_1.0.7 KernSmooth_2.23-14 lattice_0.20-30 latticeExtra_0.6-26 MASS_7.3-39
[31] munsell_0.4.2 Nozzle.R1_1.1-1 plyr_1.8.1 proto_0.3-10 RColorBrewer_1.1-2 Rcpp_0.11.5 RCurl_1.95-4.5 reshape2_1.4.1 rJava_0.9-6 RSQLite_1.0.0
[41] rtracklayer_1.26.2 scales_0.2.4 sendmailR_1.2-1 ShortRead_1.24.0 stringr_0.6.2 tools_3.1.3 XML_3.98-1.1 zlibbioc_1.12.0
Also, when I try to run ChIPQCsample with a single sample instead, it "seems" to work, in the way that there is no error message, the method runs till the end. But the result is quite strange:
samp <- ChIPQCsample(reads = samp1.bam, peaks = samp1.bed, annotation = "mm10")
> samp
ChIPQCsample
Number of Mapped reads: 30767939
Number of Mapped reads passing MapQ filter: 24496740
Percentage Of Reads as Non-Duplicates (NRF): 100(0)
Percentage Of Reads in Blacklisted Regions: NA
SSD: 4.27506515515619
Fragment Length Cross-Coverage: 0.000228443139299127
Relative Cross-Coverage: 0.497333607597129
Percentage Of Reads in GenomicFeature:
ProportionOfCounts
Peaks 0
LongPromoter20000to2000 0
Promoters2000to500 0
Promoters500 0
All5utrs 0
Alltranscripts 0
Allcds 0
Allintrons 0
All3utrs 0
Percentage Of Reads in Peaks: 0
Number of Peaks: 0
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | Counts bedRangesSummitsTemp
<Rle> <IRanges> <Rle> | <integer> <numeric>
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
Although there are no many peaks (this is normal/expected for that TF in those conditions, ~100-1000 peaks per sample/condition), there are some specific significant peaks. So how come ChIPQCsample returns null values for the different proportions counts, especially for Percentage Of Reads in Peaks (instead of something like 4e-04 etc)? Why does it give null for the number of Peaks (I know there are at least ~100 peaks) just by looking at the bed file returned by MACS?
Thanks a lot for your help in advance,
Kind regards,
H.
For what it's worth a Bioconductor way to look at the names of the 'chromosomes' in BAM files is
seqinfo()
provides the seqlengths as well; these functions work on lots of GRanges-related objects and (e.g., rtracklayer) file types that have sequence information.Ah, thanks Martin,
Always good to show the Bioconductor way
best,
tom