I am trying to normalize ChIP-seq data using an exogenous spike in control with csaw and get the following error:
>filtered.data <- normOffsets(filtered.data.spike, se.out=filtered.data)
Error in .local(object, ...) :
library sizes of 'se.out' and 'object' are not identical
library(csaw)
library(edgeR)
library(rtracklayer)
library(GenomicRanges)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
param <- readParam(minq=20, restrict=paste0("chr", c(1:19, "X", "Y")))
#Load mouse bams (endogenous bamfiles)
dir <- "../bam/mm10"
samples <- read.table(file.path(dir, "bamfiles.txt"), header = TRUE)
colnames(design) <- c("intercept", "Treatment")
treatment = samples$Treatment
bam.files <- file.path(dir,samples$BAM)
data.frame(BAM=bam.files, treatment=treatment)
#Load human bams (exogenous bamfiles)
dir.spike <- "..bam/hg38"
samples.spike <- read.table(file.path(dir.spike, "spikebamfiles.txt"), header = TRUE)
bam.files.spike <- file.path(dir.spike,samples.spike$BAM)
#Counting mouse reads into windows
win.data <- windowCounts(bam.files, param=param, width=150, ext=200)
#filter by abundance | windows with abundances 5x higher than background.
bins <- windowCounts(bam.files, bin=TRUE, width=2000, param=param)
filter.stat <- filterWindows(win.data, bins, type="global")
min.fc <- 5
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
#implement filter
filtered.data <- win.data[keep,]
#Counting spiked-in reads into windows
win.data.spike <- windowCounts(bam.files.spike, param=param, width=150, ext=200)
#filter by abundance | windows with abundances 5x higher than background.
bins.spike <- windowCounts(bam.files.spike, bin=TRUE, width=2000, param=param)
filter.stat.spike <- filterWindows(win.data.spike, bins.spike, type="global")
min.fc <- 5
keep.spike <- filter.stat.spike$filter > log2(min.fc)
#filter
filtered.data.spike <- win.data.spike[keep.spike,]
#Normalize by spike-in
filtered.data <- normOffsets(filtered.data.spike, se.out=filtered.data)
filtered.data$norm.factors
Error in .local(object, ...) :
library sizes of 'se.out' and 'object' are not identical
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin17.5.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 GenomicFeatures_1.32.0
[3] AnnotationDbi_1.42.1 rtracklayer_1.40.2
[5] edgeR_3.22.1 limma_3.36.1
[7] csaw_1.14.0 SummarizedExperiment_1.10.1
[9] DelayedArray_0.6.0 BiocParallel_1.14.1
[11] matrixStats_0.53.1 Biobase_2.40.0
[13] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0
[15] IRanges_2.14.10 S4Vectors_0.18.2
[17] BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 compiler_3.5.0 XVector_0.20.0 prettyunits_1.0.2
[5] bitops_1.0-6 tools_3.5.0 zlibbioc_1.26.0 progress_1.1.2
[9] biomaRt_2.36.0 digest_0.6.15 bit_1.1-13 RSQLite_2.1.1
[13] memoise_1.1.0 lattice_0.20-35 pkgconfig_2.0.1 Matrix_1.2-14
[17] DBI_1.0.0 GenomeInfoDbData_1.1.0 httr_1.3.1 stringr_1.3.1
[21] Biostrings_2.48.0 locfit_1.5-9.1 bit64_0.9-7 grid_3.5.0
[25] R6_2.2.2 XML_3.98-1.11 magrittr_1.5 Rhtslib_1.12.0
[29] blob_1.1.1 GenomicAlignments_1.16.0 Rsamtools_1.32.0 assertthat_0.2.0
[33] stringi_1.2.2 RCurl_1.95-4.10
Thank you for your help!
Wouldn't aligning to a combined human/mouse reference (presumably by changing chrom names, i.e. mchr1 v hchr1) suffer from the same putative double mapping issue? Alternatively, one could throw out reads that map to both genomes, but this seems like a very bad idea because these reads will likely map to regions of the genome we're interested in. Do you have a suggested workflow for this?
FWIW: I checked cross-species mapping with Bowtie2 (--very-sensitive) using simulated libraries consisting of all possible 75mers and got <1% human vs. mouse mapping (and vice versa).
For now I'll take your advice and manually set totals in both objects to the sum across objects.
The expectation would be that the aligner would ignore reads that are not uniquely mapped. I believe this is true of most aligners nowadays, where non-uniquely mapped reads are either discarded automatically or thresholded based on MAPQ. This would protect you from reads from homologous regions of the mouse/human genome.
So yes, effectively I would throw out reads that map equally well to both genomes. This is conservative but hardly a "very bad idea", as the opposite situation would be disastrous. Imagine if you wrote up a manuscript where your major findings were driven by cross-mapping between genomes! Highly homologous regions in multi-genome experiments are equivalent to repeats, and we all know how problematic those are for short read sequencing.
If you have important things happening in homologous regions, the only reliable approach to retain information from these regions is to choose a spike-in genome that is less related. Drosophila and yeast seem to be popular choices, though I have not touched much ChIP-seq spike-in data - I've been told it's a bit underwhelming.
Regarding the cross-mapping results; simulated libraries are all well and good, but remember that the important parts of the genome are evolutionarily conserved and thus more likely to suffer from cross-mapping. If you're trying to assay something functional (and who isn't?), the problem is likely to be worse than you might predict from the simulations.
Do you have a suggested workflow for aligning to combined genomes?
I agree with the points that you raised here, thanks for your input! The primary advantage of spike-in normalization is to draw empirical quantitative normalization between control and treated samples, especially in cases where there are global changes in abundance/occupancy. Additionally, methodologies with very low background like CAR-seq require spike-in for proper normalization. It's unclear to me why this method of normalization would be 'underwhelming', care to explain?
In such experiments where one is interested in comparing differences between two samples, it's hard to imagine a case where an entire manuscript would be "driven by cross-mapping between genomes", since the reference is the same in both. Furthermore, any careful researcher would make sure that enriched sites are called correctly using non-mixed samples. Of course using less related genomes would be optimal but many antibodies do not work in both Drosophila and mouse/human, so in most cases this is not possible.
Regarding the suggested workflow: build a combined reference with all relevant genomes and align to that. I do that all the time for single-cell RNA-seq data, e.g., with ERCC spike-ins. You might have to change the names of some sequences in the FASTA file to get this to work properly. Some aligners will also refuse to handle genomes above a certain size - the solution would be to use a less restrictive aligner.
Regarding spike-in normalization: yes, I'm familiar with the how and why. Again, I have not dealt with much of it myself, but in at least some of those datasets I have handled, there has not been enough coverage of the spike-ins for them to be useful. You might say that this is not a problem with the overall spike-in strategy but rather is the fault of the experimentalists for not adding the right quantity of spike-in chromatin, and I would agree. Nonetheless, they weren't going to repeat the experiment for that reason alone, so I never used the spike-ins for any analysis of that data. I have also heard anecdotal stories regarding cases where the spike-in IP efficiencies differ from that of the endogenous genomes, though having not been shown solid evidence for this, I've taken it with a grain of salt.
Regarding DB analyses; well, who knows? At best, cross-mapping from your spike-in to the endogenous genome should would result in some spurious non-DB peaks that cause no harm. If you cross-map to an existing peak in the endogenous genome, you will shrink the log-fold change for that peak towards zero, which is not ideal but generally harmless (unless your major finding involves something about the lack of differences in binding between conditions). At worst, though, cross-mapping from your endogenous genome to your spike-ins can affect your normalization; and at that point, all bets are off regarding the final effects on your DB analysis.
And of course, it's easy to tell people to be careful. But repeating the entire sequencing experiment with non-mixed samples is pretty expensive, especially as you have to redo it right at the start of the protocol, i.e., before IP. We have to do the best we can with what we get.