I have tried, using rtracklayer and Genomicranges to find the overlap between bed files that document different ChIP Seq data e.g. Nucleosome coverage, CTCF coverage etc.
I want to then have a csv file that documents the overlapping regions and their occupancy levels e.g.
"","seqnames","start","end","width","strand","ctcf","H3k27ac","H3k36me3","H3k4me1","H3k79me2"
"1","chr21",9419882,9419981,100,"+", 1.04, 3.32, 1, 0.16, 6.76
"2","chr21",9424072,9424171,100,"+", 1.84, 2, 4.64, 4, 4.52
"3","chr21",9426516,9426560,45,"+", 1.44, 2.92, 1, 1.16, 2.4
The script, based on a previous post (https://support.bioconductor.org/p/89510/#89521), goes as follows:
library(bigmemory)
library(rtracklayer)
library(GenomicRanges)
Nuc_hESC<-import('Nucleosome_coverage.bed') # file too big for processing
ctcfe<-import('CtcfStdSig.bed')
H3k27ace<-import('H3k27acStdSig.bed')
H3k36me3e<-import('H3k36me3StdSig.bed')
H3k4me1e<-import('H3k4me1StdSig.bed')
H3k79me2e<-import('H3k79me2StdSig.bed')
dj <- disjoin(c(Nuc_hESC, ctcfe, H3k27ace, H3k36me3e, H3k4me1e, H3k79me2e))
djhESC <- dj[ (dj %over% Nuc_hESC) & (dj %over% ctcfe) & (dj %over% H3k27ace) & (dj %over% H3k36me3e) &(dj %over% H3k4me1e) & (dj %over% H3k79me2e)]
mcols(djhESC) <- DataFrame(Nuc=NA, ctcf=NA, H3k27ac=NA, H3k36me3=NA, H3k4me1=NA, H3k79me2=NA)
annotate <- function(dj, gr, column) {
hits <- findOverlaps(dj, gr)
mcols(dj)[queryHits(hits), column] <- mcols(gr)[subjectHits(hits), "score"]
dj
}
dj_annotatedhESC<-annotate(djhESC, Nuc_hESC, "Nuc")
Error: cannot allocate memory of size 5.8 Gb #debilitating error
dj_annotatedhESC<-annotate(dj_annotatedhESC, ctcfe, "ctcf")
dj_annotatedhESC<-annotate(dj_annotatedhESC, H3k27ace, "H3k27ac")
dj_annotatedhESC<-annotate(dj_annotatedhESC, H3k36me3e, "H3k36me3")
dj_annotatedhESC<-annotate(dj_annotatedhESC, H3k4me1e, "H3k4me1")
dj_annotatedhESC<-annotate(dj_annotatedhESC, H3k79me2e, "H3k79me2")
write.csv(dj_annotatedhESC, "occup_embryo.csv")
The error highlighted in bold is due to the sheer size of the Nucleosome coverage vector. I tried to sort this problem out by using the command: "memory.limit(.....)" and requesting different amounts of memory but this does not work.
Although I came up with a temporary solution of reducing the size of the Nucleosome coverage file by taking the best ChIP-Seq reads- Is there a way that I can get around this (i.e. can I somehow request more memory for R, bearing in mind that I am using a linux cluster computer)?