Question

Memory issues in summarizeOverlaps funtion

0

Entering edit mode

Diana ▴ 10

@diana-19465

Last seen 6.0 years ago

Hi all,

I get a memory error ('error: cannot allocate vector of size 344.5 Mb') when running summarizeOverlaps in the Genomic alignments package. I have 4 GB RAM (with about 3.8 GB free space) and I use 64 bits R. I also increased the memory.limit size to 3500 and I tried -- vanilla as well. Nothing seems to work. Do you have any ideas? Thanks a lot!

memory summarizeOverlaps • 1.7k views

ADD COMMENT • link updated 6.0 years ago by James W. MacDonald 67k • written 6.0 years ago by Diana ▴ 10

James W. MacDonald · Answer 1 · 2019-02-07

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

Assuming you are reading in data from BAM files, you should try reading the data in chunks. See ?BamFile, particularly the yieldSize argument, and the examples which show how it's used.

ADD COMMENT • link 6.0 years ago James W. MacDonald 67k

0

Entering edit mode

Hi James,

Thanks for your answer! Yes, I am reading BAM files. I know the yieldSize argument, but the file itself is about 500 MB, so isn't the memory error a bit strange? What could be an explanation besides low RAM memory (which is not the case)?

ADD REPLY • link 6.0 years ago Diana ▴ 10

0

Entering edit mode

You say you are reading BAM files, but then you say 'the file itself', so it's not clear if you are reading in one or more files. Anyway, having a computer with 4 Gb RAM doesn't mean you actually have that much RAM to allocate to R. It may be much less, depending on what else you have running. And reading in a 500 Mb file will probably take more RAM than you would expect, given underlying copies that may be created. And if you are on Windows, which sometimes has problems releasing memory, that might be exacerbated.

I wouldn't use a Windows box with 4 Gb RAM for really basic stuff (16 Gb RAM is about the lowest I would go, even for casual use), so it's not surprising to me at all that you would run out of RAM trying to do something real.

You say that you 'know the yieldSize argument'. Does that mean you are using it, or just that you know it exists?

ADD REPLY • link 6.0 years ago James W. MacDonald 67k

0

Entering edit mode

Sorry, currently I am reading in one file. As for the yieldSize argument, I know it exists. I haven't yet tried it, as I assumed it would take a looong time to read the whole file in seperate chunks. I will try it with a yieldSize of 2000000 to start with.

ADD REPLY • link 6.0 years ago Diana ▴ 10

0

Entering edit mode

> bfl <- BamFileList("../../data/star_aligned/303360Aligned.sortedByCoord.out.bam")
> system.time(summarizeOverlaps(ensex, bfl))
   user  system elapsed 
233.280  34.764 268.371 

> bfl <- BamFileList("../../data/star_aligned/303360Aligned.sortedByCoord.out.bam", yieldSize = 2e5)
> system.time(summarizeOverlaps(ensex, bfl))
   user  system elapsed 
222.436   3.960 226.655

ADD REPLY • link 6.0 years ago James W. MacDonald 67k

0

Entering edit mode

Hi, I have still one question about the reduceByYield argument. I have the following code:

> csvfile <- file.path("W29-1-1.csv")
> sampleTable <- read.csv(csvfile,row.names=1)
> sampleTable
       File
1 W29-1-1-B
2 W29-1-1-F
> setwd("C:/Program Files/BAM files")
> filename <- file.path(paste0(sampleTable$File, "_aligned_genome_anonymized.sorted29.bam"))
> file.exists(filename)
[1] TRUE TRUE
> library("Rsamtools")
> library(GenomicFiles)
> library(GenomicFeatures)
> library(GenomicRanges)
> library("GenomicAlignments")
> library("BiocParallel")
> library("Rsamtools")
> bamfiles <- BamFileList(filename, yieldSize=2000000)
 x <- bamfiles
YIELD <- readGAlignments
 reduceByYield(x, YIELD, MAP=identity, REDUCE='+', parallel=FALSE)

However, I get the following error:

> Error in (function (classes, fdef, mtable)  :    unable to find an
> inherited method for function ‘readGAlignments’ for signature
> ‘"BamFileList"’

My following steps are counting reads with summarizeOverlaps and performing a differential expression analysis with edgeR. This works fine with my current Yieldsize of 2000000, but I want to perform these analysis on complete BAM-files. Do you know how I can make this reduceByYield argument work?

ADD REPLY • link updated 6.0 years ago by James W. MacDonald 67k • written 6.0 years ago by Diana ▴ 10

0

Entering edit mode

Why are you doing that? Simply passing a BamFileList to summarizeOverlaps where you have specified the yieldSize for the BamFileList will cause the data to be read in chunks.

ADD REPLY • link 6.0 years ago James W. MacDonald 67k

0

Entering edit mode

Really? So simply running se will actually count all reads? That would be great... But how is it possible that tail(assay(se)) gives 9997 as last row and rowRanges(se) gives an object of length 25892? I am sorry for asking these probably basic questions...

ADD REPLY • link 6.0 years ago Diana ▴ 10

0

Entering edit mode

I think you might be confused. The row names for a SummarizedExperiment are the underlying IDs (which in your case might be Entrez Gene IDs? The yieldSize argument simply sets the chunk size for the data being read in, not the total amount of data to read in:

> bams <- c("303301Aligned.sortedByCoord.out.bam","303362Aligned.sortedByCoord.out.bam")
> bfl <- BamFileList(bams)
> se_all <- summarizeOverlaps(ensex, bfl)
> bfl <- BamFileList(bams, yieldSize = 2e5)
> se_by_yield <- summarizeOverlaps(ensex, bfl)
> se_all
class: RangedSummarizedExperiment 
dim: 225589 2 
metadata(0):
assays(1): counts
rownames(225589): ENSSSCG00000000002 ENSSSCG00000000002 ...
  ENSSSCG00000040989 ENSSSCG00000040989
rowData names(0):
colnames(2): 303301Aligned.sortedByCoord.out.bam
  303362Aligned.sortedByCoord.out.bam
colData names(0):
> se_by_yield
class: RangedSummarizedExperiment 
dim: 225589 2 
metadata(0):
assays(1): counts
rownames(225589): ENSSSCG00000000002 ENSSSCG00000000002 ...
  ENSSSCG00000040989 ENSSSCG00000040989
rowData names(0):
colnames(2): 303301Aligned.sortedByCoord.out.bam
  303362Aligned.sortedByCoord.out.bam
colData names(0):

Please note that the dim for both SummarizedExperiments are identical, and that the rownames are (in this case) Ensembl Gene IDs.

ADD REPLY • link 6.0 years ago James W. MacDonald 67k