SummarizeOverlaps in R
3
0
Entering edit mode
Lillyhchen ▴ 10
@lillyhchen-8385
Last seen 9.5 years ago
United States

Hi, 
I'm fairly new to R and bioinformatics and I'm working on a project with DESeq2 in R studio. I've been trying to run the "summarizeOverlaps" function on 12 bam files. However, I let it run over the weekend and the process still hasn't finished running. I don't think it's frozen, but i'm not sure if I should try and restart it again. Does anyone know of a faster way to work around SummarizeOverlap?

Thanks,
Lilly

summarizeoverlaps Bam rstudio • 6.9k views
ADD COMMENT
0
Entering edit mode

You will have to give more information than that. How big are the bam files? What are you summarizing overlaps against? What is the output of sessionInfo? How much RAM does your computer have?

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hi Lilly,

As Jim says, we need more information in order to help you. Please show the code of how you are calling summarizeOverlaps(). There are several options for managing memory when counting bam files, 'yieldSize' in BamFile, ScanBamParam for reading in subsets, etc. You may want to look at the Counting reads with summarizeOverlaps vignette, specifically section 2.

Valerie

ADD COMMENT
0
Entering edit mode
Lillyhchen ▴ 10
@lillyhchen-8385
Last seen 9.5 years ago
United States

Hi, 

SorrI'm using RStudio 3.2.1 and I'm trying to run 12 BAM files that are each about 100,000 kb in size against a UCSC txdb GRangesList.  The computer I'm using is windows8 with 16gb of RAM and a 3.4 ghz processor. When I terminate the summarizeOverlap function I get this warning message:

"Warning message: running command 'env MASTER=localhost PORT=11880 OUT=/dev/null RPROG=C:/PROGRA~1/R/R-32~1.1/bin/R R_LIBS= C:/Users/lchen/Documents/R/win-library/3.2/BiocParallel/RSOCKnode.sh' had status 127"

I'm wasn't sure what status 127 meant, but I checked the folder and found that the file was still there, so i'm not sure why I keep getting this error.

 

Lilly

ADD COMMENT
0
Entering edit mode

You did not show your code.

  • Please show your code so we can see how you are calling the function.
  • Provide the output of sessionInfo() so we can see the versions of R and packages you are using.
  • Did you read the Counting reads with summarizeOverlaps vignette, section 2? The last paragraph starting with "By default ..." talks about iterating through files in chunks and controlling the number of cores used.

Start with one file. When you have success with that move on to processing more. summarizeOverlaps() iterates through files in chunks defined by 'yieldSize' in a BamFile object and processes files in parallel using bplapply().

Use BamFile and specificy a 'yieldSize':

bf = BamFile(myfile, yieldSize = 100000)

Define a BiocParallelParam with one worker:

param = SnowParam(workers = 1)

Call summarizeOverlaps:

summarizeOverlaps(features = yourGRrangesList, reads = bf, BPPARAM = param)

Valerie

ADD REPLY
0
Entering edit mode

My session info is

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] rtracklayer_1.28.6        DESeq2_1.8.1             
 [3] RcppArmadillo_0.5.200.1.0 Rcpp_0.11.6              
 [5] GenomicAlignments_1.4.1   Rsamtools_1.20.4         
 [7] Biostrings_2.36.1         XVector_0.8.0            
 [9] GenomicFeatures_1.20.1    AnnotationDbi_1.30.1     
[11] Biobase_2.28.0            GenomicRanges_1.20.5     
[13] GenomeInfoDb_1.4.1        IRanges_2.2.5            
[15] S4Vectors_0.6.1           BiocGenerics_0.14.0      
[17] BiocInstaller_1.18.3      BiocParallel_1.2.9       
[19] biomaRt_2.24.0           

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3          
 [4] bitops_1.0-6         futile.options_1.0.0 tools_3.2.1         
 [7] zlibbioc_1.14.0      rpart_4.1-10         digest_0.6.8        
[10] annotate_1.46.1      lattice_0.20-31      RSQLite_1.0.0       
[13] gtable_0.1.2         DBI_0.3.1            proto_0.3-10        
[16] gridExtra_0.9.1      genefilter_1.50.0    cluster_2.0.2       
[19] stringr_1.0.0        locfit_1.5-9.1       nnet_7.3-10         
[22] grid_3.2.1           snow_0.3-13          survival_2.38-3     
[25] XML_3.98-1.3         foreign_0.8-65       latticeExtra_0.6-26 
[28] Formula_1.2-1        geneplotter_1.46.0   ggplot2_1.0.1       
[31] reshape2_1.4.1       lambda.r_1.1.7       magrittr_1.5        
[34] splines_3.2.1        Hmisc_3.16-0         MASS_7.3-40         
[37] scales_0.2.5         xtable_1.7-4         colorspace_1.2-6    
[40] stringi_0.5-5        acepack_1.3-3.3      RCurl_1.95-4.7      
[43] munsell_0.4.2       

This is the code I am trying to run to be used for DESeq 

sample=read.csv("sample.tmz.csv", header=T)
path=paste(getwd(),"/",sample$sample,".sorted.rmdup.bam",sep="")
list=BamFileList(path, index=character())
Txdb=makeTranscriptDbFromUCSC(
  genome="hg19",
  tablename="ensGene",
  transcript_ids=NULL,
  circ_seqs=DEFAULT_CIRC_SEQS,
  url="http://genome.ucsc.edu/cgi-bin/",
  goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
  miRBaseBuild=NA)
exons<-exonsBy(Txdb, by="gene")
se <- summarizeOverlaps(features=exons, reads=list,
                        mode="Union",
                        singleEnd=FALSE,
                        ignore.strand=FALSE,
                        fragments=TRUE )

colData(se)$group= sample$group
dds<-DESeqDataSet(se, design=~group)
dds<-DESeq(dds)

I tried that method and I still got the same status error 127.  I'm going to try and update the RSockNode file and see if that fixes anything.

Lilly

ADD REPLY
0
Entering edit mode

Thanks for posting your code and sessionInfo(). It looks like you're on the right track. Again, I think it's wise to start with a single file, make sure your set up is correct, then move to multiple files from there.

Set the path to a single file:

file = pathTOSingleFile

Check that the file exists:

file.exists(file)  ## should be TRUE

Create the BamFile with 'yieldSize':

bf = BamFile(file, index=character(), yieldSize=100000)

Count:

res <- summarizeOverlaps(exons, bf, singleEnd=FALSE, fragments=TRUE)

You may have done this already but it's good to start with a man page example to make sure all is working as expected. This codes counts a single bam with a 'yieldSize' and should work for you. If it doesn't then we have a different problem.

fl = system.file("extdata", "sm_treated1.bam", package="GenomicAlignments")
> file.exists(fl)
[1] TRUE
bf = BamFile(fl, yieldSize=1000)

genes <- GRanges(

    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 
                                    4000, 7500, 5000, 5400), 
                                 width=c(rep(500, 3), 600, 900, 500, 300, 900, 
                                              300, 500, 500))) 

res <- summarizeOverlaps(genes, bf)
> res
class: RangedSummarizedExperiment 
dim: 11 1 
metadata(0):
assays(1): counts
rownames: NULL
rowRanges metadata column names(0):
colnames(1): sm_treated1.bam
colData names(0):

Valerie

 

ADD REPLY
0
Entering edit mode

I tried to run a single file around 2 hours ago and the summarizeOverlaps function is still trying to process.  

Here's the code of my input:

> file="C:/Users/lchen/Documents/transfer.lilly/TMZ.052215/accepted_hits.HF2303.CSC.Contc.sorted.rmdup.bam"
> file.exists(file)
[1] TRUE
> bf=BamFile(file, index=character(), yieldSize=100000)
> res<-summarizeOverlaps(exons, bf, singleEnd=FALSE, fragments=TRUE)

This was the same problem that was happening before, I had left the summarizeOverlap function to run over a weekend and it never finished

 

Lilly

 

ADD REPLY
0
Entering edit mode

I stopped the summarizeOverlap function in R and got this warning message:

Warning message:
running command 'env MASTER=localhost PORT=11444 OUT=/dev/null RPROG=C:/PROGRA~1/R/R-32~1.1/bin/R R_LIBS= C:/Users/lchen/Documents/R/win-library/3.2/BiocParallel/RSOCKnode.sh' had status 127 

I read that status 127 means the command wasn't found, could this be the potential problem as to why the function isn't running?  

 

Lilly

ADD REPLY
0
Entering edit mode

Thanks for posting the full error message. That is helpful.

It's possible the port being used to start the parallel workers is blocked. Do you have a system administrator you can ask about blocked ports? They should be able to tell you which are not blocked. Here is a similar derfinder: analyzeChr() freezes in f-stat calculationwith more discussion about this problem.

Once you have a non-blocked number you can use this to specify the R environment variable, R_PARALLEL_PORT, e.g., starting R with something like R_PARALLEL_PORT=12345 R --vanilla. Also see ?Sys.setenv for setting environment variables from within R.

A little more troubleshooting:

SerialParam (does not use workers/ports) should work:

library(BiocParallel)
bplapply(1:2, function(i) { Sys.sleep(1); i }, BPPARAM=SerialParam())

SnowParam will likely fail:

bplapply(1:2, function(i) { Sys.sleep(1); i }, BPPARAM=SnowParam(2))

Valerie

ADD REPLY
0
Entering edit mode

I never thought about the port being blocked, I think that might be the problem! 

SerialParam ran and gave me results:

> library(BiocParallel)
> bplapply(1:2, function(i) { Sys.sleep(1); i }, BPPARAM=SerialParam())
[[1]]
[1] 1

[[2]]
[1] 2

And when i run SnowParam, i get the same error message:

> bplapply(1:2, function(i) { Sys.sleep(1); i }, BPPARAM=SnowParam(2))

Warning message:
running command 'env MASTER=localhost PORT=11444 OUT=/dev/null RPROG=C:/PROGRA~1/R/R-32~1.1/bin/R R_LIBS= C:/Users/lchen/Documents/R/win-library/3.2/BiocParallel/RSOCKnode.sh' had status 127 

I'm an administrator on this comptuer, will I be able to check for blocked ports?  Or will SerialParam work to run summarizeOverlaps?  I added serialParam to the code but it was not working.

> se <- summarizeOverlaps(features=exons, reads=list,
+                         mode="Union",
+                         singleEnd=FALSE,
+                         ignore.strand=FALSE,
+                         fragments=TRUE, BPPARAM=SerialParam() )

Error in fun() : argument "e" is missing, with no default
> SerialParam()
class: SerialParam 
  bplog:FALSE; bpthreshold:INFO

Lilly

 

ADD REPLY
0
Entering edit mode

Yes, you can use SerialParam() with summarizeOverlaps(). Not sure why you're getting that error. Does this work for you?

library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
exons <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")

bf <- BamFile(untreated1_chr4(), yieldSize=50000)

se <- summarizeOverlaps(exons, bf, BPPARAM = SerialParam())

As for finding unblocked ports on windows, I'm not an expert. It looks like you can run netstat -a on the command line and look for LISTENING processes or use the Task Manager - more details here.

Valerie

ADD REPLY
0
Entering edit mode

Yes that code works:

> library(pasillaBamSubset)
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> exons <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")

> bf <- BamFile(untreated1_chr4(), yieldSize=50000)

> se <- summarizeOverlaps(exons, bf, BPPARAM = SerialParam())
> se
class: SummarizedExperiment 
dim: 15682 1 
exptData(0):
assays(1): counts
rownames(15682): FBgn0000003 FBgn0000008 ... FBgn0264726
  FBgn0264727
rowRanges metadata column names(0):
colnames(1): untreated1_chr4.bam
colData names(0):

Then I replaced the packages with my own BamFiles and database and got a different error:

bf <- BamFile("list", yieldSize=100000)
> se <- summarizeOverlaps(exons, bf, BPPARAM = SerialParam())
Error in .dispatchBamFiles(features, BamFileList(reads), mode, match.arg(algorithm),  : 
  file(s): list do not exist

As for the unblocked ports, I'll check out that link to find out more about that and see if I can find a way to unblock them!

Lilly

ADD REPLY
0
Entering edit mode

I just tried running my own script and got this error instead:

> se <- summarizeOverlaps(features=exons, reads=list,
+                         mode="Union",
+                         singleEnd=FALSE,
+                         ignore.strand=FALSE,
+                         fragments=TRUE)

Error in fun() : argument "e" is missing, with no default

ADD REPLY
0
Entering edit mode

Also, when i looked at summarizeOverlaps I get these warning messages:

> summarizeOverlaps()
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘summarizeOverlaps’ for signature ‘"missing", "missing"’
In addition: Warning messages:
1: In read.dcf(file.path(p, "DESCRIPTION"), c("Package", "Version")) :
  cannot open compressed file 'C:/Users/lchen/Documents/R/win-library/3.2/lattice/DESCRIPTION', probable reason 'No such file or directory'
2: In read.dcf(file.path(p, "DESCRIPTION"), c("Package", "Version")) :
  cannot open compressed file 'C:/Users/lchen/Documents/R/win-library/3.2/survival/DESCRIPTION', probable reason 'No such file or directory'
3: In read.dcf(file.path(p, "DESCRIPTION"), c("Package", "Version")) :
  cannot open compressed file 'C:/Users/lchen/Documents/R/win-library/3.2/foreign/DESCRIPTION', probable reason 'No such file or directory'
4: In read.dcf(file.path(p, "DESCRIPTION"), c("Package", "Version")) :
  cannot open compressed file 'C:/Users/lchen/Documents/R/win-library/3.2/cluster/DESCRIPTION', probable reason 'No such file or directory'
5: In read.dcf(file.path(p, "DESCRIPTION"), c("Package", "Version")) :
  cannot open compressed file 'C:/Users/lchen/Documents/R/win-library/3.2/MASS/DESCRIPTION', probable reason 'No such file or directory'
6: In read.dcf(file.path(p, "DESCRIPTION"), c("Package", "Version")) :
  cannot open compressed file 'C:/Users/lchen/Documents/R/win-library/3.2/rpart/DESCRIPTION', probable reason 'No such file or directory'
7: In read.dcf(file.path(p, "DESCRIPTION"), c("Package", "Version")) :
  cannot open compressed file 'C:/Users/lchen/Documents/R/win-library/3.2/nnet/DESCRIPTION', probable reason 'No such file or directory'

 

Would not being able to open those files be part of the problem possibly?  I've never seen those warning messagesbefore.

Lilly

ADD REPLY
0
Entering edit mode

You now have a small example that works so you know summarizeOverlaps() runs on your system.

Next you need to check your input args to summarizeOverlaps and see why they aren't working. Start with one of your files (don't start with 12). Make sure the single file is found, use file.exists() as I showed earlier. Read the man page ?file.exists if you aren't sure how to use it. Check the GRangesList you're using as 'features'.  Does everything look as expected? Another thing you may need to check is that the chromosome names in the bam files match those in the GRangesList. Use scanBamHeader() on one of the files to see the chromosome names. If you need to rename chromosomes in the TxDb to match the bam, see ?renameSeqlevels. There are several functions on that page that can help.

That last error you see is telling you there is no method for summarizeOverlaps with missing 'features' and missing 'reads'. Makes sense. That means it can't find whatever you gave it as 'features' and 'reads'.

showMethods() shows available methods for a function:

> showMethods("summarizeOverlaps")
Function: summarizeOverlaps (package GenomicAlignments)
features="BamViews", reads="missing"
features="GRanges", reads="BamFile"
features="GRanges", reads="BamFileList"
features="GRanges", reads="character"
features="GRanges", reads="GAlignmentPairs"
features="GRanges", reads="GAlignments"
features="GRanges", reads="GAlignmentsList"
features="GRanges", reads="GRanges"
features="GRanges", reads="GRangesList"
features="GRangesList", reads="BamFile"
features="GRangesList", reads="BamFileList"
features="GRangesList", reads="character"
features="GRangesList", reads="GAlignmentPairs"
features="GRangesList", reads="GAlignments"
features="GRangesList", reads="GAlignmentsList"
features="GRangesList", reads="GRanges"
features="GRangesList", reads="GRangesList"

 

selectMethod() returns a particular method:

> selectMethod("summarizeOverlaps", c("GRanges", "BamFile"))
Method Definition:

function (features, reads, mode = Union, algorithm = c("nclist",
    "intervaltree"), ignore.strand = FALSE, ...)
{
    .local <- function (features, reads, mode = Union, algorithm = c("nclist",
        "intervaltree"), ignore.strand = FALSE, inter.feature = TRUE,
        singleEnd = TRUE, fragments = FALSE, param = ScanBamParam(),
        preprocess.reads = NULL, ...)
    {
        .checkArgs(reads, singleEnd, fragments)
        .dispatchBamFiles(features, BamFileList(reads), mode,
            match.arg(algorithm), ignore.strand, inter.feature = inter.feature,
            singleEnd = singleEnd, fragments = fragments, param = param,
            preprocess.reads = preprocess.reads, ...)
    }
    .local(features, reads, mode, algorithm, ignore.strand, ...)
}
<environment: namespace:GenomicAlignments>

Signatures:
        features  reads   
target  "GRanges" "BamFile"
defined "GRanges" "BamFile"

Valerie

ADD REPLY
0
Entering edit mode

Hi valerie,

I played around with it this morning and it worked! I ended up checking my file names, restarted R, and used serialParam instead of snow! Thanks for all your help :)

Lilly

ADD REPLY
0
Entering edit mode

Great! I'm glad it worked.

Valerie

ADD REPLY
0
Entering edit mode
Jake ▴ 90
@jake-7236
Last seen 2.4 years ago
United States

I've found that featureCounts (http://bioinf.wehi.edu.au/featureCounts/) outside of R is great and super fast for counting reads. Then I just import the counts into R.

ADD COMMENT
0
Entering edit mode

featureCounts() is from Rsubread which is a great package. Unfortunately it's not supported on Windows.

Valerie

ADD REPLY

Login before adding your answer.

Traffic: 557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6