Question

CAGEr: Problem when running getCTSS

0

Entering edit mode

Sergio • 0

@3cdf8a1c

Last seen 3.6 years ago

United States

Hi,

I am getting the following error (code below) when running the getCTSS function from the CAGEr package using bam files:


cage_rep1_bam <- "/n/projects/sga/analysis/pipeline_test/cage/kc167_cage_1.bam"
cage_rep2_bam <- "/n/projects/sga/analysis/pipeline_test/cage/kc167_cage_2.bam"

ce <- CAGEexp( genomeName     = "BSgenome.Dmelanogaster.UCSC.dm6"
               , inputFiles     = c(cage_rep1_bam, cage_rep2_bam)
               , inputFilesType = "bam"
               , sampleLabels   = c("cage_rep1", "cage_rep2"))

getCTSS(ce, removeFirstG = F, correctSystematicG = F, nrCores = 10)

Reading in file: /n/projects/sga/analysis/pipeline_test/cage/kc167_cage_1.bam...
    -> Filtering out low quality reads...
Error in `[[<-`(`*tmp*`, name, value = new("Rle", values = c(1L, 38L,  : 
  2084199 elements in value to replace 104203438 elements

Interestingly, I have not had this error before when running the same code with the same samples. Does anyone have an idea of what may be causing this?

Thanks a lot!

Sergio

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /n/apps/CentOS7/install/r-4.1.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-4.1.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicAlignments_1.28.0              Rsamtools_2.8.0                       data.table_1.14.0                     lattice_0.20-44                      
 [5] cowplot_1.1.1                         gridExtra_2.3                         ggseqlogo_0.1                         CAGEr_1.34.0                         
 [9] MultiAssayExperiment_1.18.0           SummarizedExperiment_1.22.0           Biobase_2.52.0                        MatrixGenerics_1.4.0                 
[13] matrixStats_0.59.0                    plyranges_1.12.0                      reshape2_1.4.4                        dplyr_1.0.6                          
[17] ggplot2_3.3.3                         BSgenome.Dmelanogaster.UCSC.dm6_1.4.1 BSgenome_1.60.0                       Biostrings_2.60.1                    
[21] XVector_0.32.0                        rtracklayer_1.52.0                    magrittr_2.0.1                        GenomicRanges_1.44.0                 
[25] GenomeInfoDb_1.28.0                   IRanges_2.26.0                        S4Vectors_0.30.0                      BiocGenerics_0.38.0                  

loaded via a namespace (and not attached):
 [1] nlme_3.1-152           bitops_1.0-7           tools_4.1.0            utf8_1.2.1             R6_2.5.0               vegan_2.5-7            KernSmooth_2.23-20    
 [8] DBI_1.1.1              mgcv_1.8-36            colorspace_2.0-1       permute_0.9-5          withr_2.4.2            tidyselect_1.1.1       compiler_4.1.0        
[15] DelayedArray_0.18.0    scales_1.1.1           stringr_1.4.0          digest_0.6.27          rmarkdown_2.8          stringdist_0.9.6.3     pkgconfig_2.0.3       
[22] htmltools_0.5.1.1      fastmap_1.1.0          rlang_0.4.11           rstudioapi_0.13        VGAM_1.1-5             BiocIO_1.2.0           generics_0.1.0        
[29] BiocParallel_1.26.0    gtools_3.9.2           RCurl_1.98-1.3         GenomeInfoDbData_1.2.6 Matrix_1.3-4           Rcpp_1.0.6             munsell_0.5.0         
[36] fansi_0.5.0            lifecycle_1.0.0        stringi_1.6.2          yaml_2.2.1             MASS_7.3-54            zlibbioc_1.38.0        plyr_1.8.6            
[43] grid_4.1.0             formula.tools_1.7.1    crayon_1.4.1           splines_4.1.0          knitr_1.33             beanplot_1.2           pillar_1.6.1          
[50] rjson_0.2.20           XML_3.99-0.6           glue_1.4.2             evaluate_0.14          operator.tools_1.6.3   vctrs_0.3.8            gtable_0.3.0          
[57] purrr_0.3.4            reshape_0.8.8          assertthat_0.2.1       cachem_1.0.5           xfun_0.23              restfulr_0.0.13        tibble_3.1.2

CAGEr • 1.3k views

ADD COMMENT • link updated 3.5 years ago by Charles Plessy ▴ 180 • written 3.7 years ago by Sergio • 0

0

Entering edit mode

I actually found where the error is occurring within the getCTSS function, but still unsure how to fix it. There seems to be a difference in length when trying to assign a score to the CTSS granges object generated by the CTSS function.

This what I tried using the code within the getCTSS function I got from Github:

1) Filter low quality reads from bam file

param <- ScanBamParam(what = c("seq", "qual"), flag = scanBamFlag(isUnmappedQuery = FALSE))
bam <- readGAlignments(cage_rep1_bam, param = param)

qual <- mcols(bam)$qual
start <- 1
chunksize <- 1e6
qual.avg <- vector(mode = "integer")
  repeat {
    if (start + chunksize <= length(qual)) {
      end <- start + chunksize
    } else {
      end <- length(qual)
    }
    qual.avg <- c(qual.avg, as.integer(mean(as(qual[start:end], "IntegerList"))))
    if (end == length(qual)) {
      break
    } else {
      start <- end + 1
    }
  }
bam_filtered <- bam[qual.avg >= 10]
gr <- GRanges(bam_filtered)
gr$read.length <- qwidth(bam_filtered)
gr

2) Make sure information from aligned data and BSgenome match

coerceInBSgenome <- function(gr, genome) {
  if (is.null(genome)) return(gr)
  genome <- BSgenome.Dmelanogaster.UCSC.dm6
  gr <- gr[seqnames(gr) %in% seqnames(genome)]
  gr <- gr[! end(gr) > seqlengths(genome)[as.character(seqnames(gr))]]
  seqlevels(gr) <- seqlevels(genome)
  seqinfo(gr) <- seqinfo(genome)
  gr
}

new_gr <- coerceInBSgenome(gr, BSgenome.Dmelanogaster.UCSC.dm6)

3) Create a CTSS object and assign scores (here is the problem)

tb <- table(new_gr)
gp <- CTSS(names(tb), bsgenomeName = BSgenome.Dmelanogaster.UCSC.dm6)
score(gp) <- Rle(unclass(tb))

Error in `[[<-`(`*tmp*`, name, value = new("Rle", values = c(1L, 6L, 4L,  : 
  6447675 elements in value to replace 327809968 elements

Also, while I am able to run the CTSS function and obtain the gp object, I get this error when printing it:

gp

CTSS object with 327809968 positions and 0 metadata columns:
                      seqnames       pos strand
                         <Rle> <integer>  <Rle>
          [1]            chr2L      5875      +
          [2]            chr2L      5876      +
          [3]            chr2L      5877      +
          [4]            chr2L      5878      +
          [5]            chr2L      5879      +
          ...              ...       ...    ...
  [327809964] chrUn_DS486008v1       962      -
  [327809965] chrUn_DS486008v1       963      -
  [327809966] chrUn_DS486008v1       964      -
  [327809967] chrUn_DS486008v1       965      -
  [327809968] chrUn_DS486008v1       966      -
  -------
  seqinfo: 1860 sequences from an unspecified genome; no seqlengths
Error in rep(no, length.out = len) : 
  attempt to replicate an object of type 'S4'

Thank you!

Sergio

ADD REPLY • link 3.7 years ago Sergio • 0

0

Entering edit mode

Hi, we are preparing a version 2.0 of CAGEr. Can you open an issue in our GitHub repository ? https://github.com/charles-plessy/CAGEr/issues

ADD REPLY • link 3.6 years ago Charles • 0

score 0 · Answer 1 · 2021-11-01

0

Entering edit mode

Charles Plessy ▴ 180

@charles-plessy-7857

Last seen 19 months ago

Japan

I finally found the bug and fixed it in _CAGEr_ versions 2.0.1 and 2.1.1. I just pushed them to Bioconductor, so please give them a day or to be ready for download.

ADD COMMENT • link 3.5 years ago Charles Plessy ▴ 180