Aspli: error with gbCounts
2
0
Entering edit mode
jbono ▴ 10
@jbono-7682
Last seen 12 months ago
United States

I am trying to run gbCounts in Aspli. It appears to work through a few samples, but then I get an error message that I do not know how to address. I appreciate any insight. Here is the code I used:

> Dmel.6.38.TxDb=makeTxDbFromGFF(
+     file="dmel-all-r6.38.gtf",
+     format="gtf")

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
  ignored.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could not be inferred (either because the
  exons are not on the same chromosome/strand or because they are not separated by introns): FBtr0084079,
  FBtr0084080, FBtr0084081, FBtr0084082, FBtr0084083, FBtr0084084, FBtr0084085, FBtr0307759, FBtr0307760
3: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have stop codons that cannot be mapped to an exon:
  FBtr0100857, FBtr0100863, FBtr0433500, FBtr0433501

> saveDb(Dmel.6.38.TxDb,file="Dmel.6.38.TxDb.sqlite")

TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: dmel-all-r6.38.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 35367
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-03-18 12:37:55 -0600 (Thu, 18 Mar 2021)
# GenomicFeatures version at creation time: 1.42.2
# RSQLite version at creation time: 2.2.4
# DBSCHEMAVERSION: 1.2

> features <- binGenome( Dmel.6.38.TxDb )
* Number of extracted Genes = 17869
* Number of extracted Exon Bins = 80637
* Number of extracted intron bins = 72288
* Number of extracted trascripts = 35367
* Number of extracted junctions = 60431
* Number of AS bins (not include external) = 9547
* Number of AS bins (include external) = 9557
* Classified as: 
    ES bins = 2427  (25%)
    IR bins = 1257  (13%)
    Alt5'ss bins = 1497 (16%)
    Alt3'ss bins = 1622 (17%)
    Multiple AS bins = 2744 (29%)
    classified as:
            ES bins = 531   (19%)
            IR bins = 492   (18%)
            Alt5'ss bins = 885  (32%)
            Alt3'ss bins = 725  (26%)

> targets=read.csv("Targets.csv")

> getConditions(targets)
[1] "Mutant_F_1D"   "Mutant_M_1D"   "Mutant_F_28D"  "Mutant_M_28D"  "Control_F_1D"  "Control_M_1D"  "Control_F_28D"
[8] "Control_M_28D"

> gbcounts <- gbCounts( features = features,
+                       targets = targets,
+                       minReadLength = 100, maxISize = 50000,
+                       strandMode=0)
Summarizing Mutant_F_1D_1
ETA: 53 min
Summarizing Mutant_F_1D_2
ETA: 49 min
Summarizing Mutant_F_1D_3
ETA: 46 min
Summarizing Mutant_M_1D_1
ETA: 43 min
Summarizing Mutant_M_1D_2
ETA: 41 min
Summarizing Mutant_M_1D_3
ETA: 38 min
Summarizing Mutant_F_28D_1
[1] 7
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
In addition: Warning message:
In colnames(counts@junction.counts)[9:ncol(counts@junction.counts)] <- rownames(targets) :
  number of items to replace is not a multiple of replacement length**

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.42.2 GenomicRanges_1.42.0   GenomeInfoDb_1.26.4    ASpli_2.0.0            AnnotationDbi_1.52.0  
 [6] IRanges_2.24.1         S4Vectors_0.28.1       Biobase_2.50.0         BiocGenerics_0.36.0    edgeR_3.32.1          
[11] limma_3.46.0          

loaded via a namespace (and not attached):
  [1] colorspace_2.0-0            ellipsis_0.3.1              biovizBase_1.38.0           htmlTable_2.1.0            
  [5] XVector_0.30.0              base64enc_0.1-3             dichromat_2.0-0             rstudioapi_0.13            
  [9] DT_0.17                     bit64_4.0.5                 fansi_0.4.2                 xml2_1.3.2                 
 [13] splines_4.0.4               cachem_1.0.4                knitr_1.31                  Formula_1.2-4              
 [17] Rsamtools_2.6.0             cluster_2.1.1               dbplyr_2.1.0                png_0.1-7                  
 [21] BiocManager_1.30.10         compiler_4.0.4              httr_1.4.2                  backports_1.2.1            
 [25] lazyeval_0.2.2              assertthat_0.2.1            Matrix_1.3-2                fastmap_1.1.0              
 [29] htmltools_0.5.1.1           prettyunits_1.1.1           tools_4.0.4                 igraph_1.2.6               
 [33] gtable_0.3.0                glue_1.4.2                  GenomeInfoDbData_1.2.4      dplyr_1.0.5                
 [37] rappdirs_0.3.3              tinytex_0.30                Rcpp_1.0.6                  vctrs_0.3.6                
 [41] Biostrings_2.58.0           rtracklayer_1.50.0          xfun_0.22                   stringr_1.4.0              
 [45] lifecycle_1.0.0             ensembldb_2.14.0            XML_3.99-0.6                zlibbioc_1.36.0            
 [49] MASS_7.3-53.1               scales_1.1.1                BiocStyle_2.18.1            BSgenome_1.58.0            
 [53] VariantAnnotation_1.36.0    ProtGenerics_1.22.0         hms_1.0.0                   MatrixGenerics_1.2.1       
 [57] SummarizedExperiment_1.20.0 AnnotationFilter_1.14.0     RColorBrewer_1.1-2          yaml_2.2.1                 
 [61] curl_4.3                    memoise_2.0.0               gridExtra_2.3               ggplot2_3.3.3              
 [65] UpSetR_1.4.0                biomaRt_2.46.3              rpart_4.1-15                latticeExtra_0.6-29        
 [69] stringi_1.5.3               RSQLite_2.2.4               checkmate_2.0.0             BiocParallel_1.24.1        
 [73] rlang_0.4.10                pkgconfig_2.0.3             matrixStats_0.58.0          bitops_1.0-6               
 [77] evaluate_0.14               lattice_0.20-41             purrr_0.3.4                 htmlwidgets_1.5.3          
 [81] GenomicAlignments_1.26.0    bit_4.0.4                   tidyselect_1.1.0            plyr_1.8.6                 
 [85] magrittr_2.0.1              R6_2.5.0                    generics_0.1.0              Hmisc_4.5-0                
 [89] DelayedArray_0.16.2         DBI_1.1.1                   pillar_1.5.1                foreign_0.8-81             
 [93] survival_3.2-10             RCurl_1.98-1.3              nnet_7.3-15                 tibble_3.1.0               
 [97] crayon_1.4.1                utf8_1.2.1                  BiocFileCache_1.14.0        rmarkdown_2.7              
[101] jpeg_0.1-8.1                progress_1.2.2              locfit_1.5-9.4              grid_4.0.4                 
[105] data.table_1.14.0           blob_1.2.1                  digest_0.6.27               tidyr_1.1.3                
[109] openssl_1.4.3               munsell_0.5.0               Gviz_1.34.1                 askpass_1.1
Aspli ASpli • 2.3k views
ADD COMMENT
0
Entering edit mode
@b6a1dc8b
Last seen 3.7 years ago
Buenos Aires

Hi jbono,

The second and third warnings report some oddities concerning 'FBgn0002781','FBgn0013680','FBgn0013675','FBgn0262952','FBgn0013684' gene annotations. Could you please try removing them from the gtf file and re-run the analysis?

You can easily create an auxiliary foo.gtf file without the offending genes using the grep command:

   grep -iv 'FBgn0002781\|FBgn0013680\|FBgn0013675\|FBgn0262952\|FBgn0013684' dmel-all-r6.38.gtf > foo.gtf

and then proceed with 'foo.gtf' instead of the original file.

Best Ariel

ADD COMMENT
0
Entering edit mode

Hi Ariel,

Thanks for the advice! I took out those genes as suggested. Some of the warning messages from 'makeTxDbFromGFF' went away, but I still got the same error message when running gbCounts: Error in .subset(x, j) : only 0's may be mixed with negative subscripts In addition: Warning message: In colnames(counts@junction.counts)[9:ncol(counts@junction.counts)] <- rownames(targets) : number of items to replace is not a multiple of replacement length.

I've listed the code below:

Dmel.6.38.TxDb.edited=makeTxDbFromGFF(
+     file="dmel-all-r6.gt.edited.gtf",
+     format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
  ignored.
2: In for (i in seq_along(snames)) { :
  closing unused connection 3 (dmel-all-r6.38.edited.gtf)
> saveDb(Dmel.6.38.TxDb.edited,file="Dmel.6.38.TxDb.edited.sqlite")
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: dmel-all-r6.gt.edited.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 35345
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-03-31 10:57:52 -0600 (Wed, 31 Mar 2021)
# GenomicFeatures version at creation time: 1.42.2
# RSQLite version at creation time: 2.2.4
# DBSCHEMAVERSION: 1.2
> features <- binGenome( Dmel.6.38.TxDb.edited )
* Number of extracted Genes = 17868
* Number of extracted Exon Bins = 80610
* Number of extracted intron bins = 72241
* Number of extracted trascripts = 35345
* Number of extracted junctions = 60405
* Number of AS bins (not include external) = 9544
* Number of AS bins (include external) = 9554
* Classified as: 
    ES bins = 2427  (25%)
    IR bins = 1267  (13%)
    Alt5'ss bins = 1497 (16%)
    Alt3'ss bins = 1612 (17%)
    Multiple AS bins = 2741 (29%)
    classified as:
            ES bins = 530   (19%)
            IR bins = 491   (18%)
            Alt5'ss bins = 885  (32%)
            Alt3'ss bins = 724  (26%)

> targets=read.csv("Targets.csv")
> getConditions(targets)
[1] "Mutant_F_1D"   "Mutant_M_1D"   "Mutant_F_28D"  "Mutant_M_28D"  "Control_F_1D"  "Control_M_1D"  "Control_F_28D"
[8] "Control_M_28D"
> gbcounts <- gbCounts( features = features,
+                       targets = targets,
+                       minReadLength = 100, maxISize = 50000,
+                       strandMode=0)
Summarizing Mutant_F_1D_1
ETA: 53 min
Summarizing Mutant_F_1D_2
ETA: 50 min
Summarizing Mutant_F_1D_3
ETA: 46 min
Summarizing Mutant_M_1D_1
ETA: 44 min
Summarizing Mutant_M_1D_2
ETA: 40 min
Summarizing Mutant_M_1D_3
ETA: 38 min
Summarizing Mutant_F_28D_1
[1] 7
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
In addition: Warning message:
In colnames(counts@junction.counts)[9:ncol(counts@junction.counts)] <- rownames(targets) :
  number of items to replace is not a multiple of replacement length
ADD REPLY
0
Entering edit mode

Hi Jeremy, It seems that there could be some problems with junction information in Mutant_F_28D_1 bam file. Is this file different from the rest from a QC point of view? Could you eventually share it with me in order to see what´s going on specifically with that file? Ariel

ADD REPLY
0
Entering edit mode

Hi Ariel,

I also remade the bam file just in case but still got the error. I don't think there is a big difference from a QC point of view and the file has worked for a few other workflows. I can send you the file over email unless there is a way to do it here. Thanks for your help!

Jeremy

ADD REPLY
0
Entering edit mode

Hi Jeremy,

I could not reproduce the error you reported with the bam you sent me....could you run the code I used and tell me how it goes?

library(GenomicFeatures)
library(ASpli)
dme      <- makeTxDbFromGFF(file='/data1/genomeData/dme/foo.gtf')
features <- binGenome( dme )
targets  <- data.frame(row.names = 'C28F1_1',
                       bam = 'C28F1_1.fq.gz.subjunc.sorted.BAM',
                       f1  = 'C28F1')

gb <- gbCounts(features, targets, minReadLength = 100, maxISize = 50000, strandMode = 0)
# Summarizing C28F1_1
# Read summarization by gene completed
# Read summarization by bin completed
# Read summarization by ei1 region completed
# Read summarization by ie2 region completed
# Junction summarization completed 

gb
# Object of class ASpliCounts 
# Gene counts: 17868 genes analysed. Access using countsg(object) 
# Gene RD: 17868 genes analysed. Access using rdsg(object) 
# Bin counts: 143297 bins analysed. Access using countsb(object) 
# Bin RD: 143297 bins analysed. Access using rdsb(object) 
# Junction counts: 59250 junctions analysed. Access using countsj(object) 
ADD REPLY
0
Entering edit mode

Hi Ariel,

That also worked fine for me (see below). Just to be sure I also reran the original analysis and got the same error. I also tried the test below with the next bam file in the sequence just in case that was actually causing the problem, but that worked fine as well.

> dme      <- makeTxDbFromGFF(file="dmel-all-r6.gt.edited.gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
  ignored.
> features_test <- binGenome( dme )
* Number of extracted Genes = 17868
* Number of extracted Exon Bins = 80610
* Number of extracted intron bins = 72241
* Number of extracted trascripts = 35345
* Number of extracted junctions = 60405
* Number of AS bins (not include external) = 9544
* Number of AS bins (include external) = 9554
* Classified as: 
    ES bins = 2427  (25%)
    IR bins = 1267  (13%)
    Alt5'ss bins = 1497 (16%)
    Alt3'ss bins = 1612 (17%)
    Multiple AS bins = 2741 (29%)
    classified as:
            ES bins = 530   (19%)
            IR bins = 491   (18%)
            Alt5'ss bins = 885  (32%)
            Alt3'ss bins = 724  (26%)

> targets_test  <- data.frame(row.names = 'C28F1_1',
+                             bam = 'C28F1_1.fq.gz.subjunc.sorted.BAM',
+                             f1  = 'C28F1')
> gb_test <- gbCounts(features_test, targets_test, minReadLength = 100, maxISize = 50000, strandMode = 0)
Summarizing C28F1_1
Read summarization by gene completed
Read summarization by bin completed
Read summarization by ei1 region completed
Read summarization by ie2 region completed
Junction summarization completed
[1] 1
> gb_test
Object of class ASpliCounts 
Gene counts: 17868 genes analysed. Access using countsg(object) 
Gene RD: 17868 genes analysed. Access using rdsg(object) 
Bin counts: 143297 bins analysed. Access using countsb(object) 
Bin RD: 143297 bins analysed. Access using rdsb(object) 
Junction counts: 59250 junctions analysed. Access using countsj(object) 

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tximport_1.18.0              IsoformSwitchAnalyzeR_1.12.0 ggplot2_3.3.3                DEXSeq_1.36.0               
 [5] RColorBrewer_1.1-2           DESeq2_1.30.1                BiocParallel_1.24.1          GenomicFeatures_1.42.3      
 [9] GenomicAlignments_1.26.0     Rsamtools_2.6.0              Biostrings_2.58.0            XVector_0.30.0              
[13] SummarizedExperiment_1.20.0  MatrixGenerics_1.2.1         matrixStats_0.58.0           GenomicRanges_1.42.0        
[17] GenomeInfoDb_1.26.4          GenomeInfoDbData_1.2.4       ASpli_2.0.0                  AnnotationDbi_1.52.0        
[21] IRanges_2.24.1               S4Vectors_0.28.1             Biobase_2.50.0               BiocGenerics_0.36.0         
[25] Rsubread_2.4.3               edgeR_3.32.1                 limma_3.46.0                

loaded via a namespace (and not attached):
  [1] backports_1.2.1               AnnotationHub_2.22.0          Hmisc_4.5-0                  
  [4] DRIMSeq_1.18.0                BiocFileCache_1.14.0          plyr_1.8.6                   
  [7] igraph_1.2.6                  lazyeval_0.2.2                tximeta_1.8.4                
 [10] splines_4.0.4                 digest_0.6.27                 ensembldb_2.14.0             
 [13] htmltools_0.5.1.1             fansi_0.4.2                   magrittr_2.0.1               
 [16] checkmate_2.0.0               memoise_2.0.0                 BSgenome_1.58.0              
 [19] cluster_2.1.1                 readr_1.4.0                   annotate_1.68.0              
 [22] askpass_1.1                   prettyunits_1.1.1             jpeg_0.1-8.1                 
 [25] colorspace_2.0-0              blob_1.2.1                    rappdirs_0.3.3               
 [28] xfun_0.22                     dplyr_1.0.5                   jsonlite_1.7.2               
 [31] crayon_1.4.1                  RCurl_1.98-1.3                genefilter_1.72.1            
 [34] survival_3.2-10               VariantAnnotation_1.36.0      glue_1.4.2                   
 [37] gtable_0.3.0                  zlibbioc_1.36.0               UpSetR_1.4.0                 
 [40] DelayedArray_0.16.3           scales_1.1.1                  futile.options_1.0.1         
 [43] DBI_1.1.1                     Rcpp_1.0.6                    xtable_1.8-4                 
 [46] progress_1.2.2                htmlTable_2.1.0               foreign_0.8-81               
 [49] bit_4.0.4                     Formula_1.2-4                 DT_0.17                      
 [52] htmlwidgets_1.5.3             httr_1.4.2                    ellipsis_0.3.1               
 [55] pkgconfig_2.0.3               XML_3.99-0.6                  Gviz_1.34.1                  
 [58] nnet_7.3-15                   dbplyr_2.1.0                  locfit_1.5-9.4               
 [61] utf8_1.2.1                    later_1.1.0.1                 tidyselect_1.1.0             
 [64] rlang_0.4.10                  reshape2_1.4.4                BiocVersion_3.12.0           
 [67] munsell_0.5.0                 tools_4.0.4                   cachem_1.0.4                 
 [70] generics_0.1.0                RSQLite_2.2.5                 evaluate_0.14                
 [73] stringr_1.4.0                 fastmap_1.1.0                 yaml_2.2.1                   
 [76] knitr_1.31                    bit64_4.0.5                   purrr_0.3.4                  
 [79] AnnotationFilter_1.14.0       mime_0.10                     formatR_1.8                  
 [82] xml2_1.3.2                    biomaRt_2.46.3                BiocStyle_2.18.1             
 [85] compiler_4.0.4                rstudioapi_0.13               interactiveDisplayBase_1.28.0
 [88] curl_4.3                      png_0.1-7                     tibble_3.1.0                 
 [91] statmod_1.4.35                geneplotter_1.68.0            stringi_1.5.3                
 [94] futile.logger_1.4.3           lattice_0.20-41               ProtGenerics_1.22.0          
 [97] Matrix_1.3-2                  vctrs_0.3.7                   pillar_1.5.1                 
[100] lifecycle_1.0.0               BiocManager_1.30.12           data.table_1.14.0            
[103] bitops_1.0-6                  httpuv_1.5.5                  rtracklayer_1.50.0           
[106] R6_2.5.0                      latticeExtra_0.6-29           hwriter_1.3.2                
[109] promises_1.2.0.1              gridExtra_2.3                 lambda.r_1.2.4               
[112] dichromat_2.0-0               MASS_7.3-53.1                 assertthat_0.2.1             
[115] openssl_1.4.3                 withr_2.4.1                   hms_1.0.0                    
[118] VennDiagram_1.6.20            grid_4.0.4                    rpart_4.1-15                 
[121] tidyr_1.1.3                   rmarkdown_2.7                 biovizBase_1.38.0            
[124] shiny_1.6.0                   base64enc_0.1-3               tinytex_0.31
ADD REPLY
0
Entering edit mode

Could you send me your Targets.csv file?

ADD REPLY
0
Entering edit mode

Sorry, I forgot to respond here that I sent the file via email. Please let me know if you didn't receive it, thanks!

ADD REPLY
0
Entering edit mode

Hi Jeremy. We got it and we are working on the issue. Thanks for your patience. A

ADD REPLY
0
Entering edit mode
@b6a1dc8b
Last seen 3.7 years ago
Buenos Aires

Hi Jeremy,

Thanks for your patience. Finally, we could reproduce the error... It was produced by a single rDNA aligned junction (rDNA.11456.11467) detected in the problematic BAM that was not correctly parsed by ASpli.

We will be pushing today a new version of ASpli to the BioC devel branch [https://bioconductor.org/packages/devel/bioc/html/ASpli.html]. This version (it will be version 2.1.2) correctly handles your dataset and should be available for installation in a couple of days.

In the meantime, a fast and dirty workaround could be just keeping standard chromosome data in your bams. To do that you could use samtools:

samtools view -b input.bam  2L 2R 3L 3R 4  X Y > output.bam

Regards

Ariel

ADD COMMENT
1
Entering edit mode

Hi Ariel,

Thanks so much for working on this, I really appreciate it!

ADD REPLY

Login before adding your answer.

Traffic: 587 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6