Question

getGeneLengthAndGCContent: "zero or more than one input sequence"

1

Entering edit mode

mark.ebbert • 0

@markebbert-14120

Last seen 6.7 years ago

Hi,

I'm trying to use getGeneLengthAndGCContent to normalize some RNASeq data. My data was aligned to hg38 and I used featureCounts to aggregate by Ensembl gene ID (GRCh38 v. 87). I used the following call:

> hsa.len.gc <- getGeneLengthAndGCContent(id=rownames(counts.no.sex), org="hsa", mode=c("biomart"))

I received the following error:

NAs produced by integer overflowError in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
  zero or more than one input sequence

Oddly, when I ran it a second time, the error changed a bit, but the same result:

Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
  zero or more than one input sequence
In addition: Warning message:
In nchar(str, "bytes") * 4L : NAs produced by integer overflow

I then switched to org.db mode with the following call to see if it could map Ensembl IDs:

> hsa.len.gc <- getGeneLengthAndGCContent(id=rownames(counts.no.sex), org="hg38", mode=c("org.db"))

This completed without errors, but most of the genes came back as NA:

> summary(hsa.len.gc)
     length             gc       
 Min.   :    41   Min.   :0.20   
 1st Qu.:  1800   1st Qu.:0.45   
 Median :  3582   Median :0.51   
 Mean   :  4566   Mean   :0.51   
 3rd Qu.:  6144   3rd Qu.:0.57   
 Max.   :156366   Max.   :0.93   
 NA's   :33901    NA's   :33901

Seems I need to get the biomart version working. I suspect the issue is related to many:1 mappings. Does anyone know how to fix this?

Really appreciate your help.

Reproducible example set

Here is a link to the smallest subset of Ensembl IDs I could get to fail: https://www.dropbox.com/s/gthmo1rb5lcrbvr/gene_ids.txt?dl=0

> tmp<-read.delim("gene_ids.txt", header=FALSE)
> head(tmp)
               V1
1 ENSG00000243477
2 ENSG00000114378
3 ENSG00000068001
4 ENSG00000114383
5 ENSG00000068028
6 ENSG00000281358
> hsa.len.gc <- getGeneLengthAndGCContent(id=tmp$V1, org="hsa", mode=c("biomart"))
Connecting to BioMart ...
Downloading sequences ...
This may take a few minutes ...
Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
  zero or more than one input sequence

edaseq normalization hg38 biomart org.db • 2.9k views

ADD COMMENT • link updated 7.3 years ago by davide risso ▴ 980 • written 7.4 years ago by mark.ebbert • 0

0

Entering edit mode

It's hard to say what's going on without knowing what are your row names.

Can you please provide an example for us to reproduce it and diagnose the problem? For instance, would you be able to share the row names that you're using. How many are they? Is the error still there if you apply the function to only the first 10 genes? 100? 1000?

Please share the smallest possible reproducible example that produces the error.

ADD REPLY • link 7.4 years ago davide risso ▴ 980

0

Entering edit mode

@daviderisso, I apologize for the slow response. I've been trying to generate a *small* reproducible example. So far, the smallest group I've been able to find is 5000 Ensemble gene IDs. Let me see if I can narrow it down more.

ADD REPLY • link 7.4 years ago mark.ebbert • 0

0

Entering edit mode

@daviderisso, I updated the post to include the smallest subset (with code) I could get to fail. Will that work?

ADD REPLY • link 7.4 years ago mark.ebbert • 0

score 1 · Accepted Answer · 2017-12-01

Hi Mark,

I've just tested your code and it works on my machine.

Here's what I did:

library(EDASeq)
tmp <- read.delim("gene_ids.txt", header=FALSE)
hsa.len.gc <- getGeneLengthAndGCContent(id=tmp$V1, org="hsa", mode=c("biomart"))

and the resulting object

> summary(hsa.len.gc)
     length            gc        
 Min.   :   23   Min.   :0.1633  
 1st Qu.:  406   1st Qu.:0.3942  
 Median :  897   Median :0.4347  
 Mean   : 2341   Mean   :0.4477  
 3rd Qu.: 3089   3rd Qu.:0.4910  
 Max.   :42646   Max.   :0.8636  
 NA's   :40      NA's   :40

Are you using the latest versions of EDASeq and biomaRt? I'm using EDASeq 2.12.0 and biomaRt 2.34.0.

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] EDASeq_2.12.0              ShortRead_1.36.0          
 [3] GenomicAlignments_1.14.1   SummarizedExperiment_1.8.0
 [5] DelayedArray_0.4.1         matrixStats_0.52.2        
 [7] Rsamtools_1.30.0           GenomicRanges_1.30.0      
 [9] GenomeInfoDb_1.14.0        Biostrings_2.46.0         
[11] XVector_0.18.0             IRanges_2.12.0            
[13] S4Vectors_0.16.0           BiocParallel_1.12.0       
[15] Biobase_2.38.0             BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] genefilter_1.60.0       progress_1.1.2          splines_3.4.2          
 [4] lattice_0.20-35         rtracklayer_1.38.0      GenomicFeatures_1.30.0 
 [7] blob_1.1.0              XML_3.98-1.9            survival_2.41-3        
[10] rlang_0.1.4             R.oo_1.21.0             DBI_0.7                
[13] R.utils_2.6.0           bit64_0.9-7             aroma.light_3.8.0      
[16] RColorBrewer_1.1-2      GenomeInfoDbData_0.99.1 stringr_1.2.0          
[19] zlibbioc_1.24.0         hwriter_1.3.2           R.methodsS3_1.7.1      
[22] memoise_1.1.0           latticeExtra_0.6-28     geneplotter_1.56.0     
[25] biomaRt_2.34.0          AnnotationDbi_1.40.0    Rcpp_0.12.14           
[28] xtable_1.8-2            annotate_1.56.1         bit_1.1-12             
[31] RMySQL_0.10.13          digest_0.6.12           stringi_1.1.6          
[34] DESeq_1.30.0            grid_3.4.2              tools_3.4.2            
[37] bitops_1.0-6            magrittr_1.5            RCurl_1.95-4.8         
[40] RSQLite_2.0             tibble_1.3.4            Matrix_1.2-12          
[43] prettyunits_1.0.2       assertthat_0.2.0        R6_2.2.2               
[46] compiler_3.4.2