Hi,
I'm trying to use getGeneLengthAndGCContent
to normalize some RNASeq data. My data was aligned to hg38
and I used featureCounts
to aggregate by Ensembl gene ID (GRCh38 v. 87). I used the following call:
> hsa.len.gc <- getGeneLengthAndGCContent(id=rownames(counts.no.sex), org="hsa", mode=c("biomart"))
I received the following error:
NAs produced by integer overflowError in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : zero or more than one input sequence
Oddly, when I ran it a second time, the error changed a bit, but the same result:
Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : zero or more than one input sequence In addition: Warning message: In nchar(str, "bytes") * 4L : NAs produced by integer overflow
I then switched to org.db
mode with the following call to see if it could map Ensembl IDs:
> hsa.len.gc <- getGeneLengthAndGCContent(id=rownames(counts.no.sex), org="hg38", mode=c("org.db"))
This completed without errors, but most of the genes came back as NA:
> summary(hsa.len.gc) length gc Min. : 41 Min. :0.20 1st Qu.: 1800 1st Qu.:0.45 Median : 3582 Median :0.51 Mean : 4566 Mean :0.51 3rd Qu.: 6144 3rd Qu.:0.57 Max. :156366 Max. :0.93 NA's :33901 NA's :33901
Seems I need to get the biomart
version working. I suspect the issue is related to many:1
mappings. Does anyone know how to fix this?
Really appreciate your help.
Reproducible example set
Here is a link to the smallest subset of Ensembl IDs I could get to fail: https://www.dropbox.com/s/gthmo1rb5lcrbvr/gene_ids.txt?dl=0
> tmp<-read.delim("gene_ids.txt", header=FALSE) > head(tmp) V1 1 ENSG00000243477 2 ENSG00000114378 3 ENSG00000068001 4 ENSG00000114383 5 ENSG00000068028 6 ENSG00000281358 > hsa.len.gc <- getGeneLengthAndGCContent(id=tmp$V1, org="hsa", mode=c("biomart")) Connecting to BioMart ... Downloading sequences ... This may take a few minutes ... Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : zero or more than one input sequence
It's hard to say what's going on without knowing what are your row names.
Can you please provide an example for us to reproduce it and diagnose the problem? For instance, would you be able to share the row names that you're using. How many are they? Is the error still there if you apply the function to only the first 10 genes? 100? 1000?
Please share the smallest possible reproducible example that produces the error.
@daviderisso, I apologize for the slow response. I've been trying to generate a *small* reproducible example. So far, the smallest group I've been able to find is 5000 Ensemble gene IDs. Let me see if I can narrow it down more.
@daviderisso, I updated the post to include the smallest subset (with code) I could get to fail. Will that work?