Entering edit mode
Veerendra GP
▴
100
@veerendra-gp-4214
Last seen 10.4 years ago
Dear Members,
Greetings!
I am trying to create a Transcript db object, from the GTF file for
Musmusculus transcriptome data available at ensembl FTP site
(version 75). ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus
I am using the function, "makeTranscriptDbFromGFF()" from the "
GenomicFeatures" bioconductor library.
*Command used:*
> library("GenomicFeatures")
> txdb_75 = makeTranscriptDbFromGFF (file =
'/Mus_musculus.GRCm38.75.gtf',
exonRankAttributeName='exon_number', format='gtf',
dataSource='ENSEMBL',
species='Mus musculus')
*Error message:*
extracting transcript information
Estimating transcript ranges.
Error in unique(sub$transcript_id) :
error in evaluating the argument 'x' in selecting a method for
function
'unique': Error in subs[[i]] : subscript out of bounds
I am not able to understand the above error message. However, I could
create the transcript db object successfully for the GTF file
downloaded from ensembl version 74.
ftp://ftp.ensembl.org/pub/release-74/gtf/mus_musculus/
*Session information:*
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] data.table_1.9.2 biomaRt_2.16.0
GenomicFeatures_1.12.4
[4] AnnotationDbi_1.22.6 Biobase_2.20.1 GenomicRanges_1.12.5
[7] IRanges_1.18.4 BiocGenerics_0.6.0 BiocInstaller_1.10.4
loaded via a namespace (and not attached):
[1] Biostrings_2.28.0 bitops_1.0-6 BSgenome_1.28.0
DBI_0.2-7
[5] plyr_1.8.1 Rcpp_0.11.0 RCurl_1.95-4.1
reshape2_1.2.2
[9] Rsamtools_1.12.4 RSQLite_0.11.4 rtracklayer_1.20.4
stats4_3.0.2
[13] stringr_0.6.2 tools_3.0.2 XML_3.98-1.1
zlibbioc_1.6.0
Your assistance in this regard would be very much appreciated.
Thank you in advance.
Best Regards,
Veerendra
On Wed, Feb 13, 2013 at 1:32 PM, Veerendra GP
<gpveerendra09@gmail.com>wrote:
> Dear Valerie,
>
> Thank you very much for the descriptive reply and your suggestion,
it
> indeed was a cryptic error for me but now I understand the cause.
>
> Thanks,
> Veerendra
>
> On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain
<vobencha@fhcrc.org>wrote:
>
>> Hello,
>>
>> Thanks for sending a reproducible example. Sorry for the cryptic
error
>> message. That is something we could improve.
>>
>> The problem is that for the UCSC data, some transcripts in a single
gene
>> are from different strands. In this case the strand Rle will have
multiple
>> runValues (i.e. runValue longer than 1).
>>
>> When we summarize strand runValues by elementLength we see the UCSC
data
>> has runValues of 1, 2, 3, 4 and 5.
>>
>> table(elementLengths(runValue(strand(GenegroupedTranscripts))))
>>>
>>
>> 1
>> 37315
>>
>>>
table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts))))
>>>
>>
>> 1 2 3 4 5
>> 21686 72 1 1 1
>>
>> Here is a closer look at the entry that has an elementLength of 3:
>>
>> UCSC_elts <- elementLengths(runValue(strand(
>>> UCSCGenegroupedTranscripts)))
>>> UCSCGenegroupedTranscripts[UCSC_elts == 3]
>>>
>>
>> GRangesList of length 1:
>> $108816
>> GRanges with 5 ranges and 2 metadata columns:
>> seqnames ranges strand | tx_id
tx_name
>> <rle> <iranges> <rle> | <integer>
<character>
>> [1] chr4 [42452733, 42476567] + | 10641
uc008smu.1
>> [2] chr4 [42471253, 42471291] + | 10642
uc008smv.1
>> [3] chr4 [41902019, 41925853] - | 12216
uc008slb.1
>> [4] chr4 [41907295, 41907333] - | 12217
uc008slg.1
>> [5] chrUn_random [ 588639, 612480] + | 55359
uc012hdn.1
>>
>>
>> The strand Rle for this gene is,
>>
>> strand(UCSCGenegroupedTranscripts[UCSC_elts == 3])
>>>
>> CompressedRleList of length 1
>> $`108816`
>> factor-Rle of length 5 with 3 runs
>> Lengths: 2 2 1
>> Values : + - +
>> Levels(3): + - *
>>
>> In the case of multiple stands per list element (per gene in this
case)
>> you can't use as.integer(). It looks like you are after a vector of
a
>> single strand per gene. Depending on your use case you could remove
these
>> genes or set the strand of these genes to '*'.
>>
>>
>> Valerie
>>
>>
>>
>>
>> On 02/12/2013 08:11 AM, Veerendra GP wrote:
>>
>>> Hello Everyone!
>>>
>>> Greetings!
>>>
>>> I am using "GenomicFeatures" library to create Transcript db
objects for
>>> the mouse genome.
>>> I did it by using biomart, *makeTranscriptDbFromBiomart()* and
also
>>>
>>> UCSC, makeTranscriptDbFromUCSC()
>>> functions.
>>>
>>> As I am interested in the strand information, I tried fetching the
it
>>> using
>>> *as.integer(),* *runValue**()* & *strand()* functions. I am able
to get
>>> it
>>> from the transcriptdb object, created using
>>> *makeTranscriptDbFromBiomart()* but
>>>
>>> ending up with an error when I am trying to do the same with
>>> transcriptdb object created using makeTranscriptDbFromUCSC()
function*.*
>>>
>>>
>>> working code:
>>>
>>> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl",
dataset=
>>> "mmusculus_gene_ensembl");
>>> saveDb(mouseGNM, file="MouseGNM.sqlite");
>>> library("GenomicFeatures");
>>> MouseGNM<- loadDb("MouseGNM.sqlite");
>>> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene");
>>> strand.info
<-as.integer(runValue(strand(GenegroupedTranscripts)));
>>>
>>> runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of
>>>> length
>>>>
>>> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2
>>> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2
>>> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1
>>> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1
>>> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305
more
>>> elements>
>>>
>>> strand.info[1:100]
>>>>
>>> [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1
1 2 1
>>> 2 2
>>> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1
1 1 2
>>> 1
>>> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1
>>>
>>> here 2 is the antisense strand and 1 the sense strand
>>>
>>>
>>> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9",
tablename =
>>> "knownGene");
>>> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite");
>>> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite");
>>> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by =
"gene");
>>> strand.info2<-
as.integer(runValue(strand(UCSCGenegroupedTranscripts)));
>>>
>>> runValue(strand(UCSCGenegroupedTranscripts))
>>>>
>>> CompressedIntegerList of length 21761
>>> [["100009600"]] 2
>>> [["100009609"]] 2
>>> [["100009614"]] 1
>>> [["100012"]] 2
>>> [["100017"]] 2
>>> [["100019"]] 1
>>> [["100033459"]] 1
>>> [["100034251"]] 1
>>> [["100034361"]] 2
>>> [["100034684"]] 2
>>> ...
>>> <21751 more elements>
>>>
>>> *ERROR:*
>>>
>>> strand.info2<- as.integer(runValue(strand(
>>>> UCSCGenegroupedTranscripts)));
>>>>
>>> Error in as.vector(x, mode = "integer") : coercing an AtomicList
object
>>> to
>>> an atomic vector is supported only for objects with top-level
elements of
>>> length <= 1
>>>
>>> I am not able to understand this error message, could anyone let
me know
>>> what is going wrong in the code.
>>> Here is the session information.
>>>
>>> sessionInfo() R version 2.15.2 (2012-10-26) Platform:
>>>> x86_64-pc-linux-gnu
>>>>
>>> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3]
>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5]
LC_MONETARY=en_US.UTF-8
>>> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C
>>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> attached
>>> base packages: [1] stats graphics grDevices utils datasets methods
base
>>> other attached packages: [1] GenomicFeatures_1.10.1
AnnotationDbi_1.20.3
>>> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4
BiocGenerics_0.4.0
>>> loaded via a namespace (and not attached): [1] biomaRt_2.14.0
>>> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5
>>> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2
>>> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1
>>> zlibbioc_1.4.0
>>>
>>>
>>> Thank you in advance.
>>> Regards,
>>>
>>> Veerendra.
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> GP.Veerendra
> PhD Student (Bioinformatics)
> Stazione Zoologica Anton Dohrn
> Naples, Italy
> M:+393663915221
>
--
Veerendra Gadekar
PhD Student (Bioinformatics)
Animal Physiology and Evolution
Stazione Zoologica Anton Dohrn
Villa Comunale, 80121 Naples - Italy
M:+393663915221
[[alternative HTML version deleted]]