On 04/15/2014 10:53 AM, Maintainer wrote:
> Dear Bioconductor maintainers,
>
> I still have a problem using makeTranscriptDbFromGFF:
>
> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
> format="gtf")
> extracting transcript information
> Estimating transcript ranges.
> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
> number of items to replace is not a multiple of replacement
>
If I
options(error=recover)
then
> xx = makeTranscriptDbFromGFF(file="mm9_test_noSpace.gtf",
format="gtf")
extracting transcript information
Estimating transcript ranges.
Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
number of items to replace is not a multiple of replacement length
Enter a frame number, or 0 to exit
1: makeTranscriptDbFromGFF(file = "mm9_test_noSpace.gtf", format =
"gtf")
2: .prepareGTFTables(gff, exonRankAttributeName)
3: .prepareGTFtranscripts(data)
4: .deduceTranscriptsFromGTF(transcripts)
Selection: 4
Called from: .deduceTranscriptsFromGTF(transcripts)
Browse[1]>
then look around a little, e.g.,
Browse[1]> .deduceTranscriptsFromGTF
## ...
I spot the offending line
for (i in seq_len(length(trns))) {
res[i, ] <- .deduceTranscriptRangeData(subs[[i]])
and look at subs[[i]] and .deduceTranscriptRangeData(subs[[i]])
Browse[1]> subs[[i]]
seqnames start end strand type gene_id transcript_id
exon_rank
47369 chr2 36245431 36245515 + exon Mir684-1 Mir684-1
NA
147043 chr5 23763346 23763430 + exon Mir684-1 Mir684-1
NA
Browse[1]> .deduceTranscriptRangeData(subs[[i]])
[1] "Mir684-1" "chr2" "chr5" "+" "23763346" "36245515"
"Mir684-1"
I guess the problem is that the exons are coming from two separate
chromosomes,
and .deduceTranscriptRangeData does not handle that case successfully.
Hopefully the code will be made more robust / efficient, but in the
mean time
you could edit your file to remove the troublesome Mir684-1.
Martin
> I uploaded the file here:
>
https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>
> I tried to reduce the file further, in order to get a smaller
example file and to check which lines cause the problem, but:
> mm9_test_noSpace.gtf is 200,000 lines long and it results in an
error.
> When I am using the first 100,000 (or, last 100,000 lines,
respectively) it works!
>
> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
> --> works!
> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
> --> works!
>
> I have no idea what is going on. Also, the .gtf file is from UCSC.
>
> Do you have any idea what the problem is?
> Thank you so much!
> Katja
>
>
> ----- Forwarded Message -----
> From: "Michael Lawrence" <lawrence.michael at="" gene.com="">
> To: "Katja Hebestreit" <katjah at="" stanford.edu="">
> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">
> Sent: Monday, April 14, 2014 7:37:23 PM
> Subject: Re: [BioC] GenomicFeatures: Problem with
makeTranscriptDbFromGFF
>
> Sorry, I hadn't tested the makeTranscriptDbFromGFF function. I
didn't write
> that one, so the maintainers of GenomicFeatures will need to help
now.
>
>
> On Mon, Apr 14, 2014 at 5:31 PM, Katja Hebestreit <katjah at="" stanford.edu="">wrote:
>
>> Great! Thank you so much, Micheal!
>>
>> Unfortunately, I still have problems with this file:
>>
>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>> format="gtf")
>> extracting transcript information
>> Estimating transcript ranges.
>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>> number of items to replace is not a multiple of replacement
>>
>> To determine the gene lengths I wanted to use the exact .gtf file
we used
>> for counting the reads per gene some time ago.
>>
>>
>> ----- Original Message -----
>> From: "Michael Lawrence" <lawrence.michael at="" gene.com="">
>> To: "Katja Hebestreit" <katjah at="" stanford.edu="">
>> Cc: "Vincent Carey" <stvjc at="" channing.harvard.edu="">, "Michael
Lawrence" <
>> lawrence.michael at gene.com>, "Rsamtools Maintainer" <
>> maintainer at bioconductor.org>, bioconductor at r-project.org
>> Sent: Monday, April 14, 2014 2:55:06 PM
>> Subject: Re: [BioC] GenomicFeatures: Problem with
makeTranscriptDbFromGFF
>>
>> I've made rtracklayer 1.25.1 ignore whitespace after the last
attribute.
>> After that change, both the truncated and full file parse
successfully.
>> Btw, which genes are these? The UCSC genes are already available as
a TxDb
>> package.
>>
>>
>> On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah at="" stanford.edu="">>> wrote:
>>
>>> Okay, I removed the whitspaces with
>>> sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf
>>>
>>> Here is the file:
>>>
https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>>>
>>> But still, I get an error:
>>>
>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>>> format="gtf")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>>> number of items to replace is not a multiple of replacement
>>>
>>>
>>> I tried to reduce the file further, in order to get a smaller
example
>> file
>>> and to check which lines cause the problem, but:
>>>
>>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an
error.
>>> When I am using the first 100,000 (or, last 100,000 lines,
respectively)
>> it
>>> works!
>>>
>>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>> --> works!
>>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>> --> works!
>>>
>>> I have no idea what is going on. Also, the .gtf file is from UCSC.
>>>
>>> Thanks again for helping with that annoying problem!
>>> Katja
>>>
>>>
>>> ----- Original Message -----
>>> From: "Vincent Carey" <stvjc at="" channing.harvard.ed="">>> To: "Katja Hebestreit" <katjah at="" stanford.edu="">
>>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, "Rsamtools
>>> Maintainer" <maintainer at="" bioconductor.org="">, bioconductor at
r-project.org
>>> Sent: Monday, April 14, 2014 11:45:30 AM
>>> Subject: Re: [BioC] GenomicFeatures: Problem with
makeTranscriptDbFromGFF
>>>
>>> remove the trailing whitespace at the end of every line
>>>
>>>
>>> On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at="" stanford.edu="">>>> wrote:
>>>
>>>> You can download the file here:
>>>>
>>>>
https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf
>>>>
>>>> Using file I get the error:
>>>>
>>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf",
format="gtf")
>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>> Some attributes do not conform to 'tag value' format
>>>>
>>>> Thank you so much for helping!!
>>>> Katja
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com="">
>>>> To: "Katja Hebestreit" <katjah at="" stanford.edu="">
>>>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">,
>>>> bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>> maintainer at bioconductor.org>
>>>> Sent: Monday, April 14, 2014 7:27:26 AM
>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>> makeTranscriptDbFromGFF
>>>>
>>>> Well, I copied the text and replaced the spaces with tabs as
>> appropriate
>>>> and everything seemed to work fine, so you might to attach that
>> fragment
>>> of
>>>> the file, just to be sure it isn't a formatting issue.
>>>>
>>>> Does import("file.gtf") work for you? If so, that should be good
enough
>>> for
>>>> your use case.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <
>> katjah at stanford.edu
>>>>> wrote:
>>>>
>>>>> Actually, the error was not reproducible with the lines I
attached.
>> But
>>>> it
>>>>> is reproducible with those lines (four additional lines):
>>>>>
>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000
>>> -
>>>>> . gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000
-
>>> 2
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000
-
>>> .
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000
-
>>> 1
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000
-
>>> .
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000
-
>>> 0
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat start_codon 3661427 3661429 0.000000
>>> -
>>>>> . gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat exon 3660633 3661579 0.000000
-
>>> .
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000
>>> -
>>>>> . gene_id "Rp1"; transcript_id "Rp1";
>>>>> chr1 mm9_refFlat CDS 4283065 4283093 0.000000
-
>>> 2
>>>>> gene_id "Rp1"; transcript_id "Rp1";
>>>>>
>>>>> Let me know if you like to get the entire file.
>>>>>
>>>>> Thank you!!
>>>>> Katja
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com="">
>>>>> To: "Katja Hebestreit" <katjah at="" stanford.edu="">
>>>>> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>>> maintainer at bioconductor.org>
>>>>> Sent: Sunday, April 13, 2014 10:02:13 PM
>>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>>> makeTranscriptDbFromGFF
>>>>>
>>>>> On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <
>> katjah at stanford.edu
>>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I get an error when I try to import my gff file:
>>>>>>
>>>>>> txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf")
>>>>>>
>>>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>>>> Some attributes do not conform to 'tag value' format
>>>>>>
>>>>>> This is how my file looks like:
>>>>>>
>>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105
0.000000
>>>> -
>>>>>> . gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000
-
>>>> 2
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000
-
>>>> .
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000
-
>>>> 1
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000
-
>>>> .
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000
-
>>>> 0
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>
>>>>>> I have the feeling that this has something to do with the
missing
>>> exon
>>>>>> rank information in my file. Is that true? Is there a way to
import
>>>> this
>>>>>> file? All I want to do is to determine the gene lengths.
>>>>>>
>>>>>
>>>>> It is most likely as the error says: some of your attributes are
>>>> malformed.
>>>>> Is that the entire file listed above, or is there more? If you
could
>>> get
>>>> me
>>>>> the file somehow I could diagnose the issue.
>>>>>
>>>>>
>>>>>>
>>>>>> Could anyone help? That would be awesome!
>>>>>> Cheers,
>>>>>> Katja
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>> R version 3.1.0 (2014-04-10)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
>>>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel stats graphics grDevices utils datasets
>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19
Biobase_2.23.6
>>>>>> [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32
IRanges_1.21.45
>>>>>> [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] BatchJobs_1.2 BBmisc_1.5
>>> BiocParallel_0.6.0
>>>>>> [4] biomaRt_2.20.0 Biostrings_2.32.0
bitops_1.0-6
>>>>>> [7] brew_1.0-6 BSgenome_1.32.0
>> codetools_0.2-8
>>>>>> [10] DBI_0.2-7 digest_0.6.4 fail_1.2
>>>>>> [13] foreach_1.4.2 GenomicAlignments_1.0.0
>> iterators_1.0.7
>>>>>> [16] plyr_1.8.1 Rcpp_0.11.1
RCurl_1.95-4.1
>>>>>> [19] Rsamtools_1.16.0 RSQLite_0.11.4
>>> rtracklayer_1.24.0
>>>>>> [22] sendmailR_1.1-2 stats4_3.1.0
stringr_0.6.2
>>>>>> [25] tools_3.1.0 XML_3.98-1.1
XVector_0.4.0
>>>>>> [28] zlibbioc_1.10.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>
>
> ____________________________________________________________________
____
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
>
https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793