rtracklayer importing gtf files

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.5 years ago

My name is Sam, a grad student at the University of Missouri, and I am having trouble importing my .gtf file using the rtracklayer function import(). My gtf file seems to have the 9 columns specified by the gtf format when looking at it in a text editor, but on import I have 9 columns labeled as "X1.", "X2",...,"X9." nearly all of the entries are NA. In X1. there are 817 "1\" entries (of 474351), in X2. there are 534 "2\" and so on. The .gtf file was downloaded from http://tophat.cbcb.umd.edu/igenomes.html Arabidopsis NCBI TAIR10 release, using the genes.gtf file generated after opening the .tar.gz. I import my file by myGTF <- "path/to/file.gtf" newGTF <- import(myGTF, asRangedData = FALSE) The way I read the import.gff manual, the .gtf extension will tell the function how to parse the file with out specifying version parameter. I am trying to follow the summarizeOverlaps() method of generating read counts from the GenomicRanges packages for differential expression using DESeq. Does anyone know what has happened, or more generally what can I do to import my file? -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq_1.8.3 locfit_1.5-8 Biobase_2.16.0 [4] rtracklayer_1.16.3 Rsamtools_1.8.6 Biostrings_2.24.1 [7] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] annotate_1.34.0 AnnotationDbi_1.18.0 bitops_1.0-4.1 [4] BSgenome_1.24.0 DBI_0.2-5 genefilter_1.38.0 [7] geneplotter_1.34.0 grid_2.15.0 lattice_0.20-6 [10] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 [16] tools_2.15.0 XML_3.9-4 xtable_1.7-0 [19] zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org.

rtracklayer DESeq GenomicRanges rtracklayer DESeq GenomicRanges • 9.2k views

ADD COMMENT • link updated 11.9 years ago by Michael Lawrence ★ 11k • written 11.9 years ago by Guest User ★ 13k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.2 years ago

United States

Thanks for the report. The issue is that there are semi-colons embedded in the gene names. This is perfectly valid, but rtracklayer is not smart enough to check for quote escapes. I'll look into fixing this. Michael On Mon, Apr 1, 2013 at 1:23 PM, Sam McInturf [guest] <guest@bioconductor.org> wrote: > > My name is Sam, a grad student at the University of Missouri, and I am > having trouble importing my .gtf file using the rtracklayer function > import(). My gtf file seems to have the 9 columns specified by the gtf > format when looking at it in a text editor, but on import I have 9 columns > labeled as "X1.", "X2",...,"X9." nearly all of the entries are NA. In X1. > there are 817 "1\" entries (of 474351), in X2. there are 534 "2\" and so > on. The .gtf file was downloaded from > http://tophat.cbcb.umd.edu/igenomes.html > Arabidopsis NCBI TAIR10 release, using the genes.gtf file generated after > opening the .tar.gz. > > I import my file by > myGTF <- "path/to/file.gtf" > newGTF <- import(myGTF, asRangedData = FALSE) > > The way I read the import.gff manual, the .gtf extension will tell the > function how to parse the file with out specifying version parameter. > > I am trying to follow the summarizeOverlaps() method of generating read > counts from the GenomicRanges packages for differential expression using > DESeq. > > Does anyone know what has happened, or more generally what can I do to > import my file? > > -- output of sessionInfo(): > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DESeq_1.8.3 locfit_1.5-8 Biobase_2.16.0 > [4] rtracklayer_1.16.3 Rsamtools_1.8.6 Biostrings_2.24.1 > [7] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] annotate_1.34.0 AnnotationDbi_1.18.0 bitops_1.0-4.1 > [4] BSgenome_1.24.0 DBI_0.2-5 genefilter_1.38.0 > [7] geneplotter_1.34.0 grid_2.15.0 lattice_0.20-6 > [10] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 > [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 > [16] tools_2.15.0 XML_3.9-4 xtable_1.7-0 > [19] zlibbioc_1.2.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Michael, Are the gene names you are referring to the ones in the last column? gene_id "ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1"; gene_name "ORF153A"; ... What is an appropriate deliminator? Can I just use a perl substitution for the new deliminator? Thanks On Mon, Apr 1, 2013 at 5:02 PM, Michael Lawrence <lawrence.michael@gene.com>wrote: > Thanks for the report. The issue is that there are semi-colons embedded in > the gene names. This is perfectly valid, but rtracklayer is not smart > enough to check for quote escapes. I'll look into fixing this. > > Michael > > > On Mon, Apr 1, 2013 at 1:23 PM, Sam McInturf [guest] < > guest@bioconductor.org> wrote: > >> >> My name is Sam, a grad student at the University of Missouri, and I am >> having trouble importing my .gtf file using the rtracklayer function >> import(). My gtf file seems to have the 9 columns specified by the gtf >> format when looking at it in a text editor, but on import I have 9 columns >> labeled as "X1.", "X2",...,"X9." nearly all of the entries are NA. In X1. >> there are 817 "1\" entries (of 474351), in X2. there are 534 "2\" and so >> on. The .gtf file was downloaded from >> http://tophat.cbcb.umd.edu/igenomes.html >> Arabidopsis NCBI TAIR10 release, using the genes.gtf file generated after >> opening the .tar.gz. >> >> I import my file by >> myGTF <- "path/to/file.gtf" >> newGTF <- import(myGTF, asRangedData = FALSE) >> >> The way I read the import.gff manual, the .gtf extension will tell the >> function how to parse the file with out specifying version parameter. >> >> I am trying to follow the summarizeOverlaps() method of generating read >> counts from the GenomicRanges packages for differential expression using >> DESeq. >> >> Does anyone know what has happened, or more generally what can I do to >> import my file? >> >> -- output of sessionInfo(): >> >> > sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] DESeq_1.8.3 locfit_1.5-8 Biobase_2.16.0 >> [4] rtracklayer_1.16.3 Rsamtools_1.8.6 Biostrings_2.24.1 >> [7] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.34.0 AnnotationDbi_1.18.0 bitops_1.0-4.1 >> [4] BSgenome_1.24.0 DBI_0.2-5 genefilter_1.38.0 >> [7] geneplotter_1.34.0 grid_2.15.0 lattice_0.20-6 >> [10] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 >> [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 >> [16] tools_2.15.0 XML_3.9-4 xtable_1.7-0 >> [19] zlibbioc_1.2.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Sam McInturf [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago Sam McInturf ▴ 300

0

Entering edit mode

I just need to fix rtracklayer so that it handles the quoting. I'll work on it now. Michael On Tue, Apr 2, 2013 at 10:22 AM, Sam McInturf <smcinturf@gmail.com> wrote: > Michael, > Are the gene names you are referring to the ones in the last column? > gene_id "ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1"; > gene_name "ORF153A"; ... > > What is an appropriate deliminator? Can I just use a perl substitution > for the new deliminator? > > Thanks > > > On Mon, Apr 1, 2013 at 5:02 PM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> Thanks for the report. The issue is that there are semi-colons embedded >> in the gene names. This is perfectly valid, but rtracklayer is not smart >> enough to check for quote escapes. I'll look into fixing this. >> >> Michael >> >> >> On Mon, Apr 1, 2013 at 1:23 PM, Sam McInturf [guest] < >> guest@bioconductor.org> wrote: >> >>> >>> My name is Sam, a grad student at the University of Missouri, and I am >>> having trouble importing my .gtf file using the rtracklayer function >>> import(). My gtf file seems to have the 9 columns specified by the gtf >>> format when looking at it in a text editor, but on import I have 9 columns >>> labeled as "X1.", "X2",...,"X9." nearly all of the entries are NA. In X1. >>> there are 817 "1\" entries (of 474351), in X2. there are 534 "2\" and so >>> on. The .gtf file was downloaded from >>> http://tophat.cbcb.umd.edu/igenomes.html >>> Arabidopsis NCBI TAIR10 release, using the genes.gtf file generated >>> after opening the .tar.gz. >>> >>> I import my file by >>> myGTF <- "path/to/file.gtf" >>> newGTF <- import(myGTF, asRangedData = FALSE) >>> >>> The way I read the import.gff manual, the .gtf extension will tell the >>> function how to parse the file with out specifying version parameter. >>> >>> I am trying to follow the summarizeOverlaps() method of generating read >>> counts from the GenomicRanges packages for differential expression using >>> DESeq. >>> >>> Does anyone know what has happened, or more generally what can I do to >>> import my file? >>> >>> -- output of sessionInfo(): >>> >>> > sessionInfo() >>> R version 2.15.0 (2012-03-30) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] DESeq_1.8.3 locfit_1.5-8 Biobase_2.16.0 >>> [4] rtracklayer_1.16.3 Rsamtools_1.8.6 Biostrings_2.24.1 >>> [7] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.34.0 AnnotationDbi_1.18.0 bitops_1.0-4.1 >>> [4] BSgenome_1.24.0 DBI_0.2-5 genefilter_1.38.0 >>> [7] geneplotter_1.34.0 grid_2.15.0 lattice_0.20-6 >>> [10] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 >>> [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 >>> [16] tools_2.15.0 XML_3.9-4 xtable_1.7-0 >>> [19] zlibbioc_1.2.0 >>> >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > Sam McInturf > [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Should be resolved as of 1.19.12. Had to rewrite the attribute parser though, so there could be breakage elsewhere. Michael On Tue, Apr 2, 2013 at 11:33 AM, Michael Lawrence <michafla@gene.com> wrote: > I just need to fix rtracklayer so that it handles the quoting. I'll work > on it now. > > Michael > > > On Tue, Apr 2, 2013 at 10:22 AM, Sam McInturf <smcinturf@gmail.com> wrote: > >> Michael, >> Are the gene names you are referring to the ones in the last column? >> gene_id "ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1"; >> gene_name "ORF153A"; ... >> >> What is an appropriate deliminator? Can I just use a perl substitution >> for the new deliminator? >> >> Thanks >> >> >> On Mon, Apr 1, 2013 at 5:02 PM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> Thanks for the report. The issue is that there are semi-colons embedded >>> in the gene names. This is perfectly valid, but rtracklayer is not smart >>> enough to check for quote escapes. I'll look into fixing this. >>> >>> Michael >>> >>> >>> On Mon, Apr 1, 2013 at 1:23 PM, Sam McInturf [guest] < >>> guest@bioconductor.org> wrote: >>> >>>> >>>> My name is Sam, a grad student at the University of Missouri, and I am >>>> having trouble importing my .gtf file using the rtracklayer function >>>> import(). My gtf file seems to have the 9 columns specified by the gtf >>>> format when looking at it in a text editor, but on import I have 9 columns >>>> labeled as "X1.", "X2",...,"X9." nearly all of the entries are NA. In X1. >>>> there are 817 "1\" entries (of 474351), in X2. there are 534 "2\" and so >>>> on. The .gtf file was downloaded from >>>> http://tophat.cbcb.umd.edu/igenomes.html >>>> Arabidopsis NCBI TAIR10 release, using the genes.gtf file generated >>>> after opening the .tar.gz. >>>> >>>> I import my file by >>>> myGTF <- "path/to/file.gtf" >>>> newGTF <- import(myGTF, asRangedData = FALSE) >>>> >>>> The way I read the import.gff manual, the .gtf extension will tell the >>>> function how to parse the file with out specifying version parameter. >>>> >>>> I am trying to follow the summarizeOverlaps() method of generating read >>>> counts from the GenomicRanges packages for differential expression using >>>> DESeq. >>>> >>>> Does anyone know what has happened, or more generally what can I do to >>>> import my file? >>>> >>>> -- output of sessionInfo(): >>>> >>>> > sessionInfo() >>>> R version 2.15.0 (2012-03-30) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] DESeq_1.8.3 locfit_1.5-8 Biobase_2.16.0 >>>> [4] rtracklayer_1.16.3 Rsamtools_1.8.6 Biostrings_2.24.1 >>>> [7] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] annotate_1.34.0 AnnotationDbi_1.18.0 bitops_1.0-4.1 >>>> [4] BSgenome_1.24.0 DBI_0.2-5 genefilter_1.38.0 >>>> [7] geneplotter_1.34.0 grid_2.15.0 lattice_0.20-6 >>>> [10] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 >>>> [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 >>>> [16] tools_2.15.0 XML_3.9-4 xtable_1.7-0 >>>> [19] zlibbioc_1.2.0 >>>> >>>> >>>> -- >>>> Sent via the guest posting facility at bioconductor.org. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >> -- >> Sam McInturf >> > > [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago Michael Lawrence ★ 11k

Login before adding your answer.