Hi everyone,
I am trying to get the range information for all 5'UTR in Arabidopsis using GenomicFeatures and have 2 questions.
1. I found some of the genes that contain 5'UTR are missing in my results below. For example, AT1G01020 and AT1G01030 both contain 5'UTR, but they are missing in the list. Does anyone have a clue why this might happen?
2. Is there a way we can have a list of 5'UTR included in all of the gene models for a given gene, rather than for a given gene model? I mean, like for exons, we can use "exonsBy(txdb,by="gene")". Is there anything similar to that for 5'-UTR?
Many thanks!
- Polly
# the gff file is downloaded here (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3).
> txdb <- makeTranscriptDbFromGFF("TAIR10_GFF3_genes.gff",format="gff")
> fiveUTR <-fiveUTRsByTranscript(txdb, use.names=T)
> head(fiveUTR)
GRangesList object of length 6:
$AT1G01010.1
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] Chr1 [3631, 3759] + | 1 <NA> 1
$AT1G01040.1
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
[1] Chr1 [23146, 23518] + | 7 <NA> 1
$AT1G01040.2
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
[1] Chr1 [23416, 23518] + | 8 <NA> 1
...
<3 more elements>
-------
seqinfo: 7 sequences (1 circular) from an unspecified genome; no seqlengths
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] rtracklayer_1.26.1 GenomicFeatures_1.18.0 AnnotationDbi_1.28.0
[4] Biobase_2.26.0 GenomicAlignments_1.2.0 Rsamtools_1.18.0
[7] Biostrings_2.34.0 XVector_0.6.0 GenomicRanges_1.18.1
[10] GenomeInfoDb_1.2.0 IRanges_2.0.0 S4Vectors_0.4.0
[13] BiocGenerics_0.12.0
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7 BiocParallel_1.0.0
[5] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.0
[9] codetools_0.2-8 DBI_0.3.1 digest_0.6.4 fail_1.2
[13] foreach_1.4.2 iterators_1.0.7 RCurl_1.95-4.3 RSQLite_0.11.4
[17] sendmailR_1.2-1 stringr_0.6.2 tools_3.1.1 XML_3.98-1.1
[21] zlibbioc_1.12.0