Question

Write multiple UTRs for a gene to a fasta file

0

Entering edit mode

Jake ▴ 90

@jake-7236

Last seen 2.6 years ago

United States

I have a GRangesList with multiple transcripts per gene (see example below).

GRangesList object of length 9216:
$ENSMUSG00000000001.4 
GRanges object with 2 ranges and 4 metadata columns:
                       seqnames                 ranges strand |   exon_id            exon_name exon_rank               geneId
                          <Rle>              <IRanges>  <Rle> | <integer>          <character> <integer>          <character>
  ENSMUST00000000001.4     chr3 [108109403, 108109421]      - |     57602 ENSMUSE00000404895.1         8 ENSMUSG00000000001.4
  ENSMUST00000000001.4     chr3 [108107280, 108109316]      - |     57601 ENSMUSE00000363317.2         9 ENSMUSG00000000001.4

$ENSMUSG00000000028.14 
GRanges object with 4 ranges and 4 metadata columns:
                        seqnames               ranges strand | exon_id            exon_name exon_rank                geneId
  ENSMUST00000000028.13    chr16 [18781896, 18781897]      - |  245213 ENSMUSE00000645072.1        19 ENSMUSG00000000028.14
  ENSMUST00000000028.13    chr16 [18780447, 18780573]      - |  245211 ENSMUSE00000788192.1        20 ENSMUSG00000000028.14
   ENSMUST00000096990.9    chr16 [18781896, 18781897]      - |  245213 ENSMUSE00000645072.1        17 ENSMUSG00000000028.14
   ENSMUST00000096990.9    chr16 [18780453, 18780573]      - |  245212 ENSMUSE00000628024.1        18 ENSMUSG00000000028.14

I'd like to write the nucleotide sequences for all transcripts of a given gene to a fasta file grouped by gene id. Ex)

>ENSMUSG00000000028.14
[transcript 28 sequence][transcript 96990]

However, extractTranscriptSeqs requires that the exons be ordered by genomic location which would disrupt the sequence of each transcript for overlapping multiexon transcripts. I tried the more generic getSeqs, but it inserts transcript ids into the text and names rather than grouping everything just by gene id. Is there a way to do this? Thanks

genomicranges genomicfeatures • 1.5k views

ADD COMMENT • link updated 9.2 years ago by Michael Lawrence ★ 11k • written 9.2 years ago by Jake ▴ 90

score 2 · Accepted Answer · 2016-02-04

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.3 years ago

United States

Probably easiest to form a GRangesList with exons grouped by transcript and extract those sequences. Then concatenate the sequences from the same gene. This would be easiest / most efficient if the transcripts were sorted by gene. Then it would be something like this?

First, munge the transcripts and gene IDs:

library(Homo.sapiens)
tx <- exonsBy(Homo.sapiens, columns="GENEID", outerMcols=TRUE)
mcols(tx)$GENEID <- drop(mcols(tx)$GENEID)
tx <- tx[!is.na(mcols(tx)$GENEID)]
tx <- tx[order(mcols(tx)$GENEID)]

Extract the transcripts:

txseqs <- extractTranscriptSeqs(Hsapiens, tx)

And this is the key aggregation and regrouping to concatenate the sequences:

gene.widths <- sum(splitAsList(sum(width(tx)), mcols(tx)$GENEID))
relist(unlist(txseqs), PartitioningByWidth(gene.widths))

ADD COMMENT • link 9.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

This looks great. However, when I tried to run it on my own data, I couldn't access the mcols()$geneId in order to sort. Here is my full code.

library('GenomicFeatures')
library('hash')
library('BSgenome.Mmusculus.UCSC.mm10')

# Load genome and annotation files
genome <- BSgenome.Mmusculus.UCSC.mm10
gencode <- loadDb('../gencode_m8_basic.sqlite')

# Get transcript <-> gene annotation
trans <- transcriptsBy(gencode, by='gene')
transcript_gene_hash <- hash(mcols(unlist(trans))$tx_name, names(unlist(trans)))

# Get 5' UTRs and 3' UTRs by transcript
fiveutrByTx <- fiveUTRsByTranscript(gencode, use.names=T)

# Get UTR of each transcript, concatenate together, group by gene

# Get all UTRs, GENEID in mcols
fiveutr <- unlist(fiveutrByTx)
mcols(fiveutr)$geneId <- hash::values(transcript_gene_hash, keys=names(fiveutr))
fiveutr2 <- relist(fiveutr, fiveutrByTx)

problem with mcols:

# Order by GENEID
> fiveutr2
GRangesList object of length 36036:
$ENSMUST00000027036.10
GRanges object with 1 range and 4 metadata columns:
                        seqnames             ranges strand |   exon_id            exon_name exon_rank                geneId
                           <Rle>          <IRanges>  <Rle> | <integer>          <character> <integer>           <character>
  ENSMUST00000027036.10     chr1 [4807823, 4807913]      + |        17 ENSMUSE00000792454.1         1 ENSMUSG00000025903.14

$ENSMUST00000150971.7
GRanges object with 1 range and 4 metadata columns:
                       seqnames             ranges strand | exon_id            exon_name exon_rank                geneId
  ENSMUST00000150971.7     chr1 [4807830, 4807913]      + |      18 ENSMUSE00000737251.2         1 ENSMUSG00000025903.14

$ENSMUST00000155020.1
GRanges object with 1 range and 4 metadata columns:
                       seqnames             ranges strand | exon_id            exon_name exon_rank               geneId
  ENSMUST00000155020.1     chr1 [4807892, 4807913]      + |      19 ENSMUSE00000730558.1         1 ENSMUSG00000104217.1

...
<36033 more elements>
-------
seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths

test <- fiveutr[order(mcols(fiveutr2)$geneId)]

Error in fiveutr[order(mcols(fiveutr2)$geneId)] :
  error in evaluating the argument 'i' in selecting a method for function '[': Error in .Method(..., na.last = na.last, decreasing = decreasing) :
  argument 1 is not a vector

> mcols(fiveutr2)$geneId
NULL

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin14.5.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.20.1                    BSgenome.Hsapiens.UCSC.hg38_1.4.1       Homo.sapiens_1.3.1                   
[4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 org.Hs.eg.db_3.2.3                      GO.db_3.2.2                          
[7] RSQLite_1.0.0                           DBI_0.3.1                               OrganismDbi_1.12.1                   
[10] BSgenome.Mmusculus.UCSC.mm10_1.4.0      BSgenome_1.38.0                         rtracklayer_1.30.1                   
[13] Biostrings_2.38.3                       XVector_0.10.0                          hash_2.2.6                           
[16] GenomicFeatures_1.22.12                 AnnotationDbi_1.32.3                    Biobase_2.30.0                       
[19] GenomicRanges_1.22.4                    GenomeInfoDb_1.6.3                      IRanges_2.4.6                        
[22] S4Vectors_0.8.11                        BiocGenerics_0.16.1                    

loaded via a namespace (and not attached):
[1] graph_1.48.0               zlibbioc_1.16.0            GenomicAlignments_1.6.3    BiocParallel_1.4.3      
[5] tools_3.2.3                SummarizedExperiment_1.0.2 lambda.r_1.1.7             futile.logger_1.4.1     
[9] RBGL_1.46.0                futile.options_1.0.0       bitops_1.0-6               RCurl_1.95-4.7          
[13] biomaRt_2.26.1             Rsamtools_1.22.0           XML_3.98-1.3

ADD REPLY • link 9.2 years ago Jake ▴ 90

0

Entering edit mode

Yea, if you don't have an OrganismDbi package available, then things get annoying, since the TxDb methods for exonsBy() etc are frustratingly limited.

Maybe something like this will work?

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

tx <- transcripts(txdb, columns=c("tx_id", "gene_id"))
exons <- exonsBy(txdb)
mcols(exons)$gene_id <- tx$gene_id[match(names(exons), tx$tx_id)]
cds <- cdsBy(txdb)
utrs <- psetdiff(exons[names(cds)], cds)

mcols(utrs)$gene_id <- drop(mcols(utrs)$gene_id)
utrs <- utrs[!is.na(mcols(utrs)$gene_id)]
utrseqs <- extractTranscriptSeqs(Hsapiens, utrs)

resplit <- function(x, f) {
    widths <- rowsum(lengths(x), f)
    widths <- setNames(as.integer(widths), rownames(widths))
    relist(unlist(x[order(f)], use.names=FALSE), PartitioningByWidth(widths))
}

resplit(utrseqs, mcols(utrs)$gene_id)

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

1

Entering edit mode

If you want to keep the 3' and 5' UTRs separate, then maybe this code (3' UTR only, repeat for 5'):

tx <- transcripts(txdb, columns=c("tx_name", "gene_id"))
utrs <- threeUTRsByTranscript(txdb, use.names=TRUE)
mcols(utrs)$gene_id <- drop(tx$gene_id[match(names(utrs), tx$tx_name)])
utrseqs <- extractTranscriptSeqs(Hsapiens, utrs)
resplit(utrseqs, mcols(utrs)$gene_id) # resplit defined in previous answer

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

1

Entering edit mode

Yep, that will do it.

A few more things:

Instead of using resplit(), you can just do unstrsplit(split(utrseqs, mcols(utrs)$gene_id)) for concatenating the UTRs sequences from the same gene. There is one difference though: resplit() will concatenate together all the transcripts that are not linked to a gene (orphan transcripts) but the unstrsplit(split()) solution will drop them:

sum(is.na(mcols(utrs)$gene_id))  # number of orphan transcripts
## [1] 988

answer1 <- resplit(utrseqs, mcols(utrs)$gene_id)
## Warning message:
## In rowsum.default(lengths(x), f) : missing values for 'group'

answer2 <- unstrsplit(split(utrseqs, mcols(utrs)$gene_id))

length(answer1)
## [1] 19189
length(answer2)
## [1] 19188

tail(answer1, n=4)
##   A DNAStringSet instance of length 4
##      width seq                                              names
## [1]  15498 ACGGCCTGGCCTGTACCCCAACG...TAAACTTGTGTTCCGGAAGAGG 9993
## [2]   1883 GTGTTGCTGAATTCCTGGGACTC...AACCATGTAAAGTTTTGTTGTA 9994
## [3]    172 GCCACTGCAGTCTGGGCCCCATC...TCATTAAACGGGCTGCGTTTAA 9997
## [4] 867733 CTCTCGCCTCACTGCGGCCTCCC...TGGAGGAGCTTTAGGACGCGGG <NA>

tail(answer2, n=3)
##   A DNAStringSet instance of length 3
##     width seq                                               names
## [1] 15498 ACGGCCTGGCCTGTACCCCAACG...ATAAACTTGTGTTCCGGAAGAGG 9993
## [2]  1883 GTGTTGCTGAATTCCTGGGACTC...AAACCATGTAAAGTTTTGTTGTA 9994
## [3]   172 GCCACTGCAGTCTGGGCCCCATC...ATCATTAAACGGGCTGCGTTTAA 9997

The additional sequence in answer1 is from all the orphan transcripts.

all(head(answer1, n=-1) == answer2)
## [1] TRUE

If you want the unstrsplit(split()) solution to keep them:

mcols(utrs)$gene_id[is.na(mcols(utrs)$gene_id)] <- "NA"
answer3 <- unstrsplit(split(utrseqs, mcols(utrs)$gene_id))
all(answer1 == answer3)
## [1] TRUE

Finally, in your original question you say that "extractTranscriptSeqs requires that the exons be ordered by genomic location" which is not correct. From its man page:

Note that, for each transcript, the exons must be ordered by
ascending rank, that is, by their position in the transcript.

See ?extractTranscriptSeqs for more information.

Cheers,
H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k