I have a GRangesList
with multiple transcripts per gene (see example below).
GRangesList object of length 9216: $ENSMUSG00000000001.4 GRanges object with 2 ranges and 4 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank geneId <Rle> <IRanges> <Rle> | <integer> <character> <integer> <character> ENSMUST00000000001.4 chr3 [108109403, 108109421] - | 57602 ENSMUSE00000404895.1 8 ENSMUSG00000000001.4 ENSMUST00000000001.4 chr3 [108107280, 108109316] - | 57601 ENSMUSE00000363317.2 9 ENSMUSG00000000001.4 $ENSMUSG00000000028.14 GRanges object with 4 ranges and 4 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank geneId ENSMUST00000000028.13 chr16 [18781896, 18781897] - | 245213 ENSMUSE00000645072.1 19 ENSMUSG00000000028.14 ENSMUST00000000028.13 chr16 [18780447, 18780573] - | 245211 ENSMUSE00000788192.1 20 ENSMUSG00000000028.14 ENSMUST00000096990.9 chr16 [18781896, 18781897] - | 245213 ENSMUSE00000645072.1 17 ENSMUSG00000000028.14 ENSMUST00000096990.9 chr16 [18780453, 18780573] - | 245212 ENSMUSE00000628024.1 18 ENSMUSG00000000028.14
I'd like to write the nucleotide sequences for all transcripts of a given gene to a fasta file grouped by gene id. Ex)
>ENSMUSG00000000028.14 [transcript 28 sequence][transcript 96990]
However, extractTranscriptSeqs
requires that the exons be ordered by genomic location which would disrupt the sequence of each transcript for overlapping multiexon transcripts. I tried the more generic getSeqs
, but it inserts transcript ids into the text and names rather than grouping everything just by gene id. Is there a way to do this? Thanks
This looks great. However, when I tried to run it on my own data, I couldn't access the
mcols()$geneId
in order to sort. Here is my full code.problem with mcols:
Yea, if you don't have an OrganismDbi package available, then things get annoying, since the TxDb methods for
exonsBy()
etc are frustratingly limited.Maybe something like this will work?
If you want to keep the 3' and 5' UTRs separate, then maybe this code (3' UTR only, repeat for 5'):
Yep, that will do it.
A few more things:
Instead of using
resplit()
, you can just dounstrsplit(split(utrseqs, mcols(utrs)$gene_id))
for concatenating the UTRs sequences from the same gene. There is one difference though:resplit()
will concatenate together all the transcripts that are not linked to a gene (orphan transcripts) but theunstrsplit(split())
solution will drop them:The additional sequence in
answer1
is from all the orphan transcripts.If you want the
unstrsplit(split())
solution to keep them:Finally, in your original question you say that "extractTranscriptSeqs requires that the exons be ordered by genomic location" which is not correct. From its man page:
Note that, for each transcript, the exons must be ordered by
ascending rank, that is, by their position in the transcript.
See
?extractTranscriptSeqs
for more information.Cheers,
H.